Master节点重启k8s集群崩溃解决办法

发布于:2025-04-02 ⋅ 阅读:(18) ⋅ 点赞:(0)

Master节点重启k8s集群崩溃解决办法

现象

[root@k8s-master ~]# kubectl get svc --all-namespaces
The connection to the server 192.168.2.129:6443 was refused - did you specify the right host or port?

这个错误表明 kubectl 无法连接到 Kubernetes API 服务器(地址为 https://192.168.2.129:6443),通常是因为 API 服务器未运行或相关服务在主机重启后未能正常启动。

排查

查看docker和kubelet状态

[root@k8s-master ~]# docker ps -a | grep kube-apiserver
f1a44074cd3d   e6bf5ddd4098                                        "kube-apiserver --ad…"   12 days ago   Exited (255) 2 minutes ago             k8s_kube-apiserver_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_1
4696dd0c5d53   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                  12 days ago   Exited (255) 2 minutes ago             k8s_POD_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_1
c4c57a9857f6   e6bf5ddd4098                                        "kube-apiserver --ad…"   2 weeks ago   Exited (255) 12 days ago               k8s_kube-apiserver_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_0
63da61b7df49   registry.aliyuncs.com/google_containers/pause:3.6   "/pause"                  2 weeks ago   Exited (255) 12 days ago               k8s_POD_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_0
[root@k8s-master ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since 二 2025-04-01 11:12:47 CST; 8s ago
     Docs: https://kubernetes.io/docs/
  Process: 2274 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
 Main PID: 2274 (code=exited, status=1/FAILURE)

4月 01 11:12:47 k8s-master systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
4月 01 11:12:47 k8s-master systemd[1]: Unit kubelet.service entered failed state.
4月 01 11:12:47 k8s-master systemd[1]: kubelet.service failed.

systemctl status kubelet 显示服务状态为 activating (auto-restart),主进程退出码为 status=1/FAILURE,表明 kubelet 在启动时遇到了错误并不断尝试重启。

查看日志发现关键错误

[root@k8s-master ~]# journalctl -u kubelet -n 100 --no-pager
-- Logs begin at 二 2025-04-01 11:09:05 CST, end at 二 2025-04-01 11:16:02 CST. --
4月 01 11:13:38 k8s-master systemd[1]: Unit kubelet.service entered failed state.
4月 01 11:13:38 k8s-master systemd[1]: kubelet.service failed.
4月 01 11:13:48 k8s-master systemd[1]: kubelet.service holdoff time over, scheduling restart.
4月 01 11:13:48 k8s-master systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
4月 01 11:13:48 k8s-master systemd[1]: Started kubelet: The Kubernetes Node Agent.
4月 01 11:13:49 k8s-master kubelet[2325]: E0401 11:13:49.044946    2325 run.go:74] "command failed" err="failed to parse kubelet flag: unknown flag: --network-plugin"

根据日志可以初步判断,应该是cni网络模块出了问题,kebelet重启后,启动网络插件的命令无法执行。
--network-plugin 是一个在较早版本 Kubernetes 中用于指定网络插件的标志,但在 Kubernetes 1.24 及更高版本中已被废弃。

之后分别在master和node节点上查看下kubelet的版本,结果发现了问题:

[root@k8s-master ~]# rpm -qa | grep kube
kubeadm-1.28.2-0.x86_64
kubernetes-cni-1.2.0-0.x86_64
kubectl-1.28.2-0.x86_64
kubelet-1.28.2-0.x86_64

即master节点的kubeadm、kubelet、kubernetes-cni版本不兼容

我们选择降级

yum downgrade kubeadm-1.23.6 kubelet-1.23.6 kubectl-1.23.6

systemctl restart kubelet.service
systemctl daemon-reload

最后成功

[root@k8s-master ~]# kubectl get pods --all-namespaces
NAMESPACE         NAME                                       READY   STATUS    RESTARTS        AGE
istio-system      istio-egressgateway-655c78bb9c-mrfxp       1/1     Running   0               12d
istio-system      istio-ingressgateway-7d76958b7c-m46hz      1/1     Running   0               12d
istio-system      istiod-55f774df9d-2dk6z                    1/1     Running   0               12d
istio-system      kiali-6b455fd9f9-jw4m9                     1/1     Running   2 (4h4m ago)    12d
istio-system      prometheus-7cc96d969f-sk5ww                2/2     Running   1 (4h22m ago)   12d
kube-system       calico-kube-controllers-64cc74d646-6kvbh   1/1     Running   1 (3h19m ago)   12d
kube-system       calico-node-f9zw6                          0/1     Running   1 (3h19m ago)   12d
kube-system       calico-node-v6kgf                          0/1     Running   0               12d