Master节点重启k8s集群崩溃解决办法
现象
[root@k8s-master ~]# kubectl get svc --all-namespaces
The connection to the server 192.168.2.129:6443 was refused - did you specify the right host or port?
这个错误表明 kubectl 无法连接到 Kubernetes API 服务器(地址为 https://192.168.2.129:6443),通常是因为 API 服务器未运行或相关服务在主机重启后未能正常启动。
排查
查看docker和kubelet状态
[root@k8s-master ~]# docker ps -a | grep kube-apiserver
f1a44074cd3d e6bf5ddd4098 "kube-apiserver --ad…" 12 days ago Exited (255) 2 minutes ago k8s_kube-apiserver_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_1
4696dd0c5d53 registry.aliyuncs.com/google_containers/pause:3.6 "/pause" 12 days ago Exited (255) 2 minutes ago k8s_POD_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_1
c4c57a9857f6 e6bf5ddd4098 "kube-apiserver --ad…" 2 weeks ago Exited (255) 12 days ago k8s_kube-apiserver_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_0
63da61b7df49 registry.aliyuncs.com/google_containers/pause:3.6 "/pause" 2 weeks ago Exited (255) 12 days ago k8s_POD_kube-apiserver-k8s-master_kube-system_86e7f717b7b24cc090597078a3c967de_0
[root@k8s-master ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since 二 2025-04-01 11:12:47 CST; 8s ago
Docs: https://kubernetes.io/docs/
Process: 2274 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
Main PID: 2274 (code=exited, status=1/FAILURE)
4月 01 11:12:47 k8s-master systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
4月 01 11:12:47 k8s-master systemd[1]: Unit kubelet.service entered failed state.
4月 01 11:12:47 k8s-master systemd[1]: kubelet.service failed.
systemctl status kubelet 显示服务状态为 activating (auto-restart),主进程退出码为 status=1/FAILURE,表明 kubelet 在启动时遇到了错误并不断尝试重启。
查看日志发现关键错误
[root@k8s-master ~]# journalctl -u kubelet -n 100 --no-pager
-- Logs begin at 二 2025-04-01 11:09:05 CST, end at 二 2025-04-01 11:16:02 CST. --
4月 01 11:13:38 k8s-master systemd[1]: Unit kubelet.service entered failed state.
4月 01 11:13:38 k8s-master systemd[1]: kubelet.service failed.
4月 01 11:13:48 k8s-master systemd[1]: kubelet.service holdoff time over, scheduling restart.
4月 01 11:13:48 k8s-master systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
4月 01 11:13:48 k8s-master systemd[1]: Started kubelet: The Kubernetes Node Agent.
4月 01 11:13:49 k8s-master kubelet[2325]: E0401 11:13:49.044946 2325 run.go:74] "command failed" err="failed to parse kubelet flag: unknown flag: --network-plugin"
根据日志可以初步判断,应该是cni网络模块出了问题,kebelet重启后,启动网络插件的命令无法执行。
--network-plugin
是一个在较早版本 Kubernetes 中用于指定网络插件的标志,但在 Kubernetes 1.24 及更高版本中已被废弃。
之后分别在master和node节点上查看下kubelet的版本,结果发现了问题:
[root@k8s-master ~]# rpm -qa | grep kube
kubeadm-1.28.2-0.x86_64
kubernetes-cni-1.2.0-0.x86_64
kubectl-1.28.2-0.x86_64
kubelet-1.28.2-0.x86_64
即master节点的kubeadm、kubelet、kubernetes-cni版本不兼容
我们选择降级
yum downgrade kubeadm-1.23.6 kubelet-1.23.6 kubectl-1.23.6
systemctl restart kubelet.service
systemctl daemon-reload
最后成功
[root@k8s-master ~]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
istio-system istio-egressgateway-655c78bb9c-mrfxp 1/1 Running 0 12d
istio-system istio-ingressgateway-7d76958b7c-m46hz 1/1 Running 0 12d
istio-system istiod-55f774df9d-2dk6z 1/1 Running 0 12d
istio-system kiali-6b455fd9f9-jw4m9 1/1 Running 2 (4h4m ago) 12d
istio-system prometheus-7cc96d969f-sk5ww 2/2 Running 1 (4h22m ago) 12d
kube-system calico-kube-controllers-64cc74d646-6kvbh 1/1 Running 1 (3h19m ago) 12d
kube-system calico-node-f9zw6 0/1 Running 1 (3h19m ago) 12d
kube-system calico-node-v6kgf 0/1 Running 0 12d