K8S-etcd服务无法启动问题排查

发布于:2025-03-19 ⋅ 阅读:(21) ⋅ 点赞:(0)

一、环境、版本信息说明

k8s:v1.19.16

etcdctl version: 3.5.1

3台etcd(10.xxx.xx.129、10.xxx.xx.130、10.xxx.xx.131)组成的集群。

二、问题根因

129节点的etcd数据与其他两台数据不一致,集群一致性校验出错导致无法加入集群端点中。需先从集群中移除改异常节点,修改/etc/etcd/etcd.conf文件,并重新添加进集群。

三、巡检过程中发现129节点的etcd服务无法启动 

尝试几次启动服务器etcd.service,均无法正常启动,查看服务状态如下:

日志关键提示:stopped remote peer...,stopped TCP streaming connection with remote peer...

此时查看etcd point status,发现10.xxx.xx.129已查询不到:

命令:

etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" endpoint status --write-out table  

查看etcd member list,10.xxx.xx.129还能查询到;

命令:

 etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" member list --write-out table

 结果:

四、移除异常受损节点etcd_129

命令:

etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159"  member remove b572c7cf1e338c4d

Member b572c7cf1e338c4d removed from cluster  af90dec9e9ee777

确认移除状态:

 结果:

重新添加etcd_129节点

1)添加etcd_129

命令:

etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --peer-urls=" https://10.xxx.xx.129:2380" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" member add  etcd15_129

Member 552d3942486a87c6 added to cluster  af90dec9e9ee777

  1. 修改/etc/etcd/etcd.conf配置:

将原来的ETCD_INITIAL_CLUSTER_STATE=”new”调整ETCD_INITIAL_CLUSTER_STATE="existing" ;

命令:

sed -i s#ETCD_INITIAL_CLUSTER_STATE="new"#ETCD_INITIAL_CLUSTER_STATE="existing"#g/etc/etcd/etcd.conf

 结果:

此时查看集群状态,新加入的节点状态为unstarted

  1. 删除新增成员的旧数据目录,更改相关配置。

不删除历史受损节点的etcd就数据目录,该节点的etcd service将无法正常启动。

命令

cp /apps/etcd_data{,_bak} ;rm -rf /apps/etcd_data/etcd/*

六、启动服务 检测集群是否正常

命令

systemctl start etcd.service     # 查看etcd服务状态      

命令

etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" endpoint status --write-out table         # 查看etcd endpoint status;

 结果: