一、环境、版本信息说明
k8s:v1.19.16
etcdctl version: 3.5.1
3台etcd(10.xxx.xx.129、10.xxx.xx.130、10.xxx.xx.131)组成的集群。
二、问题根因
129节点的etcd数据与其他两台数据不一致,集群一致性校验出错导致无法加入集群端点中。需先从集群中移除改异常节点,修改/etc/etcd/etcd.conf文件,并重新添加进集群。
三、巡检过程中发现129节点的etcd服务无法启动
尝试几次启动服务器etcd.service,均无法正常启动,查看服务状态如下:
日志关键提示:stopped remote peer...,stopped TCP streaming connection with remote peer...
此时查看etcd point status,发现10.xxx.xx.129已查询不到:
命令:
etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" endpoint status --write-out table
查看etcd member list,10.xxx.xx.129还能查询到;
命令:
etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" member list --write-out table
结果:
四、移除异常受损节点etcd_129
命令:
etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" member remove b572c7cf1e338c4d
Member b572c7cf1e338c4d removed from cluster af90dec9e9ee777
确认移除状态:
结果:
重新添加etcd_129节点
1)添加etcd_129
命令:
etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --peer-urls=" https://10.xxx.xx.129:2380" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" member add etcd15_129
Member 552d3942486a87c6 added to cluster af90dec9e9ee777
- 修改/etc/etcd/etcd.conf配置:
将原来的ETCD_INITIAL_CLUSTER_STATE=”new”调整ETCD_INITIAL_CLUSTER_STATE="existing" ;
命令:
sed -i ‘s#ETCD_INITIAL_CLUSTER_STATE="new"#ETCD_INITIAL_CLUSTER_STATE="existing"#g’/etc/etcd/etcd.conf
结果:
此时查看集群状态,新加入的节点状态为unstarted
- 删除新增成员的旧数据目录,更改相关配置。
不删除历史受损节点的etcd就数据目录,该节点的etcd service将无法正常启动。
命令
cp /apps/etcd_data{,_bak} ;rm -rf /apps/etcd_data/etcd/*
六、启动服务 检测集群是否正常
命令
systemctl start etcd.service # 查看etcd服务状态
命令
etcdctl --cacert="/etc/kubernetes/ssl/ca.crt" --cert="/etc/kubernetes/ssl/etcd_client.crt" --key="/etc/kubernetes/ssl/etcd_client.key" --endpoints="https://10.xxx.xx.129:1159,https://10.xxx.xx.130:1159,https://10.xxx.xx.131:1159" endpoint status --write-out table # 查看etcd endpoint status;
结果: