GaussDB 集群故障cm_ctl: can‘t connect to cm_server

发布于:2025-09-02 ⋅ 阅读:(22) ⋅ 点赞:(0)

1. 问题描述

gaussdb,3AZ3副本架构,重启节点服务器后,报错无法连接cm_server,cm_ctl: can’t connect to cm_server.

[omm@gaussdb03 ~]$ cm_ctl query -Cvpid
[  CMServer State   ]

node             node_ip         instance                             state
-----------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   1    /data/cluster/data/cm/cm_server Down
2  172.16.60.227 172.16.60.227   2    /data/cluster/data/cm/cm_server Down
3  172.16.60.228 172.16.60.228   3    /data/cluster/data/cm/cm_server Standby

[    ETCD State     ]

node             node_ip         instance                     state
---------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   7001 /data/cluster/data/etcd Down
2  172.16.60.227 172.16.60.227   7002 /data/cluster/data/etcd Down
3  172.16.60.228 172.16.60.228   7003 /data/cluster/data/etcd Down

cm_ctl: can't connect to cm_server. 
Maybe cm_server is not running, or timeout expired. Please try again.

2. 问题分析

  • 检查每台机器上,集群组件进程CM,ETCD,GTM,CN,DN还都存在
[root@gaussdb03 ~]# ps -ef |grep cluster
omm         5198       1  0 13:43 ?        00:00:06 /data/cluster/core/app/bin/om_monitor -L /data/cluster/logs/gaussdb/omm/cm/om_monitor
omm         5202    5198  9 13:43 ?        00:01:32 /data/cluster/core/app/bin/cm_agent
omm         5214       1  0 13:43 ?        00:00:03 /data/cluster/core/app/bin/etcd -name etcd_7003 --data-dir /data/cluster/data/etcd --client-cert-auth --trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --cert-file /data/cluster/data/etcd/etcd.crt --key-file /data/cluster/data/etcd/etcd.key --peer-client-cert-auth --peer-trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --peer-cert-file /data/cluster/data/etcd/etcd.crt --peer-key-file /data/cluster/data/etcd/etcd.key -initial-advertise-peer-urls https://172.16.60.228:30320 -listen-peer-urls https://172.16.60.228:30320 -listen-client-urls https://172.16.60.228:30300 -advertise-client-urls https://172.16.60.228:30300 --election-timeout 5000 --heartbeat-interval 1000 --log-outputs stdout --quota-backend-bytes 8589934592 --auto-compaction-mode periodic --auto-compaction-retention 1h -initial-cluster-token etcd-cluster-omm --enable-v2=false -initial-cluster etcd_7001=https://172.16.60.226:30320,etcd_7002=https://172.16.60.227:30320,etcd_7003=https://172.16.60.228:30320 -initial-cluster-state new
omm         5362       1  0 13:43 ?        00:00:00 /data/cluster/core/app/bin/gs_gtm -D /data/cluster/data/gtm -M pending
omm         5369       1 41 13:43 ?        00:06:57 /data/cluster/core/app/bin/gaussdb --coordinator -D /data/cluster/data/cn
omm         5385       1  2 13:43 ?        00:00:29 /data/cluster/core/app/bin/cm_server
omm         5576       1 23 13:43 ?        00:03:56 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6003 -M pending
omm         6225       1 23 13:43 ?        00:03:54 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6006 -M pending
omm         6482       1 23 13:43 ?        00:03:57 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6007 -M pending
root       23084   23031  0 13:59 pts/0    00:00:00 grep cluster
  • 由于 CM,ETCD 均显示 Down,根据官方文档,应先保证 ETCD 正常,然后 CM 可以依赖 ETCD 选主
    在这里插入图片描述
  • 检查ETCD日志
[omm@gaussdb01 etcd]$ pwd
/data/cluster/logs/gaussdb/omm/cm/etcd
[omm@gaussdb01 etcd]$ view etcd_7001-current.log
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to 82a123c2037aba1a at term 5"}
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to d354b9b181618c10 at term 5"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: i/o timeout"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
  • 检查防火墙配置,防火墙未关闭,关闭防火墙
[root@gaussdb03 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2025-09-01 13:42:30 CST; 29min ago
     Docs: man:firewalld(1)
 Main PID: 1334 (firewalld)
    Tasks: 2
   Memory: 34.6M
   CGroup: /system.slice/firewalld.service
           └─1334 /usr/bin/python3 /usr/sbin/firewalld --nofork --nopid

Sep 01 13:42:29 gaussdb03 systemd[1]: Starting firewalld - dynamic firewall daemon...
Sep 01 13:42:30 gaussdb03 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@gaussdb03 ~]# systemctl stop firewalld.service
[root@gaussdb03 ~]# systemctl disable firewalld.service
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
  • 再次检查集群状态正常
[omm@gaussdb01 ~]$ cm_ctl query -Cv
[  CMServer State   ]

node             instance state
---------------------------------
1  172.16.60.226 1        Standby
2  172.16.60.227 2        Standby
3  172.16.60.228 3        Primary

[    ETCD State     ]

node             instance state
---------------------------------------
1  172.16.60.226 7001     StateFollower
2  172.16.60.227 7002     StateLeader
3  172.16.60.228 7003     StateFollower

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL

[ Coordinator State ]

node             instance state
---------------------------------
1  172.16.60.226 5001     Normal
2  172.16.60.227 5002     Normal
3  172.16.60.228 5003     Normal

[ Central Coordinator State ]

node             instance state
---------------------------------
2  172.16.60.227 5002     Normal

[     GTM State     ]

node             instance state                    sync_state
-----------------------------------------------------------------
1  172.16.60.226 1001     P Primary Connection ok  Sync
2  172.16.60.227 1002     S Standby Connection ok  Sync
3  172.16.60.228 1003     S Standby Connection ok  Sync

[  Datanode State   ]

node             instance state            | node             instance state            | node             instance state
---------------------------------------------------------------------------------------------------------------------------------------
1  172.16.60.226 6001     P Primary Normal | 2  172.16.60.227 6002     S Standby Normal | 3  172.16.60.228 6003     S Standby Normal
2  172.16.60.227 6004     P Primary Normal | 1  172.16.60.226 6005     S Standby Normal | 3  172.16.60.228 6006     S Standby Normal
3  172.16.60.228 6007     P Primary Normal | 2  172.16.60.227 6008     S Standby Normal | 1  172.16.60.226 6009     S Standby Normal

3. 问题总结

由于操作系统防火墙未关闭,导致操作系统重启后,ETCD状态不正常,无法连接到其它节点,导致CMS状态异常,无法正常连接到实例。


网站公告

今日签到

点亮在社区的每一天
去签到