环境
系统平台:N/A
版本:N/A
症状
集群状态:
[root@bthbj-hgywsjkjq-ip28-cen76 ~]# hghactl list
- Cluster: highgo-ee-cluster —±-------------±--------±—±----------+
| Member | Host | Role | State | TL | Lag in MB |
±---------±-------------------±-------------±--------±—±----------+
| pgsql_1 | x.x.100.1:5866 | leader | running | 84 | |
| pgsql_2 | x.x.100.2:5866 | crash | running | 84 | 0 |
±---------±-------------------±-------------±--------±—±----------+
注意:虽然状态显示为crash,但是实际状态并未停止,仍然是running,数据库正常访问。
问题原因
检查hghac日志:
2020-04-03 14:40:35,434 INFO: Lock owner: None; I am pgsql_2
2020-04-03 14:40:35,524 INFO: PAUSE: postgres is not running
2020-04-03 14:40:45,433 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:45,434 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:45,434 WARNING: Postgresql is not running.
2020-04-03 14:40:45,434 INFO: Lock owner: None; I am pgsql_2
2020-04-03 14:40:45,523 INFO: PAUSE: postgres is not running
2020-04-03 14:40:55,435 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:55,435 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:55,435 WARNING: Postgresql is not running.
原因:
由于服务器时间发生跳变,则进程在操作系统中的启动时间戳也会发生改变,不再与$PGDATA/postmaster.pid文件中记录的时间相同,
hghac(patroni)会认为有集群以外其他的postmaster正在该$PGDATA运行,并尝试停止postmaster进程,但是由于操作系统pid一致,未能成功停止该进程,所以会不断切换节点状态。
解决方案
解决方案:
(稳妥推荐)条件允许情况下,在问题节点手动重启集群服务和数据库服务,让数据库的postmaster.pid中记录的时间戳更新。或者:
手动修改postmaster.pid,修改:
[root@localhost data]# cat postmaster.pid
77888
/opt/highgo/hgdb-see-4.5.8/data
1565058213 <—修改为hghac日志中的记录时间,
5866
/tmp
*
5866001 48
ready
注意:一些版本中,hghac日志会把进程时间和pid创建时间位置反了,所以将postmaster.pid的时间改为日志中另一个时间,让两个时间相同即可。