测试环境 PostgreSQL 库连接不上—案例分享

发布于:2025-08-03 ⋅ 阅读:(13) ⋅ 点赞:(0)

业务有问题, 业务开发反馈数据库链接不上

查看问题

命令行当前可以链接, 查看日志
发现日志文件没有生成, 查看最新可看的日志

2025-07-28 14:49:46.065 UTC     124306:PANIC:  could not write to file "pg_logical/replorigin_checkpoint.tmp": No space left on device
2025-07-28 14:49:46.069 UTC     124303:LOG:  checkpointer process (PID 124306) was terminated by signal 6: Aborted
2025-07-28 14:49:46.069 UTC     124303:LOG:  terminating any other active server processes
2025-07-28 14:49:46.105 UTC     124303:LOG:  all server processes terminated; reinitializing
2025-07-28 14:49:46.529 UTC     810665:LOG:  database system was interrupted; last known up at 2025-07-28 14:34:56 UTC
2025-07-28 14:49:46.533 UTC [unknown] postgres 192.117.15.140(42692) imos 810669:FATAL:  the database system is in recovery mode

发现磁盘空间满了

查看 log_directory , 发现值是 log, 与正确的 /pg_log 不一致

postgres=# show log_directory ;
 log_directory 
---------------
 log
(1 row)

查看 postgresql.conf 文件

[root@vm110 database]# ll /pgdata/postgresql.conf 
-rw------- 1 postgres postgres 0 Jul 29 10:23 /pgdata/postgresql.conf

发现文件被修改过了

[root@vm110 pg_log]# stat /pgdata/postgresql.conf
  File: /pgdata/postgresql.conf
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 8,83	Inode: 7077966     Links: 1
Access: (0600/-rw-------)  Uid: ( 1001/postgres)   Gid: ( 1001/postgres)
Access: 2025-07-29 10:23:12.336922653 +0800
Modify: 2025-07-29 10:23:10.000917984 +0800
Change: 2025-07-29 14:12:53.973472792 +0800
 Birth: 2025-07-17 11:50:26.404484263 +0800
 

查看系统启动日志

# 查看最近的关机和启动记录
journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
  0 66930e8981b74fa7a7118e05983b950c Tue 2025-07-29 10:22:40 CST Tue 2025-07-29 15:18:03 CST


# 查看当前启动的日志
journalctl -b 0   
发现 PG 没有启动成功

Jul 29 10:23:12 vm110 rc.local[2059]:  stopped waiting
Jul 29 10:23:12 vm110 rc.local[2059]: pg_ctl: could not start server
Jul 29 10:23:12 vm110 rc.local[2059]: Examine the log output.
Jul 29 10:23:12 vm110 su[2043]: pam_unix(su-l:session): session closed for user postgres
Jul 29 10:23:12 vm110 smartd[1186]: Device: /dev/sdn [SAT], ST500DM009-2F110A, S/N:ZA47TTGM, WWN:5-000c50-0a46bd034, FW:CC43, 500 GB
Jul 29 10:23:12 vm110 smartd[1186]: Device: /dev/sdn [SAT], found in smartd database 7.3/5319: Seagate BarraCuda 3.5 (CMR)
Jul 29 10:23:12 vm110 rc.local[1639]: [10:23:12 initme] Check DB...
Jul 29 10:23:12 vm110 su[2234]: (to postgres) root on none
Jul 29 10:23:12 vm110 su[2234]: pam_unix(su-l:session): session opened for user postgres(uid=1001) by (uid=0)
Jul 29 10:23:12 vm110 dockerd[2047]: time="2025-07-29T10:23:12.624298340+08:00" level=info msg="Loading containers: start."
Jul 29 10:23:12 vm110 smartd[1186]: Device: /dev/sdn [SAT], is SMART capable. Adding to "monitor" list.


# 查看系统日志中是否有 postgresql.conf 配置文件相关信息,
果然发现了,开机启动时候, 会从 /postgresql/script/pgsql/postgresql.conf 拷贝并复制到 /pgdata/postgresql.conf
此时磁盘空间满了, 所以清空正常, 但是写入失败了

# journalctl -b 0   |grep postgresql
Jul 29 10:23:10 vm110 rc.local[1896]: cp: error copying '/postgresql/script/pgsql/postgresql.conf' to '/pgdata/postgresql.conf': No space left on device
Jul 29 10:23:11 vm110 rc.local[1639]: [10:23:11 initme] Start postgresql...
Jul 29 10:24:35 vm110 rc.local[1639]: [10:24:35 initme] postgresql started ERR!

紧接着数据库起来了, 但是日志文件没有正常生成.
查看重启信息

[root@vm110 pg_log]# last reboot |head -3
reboot   system boot  6.4.0            Tue Jul 29 10:23   still running
reboot   system boot  6.4.0            Mon Jul 28 15:24   still running
reboot   system boot  6.4.0            Thu Jul 17 11:56   still running
[root@vm110 pg_log]# 

可以看到 07.29 10:23 有过 reboot 系统, 同一分钟内, $PGDATA/postgresql.conf 文件被改过, 数据库启动了,但是有点异常

查看 PG 何时启动的 , 可以看到是 11:40 才启动的

[root@vm110 database]# ps -ef |grep 499848 |grep database
postgres  499848       1  0 11:40 ?        00:00:05 /program/bin/postgres -D /pgdata
[root@vm110 database]# 
[root@vm110 database]# 
[root@vm110 database]# ps -eo 'lstart,cmd,pid' | grep 499848 |grep -v grep 
Tue Jul 29 11:40:22 2025 /program/bin/postgres -D /v  499848

回过来看, 为啥业务无法访问数据库, 因为 $PGDATA/postgresql.conf 文件内容被清空了, 所以 listen_address相当于是默认值 127.0.0.1
所以无法正常接受 tcp/ip 访问了

但是为啥 $PGDATA/postgresql.conf文件被清空了, 这个是由于 initme.sh代码中导致的

[root@vm110 pg_log]# grep postgresql.conf /program/bin/initme.sh -in
205:        su - postgres -c "export LD_LIBRARY_PATH=/lib:$ROOTPATH/program/lib:/usr/lib/; export PGDATA=$DATABASEDIR; cp $DB_SQLPATH/postgresql.conf $DATABASEDIR/ -rf"

# 大概是如下这样一个过程
昨晚磁盘空间满 --> PG 挂了 --> 今天系统 reboot --> PG 配置文件被清空(initme.sh脚本在空间不够情况下导致) --> PG 以空配置(默认配置)启动(失败)[硬盘空间满了]  --> 人为清理过文件(使得空间有一定空闲) ?
–> 重启 server.sh 服务 --> PG 起来了但是不接受连接 --> 业务服务异常 --> 产生大量 core dump 文件 --> DBA 发现 PG 配置异常,然后手动恢复 PG 配置 --> 重启 server.sh restart --> 服务恢复 --> 测试清理冗余文件


网站公告

今日签到

点亮在社区的每一天
去签到