ceph scrub 错误记录

发布于:2024-06-16 ⋅ 阅读:(80) ⋅ 点赞:(0)

目的

  1. 记录 ceph scrub 错误问题
  2. 解决 ceph scrub 故障

故障信息

  cluster:
    id:     xxx-xxx-xxx
    health: HEALTH_ERR
            2 scrub errors
            Possible data damage: 2 pg inconsistent

message 日志信息

# egrep -i 'medium|i\/o error|sector|Prefailure' /var/log/messages
Jun 15 00:23:37 my-ceph-osd-host kernel: sd 0:2:6:0: [sdg] tag#0 Sense Key : Medium Error [current]
Jun 15 00:23:37 my-ceph-osd-host kernel: blk_update_request: critical medium error, dev sdg, sector 7541632
Jun 15 00:23:37 my-ceph-osd-host kernel: megaraid_sas 0000:1c:00.0: 63816 (771726199s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x41/s5) at 731440
Jun 15 00:23:37 my-ceph-osd-host kernel: megaraid_sas 0000:1c:00.0: 63817 (771726201s/0x0001/FATAL) - Uncorrectable medium error logged for VD 06/6 at 731440 (on PD 05(e0x41/s5) at 731440)
Jun 15 00:30:55 my-ceph-osd-host kernel: sd 0:2:6:0: [sdg] tag#1 Sense Key : Medium Error [current]
Jun 15 00:30:55 my-ceph-osd-host kernel: blk_update_request: critical medium error, dev sdg, sector 7509376
Jun 15 00:30:55 my-ceph-osd-host kernel: megaraid_sas 0000:1c:00.0: 63822 (771726637s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x41/s5) at 7296a0
Jun 15 00:30:55 my-ceph-osd-host kernel: megaraid_sas 0000:1c:00.0: 63823 (771726639s/0x0001/FATAL) - Uncorrectable medium error logged for VD 06/6 at 7296a0 (on PD 05(e0x41/s5) at 7296a0)
Jun 15 00:36:06 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 7728512
Jun 15 00:36:07 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 11491457792
Jun 15 00:36:07 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 11491458304
Jun 15 00:36:07 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 77630336
Jun 15 00:36:07 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 77630848
Jun 15 00:36:07 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 77631360
Jun 15 00:36:07 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 77631872
Jun 15 00:36:07 my-ceph-osd-host kernel: blk_update_request: I/O error, dev sdg, sector 77632384

故障信息

  1. 当 ceph 检测到一个或多个对象副本不一致,则会标记对应 PG inconsistent 信息
  2. 可以理解为
    2.1 对象副本大小不一致
    2.2 对象在 recovery 完成后 miss 对应副本数量
  3. 通常都会在 PG 执行清晰(scrubbing)时候发现了对象副本一致性有问题

解决方法

  1. 由于上述 message 显示,某个 osd 对应的磁盘故障
  2. 下线对应 OSD
  3. 通过 ceph health detail 获得故障 PG 信息
  4. 执行 ceph pg repair PGID