Ceph PG unfound/lost 问题排查与解决

发布于:2025-05-09 ⋅ 阅读:(23) ⋅ 点赞:(0)

Ceph PG unfound/lost 问题排查与解决

背景

Ceph 集群出现 HEALTH_ERR,提示有 PG 对象丢失(unfound),并且 repair 无法自动修复。

现象

  • ceph health detail 显示:

    HEALTH_ERR 4/213107278 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 36/2130898991 objects degraded (0.000%), 1 pg degraded
    OBJECT_UNFOUND 4/213107278 objects unfound (0.000%)
        pg 2.f06 has 4 unfound objects
    PG_DAMAGED Possible data damage: 1 pg recovery_unfound
        pg 2.f06 is active+recovery_unfound+degraded+repair, acting [520,454,563,300,70,59,243,166,422,333], 4 unfound
    PG_DEGRADED Degraded data redundancy: 36/2130898991 objects degraded (0.000%), 1 pg degraded
        pg 2.f06 is active+recovery_unfound+degraded+repair, acting [520,454,563,300,70,59,243,166,422,333], 4 unfound
    
  • repair 日志显示:

    repair 4 missing, 0 inconsistent objects
    repair 36 errors, 36 fixed
    

排查过程

  1. 确认 OSD 状态

    • 所有相关 OSD 均为 up,无进程或硬件异常。
  2. 分析 repair/scrub 日志

    • repair 已修复 36 个错误,但有 4 个对象在所有副本上都找不到(unfound)。
  3. 尝试 mark_unfound_lost revert

    • 报错:mode must be 'delete' for ec pool,说明 EC 池只能用 delete
  4. 最终执行

    ceph pg 2.f06 mark_unfound_lost delete
    
    • 系统提示:pg has 4 objects unfound and apparently lost marking
  5. 健康恢复

    • 片刻后,ceph health detail 显示 HEALTH_OK,PG 状态恢复正常。

经验总结

  • unfound objects 表示对象在所有副本上都丢失,无法自动修复。
  • EC 池只能用 delete 方式丢弃丢失对象,不能 revert。
  • repair 只能修复可用副本间的数据不一致,无法凭空恢复丢失对象。
  • 标记 lost 后,集群健康恢复,但对应对象永久丢失,需业务评估影响。

参考命令

# 查看健康和详细信息
ceph health detail
ceph status

# 标记 unfound/lost 对象(EC池只能delete)
ceph pg <pgid> mark_unfound_lost delete

# 检查PG状态
ceph pg <pgid> query

结语

遇到 Ceph PG unfound/lost 问题,需冷静排查,确认无法恢复后果断 mark lost,保障集群整体健康。建议定期备份重要数据,防止极端情况下的不可恢复丢失。


网站公告

今日签到

点亮在社区的每一天
去签到