[SOLVED] ceph: corrupt disk image

luphi

Renowned Member
Nov 9, 2015
82
5
73
Hello all,

I have a corrupted disk image located in one of my ceph pools:

Code:
rbd ls -l -p ceph_hdd_images
rbd: error opening vm-173-disk-1: (2) No such file or directory
NAME           SIZE    PARENT  FMT  PROT  LOCK
vm-165-disk-0  50 GiB            2        excl
vm-173-disk-0  10 GiB            2        excl
rbd: listing images failed: (2) No such file or directory

If I try to delete it, it hangs at:
Code:
# rbd rm vm-173-disk-1 -p  ceph_hdd_images
Removing image: 15% complete...

Any idea, how to get rid of it?

Cheers,
luphi
 
Hi,

May I ask if your cluster healthy is ceph -s? If yes - can you try to use --force flag, i.e.:


Bash:
rbd rm vm-173-disk-1 -p ceph_hdd_images --force
 
Code:
# ceph -s
  cluster:
    id:     607db34c-b13e-47a3-8a73-48fc46bdc941
    health: HEALTH_WARN
            Reduced data availability: 3 pgs inactive, 2 pgs peering
            2 daemons have recently crashed
            106 slow ops, oldest one blocked for 3302 sec, daemons [osd.22,osd.41] have slow ops.

  services:
    mon: 3 daemons, quorum FRGRSDXCI028,FRGRSDXCI029,FRGRSDXCI030 (age 49m)
    mgr: FRGRSDXCI028(active, since 8w), standbys: FRGRSDXCI029, FRGRSDXCI030
    mds: 1/1 daemons up, 2 standby
    osd: 80 osds: 80 up (since 48m), 80 in (since 9M); 1 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   26 pools, 822 pgs
    objects: 877.17k objects, 3.2 TiB
    usage:   9.1 TiB used, 119 TiB / 128 TiB avail
    pgs:     0.365% pgs not active
             818 active+clean
             1   activating
             1   peering
             1   active+clean+scrubbing+deep
             1   remapped+peering

  io:
    client:   9.1 MiB/s rd, 4.2 MiB/s wr, 80 op/s rd, 271 op/s wr
 
Code:
# ceph -s
  cluster:
    id:     607db34c-b13e-47a3-8a73-48fc46bdc941
    health: HEALTH_WARN
            Reduced data availability: 3 pgs inactive, 2 pgs peering
            2 daemons have recently crashed
            106 slow ops, oldest one blocked for 3302 sec, daemons [osd.22,osd.41] have slow ops.

  services:
    mon: 3 daemons, quorum FRGRSDXCI028,FRGRSDXCI029,FRGRSDXCI030 (age 49m)
    mgr: FRGRSDXCI028(active, since 8w), standbys: FRGRSDXCI029, FRGRSDXCI030
    mds: 1/1 daemons up, 2 standby
    osd: 80 osds: 80 up (since 48m), 80 in (since 9M); 1 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   26 pools, 822 pgs
    objects: 877.17k objects, 3.2 TiB
    usage:   9.1 TiB used, 119 TiB / 128 TiB avail
    pgs:     0.365% pgs not active
             818 active+clean
             1   activating
             1   peering
             1   active+clean+scrubbing+deep
             1   remapped+peering

  io:
    client:   9.1 MiB/s rd, 4.2 MiB/s wr, 80 op/s rd, 271 op/s wr
Code:
Reduced data availability: 3 pgs inactive, 2 pgs peering

pg 118.34 is stuck peering for 104m, current state remapped+peering, last acting [22,48,66]
pg 129.7 is stuck peering for 2h, current state peering, last acting [41,22,71]
pg 143.17 is stuck inactive for 104m, current state activating, last acting [22,78,41]
 
Code:
# ceph pg ls|grep -v clean
PG      OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES        OMAP_BYTES*  OMAP_KEYS*  LOG    STATE             SINCE  VERSION            REPORTED           UP             ACTING         SCRUB_STAMP                      DEEP_SCRUB_STAMP                 LAST_SCRUB_DURATION  SCRUB_SCHEDULING
118.34     5989         0          0        0  23934855186            0           0   8586  remapped+peering   101m   2515716'34833356   2516137:39057611  [70,41,17]p70  [22,48,66]p22  2023-05-19T03:10:00.874194+0200  2023-05-19T03:10:00.874194+0200                  421  queued for scrub
129.7       155         0          0        0    645923490            0           0   8583           peering   101m    2515651'3613093    2516137:5418986  [41,22,71]p41  [41,22,71]p41  2023-05-18T15:25:51.985242+0200  2023-05-12T12:56:51.253454+0200                    1  queued for deep scrub
143.17        0         0          0        0            0            0           0      0        activating   101m                0'0     2516137:152986  [22,78,41]p22  [22,78,41]p22  2023-05-18T17:04:59.006931+0200  2023-05-14T19:59:59.048072+0200                    1  queued for deep scrub
 
What do you get if you run ceph pg {pg id} query?
 
The other two are hanging

Code:
# ceph pg 129.7 query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "peering",
    "epoch": 2516138,
    "up": [
        41,
        22,
        71
    ],
    "acting": [
        41,
        22,
        71
    ],
    "info": {
        "pgid": "129.7",
        "last_update": "2515651'3613093",
        "last_complete": "2515651'3613093",
        "log_tail": "2514433'3604510",
        "last_user_version": 3613093,
        "last_backfill": "MAX",
        "purged_snaps": [],
        "history": {
            "epoch_created": 731363,
            "epoch_pool_created": 731363,
            "last_epoch_started": 2188118,
            "last_interval_started": 2188117,
            "last_epoch_clean": 2188118,
            "last_interval_clean": 2188117,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 2516103,
            "same_interval_since": 2516103,
            "same_primary_since": 2188057,
            "last_scrub": "2511210'3578684",
            "last_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
            "last_deep_scrub": "2471818'3534459",
            "last_deep_scrub_stamp": "2023-05-12T12:56:51.253454+0200",
            "last_clean_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
            "prior_readable_until_ub": 0
        },
        "stats": {
            "version": "2515651'3613093",
            "reported_seq": 5418987,
            "reported_epoch": 2516138,
            "state": "peering",
            "last_fresh": "2023-05-22T13:04:31.013486+0200",
            "last_change": "2023-05-22T11:25:51.310923+0200",
            "last_active": "2023-05-22T11:19:15.454400+0200",
            "last_peered": "2023-05-22T11:02:08.705234+0200",
            "last_clean": "2023-05-22T11:02:08.705234+0200",
            "last_became_active": "2023-03-27T11:15:41.288153+0200",
            "last_became_peered": "2023-03-27T11:15:41.288153+0200",
            "last_unstale": "2023-05-22T13:04:31.013486+0200",
            "last_undegraded": "2023-05-22T13:04:31.013486+0200",
            "last_fullsized": "2023-05-22T13:04:31.013486+0200",
            "mapping_epoch": 2516103,
            "log_start": "2514433'3604510",
            "ondisk_log_start": "2514433'3604510",
            "created": 731363,
            "last_epoch_clean": 2188118,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "2511210'3578684",
            "last_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
            "last_deep_scrub": "2471818'3534459",
            "last_deep_scrub_stamp": "2023-05-12T12:56:51.253454+0200",
            "last_clean_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
            "objects_scrubbed": 80,
            "log_size": 8583,
            "ondisk_log_size": 8583,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": false,
            "snaptrimq_len": 0,
            "last_scrub_duration": 1,
            "scrub_schedule": "queued for deep scrub",
            "scrub_duration": 0.026275316,
            "objects_trimmed": 0,
            "snaptrim_duration": 0,
            "stat_sum": {
                "num_bytes": 645923490,
                "num_objects": 155,
                "num_object_clones": 0,
                "num_object_copies": 465,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 155,
                "num_whiteouts": 0,
                "num_read": 21982,
                "num_read_kb": 13702642,
                "num_write": 3612751,
                "num_write_kb": 43262027,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 20,
                "num_bytes_recovered": 1187840,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0,
                "num_omap_bytes": 0,
                "num_omap_keys": 0,
                "num_objects_repaired": 0
            },
            "up": [
                41,
                22,
                71
            ],
            "acting": [
                41,
                22,
                71
            ],
            "avail_no_missing": [],
            "object_location_counts": [],
            "blocked_by": [
                22
            ],
            "up_primary": 41,
            "acting_primary": 41,
            "purged_snaps": []
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 2188118,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [
        {
            "peer": "71",
            "pgid": "129.7",
            "last_update": "2515651'3613093",
            "last_complete": "2515651'3613093",
            "log_tail": "2514433'3604510",
            "last_user_version": 3613093,
            "last_backfill": "MAX",
            "purged_snaps": [],
            "history": {
                "epoch_created": 731363,
                "epoch_pool_created": 731363,
                "last_epoch_started": 2188118,
                "last_interval_started": 2188117,
                "last_epoch_clean": 2188118,
                "last_interval_clean": 2188117,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 2516103,
                "same_interval_since": 2516103,
                "same_primary_since": 2188057,
                "last_scrub": "2511210'3578684",
                "last_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
                "last_deep_scrub": "2471818'3534459",
                "last_deep_scrub_stamp": "2023-05-12T12:56:51.253454+0200",
                "last_clean_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
                "prior_readable_until_ub": 0
            },
            "stats": {
                "version": "2515644'3613092",
                "reported_seq": 5418497,
                "reported_epoch": 2515651,
                "state": "active+clean",
                "last_fresh": "2023-05-19T14:20:16.572312+0200",
                "last_change": "2023-05-18T15:25:51.985290+0200",
                "last_active": "2023-05-19T14:20:16.572312+0200",
                "last_peered": "2023-05-19T14:20:16.572312+0200",
                "last_clean": "2023-05-19T14:20:16.572312+0200",
                "last_became_active": "2023-03-27T11:15:41.288153+0200",
                "last_became_peered": "2023-03-27T11:15:41.288153+0200",
                "last_unstale": "2023-05-19T14:20:16.572312+0200",
                "last_undegraded": "2023-05-19T14:20:16.572312+0200",
                "last_fullsized": "2023-05-19T14:20:16.572312+0200",
                "mapping_epoch": 2516103,
                "log_start": "2514433'3604510",
                "ondisk_log_start": "2514433'3604510",
                "created": 731363,
                "last_epoch_clean": 2188118,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "2511210'3578684",
                "last_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
                "last_deep_scrub": "2471818'3534459",
                "last_deep_scrub_stamp": "2023-05-12T12:56:51.253454+0200",
                "last_clean_scrub_stamp": "2023-05-18T15:25:51.985242+0200",
                "objects_scrubbed": 80,
                "log_size": 8582,
                "ondisk_log_size": 8582,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "manifest_stats_invalid": false,
                "snaptrimq_len": 0,
                "last_scrub_duration": 1,
                "scrub_schedule": "periodic deep scrub scheduled @ 2023-05-19T17:46:18.827505+0000",
                "scrub_duration": 0.026275316,
                "objects_trimmed": 0,
                "snaptrim_duration": 0,
                "stat_sum": {
                    "num_bytes": 645923490,
                    "num_objects": 155,
                    "num_object_clones": 0,
                    "num_object_copies": 465,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 155,
                    "num_whiteouts": 0,
                    "num_read": 21982,
                    "num_read_kb": 13702642,
                    "num_write": 3612751,
                    "num_write_kb": 43262027,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 20,
                    "num_bytes_recovered": 1187840,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0,
                    "num_large_omap_objects": 0,
                    "num_objects_manifest": 0,
                    "num_omap_bytes": 0,
                    "num_omap_keys": 0,
                    "num_objects_repaired": 0
                },
                "up": [
                    41,
                    22,
                    71
                ],
                "acting": [
                    41,
                    22,
                    71
                ],
                "avail_no_missing": [],
                "object_location_counts": [],
                "blocked_by": [],
                "up_primary": 41,
                "acting_primary": 41,
                "purged_snaps": []
            },
            "empty": 0,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 2188118,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        }
    ],
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2023-05-22T11:25:51.310912+0200",
            "requested_info_from": [
                {
                    "osd": "22"
                }
            ]
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2023-05-22T11:25:51.310910+0200",
            "past_intervals": [
                {
                    "first": "2188117",
                    "last": "2516102",
                    "all_participants": [
                        {
                            "osd": 22
                        },
                        {
                            "osd": 41
                        },
                        {
                            "osd": 71
                        }
                    ],
                    "intervals": [
                        {
                            "first": "2516081",
                            "last": "2516102",
                            "acting": "22,41"
                        }
                    ]
                }
            ],
            "probing_osds": [
                "22",
                "41",
                "71"
            ],
            "down_osds_we_would_probe": [],
            "peering_blocked_by": []
        },
        {
            "name": "Started",
            "enter_time": "2023-05-22T11:25:51.310889+0200"
        }
    ],
    "agent_state": {}
}
 
hmm, what all PGs have in common is that OSD 22 is acting and 41 should be used. You could try to restart these two OSDs one by one with some time in between to see if that helps.
 
strange:
restarting OSD.41, no improvement
restarting OSD.22... I lost ssh connection the the node hosting OSD.22 and was not able to reconnect. I decided to reboot from the local console.

currently it's recovering, I will leave it running until tomorrow...
At least, I was able to delete the corrupted image.

Thank you for your support so far. Final confirmation (hopefully) will follow...

Cheers,
luphi
 
  • Like
Reactions: aaron
everything looks normal again.
thank you again for your professional support.

Cheers,
luphi
 
  • Like
Reactions: aaron
i had some, self created, issues with Ceph a while ago. i did everything to avoid a reboot and try to solve it hours long. at any point i was desperate and just rebooted the cluster 3 times. after the 3th reboot ceph came up again. happens twice)
since, i directly reboot
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!