Proxmox Cluster stuck

marco011ET · Aug 6, 2024

Hi all, i have several problems with a cluster.
this cluster have
6 nodes every nodes have 10 OSD(3TB each)
1 node with 3 OSD(10TB each)

this cluster remained unmanaged for approximately a year and a half
after a fault in the air conditioning system, many ceph OSDs fell, and the cluster began to no longer work
I can activate some VMs, but many cannot.
I proceeded to update each node to the version that the repositories allowed me
all nodes have pve-manager/7.4-18/b1f94095 except one node, this node seems fault the updates and the only way i have to activate it is use a old kernel version 5.11.27

this is the ceph situation:

Code:

 cluster:
    id:     f17eb6a9-5bfd-4a22-b064-eaa0204a4892
    health: HEALTH_WARN
            clock skew detected on mon.CAARPVE4, mon.CAARPVE5, mon.CAARPVE6
            2/6 mons down, quorum CAARPVE3,CAARPVE4,CAARPVE5,CAARPVE6
            4 osds down
            6 nearfull osd(s)
            all OSDs are running pacific or later but require_osd_release < pacific
            Reduced data availability: 123 pgs inactive, 7 pgs down, 1 pg stale
            Low space hindering backfill (add storage if this doesn't resolve itself): 16 pgs backfill_toofull
            Degraded data redundancy: 6714845/33588930 objects degraded (19.991%), 533 pgs degraded, 584 pgs undersized
            6 pgs not deep-scrubbed in time
            3 pgs not scrubbed in time
            2 pool(s) nearfull
            1 daemons have recently crashed
            96651 slow ops, oldest one blocked for 7752 sec, mon.CAARPVE4 has slow ops

  services:
    mon: 6 daemons, quorum CAARPVE3,CAARPVE4,CAARPVE5,CAARPVE6 (age 71m), out of quorum: CAARPVE2, CAARPVE7
    mgr: CAARPVE5(active, since 2h), standbys: CAARPVE1, CAARPVE3
    osd: 66 osds: 45 up (since 42m), 49 in (since 32m); 713 remapped pgs

  data:
    pools:   2 pools, 1088 pgs
    objects: 11.20M objects, 27 TiB
    usage:   88 TiB used, 59 TiB / 146 TiB avail
    pgs:     11.305% pgs not active
             6714845/33588930 objects degraded (19.991%)
             5838798/33588930 objects misplaced (17.383%)
             300 active+undersized+degraded+remapped+backfill_wait
             245 active+clean
             189 active+remapped+backfill_wait
             107 active+undersized+degraded
             87  undersized+degraded+remapped+backfill_wait+peered
             58  active+clean+remapped
             27  active+undersized+remapped+backfill_wait
             12  active+undersized+remapped
             10  undersized+degraded+remapped+backfilling+peered
             8   active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             8   active+undersized
             7   undersized+degraded+peered
             7   active+undersized+degraded+remapped+backfilling
             7   down
             6   undersized+degraded+remapped+backfill_wait+backfill_toofull+peered
             5   undersized+remapped+backfill_wait+peered
             2   active+remapped+backfill_wait+backfill_toofull
             1   stale+active+clean
             1   undersized+peered
             1   active+recovery_wait+degraded+remapped

  io:
    recovery: 230 MiB/s, 57 objects/s

  progress:
    Global Recovery Event (2h)
      [=======.....................] (remaining: 5h)

i try to man a VM disk with rbd map command but when i try to export the data, nothing happened it remains stuck.
there are any action i can do it for fix it?
also many times the monitor go in error (500) leave me stucked

floh8 · Aug 10, 2024

Last lline give the hint about a recovery event. Still in progress.

Proxmox Cluster stuck

marco011ET

Member

floh8

Renowned Member

We value your privacy