Hi all, i have several problems with a cluster.
this cluster have
6 nodes every nodes have 10 OSD(3TB each)
1 node with 3 OSD(10TB each)
this cluster remained unmanaged for approximately a year and a half
after a fault in the air conditioning system, many ceph OSDs fell, and the cluster began to no longer work
I can activate some VMs, but many cannot.
I proceeded to update each node to the version that the repositories allowed me
all nodes have pve-manager/7.4-18/b1f94095 except one node, this node seems fault the updates and the only way i have to activate it is use a old kernel version 5.11.27
this is the ceph situation:
i try to man a VM disk with rbd map command but when i try to export the data, nothing happened it remains stuck.
there are any action i can do it for fix it?
also many times the monitor go in error (500) leave me stucked
this cluster have
6 nodes every nodes have 10 OSD(3TB each)
1 node with 3 OSD(10TB each)
this cluster remained unmanaged for approximately a year and a half
after a fault in the air conditioning system, many ceph OSDs fell, and the cluster began to no longer work
I can activate some VMs, but many cannot.
I proceeded to update each node to the version that the repositories allowed me
all nodes have pve-manager/7.4-18/b1f94095 except one node, this node seems fault the updates and the only way i have to activate it is use a old kernel version 5.11.27
this is the ceph situation:
Code:
cluster:
id: f17eb6a9-5bfd-4a22-b064-eaa0204a4892
health: HEALTH_WARN
clock skew detected on mon.CAARPVE4, mon.CAARPVE5, mon.CAARPVE6
2/6 mons down, quorum CAARPVE3,CAARPVE4,CAARPVE5,CAARPVE6
4 osds down
6 nearfull osd(s)
all OSDs are running pacific or later but require_osd_release < pacific
Reduced data availability: 123 pgs inactive, 7 pgs down, 1 pg stale
Low space hindering backfill (add storage if this doesn't resolve itself): 16 pgs backfill_toofull
Degraded data redundancy: 6714845/33588930 objects degraded (19.991%), 533 pgs degraded, 584 pgs undersized
6 pgs not deep-scrubbed in time
3 pgs not scrubbed in time
2 pool(s) nearfull
1 daemons have recently crashed
96651 slow ops, oldest one blocked for 7752 sec, mon.CAARPVE4 has slow ops
services:
mon: 6 daemons, quorum CAARPVE3,CAARPVE4,CAARPVE5,CAARPVE6 (age 71m), out of quorum: CAARPVE2, CAARPVE7
mgr: CAARPVE5(active, since 2h), standbys: CAARPVE1, CAARPVE3
osd: 66 osds: 45 up (since 42m), 49 in (since 32m); 713 remapped pgs
data:
pools: 2 pools, 1088 pgs
objects: 11.20M objects, 27 TiB
usage: 88 TiB used, 59 TiB / 146 TiB avail
pgs: 11.305% pgs not active
6714845/33588930 objects degraded (19.991%)
5838798/33588930 objects misplaced (17.383%)
300 active+undersized+degraded+remapped+backfill_wait
245 active+clean
189 active+remapped+backfill_wait
107 active+undersized+degraded
87 undersized+degraded+remapped+backfill_wait+peered
58 active+clean+remapped
27 active+undersized+remapped+backfill_wait
12 active+undersized+remapped
10 undersized+degraded+remapped+backfilling+peered
8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull
8 active+undersized
7 undersized+degraded+peered
7 active+undersized+degraded+remapped+backfilling
7 down
6 undersized+degraded+remapped+backfill_wait+backfill_toofull+peered
5 undersized+remapped+backfill_wait+peered
2 active+remapped+backfill_wait+backfill_toofull
1 stale+active+clean
1 undersized+peered
1 active+recovery_wait+degraded+remapped
io:
recovery: 230 MiB/s, 57 objects/s
progress:
Global Recovery Event (2h)
[=======.....................] (remaining: 5h)
there are any action i can do it for fix it?
also many times the monitor go in error (500) leave me stucked