Dear all,
yesterday we did some maintenance on one of our nodes (swapping the mainboard) and brought it back online afterwards. We suddenly realized that "something" is wrong, everything was painfully slow and became worse. During the maintenance, we've set the noout flag.
There are three nodes with 4 disks each, however on all three nodes (also the two we haven't changed during maintenance), the same two disks are no longer available:
Current Ceph Status:
I will say I'm not as familiar with CEPH as I probably should be to fix that issue. For most VMs we do have backups, so it wouldn't be the end of the world. Any idea if that situation can somewhat be salvaged?
yesterday we did some maintenance on one of our nodes (swapping the mainboard) and brought it back online afterwards. We suddenly realized that "something" is wrong, everything was painfully slow and became worse. During the maintenance, we've set the noout flag.
There are three nodes with 4 disks each, however on all three nodes (also the two we haven't changed during maintenance), the same two disks are no longer available:
Code:
# Node 1
root@proxmox01:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 894.3G 0 disk
└─sda4 8:4 0 37G 0 part
sdb 8:16 0 894.3G 0 disk
└─sdb4 8:20 0 37G 0 part
sdc 8:32 0 894.3G 0 disk
└─ceph--f86b1366--7404--46fe--b9d5--d27b1af6dfd6-osd--block--5bc096cf--96d5--40fa--a289--e35907275994 252:1 0 894.3G 0 lvm
sdd 8:48 0 894.3G 0 disk
└─ceph--f09229cd--fcab--4547--bda6--7342b5c138fa-osd--block--409c3800--6299--4528--93ec--745e3dbee671 252:0 0 894.3G 0 lvm
# Node 2
root@proxmox02:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 894.3G 0 disk
└─sda4 8:4 0 37G 0 part
sdb 8:16 0 894.3G 0 disk
└─sdb4 8:20 0 37G 0 part
sdc 8:32 0 894.3G 0 disk
└─ceph--fa8929ee--974d--4e0c--926a--c0a4885d7f8c-osd--block--434171d1--d900--4b57--abdc--e44dc1dfccac 252:1 0 894.3G 0 lvm
sdd 8:48 0 894.3G 0 disk
└─ceph--359c8ff3--e1d2--4680--9293--128c13dc4b4c-osd--block--d5109736--8162--4444--af70--b33e099449a3 252:0 0 894.3G 0 lvm
# Node 3
root@proxmox03:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 894.3G 0 disk
└─sda4 8:4 0 37G 0 part
sdb 8:16 0 894.3G 0 disk
└─sdb4 8:20 0 37G 0 part
sdc 8:32 0 894.3G 0 disk
└─ceph--819cc7d0--5259--4f48--97c2--0b7354245529-osd--block--c787bf62--555c--467b--930d--ead361c99001 252:1 0 894.3G 0 lvm
sdd 8:48 0 894.3G 0 disk
└─ceph--6ea5c166--73d7--4ced--98bd--1385f770a875-osd--block--b3077adf--ee58--4bae--85e1--483aef62bc59 252:0 0 894.3G 0 lvm
Current Ceph Status:
Code:
root@proxmox01:~# ceph -s
cluster:
id: d88b5090-dde9-41c0-a3d4-e04a1e212112
health: HEALTH_WARN
2 filesystems are degraded
clock skew detected on mon.proxmox03, mon.proxmox01
1/366705 objects unfound (0.000%)
norebalance,norecover flag(s) set
2 osds down
Reduced data availability: 57 pgs inactive, 13 pgs down
Degraded data redundancy: 341985/1100115 objects degraded (31.086%), 188 pgs degraded, 198 pgs undersized
5 daemons have recently crashed
1 slow ops, oldest one blocked for 134 sec, osd.6 has slow ops
services:
mon: 3 daemons, quorum proxmox02,proxmox03,proxmox01 (age 5m)
mgr: proxmox02(active, since 5m), standbys: proxmox01, proxmox03
mds: 2/2 daemons up, 1 standby
osd: 12 osds: 6 up (since 5m), 8 in (since 16h); 69 remapped pgs
flags norebalance,norecover
data:
volumes: 0/2 healthy, 2 recovering
pools: 7 pools, 289 pgs
objects: 366.70k objects, 1.3 TiB
usage: 2.5 TiB used, 2.7 TiB / 5.2 TiB avail
pgs: 3.114% pgs unknown
16.609% pgs not active
341985/1100115 objects degraded (31.086%)
6305/1100115 objects misplaced (0.573%)
1/366705 objects unfound (0.000%)
99 active+undersized+degraded
67 active+clean
48 active+undersized+degraded+remapped+backfill_wait
19 undersized+degraded+peered
13 down
12 undersized+degraded+remapped+backfill_wait+peered
9 unknown
8 active+undersized
6 active+recovery_wait+undersized+degraded+remapped
3 undersized+peered
1 active+recovery_wait+degraded
1 active+remapped+backfill_wait
1 undersized+degraded+remapped+backfilling+peered
1 active+recovering+undersized+degraded
1 active+undersized+degraded+remapped+backfilling
I will say I'm not as familiar with CEPH as I probably should be to fix that issue. For most VMs we do have backups, so it wouldn't be the end of the world. Any idea if that situation can somewhat be salvaged?