We have a 4 nodes CEPH cluster that's normally quite stable.
In the last three weeks, one of the nodes has failed in an abnormal way (like complete shutdown of the server).
Even with console level access to the device, I couldn't do anything on this server and had to power-reset the server to have it back on track.
The logs below are the one that were displayed on screen before I rebooted the server.
After reboot, the server seems "ok" beside deep-scrubbing taking place:
I have checked all VM and CT image files and they seem "ok"
One thing that seems quite strange is that on the /dev/rbd* on this server I only have:
Where as on the other servers I seem to have access to more devices:
I would need to stabilize this server and avoid any freezing in the future so any advise would be welcome.
In the last three weeks, one of the nodes has failed in an abnormal way (like complete shutdown of the server).
Even with console level access to the device, I couldn't do anything on this server and had to power-reset the server to have it back on track.
The logs below are the one that were displayed on screen before I rebooted the server.
[1597559.876019] rbd: rode: capacity 42949672960 features 0x3d
[1597559.912675] EXT4-fs (rbd®): write access unavailable, skipping orphan cleanup
[1597559.913252] EXT4-fs (rbdo): mounted filesystem 4689182e-18ec-4129-afc9-8ddbefobb9d3 ro without journal. Quota mode: none.
[1597586.984605] EXT4-fs (rbdo): unmounting filesystem 4689182e-18ec-4129-afc9-8ddbefebb9d3.
After reboot, the server seems "ok" beside deep-scrubbing taking place:
root@pve1:~# ceph -s
cluster:
id: 379d559a-3bb3-48bd-8c0f-b027c7672d1b
health: HEALTH_OK
services:
mon: 4 daemons, quorum pve1,pve3,pve2,pve (age 4h)
mgr: pve2(active, since 2w), standbys: pve3, pve, pve1
mds: 1/1 daemons up, 3 standby
osd: 11 osds: 11 up (since 4h), 11 in (since 4h); 1 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 169 pgs
objects: 681.39k objects, 2.6 TiB
usage: 7.7 TiB used, 38 TiB / 45 TiB avail
pgs: 2871/2044176 objects misplaced (0.140%)
164 active+clean
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 active+remapped+backfilling
io:
client: 503 KiB/s rd, 10 MiB/s wr, 6 op/s rd, 133 op/s wr
recovery: 101 MiB/s, 25 objects/s
I have checked all VM and CT image files and they seem "ok"
One thing that seems quite strange is that on the /dev/rbd* on this server I only have:
root@pve1:~# ls /dev/rbd*
/dev/rbd1 /dev/rbd2
/dev/rbd:
CephData
/dev/rbd-pve:
379d559a-3bb3-48bd-8c0f-b027c7672d1b
Where as on the other servers I seem to have access to more devices:
root@pve3:~# ls /dev/rbd*
/dev/rbd0 /dev/rbd1 /dev/rbd10 /dev/rbd2 /dev/rbd3 /dev/rbd4 /dev/rbd5 /dev/rbd6 /dev/rbd7 /dev/rbd8 /dev/rbd9
/dev/rbd:
CephData
/dev/rbd-pve:
379d559a-3bb3-48bd-8c0f-b027c7672d1b
I would need to stabilize this server and avoid any freezing in the future so any advise would be welcome.