Node freezing with no known reason [CEPH]

DynFi User

Renowned Member
Apr 18, 2016
148
16
83
49
dynfi.com
We have a 4 nodes CEPH cluster that's normally quite stable.

In the last three weeks, one of the nodes has failed in an abnormal way (like complete shutdown of the server).
Even with console level access to the device, I couldn't do anything on this server and had to power-reset the server to have it back on track.

The logs below are the one that were displayed on screen before I rebooted the server.

[1597559.876019] rbd: rode: capacity 42949672960 features 0x3d
[1597559.912675] EXT4-fs (rbd®): write access unavailable, skipping orphan cleanup
[1597559.913252] EXT4-fs (rbdo): mounted filesystem 4689182e-18ec-4129-afc9-8ddbefobb9d3 ro without journal. Quota mode: none.
[1597586.984605] EXT4-fs (rbdo): unmounting filesystem 4689182e-18ec-4129-afc9-8ddbefebb9d3.

After reboot, the server seems "ok" beside deep-scrubbing taking place:

root@pve1:~# ceph -s
cluster:
id: 379d559a-3bb3-48bd-8c0f-b027c7672d1b
health: HEALTH_OK

services:
mon: 4 daemons, quorum pve1,pve3,pve2,pve (age 4h)
mgr: pve2(active, since 2w), standbys: pve3, pve, pve1
mds: 1/1 daemons up, 3 standby
osd: 11 osds: 11 up (since 4h), 11 in (since 4h); 1 remapped pgs

data:
volumes: 1/1 healthy
pools: 4 pools, 169 pgs
objects: 681.39k objects, 2.6 TiB
usage: 7.7 TiB used, 38 TiB / 45 TiB avail
pgs: 2871/2044176 objects misplaced (0.140%)
164 active+clean
3 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 active+remapped+backfilling

io:
client: 503 KiB/s rd, 10 MiB/s wr, 6 op/s rd, 133 op/s wr
recovery: 101 MiB/s, 25 objects/s

I have checked all VM and CT image files and they seem "ok"

One thing that seems quite strange is that on the /dev/rbd* on this server I only have:

root@pve1:~# ls /dev/rbd*
/dev/rbd1 /dev/rbd2

/dev/rbd:
CephData

/dev/rbd-pve:
379d559a-3bb3-48bd-8c0f-b027c7672d1b

Where as on the other servers I seem to have access to more devices:

root@pve3:~# ls /dev/rbd*
/dev/rbd0 /dev/rbd1 /dev/rbd10 /dev/rbd2 /dev/rbd3 /dev/rbd4 /dev/rbd5 /dev/rbd6 /dev/rbd7 /dev/rbd8 /dev/rbd9

/dev/rbd:
CephData

/dev/rbd-pve:
379d559a-3bb3-48bd-8c0f-b027c7672d1b

I would need to stabilize this server and avoid any freezing in the future so any advise would be welcome.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!