Ceph x daemons have recently crashed

Oct 11, 2020
35
2
13
36
Since last Ceph update (to current 17.2.5) we noted that every node reboot will mark OSDs from that node as crashed. However they return with the server boot normally.

I checked ceph and journalctl logs, and I did not find anything relevant about the daemons crashing (timeout, segfault, etc)

Is this something normal after the reboot (the CEPH HEALTH_WARN)? Seems not, because this is new for us.
 
I too am seeing this issue in my cluster with most of my nodes upon reboot and I too can't find anything related in the logs. Some of them don't have daemons crashing on the update. Not sure what the cause is on this one since everything works fine after the reboot.
 
CEPH OSD FLAGS

noout -- If the mon osd report timeout is exceeded and an OSD has not reported to the monitor, the OSD will get marked out. The “noout” flag tells the ceph monitors not to “out” any OSDs from the crush map and not to start recovery and re-balance activities, to maintain the replica count.

nobackfill -- If you need to take an OSD or node down temporarily, (e.g., upgrading daemons), you can set nobackfill so that Ceph will not backfill while the OSD(s) is down.

norecover -- Ceph will prevent new recovery operations. If you need to replace an OSD disk and don’t want the PGs to recover to another OSD while you are hotswapping disks, you can set norecover to prevent the other OSDs from copying a new set of PGs to other OSDs.

norebalance -- data rebalancing is suspended

nodown -- Prevent OSDs from getting marked down. Networking issues may interrupt Ceph heartbeat processes, and an OSD may be up but still get marked down. You can set nodown to prevent OSDs from getting marked down while troubleshooting the issue. If something (like network issue,etc) is causing OSDs to ‘flap’ (repeatedly getting marked down and then up again), you can force the monitors to stop the flapping by temporarily freezing their states with nodown.

pause -- Ceph will stop processing read and write operations, but will not affect OSD in, out, up or down statuses. If you need to troubleshoot a running Ceph cluster without clients reading and writing data, you can set the cluster to pause to prevent client operations.

Try setting ceph flags as per your requirements before rebooting a node in the cluster. works like a charm.

# Node maintenance

# stop and wait for scrub and deep-scrub operations

ceph osd set noscrub
ceph osd set nodeep-scrub

ceph status

# set cluster in maintenance mode with : (I had used the below to bring the entire cluster down when we were physically migarting the entire setup to a diffrent datacentre)

ceph -s (to check ceph status)
# ceph osd set noout
# ceph osd set nobackfill
# ceph osd set norecover
# ceph osd set norebalance
# ceph osd set nodown
# ceph osd set pause


UNSET FLAGS ONCE ACTIVITY IS COMPLETED.
 
  • Like
Reactions: pvps1
Generally I set the noout flag to the cluster before rebooting a node to prevent the cluster from needing to do a lot of work on the node coming back online. The thing that is the most strange is that the daemons crash while the cluster still has the noout flag set and the node is back online.

Setting all those flags is not needed for a single node reboot since noout will prevent backfilling, recovering, and rebalance as the node is coming back. Once it returns, it will rebalance, backfill, and recover if needed but since all the OSDs return after the reboot, this process only takes seconds. I don't know if I would set nodown for a node reboot as the OSDs really are down so that might cause an issue.

Edit: Spelling
 
Last edited:
  • Like
Reactions: pvps1

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!