Ceph Quincy "Daemons have recently crashed" after node reboots

AllanM

Well-Known Member
Oct 17, 2019
119
39
48
41
Wondering if anyone else has observed this, or if I missed a memo on how to fix it (or maybe I'm doing something wrong!)

Since updating my homelab and office production server clusters to Ceph Quincy earlier this year, we get "Daemons have recently crashed" errors after doing routine cluster updates and node reboots. The crash reports are for some but not all daemons. Out of our 60 OSD's on the production cluster, we'll get 5-20 daemon crashes when performing node reboots.

After I finish rebooting all the nodes of each cluster (after kernel updates), I usually just archive the crash reports and move on with life. It doesn't seem to be impacting anything negatively other than it looks bad on the ceph status display screen.

When I do cluster maintenance (updates/reboots), I just set the "noout" flag on the OSD's then reboot nodes 1 at a time, waiting for everything to recover on ceph between reboots. Should I be doing something else?

Thanks!
 
I did have the same issue. I only have 3 servers in a cluster. After upgrading to Ceph Quincy it started.

These days I have been doing a lot of debugging and found out that the ceph-crash service wants to read a keyring file, which did not exist. So I created

Code:
/etc/pve/priv/ceph.client.crash.keyring

and for every ceph node the file

Code:
/etc/pve/priv/ceph.client.crash.<hostname>.keyring

using

Code:
ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash'


I also got error messages, because ceph did try to find the mon servers using SRV records in DNS. I created these SRV records following this instruction: https://docs.ceph.com/en/quincy/rados/configuration/mon-lookup-dns/

Rebooting the nodes goes smooth without crash of the OSDs anymore.
 
Interesting! Thanks for sharing that brudy!

I prefer not to modify anything on production proxmox servers unless the modification is part of the administration guide / documentation. In other words, the modification I make to fix something like this, should be in the scope of visibility and expected potential configuration of the developers, so that a future update doesn't "break" something because I changed things that I shouldn't have.

Not to say these changes would ever cause that, but I'd like to see those little "fixes" either included in a proxmox update or instructions adopted into the proxmox documentation/guides before using them.

Regards,
-Eric
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!