Ceph Quincy "Daemons have recently crashed" after node reboots

AllanM · Oct 20, 2022

Wondering if anyone else has observed this, or if I missed a memo on how to fix it (or maybe I'm doing something wrong!)

Since updating my homelab and office production server clusters to Ceph Quincy earlier this year, we get "Daemons have recently crashed" errors after doing routine cluster updates and node reboots. The crash reports are for some but not all daemons. Out of our 60 OSD's on the production cluster, we'll get 5-20 daemon crashes when performing node reboots.

After I finish rebooting all the nodes of each cluster (after kernel updates), I usually just archive the crash reports and move on with life. It doesn't seem to be impacting anything negatively other than it looks bad on the ceph status display screen.

When I do cluster maintenance (updates/reboots), I just set the "noout" flag on the OSD's then reboot nodes 1 at a time, waiting for everything to recover on ceph between reboots. Should I be doing something else?

Thanks!

brudy · Jan 3, 2023

I did have the same issue. I only have 3 servers in a cluster. After upgrading to Ceph Quincy it started.

These days I have been doing a lot of debugging and found out that the ceph-crash service wants to read a keyring file, which did not exist. So I created

Code:

/etc/pve/priv/ceph.client.crash.keyring

and for every ceph node the file

Code:

/etc/pve/priv/ceph.client.crash.<hostname>.keyring

using

Code:

ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash'

I also got error messages, because ceph did try to find the mon servers using SRV records in DNS. I created these SRV records following this instruction: https://docs.ceph.com/en/quincy/rados/configuration/mon-lookup-dns/

Rebooting the nodes goes smooth without crash of the OSDs anymore.

AllanM · Jan 4, 2023

Interesting! Thanks for sharing that brudy!

I prefer not to modify anything on production proxmox servers unless the modification is part of the administration guide / documentation. In other words, the modification I make to fix something like this, should be in the scope of visibility and expected potential configuration of the developers, so that a future update doesn't "break" something because I changed things that I shouldn't have.

Not to say these changes would ever cause that, but I'd like to see those little "fixes" either included in a proxmox update or instructions adopted into the proxmox documentation/guides before using them.

Regards,
-Eric

Search

Search

Ceph Quincy "Daemons have recently crashed" after node reboots

AllanM

Renowned Member

brudy

Renowned Member

AllanM

Renowned Member

We value your privacy