Ceph Managers Seg Faulting Post Upgrade (8 -> 9 upgrade)

My setup is only ceph-csi-rbd, using Volsync and the snapshot-controller. I'm unable to run rbd trash purge without failing with:
Code:
Removing images: 29% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group and were moved to the trash successfully.

I've managed to get the mgr daemons running briefly (but still crashed later) with this:
Code:
rbd -c /etc/pve/ceph.conf --cluster ceph --pool <pool> ls <pool> |grep snap | xargs -l rbd -c /etc/pve/ceph.conf --cluster ceph --pool <pool> snap purge

Even changing the ReplicationSource copyMethod to Directdoesn't help in the long run.
 
Reported this upstream in the meantime: https://tracker.ceph.com/issues/72713

This is quite tricky to track down, unfortunately; but hey, we've got a bit of an idea now at least.

Just out of curiosity, are there any users here that are only using either one of the Ceph CSI drivers, but not both? (So either only ceph-csi-rbd or only ceph-csi-cephfs.)

We are also seeing this issue on our TEST-Environment.
Acutally, we have planned to upgrade our PROD next week (there we are in «Basic» Support not only Community like the TEST).
We will stop the upgrade until this issue is well known and resolved!

I'd suggest to send out a warning for other people trying to upgrade!

Regards, Urs
 
Last edited:
To create temporary backups of my Kubernetes workloads, I set up a Debian Bookworm container within Proxmox, in which I installed Ceph-MGR and added it to the cluster. The version is also 19.2.3, matching the Proxmox Ceph cluster. The manager running in the Debian Bookworm container does not experience these Segfault crashes, allowing me to temporarily backup my workloads.
 
To create temporary backups of my Kubernetes workloads, I set up a Debian Bookworm container within Proxmox, in which I installed Ceph-MGR and added it to the cluster. The version is also 19.2.3, matching the Proxmox Ceph cluster. The manager running in the Debian Bookworm container does not experience these Segfault crashes, allowing me to temporarily backup my workloads.

Fascinating. Would you be so kind to share your installation procedure here, please? Did you build Ceph 19.2.3 from source, or did you pull it from our repos? Or did you install it some other way?
 
I installed debian-12-standard_12.7-1_amd64.tar.zst container image and added the official debian repo of ceph. then i installed ceph-mgr and started it

Bash:
apt-get install software-properties-common
apt-add-repository 'deb https://download.ceph.com/debian-squid/ bookworm main'
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E84AC2C0460F3994
apt update
apt install ceph-mgr

then i copied the admin keyring and ceph config over and started the mgr

Bash:
cd /etc/ceph/
scp source-pve-ip:/etc/ceph/ceph.client.admin* .
scp source-pve-ip:/etc/ceph/ceph.conf .

export $name=cephmgr1
``ceph auth get-or-create mgr.$name mon 'allow profile mgr' osd 'allow *' mds 'allow *'``
mkdir /var/lib/ceph/mgr/ceph-cephmgr1
nano /var/lib/ceph/mgr/ceph-cephmgr1/keyring #<-- paste key

#...
[mgr.cephmgr1]
        key = xxxxxxxxxxxxxxxxxxxxxxxxxx
#...
# start mgr daemon
ceph-mgr -i $name
 
  • Like
Reactions: Max Carrara
I have the same issue using Volsync. I use the following cronjobs as a workaround until it is fixed.

On one proxmox node:
Code:
10 * * * * /usr/bin/rbd trash purge k8s-prod && sleep 60 && /usr/bin/systemctl reset-failed && /usr/bin/systemctl restart ceph-mgr.target

On the others:
Code:
15 * * * * /usr/bin/systemctl reset-failed && /usr/bin/systemctl restart ceph-mgr.target

My volsync jobs are on the hour and usually run under 5 minutes so by the time the cronjob runs they are done.

There is also the possibility to add a
Code:
ceph crash archive-all
or
Code:
ceph crash prune 0
to the first job to remove the warnings from the Ceph dashboard but I don't like the idea of removing possible non-related crashes.
 
Would like to drop in as well.

pve 9.0.5 cluster, ceph 19.2.3

I have k3s installed on VMs with external ceph mount on the proxmox cluster crashing the managers. Using the csi-rbd-ceph plugin as well.

Since ceph froze I actually couldn't access the nodes, my steps to fix was:
1. Manually modify the ceph auth keyring used for Kubernetes, revoking most privileges. (Since empty doesn't work, I just did allow rbd on monitors)
2. rbd --pool k3s trash purge (I had to purge the metadata pool separately as well since I was using an EC pool)
3. systemctl reset-failed
4. Start ceph managers again

Would appreciate updates if anyone has a fix.

Good thing the k3s cluster was a fresh staging environment, the only pvc I had was a rancher backup, so that could be causing issues.
 
I tried adding a manager from a Debian VM install as well and it has stayed alive for the last couple days. My original 3 managers are showing they're still in standby and the new one is active and working. If one of the old ones is the active manager, it crashes and the new one crashes as well when it tries to take over.