Ceph Managers Seg Faulting Post Upgrade (8 -> 9 upgrade)

Here are ceph logs and outputs of commands asked prior.
ceph -s:
Code:
  cluster:
    id:     18acc013-3ecb-4f72-a025-86882a2a39a4
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon: 3 daemons, quorum sol-ceres-pve,sol-eris-pve,sol-pluto-pve (age 2d)
    mgr: no daemons active (since 38m)
    mds: 1/1 daemons up, 2 standby
    osd: 8 osds: 8 up (since 2d), 8 in (since 2d)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 129 pgs
    objects: 7.78k objects, 26 GiB
    usage:   75 GiB used, 14 TiB / 15 TiB avail
    pgs:     129 active+clean

pveversion -v:
Code:
proxmox-ve: 9.0.0 (running kernel: 6.14.8-2-pve)
pve-manager: 9.0.3 (running version: 9.0.3/025864202ebb6109)
proxmox-kernel-helper: 9.0.3
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
proxmox-kernel-6.14: 6.14.8-2
ceph: 19.2.3-pve1
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx9
intel-microcode: 3.20250512.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.4
libpve-network-perl: 1.1.6
libpve-rs-perl: 0.10.10
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2
lxc-pve: 6.0.4-2
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.11-1
proxmox-backup-file-restore: 4.0.11-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.1.1
proxmox-kernel-helper: 9.0.3
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.0
proxmox-widget-toolkit: 5.0.5
pve-cluster: 9.0.6
pve-container: 6.0.9
pve-docs: 9.0.8
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.16-3
pve-ha-manager: 5.0.4
pve-i18n: 3.5.2
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
qemu-server: 9.0.16
smartmontools: 7.4-pve1
spiceterm: 3.4.0
swtpm: 0.8.0+pve2
vncterm: 1.9.0
zfsutils-linux: 2.3.3-pve1

dpkg --list | grep -e 'ceph' -e 'rbd' -e 'rados':
Code:
ii  ceph                                 19.2.3-pve1                     amd64        distributed storage and file system
ii  ceph-base                            19.2.3-pve1                     amd64        common ceph daemon libraries and management tools
ii  ceph-common                          19.2.3-pve1                     amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-fuse                            19.2.3-pve1                     amd64        FUSE-based client for the Ceph distributed file system
ii  ceph-mds                             19.2.3-pve1                     amd64        metadata server for the ceph distributed file system
ii  ceph-mgr                             19.2.3-pve1                     amd64        manager for the ceph distributed storage system
ii  ceph-mgr-modules-core                19.2.3-pve1                     all          ceph manager modules which are always enabled
ii  ceph-mon                             19.2.3-pve1                     amd64        monitor server for the ceph storage system
ii  ceph-osd                             19.2.3-pve1                     amd64        OSD server for the ceph storage system
ii  ceph-volume                          19.2.3-pve1                     all          tool to facilidate OSD deployment
ii  libcephfs2                           19.2.3-pve1                     amd64        Ceph distributed file system client library
ii  librados2                            19.2.3-pve1                     amd64        RADOS distributed object store client library
ii  librados2-perl                       1.5.0                           amd64        Perl bindings for librados
ii  libradosstriper1                     19.2.3-pve1                     amd64        RADOS striping interface
ii  librbd1                              19.2.3-pve1                     amd64        RADOS block device client library
ii  libsqlite3-mod-ceph                  19.2.3-pve1                     amd64        SQLite3 VFS for Ceph
ii  python3-ceph-argparse                19.2.3-pve1                     all          Python 3 utility libraries for Ceph CLI
ii  python3-ceph-common                  19.2.3-pve1                     all          Python 3 utility libraries for Ceph
ii  python3-cephfs                       19.2.3-pve1                     amd64        Python 3 libraries for the Ceph libcephfs library
ii  python3-rados                        19.2.3-pve1                     amd64        Python 3 libraries for the Ceph librados library
ii  python3-rbd                          19.2.3-pve1                     amd64        Python 3 libraries for the Ceph librbd library

ps faxl | grep ceph:
Code:
1     0   23562       2   0 -20      0     0 rescue I<   ?          0:00  \_ [kworker/R-ceph-msgr]
1     0   23592       2   0 -20      0     0 rescue I<   ?          0:00  \_ [kworker/R-ceph-watch-notify]
1     0   23593       2   0 -20      0     0 rescue I<   ?          0:00  \_ [kworker/R-ceph-completion]
0     0 2328367 2313577  20   0   6528  2108 pipe_r S+   pts/0      0:00  |                   \_ grep ceph
4 64045    8994       1  20   0  21984 14120 hrtime Ss   ?          0:01 /usr/bin/python3 /usr/bin/ceph-crash
4 64045    9498       1  20   0 727416 491468 futex_ Ssl ?         17:22 /usr/bin/ceph-mon -f --cluster ceph --id sol-ceres-pve --setuser ceph --setgroup ceph
4 64045   10211       1  20   0 189748 41776 futex_ Ssl  ?          0:44 /usr/bin/ceph-mds -f --cluster ceph --id sol-ceres-pve --setuser ceph --setgroup ceph
4 64045   12355       1  20   0 1249604 591580 futex_ Ssl ?        11:54 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
4 64045   14046       1  20   0 1209484 549916 futex_ Ssl ?        12:20 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph

EDIT: I just now saw the trash purge commands and, after running them, my managers seem to be back online. So that does seem to be the issue. How can I prevent crashes in the future automatically?
 

Attachments

Last edited:
Purging the trash worked for me too (also using K8s). Of course, if you're using rbd namespaces, don't forget to clear out all of them!

rbd --pool rbd trash purge
rbd --pool rbd --namespace dev trash purge
rbd --pool rbd --namespace prod trash purge

I look forward to a permanent solution!
 
Purging the trash worked for me too (also using K8s). Of course, if you're using rbd namespaces, don't forget to clear out all of them!

rbd --pool rbd trash purge
rbd --pool rbd --namespace dev trash purge
rbd --pool rbd --namespace prod trash purge

I look forward to a permanent solution!
yeah this works, until ceph-csi / volsync create/delete snapshots then it immediately breaks, I've set volsync to use copymethod: Direct in the meantime

definitely seems like ceph-csi is the common factor here, although this was working fine on pve 8.x (not sure exactly what version)
 
  • Like
Reactions: 9numbernine9
I'm using Rook w/ external ceph & Velero - if there's a way to bandaid the issue in either of those, I'd love to hear it.
 
I'm using Rook w/ external ceph & Velero - if there's a way to bandaid the issue in either of those, I'd love to hear it.
As far as i know rook uses ceph-csi under the hood, so i believe the same potential issue with ceph-csi and ceph on proxmox could be the cause
 
  • Like
Reactions: benbutton
Same issue here, with a couple of notes:
  • I am using Kubernetes as well, and can relate the issue to something being in the rbd trash
  • I've upgraded to the latest Ceph-CSI 3.14.2 and the problem still exists, although I don't notice anything in the release notes that looked related so this is perhaps not surprising.
  • Although I have the snapshot controller etc. configured, I don't use snapshots or clones in my environment.
Reproducing the issue was very simple: simply creating an RBD based PVC (by applying the following yaml), waiting until the PV is created, and then deleting the PVC is enough to reproduce the issue for me. This RBD ended up in the trash and then the Ceph Manager crashed.

YAML:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-csi-fs-test-pvc
  labels:
    app: ceph-csi-fs-test
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-csi-rbd
 
Although I have the snapshot controller etc. configured, I don't use snapshots or clones in my environment.
that's interesting, have you at some point though? because since i've stopped using these i don't get any issues any more, not a solution but a temp workaround for sure.
 
that's interesting, have you at some point though? because since i've stopped using these i don't get any issues any more, not a solution but a temp workaround for sure.
Interesting. I've never used snapshots other than a quick test when I first set it up about 4-5 years ago - I just noticed I don't even have a VolumeSnapshotClass on my cluster anymore...

I did notice then when I first delete a PVC, the RBD is moved to the trash but there is a watcher on it which is preventing it from being removed. Running RBD info shows a snapshot_count = 0, but emptying the trash will fail stating that there are still watchers on the RBD. In an interesting twist of fate, it looks like it's the ceph-mgr that has a watcher on the RBD which is blocking the RBD from being removed from the trash.

Did you remove or disable anything (like the external-snapshotter or the VolumeSnapshotClass), or just stop creating snapshots? Whilst I can recreate the issue at will by just creating a PVC and deleting it, I'm wondering if there's some sort of event hook "on delete of an RBD in the kubernetes pool".
 
that's interesting, have you at some point though? because since i've stopped using these i don't get any issues any more, not a solution but a temp workaround for sure.
One possibility for our differing experience... I had a job running that would create an ephemeral volume to do its work. This job runs semi-regularly, creating and deleting a PVC and therefore seemingly causing the ceph-mgr to crash each time it ran. I'm just curious, if this is something you can realistically test in your environment, if you just create a test rbd-based PVC and then delete the PVC, does that still crash your ceph-mgr? Anyways, I changed this job to use an in-memory tmpfs instead of an ephemeral volume, so fingers crossed the ceph-mgr doesn't crash as often for me now.
 
I don't have much valuable to contribute at this point besides mentioning that this is effecting us as well with exactly the same setup. Ceph-CSI, Snapshots etc.

Is there a possibility of downgrading to Ceph 19.2.2 on PVE 9?

Edit 1:

I just attempted to downgrade to 19.2.2 on one of my manager nodes. It worked flawlessly but did not solve the issue.
I am still seeing the same mgr crash:

Note this line:

Code:
Aug 15 22:19:50 srv01 ceph-mgr[593278]:    -45> 2025-08-15T22:19:50.805+0200 7e5656ae06c0 -1 librbd::image::PreRemoveRequest: 0x5af55f5513b0 check_image_watchers: image has watchers - not removing


Code:
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  23: /lib/librados.so.2(+0xd7fb1) [0x7e567cae7fb1]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  24: /lib/librados.so.2(+0xee15e) [0x7e567cafe15e]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  25: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xe1224) [0x7e567cee1224]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  26: /lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x7e567cc9cb7b]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  27: /lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x7e567cd1a7b8]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2326> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2324> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2322> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2320> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2318> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2265> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2263> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2260> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2256> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2253> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:    -45> 2025-08-15T22:19:50.805+0200 7e5656ae06c0 -1 librbd::image::PreRemoveRequest: 0x5af55f5513b0 check_image_watchers: image has watchers - not removing
Aug 15 22:19:50 srv01 ceph-mgr[593278]:      0> 2025-08-15T22:19:50.819+0200 7e56572e16c0 -1 *** Caught signal (Segmentation fault) **
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  in thread 7e56572e16c0 thread_name:io_context_pool
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  ceph version 19.2.2 (e3f44088ed9d68b8cb1628a6d5d474c2010b29a0) squid (stable)
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7e567cc49df0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  2: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1598b0) [0x7e567e3598b0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  3: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a1843) [0x7e567e3a1843]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  4: _PyType_LookupRef()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  5: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a216b) [0x7e567e3a216b]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  6: PyObject_GetAttr()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  7: _PyEval_EvalFrameDefault()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  8: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1109dd) [0x7e567e3109dd]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  9: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x3d3442) [0x7e567e5d3442]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  10: /lib/python3/dist-packages/rbd.cpython-313-x86_64-linux-gnu.so(+0xab9cd) [0x7e567013f9cd]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  11: /lib/librbd.so.1(+0x42a44a) [0x7e566fa2a44a]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  12: /lib/librbd.so.1(+0x42ab4d) [0x7e566fa2ab4d]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  13: /lib/librbd.so.1(+0x40add5) [0x7e566fa0add5]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  14: /lib/librbd.so.1(+0x40b5f0) [0x7e566fa0b5f0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  15: /lib/librbd.so.1(+0x2e4c23) [0x7e566f8e4c23]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  16: /lib/librbd.so.1(+0x13169d) [0x7e566f73169d]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  17: /lib/librbd.so.1(+0x2cb644) [0x7e566f8cb644]

Edit 2:
We had not fully upgraded our cluster, yet and had some Proxmox 8 nodes left.
I started a manager on one of them, using ceph-mgr=19.2.2-pve1~bpo12+1

Unlike the 19.2.2-pve5 package which I had tried on a Proxmox 9 node previously, this package seems to work perfectly fine, the manager is running stable.

By the way, in terms of immediate impact on this: ceph-csi-cephfs requires managers to be available to fetch storage stats I believe, otherwise any mount request will hang.

This results in cephfs volumes not being mountable inside K8s clusters when no managers are available, making this an even more critical issue.
 
Last edited:
  • Like
Reactions: joed and benbutton
I don't have much valuable to contribute at this point besides mentioning that this is effecting us as well with exactly the same setup. Ceph-CSI, Snapshots etc.

Is there a possibility of downgrading to Ceph 19.2.2 on PVE 9?

Edit 1:

I just attempted to downgrade to 19.2.2 on one of my manager nodes. It worked flawlessly but did not solve the issue.
I am still seeing the same mgr crash:

Note this line:

Code:
Aug 15 22:19:50 srv01 ceph-mgr[593278]:    -45> 2025-08-15T22:19:50.805+0200 7e5656ae06c0 -1 librbd::image::PreRemoveRequest: 0x5af55f5513b0 check_image_watchers: image has watchers - not removing


Code:
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  23: /lib/librados.so.2(+0xd7fb1) [0x7e567cae7fb1]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  24: /lib/librados.so.2(+0xee15e) [0x7e567cafe15e]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  25: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xe1224) [0x7e567cee1224]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  26: /lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x7e567cc9cb7b]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  27: /lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x7e567cd1a7b8]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2326> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2324> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2322> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2320> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2318> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2265> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2263> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2260> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2256> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2253> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:    -45> 2025-08-15T22:19:50.805+0200 7e5656ae06c0 -1 librbd::image::PreRemoveRequest: 0x5af55f5513b0 check_image_watchers: image has watchers - not removing
Aug 15 22:19:50 srv01 ceph-mgr[593278]:      0> 2025-08-15T22:19:50.819+0200 7e56572e16c0 -1 *** Caught signal (Segmentation fault) **
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  in thread 7e56572e16c0 thread_name:io_context_pool
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  ceph version 19.2.2 (e3f44088ed9d68b8cb1628a6d5d474c2010b29a0) squid (stable)
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7e567cc49df0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  2: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1598b0) [0x7e567e3598b0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  3: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a1843) [0x7e567e3a1843]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  4: _PyType_LookupRef()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  5: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a216b) [0x7e567e3a216b]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  6: PyObject_GetAttr()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  7: _PyEval_EvalFrameDefault()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  8: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1109dd) [0x7e567e3109dd]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  9: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x3d3442) [0x7e567e5d3442]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  10: /lib/python3/dist-packages/rbd.cpython-313-x86_64-linux-gnu.so(+0xab9cd) [0x7e567013f9cd]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  11: /lib/librbd.so.1(+0x42a44a) [0x7e566fa2a44a]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  12: /lib/librbd.so.1(+0x42ab4d) [0x7e566fa2ab4d]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  13: /lib/librbd.so.1(+0x40add5) [0x7e566fa0add5]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  14: /lib/librbd.so.1(+0x40b5f0) [0x7e566fa0b5f0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  15: /lib/librbd.so.1(+0x2e4c23) [0x7e566f8e4c23]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  16: /lib/librbd.so.1(+0x13169d) [0x7e566f73169d]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  17: /lib/librbd.so.1(+0x2cb644) [0x7e566f8cb644]

Edit 2:
We had not fully upgraded our cluster, yet and had some Proxmox 8 nodes left.
I started a manager on one of them, using ceph-mgr=19.2.2-pve1~bpo12+1

Unlike the 19.2.2-pve5 package which I had tried on a Proxmox 9 node previously, this package seems to work perfectly fine, the manager is running stable.

By the way, in terms of immediate impact on this: ceph-csi-cephfs requires managers to be available to fetch storage stats I believe, otherwise any mount request will hang.

This results in cephfs volumes not being mountable inside K8s clusters when no managers are available, making this an even more critical issue.
How can we downgrade to this though ?
 
I don't have much valuable to contribute at this point besides mentioning that this is effecting us as well with exactly the same setup. Ceph-CSI, Snapshots etc.

Is there a possibility of downgrading to Ceph 19.2.2 on PVE 9?

Edit 1:

I just attempted to downgrade to 19.2.2 on one of my manager nodes. It worked flawlessly but did not solve the issue.
I am still seeing the same mgr crash:

Note this line:

Code:
Aug 15 22:19:50 srv01 ceph-mgr[593278]:    -45> 2025-08-15T22:19:50.805+0200 7e5656ae06c0 -1 librbd::image::PreRemoveRequest: 0x5af55f5513b0 check_image_watchers: image has watchers - not removing


Code:
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  23: /lib/librados.so.2(+0xd7fb1) [0x7e567cae7fb1]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  24: /lib/librados.so.2(+0xee15e) [0x7e567cafe15e]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  25: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xe1224) [0x7e567cee1224]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  26: /lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x7e567cc9cb7b]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  27: /lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x7e567cd1a7b8]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2326> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2324> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2322> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2320> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2318> 2025-08-15T22:19:50.149+0200 7e5643a6e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2265> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2263> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2260> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2256> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  -2253> 2025-08-15T22:19:50.150+0200 7e564a2af6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 15 22:19:50 srv01 ceph-mgr[593278]:    -45> 2025-08-15T22:19:50.805+0200 7e5656ae06c0 -1 librbd::image::PreRemoveRequest: 0x5af55f5513b0 check_image_watchers: image has watchers - not removing
Aug 15 22:19:50 srv01 ceph-mgr[593278]:      0> 2025-08-15T22:19:50.819+0200 7e56572e16c0 -1 *** Caught signal (Segmentation fault) **
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  in thread 7e56572e16c0 thread_name:io_context_pool
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  ceph version 19.2.2 (e3f44088ed9d68b8cb1628a6d5d474c2010b29a0) squid (stable)
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7e567cc49df0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  2: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1598b0) [0x7e567e3598b0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  3: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a1843) [0x7e567e3a1843]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  4: _PyType_LookupRef()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  5: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a216b) [0x7e567e3a216b]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  6: PyObject_GetAttr()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  7: _PyEval_EvalFrameDefault()
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  8: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1109dd) [0x7e567e3109dd]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  9: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x3d3442) [0x7e567e5d3442]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  10: /lib/python3/dist-packages/rbd.cpython-313-x86_64-linux-gnu.so(+0xab9cd) [0x7e567013f9cd]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  11: /lib/librbd.so.1(+0x42a44a) [0x7e566fa2a44a]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  12: /lib/librbd.so.1(+0x42ab4d) [0x7e566fa2ab4d]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  13: /lib/librbd.so.1(+0x40add5) [0x7e566fa0add5]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  14: /lib/librbd.so.1(+0x40b5f0) [0x7e566fa0b5f0]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  15: /lib/librbd.so.1(+0x2e4c23) [0x7e566f8e4c23]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  16: /lib/librbd.so.1(+0x13169d) [0x7e566f73169d]
Aug 15 22:19:50 srv01 ceph-mgr[593278]:  17: /lib/librbd.so.1(+0x2cb644) [0x7e566f8cb644]

Edit 2:
We had not fully upgraded our cluster, yet and had some Proxmox 8 nodes left.
I started a manager on one of them, using ceph-mgr=19.2.2-pve1~bpo12+1

Unlike the 19.2.2-pve5 package which I had tried on a Proxmox 9 node previously, this package seems to work perfectly fine, the manager is running stable.

By the way, in terms of immediate impact on this: ceph-csi-cephfs requires managers to be available to fetch storage stats I believe, otherwise any mount request will hang.

This results in cephfs volumes not being mountable inside K8s clusters when no managers are available, making this an even more critical issue.
Thank you for sharing this. This certainly resonates for me… I was fully patched on PVE 8.4.x for 3-4 days including Ceph 19.2.x and didn’t have any issues with ceph-mgr crashing until upgrading to PVE 9. Unfortunately, I don’t have careful notes of versions or good scientific testing to say anything beyond speculation, but if this were to be a PVE9 specific issue (i.e. not PVE8 regardless of Ceph version), that would be compatible with my experience.

And agreed, working with the cephfs CSI driver would hang for me as well when no managers were running.
 
Last edited:
Hey everybody,

sorry for leaving you in the dark for a bit there. Was unfortunately sick in between—however, I just managed to reproduce the bug on my fresh PVE 9 w/ Ceph 19.2.3 test cluster, which my new Kubernetes cluster running on some Debian 13 instances is using for storage. Took me a little while to deploy k8s from scratch, as I had never done that before, and also set up ceph-csi afterwards. (That took some acrobatics, man.) Thankfully, @cheiss had already set up ceph-csi before and came in clutch, so a big thanks to him!

Anyways, just wanted to let all of you know that I'll continue debugging and working on this in the coming days. Will have to rebuild Ceph locally in order to get the latest debug symbols and then gather some coredumps.

I'll keep you posted and let you know once I find out more.