Ceph Managers Seg Faulting Post Upgrade (8 -> 9 upgrade)

daemonslayer2047

New Member
Aug 7, 2025
4
0
1
I upgraded my Proxmox cluster from the latest 8.4 version to 9.0, post upgrade most things went well. However the Ceph cluster has not gone as well. All monitors, OSDs, and metadata servers have all upgraded to Ceph 19.2.3, however, all of my manager services have failed. They ran for a while just fine but then started crashing, I took a look in the logs and at first saw this error:

Code:
Aug 05 21:00:43 pve-02 ceph-mgr[32601]: ERROR:root:Module 'xmltodict' is not installed.

I was able to correct this with python3-xmltodict, that resolved one issue but not the main one. My manager service is still failing with the following log:

Code:
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  -1022> 2025-08-06T21:11:20.227-0500 7457d18586c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:   -203> 2025-08-06T21:11:21.165-0500 7457ce83e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:      0> 2025-08-06T21:11:22.819-0500 7457e10a36c0 -1 *** Caught signal (Segmentation fault) **
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  in thread 7457e10a36c0 thread_name:io_context_pool
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  ceph version 19.2.3 (ad1eecf4042e0ce72f382f60c97b709fd6f16a51) squid (stable)
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x745809249df0]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  2: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1598b0) [0x74580a9598b0]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  3: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a1843) [0x74580a9a1843]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  4: _PyType_LookupRef()
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  5: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a216b) [0x74580a9a216b]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  6: PyObject_GetAttr()
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  7: _PyEval_EvalFrameDefault()
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  8: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1109dd) [0x74580a9109dd]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  9: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x3d3442) [0x74580abd3442]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  10: /lib/python3/dist-packages/rbd.cpython-313-x86_64-linux-gnu.so(+0xacfed) [0x7457f864ffed]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  11: /lib/librbd.so.1(+0x3cc8ea) [0x7457f7dcc8ea]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  12: /lib/librbd.so.1(+0x3ccfed) [0x7457f7dccfed]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  13: /lib/librbd.so.1(+0x3afec6) [0x7457f7dafec6]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  14: /lib/librbd.so.1(+0x3b0560) [0x7457f7db0560]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  15: /lib/librbd.so.1(+0x2cac93) [0x7457f7ccac93]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  16: /lib/librbd.so.1(+0x12e7bd) [0x7457f7b2e7bd]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  17: /lib/librbd.so.1(+0x2b1c9e) [0x7457f7cb1c9e]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  18: /lib/librbd.so.1(+0x2b4379) [0x7457f7cb4379]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  19: /lib/librados.so.2(+0xd2716) [0x7458090e4716]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  20: /lib/librados.so.2(+0xd3705) [0x7458090e5705]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  21: /lib/librados.so.2(+0xd3f8a) [0x7458090e5f8a]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  22: /lib/librados.so.2(+0xea598) [0x7458090fc598]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  23: /lib/librados.so.2(+0xd7a71) [0x7458090e9a71]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  24: /lib/librados.so.2(+0xedf63) [0x7458090fff63]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  25: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xe1224) [0x7458094e1224]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  26: /lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x74580929cb7b]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  27: /lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x74580931a7b8]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 06 21:11:22 pve-01 systemd[1]: ceph-mgr@pve-01.service: Main process exited, code=killed, status=11/SEGV
Aug 06 21:11:22 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:11:22 pve-01 systemd[1]: ceph-mgr@pve-01.service: Consumed 4.515s CPU time, 368.7M memory peak.
Aug 06 21:11:33 pve-01 systemd[1]: ceph-mgr@pve-01.service: Scheduled restart job, restart counter is at 3.
Aug 06 21:11:33 pve-01 systemd[1]: ceph-mgr@pve-01.service: Start request repeated too quickly.
Aug 06 21:11:33 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:11:33 pve-01 systemd[1]: Failed to start ceph-mgr@pve-01.service - Ceph cluster manager daemon.
Aug 06 21:12:59 pve-01 systemd[1]: ceph-mgr@pve-01.service: Start request repeated too quickly.
Aug 06 21:12:59 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:12:59 pve-01 systemd[1]: Failed to start ceph-mgr@pve-01.service - Ceph cluster manager daemon.
Aug 06 21:21:45 pve-01 systemd[1]: ceph-mgr@pve-01.service: Start request repeated too quickly.
Aug 06 21:21:45 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:21:45 pve-01 systemd[1]: Failed to start ceph-mgr@pve-01.service - Ceph cluster manager daemon.

I admit I am stumped, I have now started running into strange behavior, I am attempting to delete a VM that has no storage associated with Ceph at all failing to delete because proxmox can not access Ceph:
Code:
  Logical volume "snap_vm-118-disk-0_no-os" successfully removed.
  Logical volume "snap_vm-118-disk-0_minimal-config" successfully removed.
  Logical volume "vm-118-disk-0" successfully removed.
  Logical volume "snap_vm-118-disk-1_minimal-config" successfully removed.
  Logical volume "snap_vm-118-disk-1_no-os" successfully removed.
  Logical volume "vm-118-disk-1" successfully removed.
TASK ERROR: rbd error: rbd: listing images failed: (2) No such file or directory

I tried to purge a single Proxmox node of all Ceph material (NOT including `/etc/pve/`) and then reinstall, but the seg fault remains.
 
please post the output of "pveversion -v" and "dpkg --list | grep -e 'ceph' -e 'rbd' -e 'rados'"
 
No problem:
pveversion -v:
Code:
root@pve-01:~# pveversion -v
proxmox-ve: 9.0.0 (running kernel: 6.14.8-2-pve)
pve-manager: 9.0.3 (running version: 9.0.3/025864202ebb6109)
proxmox-kernel-helper: 9.0.3
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
proxmox-kernel-6.14: 6.14.8-2
proxmox-kernel-6.8.12-13-pve-signed: 6.8.12-13
proxmox-kernel-6.8: 6.8.12-13
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 19.2.3-pve1
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.4
libpve-network-perl: 1.1.6
libpve-rs-perl: 0.10.7
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2
lxc-pve: 6.0.4-2
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.11-1
proxmox-backup-file-restore: 4.0.11-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.1.1
proxmox-kernel-helper: 9.0.3
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.0
proxmox-widget-toolkit: 5.0.5
pve-cluster: 9.0.6
pve-container: 6.0.9
pve-docs: 9.0.8
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.16-3
pve-ha-manager: 5.0.4
pve-i18n: 3.5.2
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
qemu-server: 9.0.16
smartmontools: 7.4-pve1
spiceterm: 3.4.0
swtpm: 0.8.0+pve2
vncterm: 1.9.0
zfsutils-linux: 2.3.3-pve1

dpkg --list | grep -e 'ceph' -e 'rbd' -e 'rados':
Code:
root@pve-01:~# dpkg --list | grep -e 'ceph' -e 'rbd' -e 'rados'
ii  ceph                                 19.2.3-pve1                         amd64        distributed storage and file system
ii  ceph-base                            19.2.3-pve1                         amd64        common ceph daemon libraries and management tools
ii  ceph-common                          19.2.3-pve1                         amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-fuse                            19.2.3-pve1                         amd64        FUSE-based client for the Ceph distributed file system
ii  ceph-mds                             19.2.3-pve1                         amd64        metadata server for the ceph distributed file system
ii  ceph-mgr                             19.2.3-pve1                         amd64        manager for the ceph distributed storage system
ii  ceph-mgr-dashboard                   19.2.3-pve1                         all          dashboard module for ceph-mgr
ii  ceph-mgr-modules-core                19.2.3-pve1                         all          ceph manager modules which are always enabled
ii  ceph-mon                             19.2.3-pve1                         amd64        monitor server for the ceph storage system
ii  ceph-osd                             19.2.3-pve1                         amd64        OSD server for the ceph storage system
ii  ceph-volume                          19.2.3-pve1                         all          tool to facilidate OSD deployment
ii  libcephfs2                           19.2.3-pve1                         amd64        Ceph distributed file system client library
ii  librados2                            19.2.3-pve1                         amd64        RADOS distributed object store client library
ii  librados2-perl                       1.5.0                               amd64        Perl bindings for librados
ii  libradosstriper1                     19.2.3-pve1                         amd64        RADOS striping interface
ii  librbd1                              19.2.3-pve1                         amd64        RADOS block device client library
ii  libsqlite3-mod-ceph                  19.2.3-pve1                         amd64        SQLite3 VFS for Ceph
ii  python3-ceph-argparse                19.2.3-pve1                         all          Python 3 utility libraries for Ceph CLI
ii  python3-ceph-common                  19.2.3-pve1                         all          Python 3 utility libraries for Ceph
ii  python3-cephfs                       19.2.3-pve1                         amd64        Python 3 libraries for the Ceph libcephfs library
ii  python3-rados                        19.2.3-pve1                         amd64        Python 3 libraries for the Ceph librados library
ii  python3-rbd                          19.2.3-pve1                         amd64        Python 3 libraries for the Ceph librbd library

All nodes appear to have the same versions of everything.
 
Also I don't know if it is related but after the proxmox upgrade the Ceph Performance section reports obviously false information when it come to the graphs
1754574141458.png
 
Hi,
Code:
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  -1022> 2025-08-06T21:11:20.227-0500 7457d18586c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:   -203> 2025-08-06T21:11:21.165-0500 7457ce83e6c0 -1 client.0 error registering admin socket command: (17) File exists
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:      0> 2025-08-06T21:11:22.819-0500 7457e10a36c0 -1 *** Caught signal (Segmentation fault) **
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  in thread 7457e10a36c0 thread_name:io_context_pool
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  ceph version 19.2.3 (ad1eecf4042e0ce72f382f60c97b709fd6f16a51) squid (stable)
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x745809249df0]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  2: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1598b0) [0x74580a9598b0]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  3: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a1843) [0x74580a9a1843]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  4: _PyType_LookupRef()
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  5: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1a216b) [0x74580a9a216b]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  6: PyObject_GetAttr()
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  7: _PyEval_EvalFrameDefault()
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  8: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x1109dd) [0x74580a9109dd]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  9: /lib/x86_64-linux-gnu/libpython3.13.so.1.0(+0x3d3442) [0x74580abd3442]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  10: /lib/python3/dist-packages/rbd.cpython-313-x86_64-linux-gnu.so(+0xacfed) [0x7457f864ffed]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  11: /lib/librbd.so.1(+0x3cc8ea) [0x7457f7dcc8ea]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  12: /lib/librbd.so.1(+0x3ccfed) [0x7457f7dccfed]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  13: /lib/librbd.so.1(+0x3afec6) [0x7457f7dafec6]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  14: /lib/librbd.so.1(+0x3b0560) [0x7457f7db0560]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  15: /lib/librbd.so.1(+0x2cac93) [0x7457f7ccac93]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  16: /lib/librbd.so.1(+0x12e7bd) [0x7457f7b2e7bd]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  17: /lib/librbd.so.1(+0x2b1c9e) [0x7457f7cb1c9e]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  18: /lib/librbd.so.1(+0x2b4379) [0x7457f7cb4379]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  19: /lib/librados.so.2(+0xd2716) [0x7458090e4716]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  20: /lib/librados.so.2(+0xd3705) [0x7458090e5705]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  21: /lib/librados.so.2(+0xd3f8a) [0x7458090e5f8a]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  22: /lib/librados.so.2(+0xea598) [0x7458090fc598]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  23: /lib/librados.so.2(+0xd7a71) [0x7458090e9a71]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  24: /lib/librados.so.2(+0xedf63) [0x7458090fff63]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  25: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xe1224) [0x7458094e1224]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  26: /lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x74580929cb7b]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  27: /lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x74580931a7b8]
Aug 06 21:11:22 pve-01 ceph-mgr[617162]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 06 21:11:22 pve-01 systemd[1]: ceph-mgr@pve-01.service: Main process exited, code=killed, status=11/SEGV
Aug 06 21:11:22 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:11:22 pve-01 systemd[1]: ceph-mgr@pve-01.service: Consumed 4.515s CPU time, 368.7M memory peak.
Aug 06 21:11:33 pve-01 systemd[1]: ceph-mgr@pve-01.service: Scheduled restart job, restart counter is at 3.
Aug 06 21:11:33 pve-01 systemd[1]: ceph-mgr@pve-01.service: Start request repeated too quickly.
Aug 06 21:11:33 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:11:33 pve-01 systemd[1]: Failed to start ceph-mgr@pve-01.service - Ceph cluster manager daemon.
Aug 06 21:12:59 pve-01 systemd[1]: ceph-mgr@pve-01.service: Start request repeated too quickly.
Aug 06 21:12:59 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:12:59 pve-01 systemd[1]: Failed to start ceph-mgr@pve-01.service - Ceph cluster manager daemon.
Aug 06 21:21:45 pve-01 systemd[1]: ceph-mgr@pve-01.service: Start request repeated too quickly.
Aug 06 21:21:45 pve-01 systemd[1]: ceph-mgr@pve-01.service: Failed with result 'signal'.
Aug 06 21:21:45 pve-01 systemd[1]: Failed to start ceph-mgr@pve-01.service - Ceph cluster manager daemon.
Can you share more of the log from before this?
I admit I am stumped, I have now started running into strange behavior, I am attempting to delete a VM that has no storage associated with Ceph at all failing to delete because proxmox can not access Ceph:
Code:
  Logical volume "snap_vm-118-disk-0_no-os" successfully removed.
  Logical volume "snap_vm-118-disk-0_minimal-config" successfully removed.
  Logical volume "vm-118-disk-0" successfully removed.
  Logical volume "snap_vm-118-disk-1_minimal-config" successfully removed.
  Logical volume "snap_vm-118-disk-1_no-os" successfully removed.
  Logical volume "vm-118-disk-1" successfully removed.
TASK ERROR: rbd error: rbd: listing images failed: (2) No such file or directory

I tried to purge a single Proxmox node of all Ceph material (NOT including `/etc/pve/`) and then reinstall, but the seg fault remains.
When you destroy a VM with "remove unreferenced disks" selected, then it will scan all storages for disks of the VM.

What does ceph -s say? What about ps faxl | grep ceph? Anything interesting in the /var/log/ceph/ceph-mgr.pve-01.log? How many managers do you have in the cluster, just this one?
 
The logs are very long so I attached them as text files instead, syslog.log is the service log from when I started the daemon again this morning all the way to the restart counter ending (i.e. service no longer willing to restart). I also have 4 managers, they are all experiencing the exact same issue as far as I can tell, and not one is able to remain online. As for the output of ceph -s as you can imagine my cluster is not happy. A manager node went down over night and has not been able to come back online, I am debugging that now, I am not sure it is related at the moment:
Code:
root@pve-01:~# ceph -s
  cluster:
    id:     76a77648-177d-42ef-ab08-7c3dcc753993
    health: HEALTH_WARN
            no active mgr
            Degraded data redundancy: 616125/6281385 objects degraded (9.809%), 62 pgs degraded, 67 pgs undersized
            109 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum pve-02,pve-04,pve-03 (age 16m)
    mgr: no daemons active (since 9m)
    mds: 2/2 daemons up, 1 standby
    osd: 9 osds: 9 up (since 7h), 9 in (since 3M); 67 remapped pgs
 
  data:
    volumes: 2/2 healthy
    pools:   7 pools, 289 pgs
    objects: 2.09M objects, 7.6 TiB
    usage:   21 TiB used, 45 TiB / 65 TiB avail
    pgs:     616125/6281385 objects degraded (9.809%)
             179584/6281385 objects misplaced (2.859%)
             218 active+clean
             58  active+undersized+degraded+remapped+backfill_wait
             5   active+undersized+remapped+backfill_wait
             4   active+undersized+degraded+remapped+backfilling
             2   active+clean+scrubbing+deep
             1   active+remapped+backfill_wait
             1   active+clean+scrubbing
 
  io:
    client:   104 KiB/s rd, 101 KiB/s wr, 18 op/s rd, 11 op/s wr
    recovery: 83 MiB/s, 21 objects/s

and I don't see anything interesting out of ps faxl | grep ceph:
Code:
1     0   32294       2   0 -20      0     0 rescue I<   ?          0:00  \_ [kworker/R-ceph-msgr]
4 64045    1062       1  20   0  21984 14276 hrtime Ss   ?          0:01 /usr/bin/python3 /usr/bin/ceph-crash
4 64045    1394       1  20   0 191792 47556 futex_ Ssl  ?          0:18 /usr/bin/ceph-mds -f --cluster ceph --id pve-01 --setuser ceph --setgroup ceph
0     0  847393  845942  20   0   6528  2344 pipe_r S+   pts/0      0:00  |                   \_ grep ceph
4 64045    2015       1  20   0 3702192 2803844 futex_ Ssl ?       65:21 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
4 64045    2016       1  20   0 3667584 2473048 futex_ Ssl ?       44:27 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
4 64045  703650       1  20   0 2584696 1915624 futex_ Ssl ?       12:41 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
 

Attachments

I'm also having this issue. I upgraded from 8.4 to 9.0.3 and now my mgr services will not start. I even deleted all the osds, purged ceph from my cluster, reinstalled, and I'm still unable to get the managers to start.

Code:
Aug 09 02:56:26 pve1 systemd[1]: ceph-mgr@pve1.service: Main process exited, code=killed, status=11/SEGV
Aug 09 02:56:26 pve1 systemd[1]: ceph-mgr@pve1.service: Failed with result 'signal'.
Aug 09 02:56:26 pve1 systemd[1]: ceph-mgr@pve1.service: Consumed 1.907s CPU time, 318.2M memory peak.
Aug 09 02:56:37 pve1 systemd[1]: ceph-mgr@pve1.service: Scheduled restart job, restart counter is at 3.
Aug 09 02:56:37 pve1 systemd[1]: ceph-mgr@pve1.service: Start request repeated too quickly.
Aug 09 02:56:37 pve1 systemd[1]: ceph-mgr@pve1.service: Failed with result 'signal'.
Aug 09 02:56:37 pve1 systemd[1]: Failed to start ceph-mgr@pve1.service - Ceph cluster manager daemon.
 

Attachments

I am seeing the same problem. Upgraded 2-3 days ago, and since yesterday all Ceph managers have crashed and are unable to start without crashing.

I had 3 of everything, physical nodes, OSDs, monitors, MDS and managers.

EDIT: Also Ceph 19.2.3 and Proxmox 9.0.3
 

Attachments

Last edited:
@daemonslayer2047 @CaptainRedHat @joriskt @fma965 Could you run the following command on any of the nodes with a crashing MGR and attach the ceph-crash.tar.gz file here, please?

Code:
# tar -I 'gzip --best' -cvf ceph-crash.tar.gz /var/lib/ceph/crash/*

Before posting, check if any of the logs contain sensitive information, of course.

This seems very similar to an issue I've recently investigated.

Also, can you see any images when you run the following command for each pool you have on your Ceph cluster? (If any, that is.)

Code:
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool <POOL> trash ls
 
Also, is there anything else you did in the meantime regarding Ceph? Any configuration changes, new OSDs, etc.? Anything in that regard helps, thanks!
 
Hi @Max Carrara , thanks for your reply! Attached is the generated archive.

For the second question, yes I'm seeing some images:
Code:
root@nuc3:~# rbd -c /etc/pve/ceph.conf --cluster ceph --pool proxmox-vms trash ls
root@nuc3:~# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-staging trash ls
80e18fcf296441 csi-vol-262b74c6-a48e-44e1-b766-e17dce820ba3
root@nuc3:~# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-staging_data trash ls
root@nuc3:~# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-staging_metadata trash ls
root@nuc3:~# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-production trash ls
7fbe1874697cb3 csi-vol-107c1ef3-3180-4466-8a2a-bddaa7b2137b
root@nuc3:~# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-production_data trash ls
root@nuc3:~# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-production_metadata trash ls
root@nuc3:~#

I suppose the context being that I'm using Ceph as storage backend for my K8s clusters with ceph-csi-cephfs and ceph-csi-rbd.

I have not made any direct or manual changes to OSDs or configuration around the time of my first crashes. The only possibility in my case is that my ArgoCD was updating something on my clusters, but that (to me) seems unlikely to produce a new interaction with a Ceph mgr as those seem to be limited to provisioning new volumes in K8s. (Existing volumes still work fine, even now.)


Kind regards,

Joris.
 

Attachments

Last edited:
Hi @Max Carrara , thanks for your reply! Attached is the generated archive.

Thanks a bunch!

It seems that you have something in common with what was reported over at #6635—you're using Kubernetes, too, and there's also something stuck in your RBD trash. In particular, it seems that tasks for removing k8s-staging/80e18fcf296441 and k8s-production/7fbe1874697cb3 from the trash are being started, but they also immediately fail.

So, it seems like there's something up with Kubernetes / ceph-csi-rbd (and maybe ceph-csi-cephfs, too). I'll see if I can reproduce this on my end (might take a while to set up Kubernetes though).

In the meantime, since those images are trashed, can you try to remove them by hand? I'm assuming you don't need them anymore, given that they're in the trash.

Code:
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-staging trash rm 80e18fcf296441
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-production trash rm 7fbe1874697cb3

If that doesn't work, please post the errors in your reply and try purging the images instead:

Code:
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-staging trash purge
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-production trash purge

Afterwards, you should hopefully be able to restart the MGRs like so:

Code:
# systemctl reset-failed
# systemctl restart ceph-mgr.target

For the others: Please check if you have anything in your RBD trash as well. If y'all are using Kubernetes too, we've got a smoking gun.

Thanks a lot for bringing this to our attention!
 
  • Like
Reactions: VictorSTS and fiona
Hi Max,

Code:
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-staging trash rm 80e18fcf296441
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-production trash rm 7fbe1874697cb3

Worked perfectly, no errors.

Code:
# systemctl reset-failed
# systemctl restart ceph-mgr.target

Worked perfectly, no errors.

Lo and behold: the managers stay online!
1754921405278.png

It does look a lot like a smoking gun! Thanks for your attention to this Max :)

If there's anything I can do to help debug or troubleshoot this beyond this point please let me know.


Kind regards,

Joris
 
  • Like
Reactions: Max Carrara
Hi Max,

Code:
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-staging trash rm 80e18fcf296441
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s-production trash rm 7fbe1874697cb3

Worked perfectly, no errors.

Code:
# systemctl reset-failed
# systemctl restart ceph-mgr.target

Worked perfectly, no errors.

Lo and behold: the managers stay online!
View attachment 89297

It does look a lot like a smoking gun! Thanks for your attention to this Max :)

If there's anything I can do to help debug or troubleshoot this beyond this point please let me know.


Kind regards,

Joris

Excellent! Thanks a lot, Joris!

If the issue reappears, please ping me again here.

Just for curiosity's sake, does Kubernetes usually put something in the trash there? For example when you delete a volume (in Kubernetes) or something?
 
@daemonslayer2047 @CaptainRedHat @joriskt @fma965 Could you run the following command on any of the nodes with a crashing MGR and attach the ceph-crash.tar.gz file here, please?

Code:
# tar -I 'gzip --best' -cvf ceph-crash.tar.gz /var/lib/ceph/crash/*

Before posting, check if any of the logs contain sensitive information, of course.

This seems very similar to an issue I've recently investigated.

Also, can you see any images when you run the following command for each pool you have on your Ceph cluster? (If any, that is.)

Code:
# rbd -c /etc/pve/ceph.conf --cluster ceph --pool <POOL> trash ls
Sorry, I had to get systems back up. I was very close to reinstalling PVE 8.4 but I managed to get ceph working again. I ended up uninstalling ceph completely, cleaning up the disks and all configurations, rebooting each node, then reinstalled and reconfigured ceph. I am now able to use ceph again and it seems to be working fine at this point.
 
  • Like
Reactions: Max Carrara
Sorry, I had to get systems back up. I was very close to reinstalling PVE 8.4 but I managed to get ceph working again. I ended up uninstalling ceph completely, cleaning up the disks and all configurations, rebooting each node, then reinstalled and reconfigured ceph. I am now able to use ceph again and it seems to be working fine at this point.

Alright, no problem! I'm glad you sorted it out nevertheless. Should the issue surface again, please don't hesitate to ping me.
 
For the others: Please check if you have anything in your RBD trash as well. If y'all are using Kubernetes too, we've got a smoking gun.
Yes i have stuff in my trash, lots of it and yes i am using Kubernetes, you can see my entire GitOps powered K8s cluster config here - https://github.com/fma965/f9-homelab

rbd -c /etc/pve/ceph.conf --cluster ceph --pool k8s trash purge
Removing images: 0% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group and were moved to the trash successfully.

UPDATE: I managed to get it to work, i had to clean out my VolumeSnapshots and VolumeSnapshotContents in K8S, hopefully it continues to work
 
Last edited:
Nope, @Max Carrara , I spoke to soon, mgr are crashing again. not sure what to do from here.

Is it the same issue as this or a different one?


EDIT:

Could this be a workaround for now?
`ceph config set client rbd_move_to_trash_on_remove false`

With Ceph-csi in k8s i believe it would be this
https://github.com/fma965/f9-homela...eph-csi/ceph-csi-rbd/app/helmrelease.yaml#L39

EDIT2: That doesn't seem to make any different unfortunately
 

Attachments

Last edited:
This issue is happening for me as well. I do have a k8s-rbd pool that's being accessed by ceph-csi, but the pool was created using the Proxmox GUI. The only commands I have ran outside of the GUI is adding a ceph user: ceph auth get-or-create client.k8s mon 'profile rbd' osd 'profile rbd pool=k8s' mgr 'profile rbd pool=k8s'

However, my managers just keep crashing.

This is a completely new cluster, both Proxmox and Ceph.
 

Attachments