ceph manager crashing on 19.2.3

cfgmgr

Member
Jul 25, 2023
66
14
13
Greetings!

I have two clusters running PVE 8.4.5. Both had crashing managers on 19.2.2, and I was hoping that if I patched up to 19.2.3 it would fix everything but alas it does not. Anyone have any ideas here if this is a bug or some user caused issue (me)?

Code:
# pveversion --verbose
proxmox-ve: 8.4.0 (running kernel: 6.8.12-13-pve)
pve-manager: 8.4.5 (running version: 8.4.5/57892e8e686cb35b)
proxmox-kernel-helper: 8.1.4
pve-kernel-5.15: 7.4-7
proxmox-kernel-6.8.12-13-pve-signed: 6.8.12-13
proxmox-kernel-6.8: 6.8.12-13
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph: 19.2.3-1~bpo12+1
ceph-fuse: 19.2.3-1~bpo12+1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.2
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.2
libpve-cluster-perl: 8.1.2
libpve-common-perl: 8.3.2
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.3-1
proxmox-backup-file-restore: 3.4.3-1
proxmox-backup-restore-image: 0.7.0
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.4
proxmox-mail-forward: 0.3.3
proxmox-mini-journalreader: 1.5
proxmox-widget-toolkit: 4.3.12
pve-cluster: 8.1.2
pve-container: 5.3.0
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-4~bpo12+1
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.2
pve-firmware: 3.16-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.5
pve-qemu-kvm: 9.2.0-7
pve-xtermjs: 5.5.0-2
qemu-server: 8.4.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.8-pve1


Code:
# ceph crash ls
ID                                                                ENTITY               NEW
2025-11-12T22:00:38.645572Z_42b0a1c1-25e1-4bb0-bd04-2df29d05d46a  mgr.hv03   *
2025-11-17T17:54:18.963675Z_ac9e7c60-04b7-4810-a129-28e7714ed917  mgr.hv03   *


Code:
# ceph crash info 2025-11-17T17:54:18.963675Z_ac9e7c60-04b7-4810-a129-28e7714ed917
{
    "assert_condition": "nref == 0",
    "assert_file": "./src/common/RefCountedObj.cc",
    "assert_func": "virtual ceph::common::RefCountedObject::~RefCountedObject()",
    "assert_line": 14,
    "assert_msg": "./src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedObject()' thread 76c525d796c0 time 2025-11-17T17:54:18.961365+0000\n./src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)\n",
    "assert_thread_name": "OpHistorySvc",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x76c53f65b050]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x8aeec) [0x76c53f6a9eec]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x178) [0x76c53e8c98d3]",
        "/usr/lib/ceph/libceph-common.so.2(+0x2c9a16) [0x76c53e8c9a16]",
        "/usr/lib/ceph/libceph-common.so.2(+0x3b9885) [0x76c53e9b9885]",
        "(MMgrCommand::~MMgrCommand()+0x7a) [0x76c53ebad43a]",
        "(ceph::common::RefCountedObject::put() const+0x113) [0x76c53e9b9aa3]",
        "(TrackedOp::put()+0x25a) [0x5d7e81a5b5aa]",
        "(OpHistoryServiceThread::entry()+0x143) [0x5d7e81ad13a3]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x76c53f6a81f5]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x76c53f72889c]"
    ],
    "ceph_version": "19.2.3",
    "crash_id": "2025-11-17T17:54:18.963675Z_ac9e7c60-04b7-4810-a129-28e7714ed917",
    "entity_name": "mgr.hv03",
    "os_id": "12",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_version": "12 (bookworm)",
    "os_version_id": "12",
    "process_name": "ceph-mgr",
    "stack_sig": "34573e4c3543433958d462fb8fbe67add0880797713bcf5217f9638195366242",
    "timestamp": "2025-11-17T17:54:18.963675Z",
    "utsname_hostname": "hv03",
    "utsname_machine": "x86_64",
    "utsname_release": "6.8.12-13-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.8.12-13 (2025-07-22T10:00Z)"
}


Thanks!
 
This is still happening on two clusters. Does anyone have any idea or seen this before? Thanks!
 
Hi @cfgmgr,

I stumbled over Your post while searching for "ceph-mgr chrashes" in this forum. We have a three node cluster that shows the same behaviour in the last couple of days:

Code:
ceph crash info 2026-04-03T21:58:55.327972Z_43ce14a1-d104-4175-9580-2d803a1e3800
{
    "assert_condition": "nref == 0",
    "assert_file": "./src/common/RefCountedObj.cc",
    "assert_func": "virtual ceph::common::RefCountedObject::~RefCountedObject()",
    "assert_line": 14,
    "assert_msg": "./src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedObject()' thread 7d7685ad56c0 time 2026-04-03T21:58:55.325495+0000\n./src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)\n",
    "assert_thread_name": "OpHistorySvc",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7d769e04adf0]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x9495c) [0x7d769e09f95c]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17a) [0x7d769e87cb33]",
        "/usr/lib/ceph/libceph-common.so.2(+0x27ccb1) [0x7d769e87ccb1]",
        "/usr/lib/ceph/libceph-common.so.2(+0x3a8b39) [0x7d769e9a8b39]",
        "(MMgrCommand::~MMgrCommand()+0x7a) [0x7d769ebd2d1a]",
        "(ceph::common::RefCountedObject::put() const+0x115) [0x7d769e9a8cf5]",
        "(MgrOpRequest::~MgrOpRequest()+0x5e) [0x55b0d5ce687e]",
        "(OpHistoryServiceThread::entry()+0x13a) [0x55b0d5ceb14a]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x92b7b) [0x7d769e09db7b]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x1107b8) [0x7d769e11b7b8]"
    ],
    "ceph_version": "19.2.3",
    "crash_id": "2026-04-03T21:58:55.327972Z_43ce14a1-d104-4175-9580-2d803a1e3800",
    "entity_name": "mgr.pve-net1-hv-1-prod",
    "os_id": "13",
    "os_name": "Debian GNU/Linux 13 (trixie)",
    "os_version": "13 (trixie)",
    "os_version_id": "13",
    "process_name": "ceph-mgr",
    "stack_sig": "89f9e598a07ad60edfde527c603be96a7e3dbbe56b1ea7f97f082142e835ae9b",
    "timestamp": "2026-04-03T21:58:55.327972Z",
    "utsname_hostname": "pve-net1-hv-1-prod",
    "utsname_machine": "x86_64",
    "utsname_release": "6.17.4-2-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.17.4-2 (2025-12-19T07:49Z)"
}

We have a couple of clusters with identical configuration running which do not show these ceph-mgr crashes so far.

Have You found any interesting information since Your posts?

Since the ceph-mgr recovered "auto-magically" we are not too concerned right now but will monitor our clusters regarding ceph mgr problems for the next couple of weeks.

Regards,
Martin
 
As far as I know it supposed to be back ported to 19.2.4. I also think its fixed in 20.2.something. Was hoping to get to a 19.2.4 before going to 20.2.X...