Ceph Initialization Crash

scyto

Active Member
Aug 8, 2023
390
70
28
I have been moving my PoC cluster to a rack and through my own mis-cabling ceph was down for a few days (there was nothing critical on it).
I was restarting nodes, etc etc all the stuff that comes with moving and noticed two nodes came up nicely, but the cluster was undersized.

In the logs on the failed node I saw this:

Code:
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.crash.pve2 failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.crash failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

It was easily fixed in that i just had to restart the monitors and the OSDs by hand (clicking start buttons in the UI).

There seem to have been no ill affects.

I can see that the issue is the conf file could not be read. I assume this is the one in the corosync file sysytem?

Would this indicate that all that happened was the corosyncy filesystem was unavailable during ceph start on that node (which given how i was bouncing nodes is not a surprise).

This is more an academic question to help me understand dependencies.
 
Last edited:
ahh more detail:

Code:
root@pve2:/var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd# cat meta
{
    "crash_id": "2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd",
    "timestamp": "2023-09-04T00:02:34.087712Z",
    "process_name": "ceph-mgr",
    "entity_name": "mgr.pve2",
    "ceph_version": "17.2.6",
    "utsname_hostname": "pve2",
    "utsname_sysname": "Linux",
    "utsname_release": "6.2.16-10-pve",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.2.16-10 (2023-08-18T11:42Z)",
    "utsname_machine": "x86_64",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_id": "12",
    "os_version_id": "12",
    "os_version": "12 (bookworm)",
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n    self.scrape_all()",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n    self.put_device_metrics(device, data)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n    self._create_device(devid)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n    cursor = self.db.execute(SQL, (devid,))\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^",
        "sqlite3.OperationalError: disk I/O error"
    ],
    "mgr_module": "devicehealth",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "OperationalError"
}
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!