Ceph Initialization Crash

scyto · Sep 13, 2023

I have been moving my PoC cluster to a rack and through my own mis-cabling ceph was down for a few days (there was nothing critical on it).
I was restarting nodes, etc etc all the stuff that comes with moving and noticed two nodes came up nicely, but the cluster was undersized.

In the logs on the failed node I saw this:

Code:

Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.crash.pve2 failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.crash failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

It was easily fixed in that i just had to restart the monitors and the OSDs by hand (clicking start buttons in the UI).

There seem to have been no ill affects.

I can see that the issue is the conf file could not be read. I assume this is the one in the corosync file sysytem?

Would this indicate that all that happened was the corosyncy filesystem was unavailable during ceph start on that node (which given how i was bouncing nodes is not a surprise).

This is more an academic question to help me understand dependencies.

scyto · Sep 13, 2023

ahh more detail:

Code:

root@pve2:/var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd# cat meta
{
    "crash_id": "2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd",
    "timestamp": "2023-09-04T00:02:34.087712Z",
    "process_name": "ceph-mgr",
    "entity_name": "mgr.pve2",
    "ceph_version": "17.2.6",
    "utsname_hostname": "pve2",
    "utsname_sysname": "Linux",
    "utsname_release": "6.2.16-10-pve",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.2.16-10 (2023-08-18T11:42Z)",
    "utsname_machine": "x86_64",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_id": "12",
    "os_version_id": "12",
    "os_version": "12 (bookworm)",
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n    self.scrape_all()",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n    self.put_device_metrics(device, data)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n    self._create_device(devid)",
        "  File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n    cursor = self.db.execute(SQL, (devid,))\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^",
        "sqlite3.OperationalError: disk I/O error"
    ],
    "mgr_module": "devicehealth",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "OperationalError"
}

Search

Search

Ceph Initialization Crash

scyto

Active Member

scyto

Active Member