I have been moving my PoC cluster to a rack and through my own mis-cabling ceph was down for a few days (there was nothing critical on it).
I was restarting nodes, etc etc all the stuff that comes with moving and noticed two nodes came up nicely, but the cluster was undersized.
In the logs on the failed node I saw this:
It was easily fixed in that i just had to restart the monitors and the OSDs by hand (clicking start buttons in the UI).
There seem to have been no ill affects.
I can see that the issue is the conf file could not be read. I assume this is the one in the corosync file sysytem?
Would this indicate that all that happened was the corosyncy filesystem was unavailable during ceph start on that node (which given how i was bouncing nodes is not a surprise).
This is more an academic question to help me understand dependencies.
I was restarting nodes, etc etc all the stuff that comes with moving and noticed two nodes came up nicely, but the cluster was undersized.
In the logs on the failed node I saw this:
Code:
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.crash.pve2 failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.crash failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
Sep 13 11:52:51 pve2 ceph-crash[742]: WARNING:ceph-crash:post /var/lib/ceph/crash/2023-09-04T00:02:34.087712Z_f6365ca6-2636-4531-b10f-432edb9e87bd as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
It was easily fixed in that i just had to restart the monitors and the OSDs by hand (clicking start buttons in the UI).
There seem to have been no ill affects.
I can see that the issue is the conf file could not be read. I assume this is the one in the corosync file sysytem?
Would this indicate that all that happened was the corosyncy filesystem was unavailable during ceph start on that node (which given how i was bouncing nodes is not a surprise).
This is more an academic question to help me understand dependencies.
Last edited: