ceph-crash problem

In my opinion, the better option is only on one node run either ceph crash archive-all or ceph crash rm <crashid>.

ceph crash archive-all - reports are no longer considered for the RECENT_CRASH health check and does not appear in the crash ls-new output (it will still appear in the crash ls output - so you can analyze them in the future).

ceph crash rm <crashid> - removes specific crash report.

Both commands also invoke refresh_health_checks() function.
 
In my opinion, the better option is only on one node run either ceph crash archive-all or ceph crash rm <crashid>.

ceph crash archive-all - reports are no longer considered for the RECENT_CRASH health check and does not appear in the crash ls-new output (it will still appear in the crash ls output - so you can analyze them in the future).

ceph crash rm <crashid> - removes specific crash report.

Both commands also invoke refresh_health_checks() function.
Done all but does not solve the remaining issue.
So, manual deleting was the only way to get rid off the annoying messages in syslog..
 
Last edited:
Done all but does not solve the remaining issue.
So, manual deleting was the only way to get rid off the annoying messages in syslog..
How do I delete these all it takes my all space in /var/lib/ceph/crash/posted


2023-03-26_14:46:57.391125Z_27286eae-78c1-4f23-ad64-42b23403cd42 2023-10-18_09:28:35.630719Z_12f663c8-de5a-4b31-be09-52dfba67e369
2023-03-26_15:03:41.483267Z_6b6b22f2-ad09-471e-a44b-b36b99de011f 2023-10-18_09:41:11.153720Z_2f66b343-8ef6-42ce-8dbf-80f7b8860e0e
2023-03-26_15:20:50.403741Z_e2da09bd-dc4f-473a-85e3-fc701a79c35a 2023-10-18_09:53:43.467781Z_4416d532-2f01-47b3-b524-d6c8e60eef41
2023-03-26_15:37:19.519444Z_5b07b0da-9207-4f89-b3bc-5b146977a956 2023-10-18_10:05:55.115032Z_7b612d81-9f57-4ecb-9796-5ef26ed1409d
2023-03-26_15:52:59.275860Z_00e43e8a-5757-4273-b3b2-6a83efdf3aaf 2023-10-18_10:19:12.930650Z_8be02204-bc95-47b8-88be-6e522370464f
2023-03-26_16:08:44.063938Z_3ac5e1d0-81c7-4e12-b94a-029ab1076a00 2023-10-18_10:34:39.344234Z_498de363-7ca0-4041-ba86-9c5afcc07d97
2023-03-26_16:24:19.822101Z_650f0bbf-6162-4dcf-8cc6-ed169037c59f 2023-10-18_10:50:39.033067Z_c16ddb7c-a15a-4adb-bfe3-7775e857b4be
2023-03-26_16:39:21.376618Z_c2e868e3-618a-4ed8-b475-e529b9fcfefe 2023-10-18_11:04:56.304781Z_f51489cd-4874-491a-8113-ebd7bbfa9fe2
2023-03-26_16:54:32.491550Z_23e274a6-8d7b-4c9c-ab63-1e4e838e966a 2023-10-18_11:17:14.777913Z_eee9ebc9-c613-4677-8769-6c87e0ba9e87
 
I added separate auth for ceph-crash with ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash' and added it to a dedicated keyring /etc/pve/ceph.client.crash.keyring.

Then I added the following to ceph.conf to make sure ceph-crash could pick it up.
Code:
[client.crash]
        keyring = /etc/pve/$cluster.$name.keyring
I also had to change ownership on /var/lib/ceph/crash/ on all the nodes since the posted/ subfolder was still owned by root.
chown -R ceph: /var/lib/ceph/crash/

The key will be public for all cluster nodes, but at least the client is limited to the crash profile. Crash reporting works now, including archiving and pruning.
 
Last edited:
@ahovda : With your changes i don't get the
auth: unable to find a keyring on /etc/pve/priv/ceph.client.crash.keyring:
anymore. Thx for that.

But i still get
Jan 18 16:47:06 proxh5 ceph-crash[34121]: 2024-01-18T16:47:06.782+0100 7f8dc7e926c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.admin.keyring: (13) Permission denied Jan 18 16:47:06 proxh5 ceph-crash[34121]: 2024-01-18T16:47:06.782+0100 7f8dc7e926c0 -1 monclient: keyring not found

Any idea what that coiuld be? Did you have these messages as well?

[6 days later]

OK, the two messages only show up, on the startup of the ceph-crash service. The running service does not throw any errors anymore with the workaround. And it reports the crashes correctly.
 
Last edited:
is it advisable to upgrade ceph from quincy to reef with this error ? any advise ? coz, i have a critical production cluster , directly upgrading will have any impact rather than troubleshooting ?
 
I checked in today. The two messages only show up, on the startup of the ceph-crash service. The running service does not throw any errors anymore (with the workaround). And it reports the crashes correctly.