Hi,
yesterday one OSD went down and dropped out of my cluster, systemd stopped the service after it "crashed" 4 times. I tried restarting the OSD manually, but it continues to crash immediately, the OSD looks effectively dead.
Here's the first
It seems what fails here is a RocksDB commit (https://github.com/ceph/ceph/blob/v17.2.7/src/os/bluestore/BlueStore.cc#L12868), according to the systemd journal,
Over the last days, I've written a few TB of data to an image in this RBD pool 11, the transfer was in progress when all of this happened. The disk itself looks fine, nothing unusual in the logs (besides the above), so I'm not sure what's going on.
I've found a few ceph tracker issues with basically the same stacktrace:
* https://tracker.ceph.com/issues/51875 -> closed as "Won't Fix" with comment "rocksdb returns error code"
* https://tracker.ceph.com/issues/52196, https://tracker.ceph.com/issues/54925 -> new/open, nobody seems to care
So my assumption is, that the RocksDB on that OSD got somehow broken, and I now have to re-create this OSD? Do you agree? Or is there any way to rescue the OSD? (Probably safer not to, since the other OSDs are all fine, but what if not?)
This is o.c. rather concerning, if the RocksDB can just die under minor load, and takes the complete OSD with it, this puts the stability of ceph into question for me. In particular if you check https://tracker.ceph.com/issues/52196, this seem to have have happened on various systems and versions from 15 to 17 (interestingly not later?).
Thanks for any ideas!
yesterday one OSD went down and dropped out of my cluster, systemd stopped the service after it "crashed" 4 times. I tried restarting the OSD manually, but it continues to crash immediately, the OSD looks effectively dead.
Here's the first
ceph crash info
(the later ones look the same):
JSON:
{
"assert_condition": "r == 0",
"assert_file": "./src/os/bluestore/BlueStore.cc",
"assert_func": "void BlueStore::_txc_apply_kv(TransContext*, bool)",
"assert_line": 12869,
"assert_msg": "./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 79a8540006c0 time 2024-11-10T18:38:11.976893+0100\n./src/os/bluestore/BlueStore.cc: 12869: FAILED ceph_assert(r == 0)\n",
"assert_thread_name": "bstore_kv_sync",
"backtrace": [
"/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x79a86bdf2050]",
"/lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x79a86be40e2c]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5e104ccfe89c]",
"/usr/bin/ceph-osd(+0x5939dc) [0x5e104ccfe9dc]",
"(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x594) [0x5e104d3c7534]",
"(BlueStore::_kv_sync_thread()+0xfb3) [0x5e104d406e33]",
"(BlueStore::KVSyncThread::entry()+0xd) [0x5e104d43445d]",
"/lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x79a86be3f134]",
"/lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x79a86bebf7dc]"
],
"ceph_version": "17.2.7",
"crash_id": "[...]",
"entity_name": "osd.[...]",
"os_id": "12",
"os_name": "Debian GNU/Linux 12 (bookworm)",
"os_version": "12 (bookworm)",
"os_version_id": "12",
"process_name": "ceph-osd",
"stack_sig": "[...]",
"timestamp": "2024-11-10T17:38:12.010665Z",
"utsname_hostname": "[...]",
"utsname_machine": "x86_64",
"utsname_release": "6.8.8-3-pve",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.8.8-3 (2024-07-16T16:16Z)"
}
It seems what fails here is a RocksDB commit (https://github.com/ceph/ceph/blob/v17.2.7/src/os/bluestore/BlueStore.cc#L12868), according to the systemd journal,
DeleteCF
sometimes PutCF
s for the later crashes, more DeleteCF
s for the first one. About 3h before the first crash, lots of scrub errors have been reported:
Code:
2024-11-10T15:44:51.571+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 shard 10 11:[...]:::rbd_data.[...]:head : missing
[... lots of those ...]
2024-11-10T15:44:51.571+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 shard 10 11:[...]:::rbd_data.[...]:head : missing
2024-11-10T15:44:52.478+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 scrub 16883 missing, 0 inconsistent objects
2024-11-10T15:44:52.478+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 scrub 16883 errors
[... nothing for about 3h ...]
2024-11-10T18:38:11.974+0100 [...] -1 rocksdb: submit_common error: Corruption: block checksum mismatch: stored = 2324967102, computed = 2953824923 in db/012021.sst offset 17365953 size 3841 code = #002 Rocksdb transaction
DeleteCF( prefix = P key = [...])
[... a bunch on those ...]
DeleteCF( prefix = P key = [...])
[... assert -> boom ...]
Over the last days, I've written a few TB of data to an image in this RBD pool 11, the transfer was in progress when all of this happened. The disk itself looks fine, nothing unusual in the logs (besides the above), so I'm not sure what's going on.
I've found a few ceph tracker issues with basically the same stacktrace:
* https://tracker.ceph.com/issues/51875 -> closed as "Won't Fix" with comment "rocksdb returns error code"
* https://tracker.ceph.com/issues/52196, https://tracker.ceph.com/issues/54925 -> new/open, nobody seems to care
So my assumption is, that the RocksDB on that OSD got somehow broken, and I now have to re-create this OSD? Do you agree? Or is there any way to rescue the OSD? (Probably safer not to, since the other OSDs are all fine, but what if not?)
This is o.c. rather concerning, if the RocksDB can just die under minor load, and takes the complete OSD with it, this puts the stability of ceph into question for me. In particular if you check https://tracker.ceph.com/issues/52196, this seem to have have happened on various systems and versions from 15 to 17 (interestingly not later?).
Thanks for any ideas!