Ceph OSD crash loop, RocksDB corruption

Glaciar Errante · Monday at 11:30

Hi,

yesterday one OSD went down and dropped out of my cluster, systemd stopped the service after it "crashed" 4 times. I tried restarting the OSD manually, but it continues to crash immediately, the OSD looks effectively dead.

Here's the first ceph crash info (the later ones look the same):

JSON:

{   
    "assert_condition": "r == 0",
    "assert_file": "./src/os/bluestore/BlueStore.cc",
    "assert_func": "void BlueStore::_txc_apply_kv(TransContext*, bool)",
    "assert_line": 12869,
    "assert_msg": "./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 79a8540006c0 time 2024-11-10T18:38:11.976893+0100\n./src/os/bluestore/BlueStore.cc: 12869: FAILED ceph_assert(r == 0)\n",
    "assert_thread_name": "bstore_kv_sync",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x79a86bdf2050]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x79a86be40e2c]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5e104ccfe89c]",
        "/usr/bin/ceph-osd(+0x5939dc) [0x5e104ccfe9dc]",
        "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x594) [0x5e104d3c7534]",
        "(BlueStore::_kv_sync_thread()+0xfb3) [0x5e104d406e33]",
        "(BlueStore::KVSyncThread::entry()+0xd) [0x5e104d43445d]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x79a86be3f134]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x79a86bebf7dc]"
    ],
    "ceph_version": "17.2.7",
    "crash_id": "[...]",
    "entity_name": "osd.[...]",
    "os_id": "12",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_version": "12 (bookworm)",
    "os_version_id": "12",
    "process_name": "ceph-osd",
    "stack_sig": "[...]",
    "timestamp": "2024-11-10T17:38:12.010665Z",
    "utsname_hostname": "[...]",
    "utsname_machine": "x86_64",
    "utsname_release": "6.8.8-3-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.8.8-3 (2024-07-16T16:16Z)"
}

It seems what fails here is a RocksDB commit (https://github.com/ceph/ceph/blob/v17.2.7/src/os/bluestore/BlueStore.cc#L12868), according to the systemd journal, DeleteCF sometimes PutCFs for the later crashes, more DeleteCFs for the first one. About 3h before the first crash, lots of scrub errors have been reported:

Code:

2024-11-10T15:44:51.571+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 shard 10 11:[...]:::rbd_data.[...]:head : missing
[... lots of those ...]
2024-11-10T15:44:51.571+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 shard 10 11:[...]:::rbd_data.[...]:head : missing
2024-11-10T15:44:52.478+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 scrub 16883 missing, 0 inconsistent objects
2024-11-10T15:44:52.478+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 scrub 16883 errors
[... nothing for about 3h ...]
2024-11-10T18:38:11.974+0100 [...] -1 rocksdb: submit_common error: Corruption: block checksum mismatch: stored = 2324967102, computed = 2953824923  in db/012021.sst offset 17365953 size 3841 code = #002 Rocksdb transaction
DeleteCF( prefix = P key = [...])
[... a bunch on those ...]
DeleteCF( prefix = P key = [...])
[... assert -> boom ...]

Over the last days, I've written a few TB of data to an image in this RBD pool 11, the transfer was in progress when all of this happened. The disk itself looks fine, nothing unusual in the logs (besides the above), so I'm not sure what's going on.

I've found a few ceph tracker issues with basically the same stacktrace:
* https://tracker.ceph.com/issues/51875 -> closed as "Won't Fix" with comment "rocksdb returns error code"
* https://tracker.ceph.com/issues/52196, https://tracker.ceph.com/issues/54925 -> new/open, nobody seems to care

So my assumption is, that the RocksDB on that OSD got somehow broken, and I now have to re-create this OSD? Do you agree? Or is there any way to rescue the OSD? (Probably safer not to, since the other OSDs are all fine, but what if not?)

This is o.c. rather concerning, if the RocksDB can just die under minor load, and takes the complete OSD with it, this puts the stability of ceph into question for me. In particular if you check https://tracker.ceph.com/issues/52196, this seem to have have happened on various systems and versions from 15 to 17 (interestingly not later?).

Thanks for any ideas!

aaron · Monday at 17:09

I didn't look too deeply into the bug reports on the Ceph tracker. But overall, if you have the replicas, destroying and creating an OSD is a operation that should be fine.

Out of curiosity, what kind of disks are you using for the OSDs?

Glaciar Errante · Tuesday at 10:24

aaron said:
Out of curiosity, what kind of disks are you using for the OSDs?

The HDDs are all from the Toshiba MG line, MG08 and MG10, the disk in question is a MG08ACA16TE. Write cache disabled, encryption (dmcrypt) enabled.

(Just a small 3 node home cluster. The rest is I guess not uncommon home server equipment, Supermicro mainboard, ECC memory, according to the graphs there was also no memory shortage. Before this the system incl. the disk in question have been running for years without any problem afaik, never lost any OSD before. On the other hand, the load is usually low, so the long transfer might have increased the change to trigger a possible hardware issue. If that's what it was...)

Recreating the OSD should not be a problem indeed, I'm just surprised, that it seemingly died so "easily". The OSD should survive like a bad PG shard, but a single wrong bit in the DB and the OSD's done? I guess Ceph is more about high level resiliency, but I was hoping for a bit more low level resiliency, maybe just "bad luck". But unless it happens again, I guess probably not worth a deeper dive.

Search

Search

Ceph OSD crash loop, RocksDB corruption

Glaciar Errante

Member

aaron

Proxmox Staff Member

Glaciar Errante

Member