Ceph OSD crash loop, RocksDB corruption

Glaciar Errante · Nov 11, 2024

Hi,

yesterday one OSD went down and dropped out of my cluster, systemd stopped the service after it "crashed" 4 times. I tried restarting the OSD manually, but it continues to crash immediately, the OSD looks effectively dead.

Here's the first ceph crash info (the later ones look the same):

JSON:

{   
    "assert_condition": "r == 0",
    "assert_file": "./src/os/bluestore/BlueStore.cc",
    "assert_func": "void BlueStore::_txc_apply_kv(TransContext*, bool)",
    "assert_line": 12869,
    "assert_msg": "./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 79a8540006c0 time 2024-11-10T18:38:11.976893+0100\n./src/os/bluestore/BlueStore.cc: 12869: FAILED ceph_assert(r == 0)\n",
    "assert_thread_name": "bstore_kv_sync",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x79a86bdf2050]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x79a86be40e2c]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5e104ccfe89c]",
        "/usr/bin/ceph-osd(+0x5939dc) [0x5e104ccfe9dc]",
        "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x594) [0x5e104d3c7534]",
        "(BlueStore::_kv_sync_thread()+0xfb3) [0x5e104d406e33]",
        "(BlueStore::KVSyncThread::entry()+0xd) [0x5e104d43445d]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x79a86be3f134]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x79a86bebf7dc]"
    ],
    "ceph_version": "17.2.7",
    "crash_id": "[...]",
    "entity_name": "osd.[...]",
    "os_id": "12",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_version": "12 (bookworm)",
    "os_version_id": "12",
    "process_name": "ceph-osd",
    "stack_sig": "[...]",
    "timestamp": "2024-11-10T17:38:12.010665Z",
    "utsname_hostname": "[...]",
    "utsname_machine": "x86_64",
    "utsname_release": "6.8.8-3-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.8.8-3 (2024-07-16T16:16Z)"
}

It seems what fails here is a RocksDB commit (https://github.com/ceph/ceph/blob/v17.2.7/src/os/bluestore/BlueStore.cc#L12868), according to the systemd journal, DeleteCF sometimes PutCFs for the later crashes, more DeleteCFs for the first one. About 3h before the first crash, lots of scrub errors have been reported:

Code:

2024-11-10T15:44:51.571+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 shard 10 11:[...]:::rbd_data.[...]:head : missing
[... lots of those ...]
2024-11-10T15:44:51.571+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 shard 10 11:[...]:::rbd_data.[...]:head : missing
2024-11-10T15:44:52.478+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 scrub 16883 missing, 0 inconsistent objects
2024-11-10T15:44:52.478+0100 [...] -1 log_channel(cluster) log [ERR] : 11.4 scrub 16883 errors
[... nothing for about 3h ...]
2024-11-10T18:38:11.974+0100 [...] -1 rocksdb: submit_common error: Corruption: block checksum mismatch: stored = 2324967102, computed = 2953824923  in db/012021.sst offset 17365953 size 3841 code = #002 Rocksdb transaction
DeleteCF( prefix = P key = [...])
[... a bunch on those ...]
DeleteCF( prefix = P key = [...])
[... assert -> boom ...]

Over the last days, I've written a few TB of data to an image in this RBD pool 11, the transfer was in progress when all of this happened. The disk itself looks fine, nothing unusual in the logs (besides the above), so I'm not sure what's going on.

I've found a few ceph tracker issues with basically the same stacktrace:
* https://tracker.ceph.com/issues/51875 -> closed as "Won't Fix" with comment "rocksdb returns error code"
* https://tracker.ceph.com/issues/52196, https://tracker.ceph.com/issues/54925 -> new/open, nobody seems to care

So my assumption is, that the RocksDB on that OSD got somehow broken, and I now have to re-create this OSD? Do you agree? Or is there any way to rescue the OSD? (Probably safer not to, since the other OSDs are all fine, but what if not?)

This is o.c. rather concerning, if the RocksDB can just die under minor load, and takes the complete OSD with it, this puts the stability of ceph into question for me. In particular if you check https://tracker.ceph.com/issues/52196, this seem to have have happened on various systems and versions from 15 to 17 (interestingly not later?).

Thanks for any ideas!

aaron · Nov 11, 2024

I didn't look too deeply into the bug reports on the Ceph tracker. But overall, if you have the replicas, destroying and creating an OSD is a operation that should be fine.

Out of curiosity, what kind of disks are you using for the OSDs?

Glaciar Errante · Nov 12, 2024

aaron said:
Out of curiosity, what kind of disks are you using for the OSDs?

The HDDs are all from the Toshiba MG line, MG08 and MG10, the disk in question is a MG08ACA16TE. Write cache disabled, encryption (dmcrypt) enabled.

(Just a small 3 node home cluster. The rest is I guess not uncommon home server equipment, Supermicro mainboard, ECC memory, according to the graphs there was also no memory shortage. Before this the system incl. the disk in question have been running for years without any problem afaik, never lost any OSD before. On the other hand, the load is usually low, so the long transfer might have increased the change to trigger a possible hardware issue. If that's what it was...)

Recreating the OSD should not be a problem indeed, I'm just surprised, that it seemingly died so "easily". The OSD should survive like a bad PG shard, but a single wrong bit in the DB and the OSD's done? I guess Ceph is more about high level resiliency, but I was hoping for a bit more low level resiliency, maybe just "bad luck". But unless it happens again, I guess probably not worth a deeper dive.

aaron · Monday at 14:32

Glaciar Errante said:
Write cache disabled

You might want to enable the disk cache for better performance.

Includes Toshiba Persistent Write Cache Technology for Data-Loss Protection in Sudden Power-Loss Events (512e models)

https://storage.toshiba.com/enterprise-hdd/cloud-scale-capacity/mg08-series

https://docs.ceph.com/en/reef/start/hardware-recommendations/#write-caches

Glaciar Errante · Tuesday at 19:38

aaron said:
You might want to enable the disk cache for better performance.

https://storage.toshiba.com/enterprise-hdd/cloud-scale-capacity/mg08-series

https://docs.ceph.com/en/reef/start/hardware-recommendations/#write-caches

The performance is better with the (OS) write cache disabled, due to the "Persistent Write Cache", just like for enterprise SSDs. I did o.c. benchmark that, IOPS are pretty much the same except for sync writes, which are way faster with the write cache disabled. (The Ceph documentation you linked basically says the same, right?)

aaron · Wednesday at 14:16

I never had any experience with HDDs. With SSDs one definitely wants to use the cache of the disk if they have PLP. Then they can ACK the sync writes a lot faster.
I remember watching a talk about the experience gained by CERN running Ceph, and they do mention, to enable the cache of the disk with hdparm, even with HDDs.
That's why I mentioned it.

But if you tested it with it on and off, of course, choose what performs better

Glaciar Errante said:
I did o.c. benchmark that, IOPS are pretty much the same except for sync writes, which are way faster with the write cache disabled.

How did sync writes behave? Because OSDs should benefit from fast sync writes.

Glaciar Errante · Thursday at 10:27

aaron said:
How did sync writes behave? Because OSDs should benefit from fast sync writes.

In general these drives are not so easy to benchmark, because they have a relatively large cache/internal write buffer compared to their speed, which can lead to weird results, if you mix operations. But to answer the question, for random 4K writes with iodepth=1, sync writes with OS write cache disabled are about as fast as non-sync writes:

OS write cache disabled: around 400 IOPS (starts faster at 2000 IOPS, then drops as the drive cache is filling), whereas reads are around 80 IOPS
OS write cache enabled: around 90 IOPS (so more like reads, I assume because they are flushed immediately, non-sync writes still ~400 IOPS)

For the journal benchmark (4k sequential sync writes, iodepth=1):

OS write cache disabled: around 4000 IOPS independently of numjobs I guess, starts faster at around 10000-20000 IOPS, but I think that's just because the cache is empty in the beginning
OS write cache enabled: starts around 120 IOPS with numjobs=1, but jumps to 2000 IOPS with numjobs=2, growing to 3500 IOPS with numjobs=16 (not sure why the difference between 1 and 2 is so bit, seems like the OS does buffer some writes here)

Search

Search

Ceph OSD crash loop, RocksDB corruption

Glaciar Errante

Member

aaron

Proxmox Staff Member

Glaciar Errante

Member

aaron

Proxmox Staff Member

Glaciar Errante

Member

aaron

Proxmox Staff Member

Glaciar Errante

Member