Ceph OSD crashing after upgrade to 8.3.2

bitbass

New Member
Aug 17, 2023
12
4
3
3 node cluster. All configured similarly, with a single NVMe stick for the OSDs. One node is crashing the Ceph OSD process. My untrained eyes look like maybe the 8.3.2 update failed to install something correctly, but I really don't know. Nothing available in the updates at this point. Here's the journalctl output:

Code:
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: ./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7c3a58c006c0 time 2024-12-23T10:45:28.891258-0500
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: ./src/os/bluestore/BlueStore.cc: 12889: FAILED ceph_assert(r == 0)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: [236B blob data]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: PutCF( prefix = O key = 0x7F800000000000000278000000'!!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value size = 33)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x5a55a16ec307]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: *** Caught signal (Aborted) **
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  in thread 7c3a58c006c0 thread_name:bstore_kv_sync
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: 2024-12-23T10:45:28.892-0500 7c3a58c006c0 -1 ./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7c3a58c006c0 time 2024-12-23T10:45:28.891258-0500
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: ./src/os/bluestore/BlueStore.cc: 12889: FAILED ceph_assert(r == 0)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x5a55a16ec307]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c3a71e5b050]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7c3a71ea9ebc]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: gsignal()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: abort()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5a55a16ec362]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  8: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  9: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  11: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: 2024-12-23T10:45:28.896-0500 7c3a58c006c0 -1 *** Caught signal (Aborted) **
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  in thread 7c3a58c006c0 thread_name:bstore_kv_sync
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c3a71e5b050]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7c3a71ea9ebc]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: gsignal()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: abort()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5a55a16ec362]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  8: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  9: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  11: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: [244B blob data]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: PutCF( prefix = O key = 0x7F800000000000000278000000'!!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value size = 33)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:     -1> 2024-12-23T10:45:28.892-0500 7c3a58c006c0 -1 ./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7c3a58c006c0 time 2024-12-23T10:45:28.891258-0500
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: ./src/os/bluestore/BlueStore.cc: 12889: FAILED ceph_assert(r == 0)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x5a55a16ec307]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:      0> 2024-12-23T10:45:28.896-0500 7c3a58c006c0 -1 *** Caught signal (Aborted) **
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  in thread 7c3a58c006c0 thread_name:bstore_kv_sync
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c3a71e5b050]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7c3a71ea9ebc]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: gsignal()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: abort()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5a55a16ec362]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  8: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  9: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  11: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: [244B blob data]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: PutCF( prefix = O key = 0x7F800000000000000278000000'!!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value size = 33)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:     -1> 2024-12-23T10:45:28.892-0500 7c3a58c006c0 -1 ./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(TransContext*, bool)' thread 7c3a58c006c0 time 2024-12-23T10:45:28.891258-0500
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]: ./src/os/bluestore/BlueStore.cc: 12889: FAILED ceph_assert(r == 0)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x5a55a16ec307]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:      0> 2024-12-23T10:45:28.896-0500 7c3a58c006c0 -1 *** Caught signal (Aborted) **
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  in thread 7c3a58c006c0 thread_name:bstore_kv_sync
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c3a71e5b050]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  2: /lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7c3a71ea9ebc]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  3: gsignal()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  4: abort()
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5a55a16ec362]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  6: /usr/bin/ceph-osd(+0x6334a2) [0x5a55a16ec4a2]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  7: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x454) [0x5a55a1d12724]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  8: (BlueStore::_kv_sync_thread()+0xed3) [0x5a55a1d88ce3]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  9: (BlueStore::KVSyncThread::entry()+0xd) [0x5a55a1db514d]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7c3a71ea81c4]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  11: /lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7c3a71f2885c]
Dec 23 10:45:28 nuc7i7 ceph-osd[4901]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 23 10:45:29 nuc7i7 systemd[1]: ceph-osd@2.service: Main process exited, code=killed, status=6/ABRT
 
I can destroy the OSD and create a new one, if that's the answer.
I have never seen that type of crash. But I did have OSDs with problems a while ago. My take-it-easy approach was to out/stop/destroy it. Ceph will/should rebalance - probably it has done that already.
  • run "memtest86" for some hours (over night) - at least if you are using non-ECC chips and haven't done that before
  • <pvenode> --> Disks --> <disk> --> "Wipe Disk"
  • run Smart long selftest smartctl -t long <device>
  • check result smartctl -l [selftest|error] <device> after some hours
  • add the same device via "Create: OSD"
 
"Disk has a holder(500)". Not sure what that means.

So, I royally screwed things up like a year 1 tech. I think I've hosed the entire Ceph cluster. I have backups in PBS, so I'll probably just start restoring. However, I'm having trouble wiping the first disk with this error message.
 
I looked up the holder error and sorted that out.

Long road of restoring now...

I appreciate you responding @UdoB! If not for my own impatience you might have sorted me all out!
 
  • Like
Reactions: UdoB