Hi all,
after an upgrade (on Friday night) to Proxmox 7.x and Ceph 16.2, everything seemed to work perfectly.
Sometime early morning today (sunday), the cluster crashed.
17 out of 24 OSDs will no longer start
most of them will do a successful
but some will have an assertion (just like when trying to start them):
Some of the last non-assertion messages were that OSDs were running full, which would make sense if enough of them died to fill the rest (cluster usage was aroudn 65% - 70%, so it is conceivable).
Anyway, I have to go back to backups and will downgrade ceph to octopus again and reformat the osds (some services are critical and cannot wait for extensive recovery attempts).
However, I thought I'd let people know - maybe there still is something off with ceph 16.2.
If any of the logs I have could be useful for analysis - please let me know and I'll send them.
after an upgrade (on Friday night) to Proxmox 7.x and Ceph 16.2, everything seemed to work perfectly.
Sometime early morning today (sunday), the cluster crashed.
17 out of 24 OSDs will no longer start
most of them will do a successful
Code:
ceph-bluestore-tool fsck
but some will have an assertion (just like when trying to start them):
Code:
./src/os/bluestore/BlueFS.cc: In function 'void BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)' thread 7f95ca004240 time 2021-07-11T07:53:55.341113+0000
./src/os/bluestore/BlueFS.cc: 2340: FAILED ceph_assert(r == 0)
2021-07-11T07:53:55.337+0000 7f95ca004240 -1 bluefs _allocate allocation failed, needed 0x400000
ceph version 16.2.4 (a912ff2c95b1f9a8e2e48509e602ee008d5c9434) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f95cac1d1b6]
2: /usr/lib/ceph/libceph-common.so.2(+0x24f341) [0x7f95cac1d341]
3: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x1a4d) [0x555a1c7e05ed]
4: (BlueFS::sync_metadata(bool)+0x115) [0x555a1c7e0985]
5: (BlueFS::umount(bool)+0x1b4) [0x555a1c7e0e34]
6: (BlueStore::_close_bluefs(bool)+0x14) [0x555a1c7ff284]
7: (BlueStore::_close_db_and_around(bool)+0xd) [0x555a1c8360dd]
8: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x258) [0x555a1c88cfb8]
9: main()
10: __libc_start_main()
11: _start()
*** Caught signal (Aborted) **
in thread 7f95ca004240 thread_name:ceph-bluestore-
2021-07-11T07:53:55.341+0000 7f95ca004240 -1 ./src/os/bluestore/BlueFS.cc: In function 'void BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)' thread 7f95ca004240 time 2021-07-11T07:53:55.341113+0000
./src/os/bluestore/BlueFS.cc: 2340: FAILED ceph_assert(r == 0)
Some of the last non-assertion messages were that OSDs were running full, which would make sense if enough of them died to fill the rest (cluster usage was aroudn 65% - 70%, so it is conceivable).
Anyway, I have to go back to backups and will downgrade ceph to octopus again and reformat the osds (some services are critical and cannot wait for extensive recovery attempts).
However, I thought I'd let people know - maybe there still is something off with ceph 16.2.
If any of the logs I have could be useful for analysis - please let me know and I'll send them.