Hallo,
ich habe einen PVE 8.2.7 Ceph Cluster .. sporadisch, 1x täglich fällt ein Node aus. Da ich die Meldung "Caught signal (Segmentation fault)" im Log der anderen Nodes nicht sehe, vermute ich, dass dies die Ursache ist. Der Failover der VMs und Container funktioniert problemlos.
Ich bin etwas ratlos was die Ursache ist. Das Node muss ich dann immer ausschalten und wieder anschalten. Das Node startet problemlos und es findet eine Recovery Objekte statt.
Die drei Nodes sind von der Hardware her identisch.
Was könnte die Ursache sein?
Herzliche Grüße und besten Dank vorab
Sebastian
----
ich habe einen PVE 8.2.7 Ceph Cluster .. sporadisch, 1x täglich fällt ein Node aus. Da ich die Meldung "Caught signal (Segmentation fault)" im Log der anderen Nodes nicht sehe, vermute ich, dass dies die Ursache ist. Der Failover der VMs und Container funktioniert problemlos.
Ich bin etwas ratlos was die Ursache ist. Das Node muss ich dann immer ausschalten und wieder anschalten. Das Node startet problemlos und es findet eine Recovery Objekte statt.
Die drei Nodes sind von der Hardware her identisch.
Was könnte die Ursache sein?
Herzliche Grüße und besten Dank vorab
Sebastian
----
Code:
Oct 14 08:03:52 pve3 ceph-osd[1226]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 14 08:03:52 pve3 ceph-osd[1226]: 0> 2024-10-14T08:03:52.519+0200 7d932b4006c0 -1 *** Caught signal (Segmentation fault) **
Oct 14 08:03:52 pve3 ceph-osd[1226]: in thread 7d932b4006c0 thread_name:bstore_kv_sync
Oct 14 08:03:52 pve3 ceph-osd[1226]: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
Oct 14 08:03:52 pve3 ceph-osd[1226]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7d934285b050]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 2: tc_memalign()
Oct 14 08:03:52 pve3 ceph-osd[1226]: 3: posix_memalign()
Oct 14 08:03:52 pve3 ceph-osd[1226]: 4: (ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)+0x14b) [0x57eec5a5ed3b]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 5: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x22) [0x57eec5a5ee32]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 6: (ceph::buffer::v15_2_0::create(unsigned int)+0x22) [0x57eec5a5ee72]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 7: (ceph::buffer::v15_2_0::list::obtain_contiguous_space(unsigned int)+0xa3) [0x57eec5a5f083]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 8: (BlueFS::_compact_log_dump_metadata_NF(unsigned long, bluefs_transaction_t*, int, unsigned long)+0x16d) [0x57eec57e0e4d]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 9: (BlueFS::_compact_log_async_LD_LNF_D()+0x464) [0x57eec57f1a14]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 10: (BlueFS::_maybe_compact_log_LNF_NF_LD_D()+0xcd) [0x57eec57f2c2d]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 11: (BlueFS::fsync(BlueFS::FileWriter*)+0x12f) [0x57eec57f33cf]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 12: (BlueRocksWritableFile::Sync()+0x14) [0x57eec5807ad4]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 13: /usr/bin/ceph-osd(+0x13e3d76) [0x57eec5e98d76]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 14: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x681) [0x57eec5ecd401]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 15: (rocksdb::WritableFileWriter::Sync(bool)+0x393) [0x57eec5ed1ec3]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 16: (rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long, rocksdb::DBImpl::LogFileNumberSize&)+0x51d) [0x57eec5d318dd]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 17: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*, rocksdb::PostMemTableCallback*)+0x17ea) [0x57eec5d3a14a]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 18: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x74) [0x57eec5d3ac04]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 19: (RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x85) [0x57eec5cb9965]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 20: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0xa1) [0x57eec5cba711]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 21: (BlueStore::_kv_sync_thread()+0x1218) [0x57eec5785028]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 22: (BlueStore::KVSyncThread::entry()+0xd) [0x57eec57b114d]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 23: /lib/x86_64-linux-gnu/libc.so.6(+0x89144) [0x7d93428a8144]
Oct 14 08:03:52 pve3 ceph-osd[1226]: 24: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x7d93429287dc]
Oct 14 08:03:52 pve3 ceph-osd[1226]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Oct 14 08:03:53 pve3 systemd[1]: ceph-osd@5.service: Main process exited, code=killed, status=11/SEGV
Oct 14 08:03:53 pve3 systemd[1]: ceph-osd@5.service: Failed with result 'signal'.
Oct 14 08:03:53 pve3 systemd[1]: ceph-osd@5.service: Consumed 31min 27.802s CPU time.
Oct 14 08:03:53 pve3 kernel: libceph (81290533-0f9d-4009-a368-3b7e9262cbb3 e1584): osd5 down
Oct 14 08:04:03 pve3 systemd[1]: ceph-osd@5.service: Scheduled restart job, restart counter is at 1.
Oct 14 08:04:03 pve3 systemd[1]: Stopped ceph-osd@5.service - Ceph object storage daemon osd.5.
Oct 14 08:04:03 pve3 systemd[1]: ceph-osd@5.service: Consumed 31min 27.802s CPU time.
Oct 14 08:04:03 pve3 systemd[1]: Starting ceph-osd@5.service - Ceph object storage daemon osd.5...
Oct 14 08:04:03 pve3 systemd[1]: Started ceph-osd@5.service - Ceph object storage daemon osd.5.
Oct 14 08:04:31 pve3 pveproxy[487557]: worker exit
Oct 14 08:04:31 pve3 pveproxy[1383]: worker 487557 finished
Oct 14 08:04:31 pve3 pveproxy[1383]: starting 1 worker(s)
Oct 14 08:04:31 pve3 pveproxy[1383]: worker 507864 started
Oct 14 08:04:33 pve3 ceph-osd[507265]: 2024-10-14T08:04:33.761+0200 752b23d8e840 -1 osd.5 1583 log_to_monitors true
Oct 14 08:04:34 pve3 ceph-osd[507265]: 2024-10-14T08:04:34.003+0200 752b16a006c0 -1 osd.5 1583 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
Oct 14 08:04:36 pve3 kernel: libceph (81290533-0f9d-4009-a368-3b7e9262cbb3 e1587): osd5 up
Oct 14 08:05:01 pve3 CRON[508077]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 14 08:05:01 pve3 CRON[508078]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 14 08:05:01 pve3 CRON[508077]: pam_unix(cron:session): session closed for user root
Oct 14 08:05:18 pve3 pmxcfs[823]: [status] notice: received log
Oct 14 08:06:03 pve3 pveproxy[493126]: worker exit
Oct 14 08:06:03 pve3 pveproxy[1383]: worker 493126 finished
Oct 14 08:06:03 pve3 pveproxy[1383]: starting 1 worker(s)
Oct 14 08:06:03 pve3 pveproxy[1383]: worker 508442 started
-- Reboot --
Code:
HEALTH_WARN: Degraded data redundancy: 26814/8525754 objects degraded (0.315%), 33 pgs degraded, 33 pgs undersized
pg 3.0 is stuck undersized for 32m, current state active+recovery_wait+undersized+degraded+remapped, last acting [4,3]
HEALTH_WARN: 1 daemons have recently crashed
osd.5 crashed on host pve3 at 2024-10-14T06:03:52.520471Z