Hi,There is no ssh access, the vnc console does not respond to requests. The static picture is standing. The log on the machine itself is abruptly interrupted and resumes after restarting:
on the proxmox node itself in the syslog i see this:Code:June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,285] INFO [Craft Manager NodeID=2] Request to vote VoteRequestDa> June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,368] INFORMATION [Root Manager Node ID=2] Transition to Fo is completed> June 16, 20:12:22 dev-kafka-2-kraft kafka-server-start.sh[23704]: [2023-06-16 20:12:22,370] INFORMATION [BrokerToControllerChannelManager broker=2 name=h> -- Bootable fbd87f21176742cb8ab0717732d2b6bc -- June 16, 21:27:04 dev-kafka-2-kraft kernel: Linux version 5.15.0-73-generic (buildd@bos03-amd64-060) (gcc (Ubuntu 11.3.0-1ubuntu1~04/22/04) 11.3.0,> June 16, 21:27:04 dev-kafka kernel-2-kraft: Command line: BOOT_IMAGE=/vmlinuz-5.15.0-73- common root=/dev/mapper/ap--vg-ap--lv--root root=1 ip> June 16, 21:27:04 dev-kafka-2-kraft core: Core-supported processors:
Code:June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors June 16, 18:36:32 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors June 16, 19:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable sectors (pending) June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors June 16, 19:36:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 currently unreadable (pending) sectors June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_18] [SAT], 16 autonomous incorrigible sectors June 16, 20:06:33 petr-stor4 smartd[1489]: Device: /dev/bus/0 [megaraid_disk_24], INTELLIGENT failure: FAILURE PREDICTION THRESHOLD EXCEEDED June 16, 20:08:14 petr-stor4 pvestatd[2860]: VM command error 105 qmp - VM command error 105 qmp 'request-proxmox-support' - timeout received June 16, 20:18:49 petr-stor4 pvedaemon[1718506]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received June 16, 20:19:08 petr-stor4 pvedaemon[1727264]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received June 16, 20:19:28 petr-stor4 pvedaemon[1717223]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received June 16, 20:21:24 petr-stor4 pvedaemon[1727264]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received June 16, 20:21:43 petr-stor4 pvedaemon[1717223]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received June 16, 20:22:05 petr-stor4 pvedaemon[1718506]: Failed to execute VM 121 qmp command - failed to execute VM 121 qmp 'guest-ping' command - timeout received
maybe my osd.54 slowly dying and so the machines freeze ? But i have replication factor 2/3 in my ceph... I have a Ceph warning in the PVE UI yesterday which said 1 daemons have recently crashed osd.54 crashed on host *****. But for now osd.54 service works well. Backtrace of osd.54 crash:
Code:{ "backtrace": [ "/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f69cb483140]", "(BlueStore::Extent::~Extent()+0x27) [0x55d6e1ebb8e7]", "(BlueStore::Onode::put()+0x2c5) [0x55d6e1e32f25]", "(std::_Hashtable<ghobject_t, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<ghobject_t>, std::hash<ghobject_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, true>*)+0x67) [0x55d6e1ebc2c7]", "(LruOnodeCacheShard::_trim_to(unsigned long)+0xca) [0x55d6e1ebfb5a]", "(BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>&)+0x15d) [0x55d6e1e3371d]", "(BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x399) [0x55d6e1e3a309]", "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x154d) [0x55d6e1e814dd]", "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2e0) [0x55d6e1e82430]", "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x52) [0x55d6e1aa8412]", "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7b4) [0x55d6e1cb8ef4]", "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x53d) [0x55d6e1a2418d]", "(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd46) [0x55d6e1a80326]", "(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x334a) [0x55d6e1a87c6a]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bc) [0x55d6e18f789c]", "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x65) [0x55d6e1b77505]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa27) [0x55d6e1924367]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x55d6e1fcd3da]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d6e1fcf9b0]", "/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f69cb477ea7]", "clone()" ], "ceph_version": "16.2.9", "crash_id": "2023-06-15T20:43:02.275025Z_aad0cf01-3839-41a3-b8bd-d516080722b1", "entity_name": "osd.54", "os_id": "11", "os_name": "Debian GNU/Linux 11 (bullseye)", "os_version": "11 (bullseye)", "os_version_id": "11", "process_name": "ceph-osd", "stack_sig": "f33237076f54d8500909a0c8c279f6639d4e914520f35b288af4429eebfd958e", "timestamp": "2023-06-15T20:43:02.275025Z", "utsname_hostname": "petr-stor4", "utsname_machine": "x86_64", "utsname_release": "5.15.35-2-pve", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PVE 5.15.35-5 (Wed, 08 Jun 2022 15:02:51 +0200)" }
you have two disks with issues!
Please replace (fast) megaraid_disk_18 first - this disk has unrecoverable read-errors. not an good sign! If you can't replace set the osd as down/out, so that the data is moved to other OSDs.
And after replace (and rebuild), renew megaraid_disk_24 too.
BTW: ceph should access directly megaraid_disk_18 sound's a little bit as raid-0 for each disk, to simulate an HBA?
Udo