using pve 9.0.5 . 5 node ceph cluster. nodes have a mix of zfs and non zfs root/boot disks along with one large nvme formatted ext4 for vzdumps. we also use pbs .
i have a cronscript which we have used for years that checks this:
most or all nodes just started sending emails. kvm's and pct's too.
i'll put an example of full dmesg hung part below.
today I replaced a node. during the process moved 8 osd's .
usually if we get hangs a reboot of the node and and kvm's with the message fixes the issue. not this time. the hangs keep coming back.
i have a cronscript which we have used for years that checks this:
Code:
dmesg -T | grep hung | grep -v vethXChung ## **URGENT** probably need to restart node. find cause
most or all nodes just started sending emails. kvm's and pct's too.
i'll put an example of full dmesg hung part below.
today I replaced a node. during the process moved 8 osd's .
usually if we get hangs a reboot of the node and and kvm's with the message fixes the issue. not this time. the hangs keep coming back.
Code:
[Tue Aug 19 22:26:22 2025] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:22 2025] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd37 (1)10.11.12.2:6822 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd8 (1)10.11.12.2:6857 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:43 2025] INFO: task kworker/u193:1:314 blocked for more than 122 seconds.
[Tue Aug 19 22:26:43 2025] Tainted: P O 6.14.8-2-pve #1
[Tue Aug 19 22:26:43 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Aug 19 22:26:43 2025] task:kworker/u193:1 state:D stack:0 pid:314 tgid:314 ppid:2 task_flags:0x4248060 flags:0x00004000
[Tue Aug 19 22:26:43 2025] Workqueue: writeback wb_workfn (flush-251:0)
[Tue Aug 19 22:26:43 2025] Call Trace:
[Tue Aug 19 22:26:43 2025] <TASK>
[Tue Aug 19 22:26:43 2025] ? __pfx_wbt_inflight_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025] __schedule+0x466/0x13f0
[Tue Aug 19 22:26:43 2025] ? __pfx_wbt_inflight_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025] schedule+0x29/0x130
[Tue Aug 19 22:26:43 2025] io_schedule+0x4c/0x80
[Tue Aug 19 22:26:43 2025] rq_qos_wait+0xc9/0x170
[Tue Aug 19 22:26:43 2025] ? __pfx_wbt_cleanup_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025] ? __pfx_rq_qos_wake_function+0x10/0x10
[Tue Aug 19 22:26:43 2025] ? __pfx_wbt_inflight_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025] wbt_wait+0xb6/0x100
[Tue Aug 19 22:26:43 2025] __rq_qos_throttle+0x25/0x40
[Tue Aug 19 22:26:43 2025] blk_mq_submit_bio+0x21e/0x800
[Tue Aug 19 22:26:43 2025] __submit_bio+0x75/0x290
[Tue Aug 19 22:26:43 2025] ? get_page_from_freelist+0x35a/0x13e0
[Tue Aug 19 22:26:43 2025] submit_bio_noacct_nocheck+0x30f/0x3e0
[Tue Aug 19 22:26:43 2025] submit_bio_noacct+0x28c/0x580
[Tue Aug 19 22:26:43 2025] submit_bio+0xb1/0x110
[Tue Aug 19 22:26:43 2025] ext4_io_submit+0x24/0x50
[Tue Aug 19 22:26:43 2025] ext4_do_writepages+0x39d/0xef0
[Tue Aug 19 22:26:43 2025] ext4_writepages+0xc0/0x190
[Tue Aug 19 22:26:43 2025] ? timerqueue_add+0x72/0xe0
[Tue Aug 19 22:26:43 2025] ? ext4_writepages+0xc0/0x190
[Tue Aug 19 22:26:43 2025] do_writepages+0xde/0x280
[Tue Aug 19 22:26:43 2025] ? sched_clock_noinstr+0x9/0x10
[Tue Aug 19 22:26:43 2025] __writeback_single_inode+0x44/0x350
[Tue Aug 19 22:26:43 2025] ? pick_eevdf+0x175/0x1b0
[Tue Aug 19 22:26:43 2025] writeback_sb_inodes+0x255/0x550
[Tue Aug 19 22:26:43 2025] __writeback_inodes_wb+0x54/0x100
[Tue Aug 19 22:26:43 2025] ? queue_io+0x113/0x120
[Tue Aug 19 22:26:43 2025] wb_writeback+0x1ac/0x330
[Tue Aug 19 22:26:43 2025] ? get_nr_inodes+0x41/0x70
[Tue Aug 19 22:26:43 2025] wb_workfn+0x351/0x410
[Tue Aug 19 22:26:43 2025] process_one_work+0x172/0x350
[Tue Aug 19 22:26:43 2025] worker_thread+0x34a/0x480
[Tue Aug 19 22:26:43 2025] ? __pfx_worker_thread+0x10/0x10
[Tue Aug 19 22:26:43 2025] kthread+0xf9/0x230
[Tue Aug 19 22:26:43 2025] ? __pfx_kthread+0x10/0x10
[Tue Aug 19 22:26:43 2025] ret_from_fork+0x44/0x70
[Tue Aug 19 22:26:43 2025] ? __pfx_kthread+0x10/0x10
[Tue Aug 19 22:26:43 2025] ret_from_fork_asm+0x1a/0x30
[Tue Aug 19 22:26:43 2025] </TASK>
[Tue Aug 19 22:26:43 2025] INFO: task kworker/u193:3:639 blocked for more than 122 seconds.
[Tue Aug 19 22:26:43 2025] Tainted: P O 6.14.8-2-pve #1
[Tue Aug 19 22:26:43 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Aug 19 22:26:43 2025] task:kworker/u193:3 state:D stack:0 pid:639 tgid:639 ppid:2 task_flags:0x4248060 flags:0x00004000
[Tue Aug 19 22:26:43 2025] Workqueue: writeback wb_workfn (flush-251:0)
[Tue Aug 19 22:26:43 2025] Call Trace:
..