INFO: task kworker blocked for more than 122 seconds

RobFantini · Aug 20, 2025

using pve 9.0.5 . 5 node ceph cluster. nodes have a mix of zfs and non zfs root/boot disks along with one large nvme formatted ext4 for vzdumps. we also use pbs .

i have a cronscript which we have used for years that checks this:

Code:

dmesg -T |  grep hung |  grep -v vethXChung ##  **URGENT** probably need to restart node.  find cause

most or all nodes just started sending emails. kvm's and pct's too.

i'll put an example of full dmesg hung part below.

today I replaced a node. during the process moved 8 osd's .

usually if we get hangs a reboot of the node and and kvm's with the message fixes the issue. not this time. the hangs keep coming back.

Code:

[Tue Aug 19 22:26:22 2025] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:22 2025] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd37 (1)10.11.12.2:6822 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd8 (1)10.11.12.2:6857 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:43 2025] INFO: task kworker/u193:1:314 blocked for more than 122 seconds.
[Tue Aug 19 22:26:43 2025]       Tainted: P           O       6.14.8-2-pve #1
[Tue Aug 19 22:26:43 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Aug 19 22:26:43 2025] task:kworker/u193:1  state:D stack:0     pid:314   tgid:314   ppid:2      task_flags:0x4248060 flags:0x00004000
[Tue Aug 19 22:26:43 2025] Workqueue: writeback wb_workfn (flush-251:0)
[Tue Aug 19 22:26:43 2025] Call Trace:
[Tue Aug 19 22:26:43 2025]  <TASK>
[Tue Aug 19 22:26:43 2025]  ? __pfx_wbt_inflight_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025]  __schedule+0x466/0x13f0
[Tue Aug 19 22:26:43 2025]  ? __pfx_wbt_inflight_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025]  schedule+0x29/0x130
[Tue Aug 19 22:26:43 2025]  io_schedule+0x4c/0x80
[Tue Aug 19 22:26:43 2025]  rq_qos_wait+0xc9/0x170
[Tue Aug 19 22:26:43 2025]  ? __pfx_wbt_cleanup_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025]  ? __pfx_rq_qos_wake_function+0x10/0x10
[Tue Aug 19 22:26:43 2025]  ? __pfx_wbt_inflight_cb+0x10/0x10
[Tue Aug 19 22:26:43 2025]  wbt_wait+0xb6/0x100
[Tue Aug 19 22:26:43 2025]  __rq_qos_throttle+0x25/0x40
[Tue Aug 19 22:26:43 2025]  blk_mq_submit_bio+0x21e/0x800
[Tue Aug 19 22:26:43 2025]  __submit_bio+0x75/0x290
[Tue Aug 19 22:26:43 2025]  ? get_page_from_freelist+0x35a/0x13e0
[Tue Aug 19 22:26:43 2025]  submit_bio_noacct_nocheck+0x30f/0x3e0
[Tue Aug 19 22:26:43 2025]  submit_bio_noacct+0x28c/0x580
[Tue Aug 19 22:26:43 2025]  submit_bio+0xb1/0x110
[Tue Aug 19 22:26:43 2025]  ext4_io_submit+0x24/0x50
[Tue Aug 19 22:26:43 2025]  ext4_do_writepages+0x39d/0xef0
[Tue Aug 19 22:26:43 2025]  ext4_writepages+0xc0/0x190
[Tue Aug 19 22:26:43 2025]  ? timerqueue_add+0x72/0xe0
[Tue Aug 19 22:26:43 2025]  ? ext4_writepages+0xc0/0x190
[Tue Aug 19 22:26:43 2025]  do_writepages+0xde/0x280
[Tue Aug 19 22:26:43 2025]  ? sched_clock_noinstr+0x9/0x10
[Tue Aug 19 22:26:43 2025]  __writeback_single_inode+0x44/0x350
[Tue Aug 19 22:26:43 2025]  ? pick_eevdf+0x175/0x1b0
[Tue Aug 19 22:26:43 2025]  writeback_sb_inodes+0x255/0x550
[Tue Aug 19 22:26:43 2025]  __writeback_inodes_wb+0x54/0x100
[Tue Aug 19 22:26:43 2025]  ? queue_io+0x113/0x120
[Tue Aug 19 22:26:43 2025]  wb_writeback+0x1ac/0x330
[Tue Aug 19 22:26:43 2025]  ? get_nr_inodes+0x41/0x70
[Tue Aug 19 22:26:43 2025]  wb_workfn+0x351/0x410
[Tue Aug 19 22:26:43 2025]  process_one_work+0x172/0x350
[Tue Aug 19 22:26:43 2025]  worker_thread+0x34a/0x480
[Tue Aug 19 22:26:43 2025]  ? __pfx_worker_thread+0x10/0x10
[Tue Aug 19 22:26:43 2025]  kthread+0xf9/0x230
[Tue Aug 19 22:26:43 2025]  ? __pfx_kthread+0x10/0x10
[Tue Aug 19 22:26:43 2025]  ret_from_fork+0x44/0x70
[Tue Aug 19 22:26:43 2025]  ? __pfx_kthread+0x10/0x10
[Tue Aug 19 22:26:43 2025]  ret_from_fork_asm+0x1a/0x30
[Tue Aug 19 22:26:43 2025]  </TASK>
[Tue Aug 19 22:26:43 2025] INFO: task kworker/u193:3:639 blocked for more than 122 seconds.
[Tue Aug 19 22:26:43 2025]       Tainted: P           O       6.14.8-2-pve #1
[Tue Aug 19 22:26:43 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Aug 19 22:26:43 2025] task:kworker/u193:3  state:D stack:0     pid:639   tgid:639   ppid:2      task_flags:0x4248060 flags:0x00004000
[Tue Aug 19 22:26:43 2025] Workqueue: writeback wb_workfn (flush-251:0)
[Tue Aug 19 22:26:43 2025] Call Trace:
..

RobFantini · Aug 20, 2025

from the new node. note we have not moved the subscription over so it is using testing

Code:

# pveversion -v
proxmox-ve: 9.0.0 (running kernel: 6.14.8-2-pve)
pve-manager: 9.0.5 (running version: 9.0.5/9c5600b249dbfd2f)
proxmox-kernel-helper: 9.0.3
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
proxmox-kernel-6.14: 6.14.8-2
ceph: 19.2.3-pve1
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
dnsmasq: 2.91-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx9
intel-microcode: 3.20250512.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.4
libpve-network-perl: 1.1.6
libpve-rs-perl: 0.10.10
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.4-2
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.14-1
proxmox-backup-file-restore: 4.0.14-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.1.1
proxmox-kernel-helper: 9.0.3
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.0
proxmox-widget-toolkit: 5.0.5
pve-cluster: 9.0.6
pve-container: 6.0.9
pve-docs: 9.0.8
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.16-3
pve-ha-manager: 5.0.4
pve-i18n: 3.5.2
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
qemu-server: 9.0.18
smartmontools: 7.4-pve1
spiceterm: 3.4.0
swtpm: 0.8.0+pve2
vncterm: 1.9.0
zfsutils-linux: 2.3.3-pve1

RobFantini · Aug 20, 2025

other 4 nodes use enterprise repo

Code:

# pveversion -v
proxmox-ve: 9.0.0 (running kernel: 6.14.8-2-pve)
pve-manager: 9.0.5 (running version: 9.0.5/9c5600b249dbfd2f)
proxmox-kernel-helper: 9.0.3
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
proxmox-kernel-6.14: 6.14.8-2
proxmox-kernel-6.8.12-13-pve-signed: 6.8.12-13
proxmox-kernel-6.8: 6.8.12-13
ceph: 19.2.3-pve1
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
dnsmasq: 2.91-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx9
intel-microcode: 3.20250512.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.4
libpve-network-perl: 1.1.6
libpve-rs-perl: 0.10.7
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.4-2
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.14-1
proxmox-backup-file-restore: 4.0.14-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.1.1
proxmox-kernel-helper: 9.0.3
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.0
proxmox-widget-toolkit: 5.0.5
pve-cluster: 9.0.6
pve-container: 6.0.9
pve-docs: 9.0.8
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.16-3
pve-ha-manager: 5.0.4
pve-i18n: 3.5.2
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
pve-zsync: 2.4.0
qemu-server: 9.0.16
smartmontools: 7.4-pve1
spiceterm: 3.4.0
swtpm: 0.8.0+pve2
vncterm: 1.9.0
zfsutils-linux: 2.3.3-pve1

RobFantini · Aug 20, 2025

in the last 6 hours no new hangs occurred. [ hangs = blocked for more than 122 seconds ] . i call hang because when that occurs keyboard hangs. certainly inside a KVM , not sure if at PVE cli ]

there are 3 nodes and a few kvm;s with hangs in dmesg

in all cases the time of hang occurred when another node was rebooted.

all 3 pve nodes have note of osd closed from the rebooting system IP [ times are a few seconds different ]

Code:

[Tue Aug 19 22:26:07 2025] [  +0.000287] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:07 2025] [  +0.000194] libceph: osd8 (1)10.11.12.2:6857 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:07 2025] [  +0.000160] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:17 2025] [ +10.036340] libceph: mon0 (1)10.11.12.11:6789 socket closed (con state OPEN)
[Tue Aug 19 22:26:22 2025] [  +5.836551] libceph: osd37 (1)10.11.12.2:6822 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:22 2025] [  +0.000013] libceph: osd8 (1)10.11.12.2:6857 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:22 2025] [  +0.000004] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:22 2025] [  +0.000021] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] [ +15.873474] libceph: osd37 (1)10.11.12.2:6822 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] [  +0.000000] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] [  +0.000031] libceph: osd8 (1)10.11.12.2:6857 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] [  +0.000249] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:43 2025] [  +4.609017] INFO: task kworker/u193:1:314 blocked for more than 122 seconds.
[Tue Aug 19 22:26:43 2025] [  +0.000232]       Tainted: P           O       6.14.8-2-pve #1

Code:

[Tue Aug 19 22:23:59 2025] [  +3.263920] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:24:05 2025] [Aug19 22:24] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:24:11 2025] [  +5.944333] libceph: mon3 (1)10.11.12.5:6789 socket closed (con state OPEN)
[Tue Aug 19 22:24:41 2025] [ +29.511345] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:24:44 2025] [  +3.073078] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:24:47 2025] [  +3.199110] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:24:50 2025] [  +3.264012] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:24:53 2025] [  +3.072032] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:24:59 2025] [  +6.271794] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:25:09 2025] [Aug19 22:25] libceph: mon2 (1)10.11.12.4:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:25:13 2025] [  +4.032227] libceph: mon1 (1)10.11.12.2:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:25:16 2025] [  +3.264836] libceph: mon1 (1)10.11.12.2:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:25:20 2025] [  +3.071127] libceph: mon1 (1)10.11.12.2:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:25:23 2025] [  +3.072208] libceph: mon1 (1)10.11.12.2:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:25:32 2025] [  +9.472083] libceph: mon1 (1)10.11.12.2:6789 socket closed (con state V1_BANNER)
[Tue Aug 19 22:25:36 2025] [  +3.583522] INFO: task jbd2/rbd0-8:195438 blocked for more than 122 seconds.
[Tue Aug 19 22:25:36 2025] [  +0.000006]       Tainted: P           O       6.14.8-2-pve #1

Code:

[Tue Aug 19 22:26:38 2025] [  +1.023783] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:38 2025] [  +0.000024] libceph: osd8 (1)10.11.12.2:6857 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:44 2025] [  +6.143954] libceph: osd37 (1)10.11.12.2:6822 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:44 2025] [  +0.000121] libceph: osd3 (1)10.11.12.2:6833 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:47 2025] [  +2.863998] libceph: mon1 (1)10.11.12.2:6789 socket closed (con state OPEN)
[Tue Aug 19 22:26:47 2025] [  +0.208100] libceph: osd36 (1)10.11.12.2:6849 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:52 2025] [  +5.119981] libceph: osd26 (1)10.11.12.2:6841 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:52 2025] [  +0.512258] libceph: osd16 (1)10.11.12.2:6817 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:52 2025] [  +0.000000] libceph: osd19 (1)10.11.12.2:6809 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:53 2025] [  +0.511595] libceph: osd8 (1)10.11.12.2:6857 socket closed (con state V1_BANNER)
[Tue Aug 19 22:26:53 2025] [  +0.511876] INFO: task kworker/u322:9:1673 blocked for more than 122 seconds.
[Tue Aug 19 22:26:53 2025] [  +0.000014]       Tainted: P           O       6.14.8-2-pve #1

RobFantini · Aug 20, 2025

hung KVM's run both bookworm and trixie. so the kvm kernel version is probably not at fault.

also the hang does not persist.

RobFantini · Aug 21, 2025

so with no node restarts there have been no hangs on pve nodesand kvm's .

I have this set in sysctl.d since 2019 per
https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#Sample-sysctlconf
could these be causing an issue?

Code:

fs.file-max = 6553600
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 819200
net.ipv4.tcp_keepalive_time = 20
kernel.msgmni = 2878
kernel.sem = 256 32000 100 142
kernel.shmmni = 4096
net.core.rmem_default = 1048576
net.core.rmem_max = 1048576
net.core.wmem_default = 1048576
net.core.wmem_max = 1048576
net.core.somaxconn = 40000
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_max_tw_buckets = 10000

RobFantini · Aug 21, 2025

Also: has anyone else had the same issue? It could be that I have done something uniquely wrong.

Kingneutron · Aug 21, 2025

Looks to me like ext4 was having trouble writing to something, have you done a SMART long test lately?

RobFantini · Aug 21, 2025

Kingneutron said:
Looks to me like ext4 was having trouble writing to something, have you done a SMART long test lately?

I also noticed the ext4 part. However the issue starts 100% of the time when a node is in the process of shutting down for a reboot - shortly after the point it turns off osd , the hang starts at some or all of the remaining nodes.
I did not think that ext4 had hnything to do with ceph. There is some relationship based on the dmesg output.

So we do have a large ext4 formatted nvme for backups and use incase ceph is not available. I do not think those are causing the hang issue. All systems have been rebooted a couple of times as part of trying to solve the issue. Normal mount process would complain if fsck issue .
I will do a long smart test just in case.
thank you for the reply!

Search

Search

INFO: task kworker blocked for more than 122 seconds

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

Kingneutron

Renowned Member

RobFantini

Famous Member

We value your privacy