With the new kernel 6.11.0-1-pve, I see hung tasks in the dmesg output of my Proxmox host, causing hangs and disconnected RDP sessions to my VMs.
E.g. like this one on the host:
Code:
[10201.836445] INFO: task kcompactd0:317 blocked for more than 122 seconds.
[10201.836455] Tainted: P O 6.11.0-1-pve #1
[10201.836457] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10201.836458] task:kcompactd0 state:D stack:0 pid:317 tgid:317 ppid:2 flags:0x00004000
[10201.836464] Call Trace:
[10201.836468] <TASK>
[10201.836473] __schedule+0x400/0x15d0
[10201.836489] schedule+0x29/0x130
[10201.836492] io_schedule+0x4c/0x80
[10201.836495] folio_wait_bit_common+0x138/0x310
[10201.836504] ? __pfx_wake_page_function+0x10/0x10
[10201.836508] folio_wait_bit+0x18/0x30
[10201.836512] folio_wait_writeback+0x2b/0xa0
[10201.836515] nfs_wb_folio+0xa5/0x1f0 [nfs]
[10201.836579] nfs_release_folio+0x75/0x140 [nfs]
[10201.836606] filemap_release_folio+0x68/0xa0
[10201.836609] split_huge_page_to_list_to_order+0x1f1/0xea0
[10201.836615] migrate_pages_batch+0x580/0xce0
[10201.836619] ? __pfx_compaction_alloc+0x10/0x10
[10201.836624] ? __pfx_compaction_free+0x10/0x10
[10201.836627] ? __mod_memcg_lruvec_state+0x9f/0x190
[10201.836631] ? __pfx_compaction_free+0x10/0x10
[10201.836633] migrate_pages+0xabb/0xd50
[10201.836636] ? __pfx_compaction_free+0x10/0x10
[10201.836638] ? __pfx_compaction_alloc+0x10/0x10
[10201.836642] compact_zone+0xad3/0x1140
[10201.836646] compact_node+0xa4/0x120
[10201.836651] kcompactd+0x2cf/0x460
[10201.836654] ? __pfx_autoremove_wake_function+0x10/0x10
[10201.836658] ? __pfx_kcompactd+0x10/0x10
[10201.836661] kthread+0xe4/0x110
[10201.836664] ? __pfx_kthread+0x10/0x10
[10201.836666] ret_from_fork+0x47/0x70
[10201.836677] ? __pfx_kthread+0x10/0x10
[10201.836679] ret_from_fork_asm+0x1a/0x30
[10201.836683] </TASK>
[10201.836783] INFO: task task UPID:linus:96733 blocked for more than 122 seconds.
[10201.836785] Tainted: P O 6.11.0-1-pve #1
[10201.836786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10201.836787] task:task UPID:linus state:D stack:0 pid:96733 tgid:96733 ppid:2110 flags:0x00000002
[10201.836790] Call Trace:
[10201.836791] <TASK>
[10201.836792] __schedule+0x400/0x15d0
[10201.836797] schedule+0x29/0x130
[10201.836800] io_schedule+0x4c/0x80
[10201.836803] folio_wait_bit_common+0x138/0x310
[10201.836807] ? __pfx_wake_page_function+0x10/0x10
[10201.836809] __folio_lock+0x17/0x30
[10201.836813] writeback_iter+0x1ee/0x2d0
[10201.836815] ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
[10201.836860] write_cache_pages+0x4c/0xb0
[10201.836863] nfs_writepages+0x17b/0x310 [nfs]
[10201.836890] ? crypto_shash_update+0x19/0x30
[10201.836895] ? ext4_inode_csum+0x1f8/0x270
[10201.836914] do_writepages+0x7e/0x270
[10201.836917] ? jbd2_journal_stop+0x155/0x2f0
[10201.836922] filemap_fdatawrite_wbc+0x75/0xb0
[10201.836924] __filemap_fdatawrite_range+0x6d/0xa0
[10201.836929] filemap_write_and_wait_range+0x59/0xc0
[10201.836932] nfs_wb_all+0x27/0x120 [nfs]
[10201.836960] nfs4_file_flush+0x7b/0xd0 [nfsv4]
[10201.837018] filp_flush+0x38/0x90
[10201.837021] __x64_sys_close+0x33/0x90
[10201.837023] x64_sys_call+0x1a84/0x24e0
[10201.837026] do_syscall_64+0x7e/0x170
[10201.837032] ? __f_unlock_pos+0x12/0x20
[10201.837035] ? ksys_write+0xd9/0x100
[10201.837039] ? syscall_exit_to_user_mode+0x4e/0x250
[10201.837043] ? do_syscall_64+0x8a/0x170
[10201.837046] ? syscall_exit_to_user_mode+0x4e/0x250
[10201.837049] ? do_syscall_64+0x8a/0x170
[10201.837051] ? ptep_set_access_flags+0x4a/0x70
[10201.837057] ? wp_page_reuse+0x97/0xc0
[10201.837059] ? do_wp_page+0x84b/0xb90
[10201.837062] ? __pte_offset_map+0x1c/0x1b0
[10201.837067] ? __handle_mm_fault+0xbdc/0x1120
[10201.837071] ? __count_memcg_events+0x7d/0x130
[10201.837074] ? count_memcg_events.constprop.0+0x2a/0x50
[10201.837077] ? handle_mm_fault+0xaf/0x2e0
[10201.837080] ? do_user_addr_fault+0x5ec/0x830
[10201.837083] ? irqentry_exit_to_user_mode+0x43/0x250
[10201.837086] ? irqentry_exit+0x43/0x50
[10201.837088] ? exc_page_fault+0x96/0x1e0
[10201.837105] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[10201.837109] RIP: 0033:0x71103703a8e0
[10201.837122] RSP: 002b:00007fff544adda8 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
[10201.837124] RAX: ffffffffffffffda RBX: 000059c36bd312a0 RCX: 000071103703a8e0
[10201.837125] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000013
[10201.837127] RBP: 0000000000000013 R08: 0000000000000000 R09: 0000000000000000
[10201.837128] R10: 0000000000000000 R11: 0000000000000202 R12: 000059c37401c890
[10201.837129] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[10201.837131] </TASK>
[10201.837134] INFO: task zstd:98699 blocked for more than 122 seconds.
[...]
[12536.577551] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
causing something like this in e.g. a VM with Redhat Enterprise Linux:
Code:
[root@rh92 ~]# dnf update
Subscription Management Repositorys werden aktualisiert.
Message from syslogd@rh92 at Nov 5 09:51:51 ...
kernel:Uhhuh. NMI received for unknown reason 30 on CPU 0.
Message from syslogd@rh92 at Nov 5 09:51:51 ...
kernel:Do you have a strange power saving mode enabled?
Message from syslogd@rh92 at Nov 5 09:51:51 ...
kernel:Dazed and confused, but trying to continue
Message from syslogd@rh92 at Nov 5 10:01:48 ...
kernel:Uhhuh. NMI received for unknown reason 20 on CPU 1.
Message from syslogd@rh92 at Nov 5 10:01:48 ...
kernel:Do you have a strange power saving mode enabled?
Message from syslogd@rh92 at Nov 5 10:01:48 ...
kernel:Dazed and confused, but trying to continue
Message from syslogd@rh92 at Nov 5 10:02:09 ...
kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [in:imjournal:1051]
Message from syslogd@rh92 at Nov 5 10:02:29 ...
kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [khugepaged:51]
Message from syslogd@rh92 at Nov 5 10:03:49 ...
kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [khugepaged:51]
Message from syslogd@rh92 at Nov 5 10:04:14 ...
kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [khugepaged:51]
Message from syslogd@rh92 at Nov 5 10:04:42 ...
kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 48s! [khugepaged:51]
Message from syslogd@rh92 at Nov 5 10:06:07 ...
kernel:watchdog: BUG: soft lockup - CPU#0 stuck for 28s! [khugepaged:51]
Letzte Prüfung auf abgelaufene Metadaten: vor 0:45:38 am Di 05 Nov 2024 09:28:25 CET.
Abhängigkeiten sind aufgelöst.
==========================================================================================================================================================================
Paket Architektur Version Paketquelle Größe
==========================================================================================================================================================================
Installieren:
kernel x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 4.6 M
kernel-core x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 19 M
kernel-modules x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 38 M
kernel-modules-core x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 33 M
Aktualisieren:
bpftool x86_64 7.3.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 5.4 M
firefox x86_64 128.4.0-1.el9_4 rhel-9-for-x86_64-appstream-rpms 123 M
kernel-headers x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-appstream-rpms 6.3 M
kernel-tools x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 4.8 M
kernel-tools-libs x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 4.6 M
python3-perf x86_64 5.14.0-427.42.1.el9_4 rhel-9-for-x86_64-baseos-rpms 4.7 M
thunderbird x86_64 128.4.0-1.el9_4 rhel-9-for-x86_64-appstream-rpms 118 M
tzdata noarch 2024b-2.el9 rhel-9-for-x86_64-baseos-rpms 841 k
tzdata-java noarch 2024b-2.el9 rhel-9-for-x86_64-appstream-rpms 228 k
Entfernen:
kernel x86_64 5.14.0-427.35.1.el9_4 @rhel-9-for-x86_64-baseos-rpms 0
kernel-core x86_64 5.14.0-427.35.1.el9_4 @rhel-9-for-x86_64-baseos-rpms 64 M
kernel-modules x86_64 5.14.0-427.35.1.el9_4 @rhel-9-for-x86_64-baseos-rpms 33 M
kernel-modules-core x86_64 5.14.0-427.35.1.el9_4 @rhel-9-for-x86_64-baseos-rpms 27 M
Transaktionszusammenfassung
==========================================================================================================================================================================
Installieren 4 Pakete
Aktualisieren 9 Pakete
Entfernen 4 Pakete
Gesamte Downloadgröße: 362 M
Ist dies in Ordnung? [j/N]:
My host is an older server machine: 2-socket Ivy Bridge Xeon E5-2697 v2 (24C/48T) in an Asus Z9PE-D16/2L motherboard (Intel C-602A chipset); BIOS patched to the latest available from Asus. All memory slots occupied, so 256 GB RAM in total.
At the time of this happening, my workload is 17 VMs running, and I am performing a Proxmox VM backup (to NFS mounted storage) of the Linux VMs of this server (hence the ZSTD process in the hung task report above).
Code:
root@linus:~# qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
100 black running 8192 50.00 98672
101 w11 running 16384 300.00 8361
102 w10 running 16384 300.00 8021
103 srv19e running 16384 256.00 2917
104 srv22 running 16384 256.00 2866
105 ucs50 running 16384 80.00 103582
106 kali running 8192 50.00 116175
107 fed41-xfce running 16384 80.00 126410
108 emcc running 24576 80.00 158724
109 rh92 running 16384 64.00 173195
110 db1 running 24576 80.00 177487
111 db2 running 24576 80.00 194532
112 ora-appsrv running 24576 80.00 4859
113 ora-VM-tmpl stopped 24576 80.00 0
114 fed41-mate running 16384 80.00 4623
115 fed41-kde running 16384 80.00 4206
116 srv25 running 16384 256.00 3046
117 osus-tumble running 8192 64.00 3679
root@linus:~#