Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

FXKai · Jun 15, 2022

Hello Everybody,

not sure if this thread (https://forum.proxmox.com/threads/kernel-panic-since-update-to-proxmox-7-1.101164/#post-437435) is related but since we updated to PVE7.2 we have repeated crashes with kernel message

Code:

INFO: task khugepaged:796 blocked for more than 120 seconds.

following multiple

Code:

INFO: task iou-wrk-9815:650094 blocked for more than 120 seconds.

leading to a stuck VM (pid 9815 in this case). We can not do

Bash:

ps ax

anymore, it gets stuck just before displaying PID 9815.

Once we kill this process hard with kill -9 the behaviour of ps stays the same but we still have a kworker/36:3+events process running with high load.

Ultimately we have to hard (!!) reset the machine, since a shutdown never manages to kill the blocked process.

Kernel Version is 5.15.35-1-pve, PVE Version is 7.2-3. The System is a pure KVM compute machine servered by CEPH from seperate CEPH nodes but the stuck VM (same as last time) has its storage on a local NVMe Raid (via a Dell Perc H755 NVME Controller), if this matters.

FXKai · Jun 15, 2022

Bash:

# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7543 32-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         2800.000
CPU max MHz:                     3737.8899
CPU min MHz:                     1500.0000
BogoMIPS:                        5589.66
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-15,64-79
NUMA node1 CPU(s):               16-31,80-95
NUMA node2 CPU(s):               32-47,96-111
NUMA node3 CPU(s):               48-63,112-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid
                                 aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ib
                                 s skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed
                                  adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_sca
                                 le vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

FXKai · Jun 15, 2022

Code:

[1286374.147087] INFO: task khugepaged:796 blocked for more than 120 seconds.
[1286374.154005]       Tainted: P           O      5.15.35-1-pve #1
[1286374.160027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1286374.168042] task:khugepaged      state:D stack:    0 pid:  796 ppid:     2 flags:0x00004000
[1286374.168047] Call Trace:
[1286374.168049]  <TASK>
[1286374.168052]  __schedule+0x33d/0x1750
[1286374.168056]  ? enqueue_entity+0x17d/0x750
[1286374.168060]  ? page_counter_try_charge+0x34/0xb0
[1286374.168063]  schedule+0x4e/0xb0
[1286374.168064]  rwsem_down_write_slowpath+0x217/0x4d0
[1286374.168067]  down_write+0x43/0x50
[1286374.168069]  collapse_huge_page+0x1d8/0x11a0
[1286374.168070]  ? flush_work+0x49/0x90
[1286374.168072]  ? worker_detach_from_pool+0xc0/0xc0
[1286374.168074]  ? __lru_add_drain_all+0x1dc/0x390
[1286374.168077]  khugepaged+0xfab/0x1580
[1286374.168078]  ? wait_woken+0x70/0x70
[1286374.168080]  ? collapse_pte_mapped_thp+0x450/0x450
[1286374.168081]  kthread+0x12a/0x150
[1286374.168083]  ? set_kthread_struct+0x50/0x50
[1286374.168084]  ret_from_fork+0x22/0x30
[1286374.168088]  </TASK>
[1286374.168393] INFO: task iou-wrk-9815:650094 blocked for more than 120 seconds.
[1286374.175727]       Tainted: P           O      5.15.35-1-pve #1
[1286374.182064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1286374.190097] task:iou-wrk-9815    state:D stack:    0 pid:650094 ppid:     1 flags:0x00004000
[1286374.190101] Call Trace:
[1286374.190103]  <TASK>
[1286374.190104]  __schedule+0x33d/0x1750
[1286374.190106]  schedule+0x4e/0xb0
[1286374.190108]  rwsem_down_read_slowpath+0x318/0x370
[1286374.190109]  ? __cond_resched+0x1a/0x50
[1286374.190111]  down_read+0x43/0x90
[1286374.190113]  mmap_read_lock+0x15/0x40
[1286374.190115]  do_exit+0x1f3/0xa20
[1286374.190118]  io_wqe_worker+0x2b8/0x360
[1286374.190120]  ? finish_task_switch.isra.0+0xa6/0x2a0
[1286374.190123]  ? io_worker_handle_work+0x5f0/0x5f0
[1286374.190125]  ? io_worker_handle_work+0x5f0/0x5f0
[1286374.190127]  ret_from_fork+0x22/0x30
[1286374.190129] RIP: 0033:0x0
[1286374.190132] RSP: 002b:0000000000000000 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
[1286374.190134] RAX: 0000000000000000 RBX: 00007fbf23efe860 RCX: 00007fc7a18e39b9
[1286374.190135] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000015
[1286374.190137] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
[1286374.190138] R10: 0000000000000000 R11: 0000000000000216 R12: 000055aef31caa68
[1286374.190138] R13: 000055aef31cab20 R14: 000055aef31caa60 R15: 0000000000000001
[1286374.190139]  </TASK>
[1286374.190141] INFO: task iou-wrk-9815:651458 blocked for more than 120 seconds.
[1286374.197464]       Tainted: P           O      5.15.35-1-pve #1
[1286374.203486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1286374.211530] task:iou-wrk-9815    state:D stack:    0 pid:651458 ppid:     1 flags:0x00004000
[1286374.211534] Call Trace:
[1286374.211536]  <TASK>
[1286374.211537]  __schedule+0x33d/0x1750
[1286374.211539]  schedule+0x4e/0xb0
[1286374.211541]  rwsem_down_read_slowpath+0x318/0x370
[1286374.211543]  ? __cond_resched+0x1a/0x50
[1286374.211544]  down_read+0x43/0x90
[1286374.211546]  mmap_read_lock+0x15/0x40
[1286374.211548]  do_exit+0x1f3/0xa20
[1286374.211550]  io_wqe_worker+0x2b8/0x360
[1286374.211552]  ? finish_task_switch.isra.0+0xa6/0x2a0
[1286374.211553]  ? io_worker_handle_work+0x5f0/0x5f0
[1286374.211555]  ? io_worker_handle_work+0x5f0/0x5f0
[1286374.211557]  ret_from_fork+0x22/0x30
[1286374.211559] RIP: 0033:0x0
[1286374.211560] RSP: 002b:0000000000000000 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
[1286374.211562] RAX: 0000000000000000 RBX: 000055aef2f59a00 RCX: 00007fc7a18de4f6
[1286374.211564] RDX: 00007fff5b8be810 RSI: 000000000000009e RDI: 000055aef46fec00
[1286374.211565] RBP: 00007fff5b8be87c R08: 0000000000000008 R09: 0000000000000000
[1286374.211566] R10: 0000000000000000 R11: 0000000000000293 R12: 00007fff5b8be810
[1286374.211567] R13: 000055aef2f59a00 R14: 00007fff5b8be880 R15: 0000000000000000
[1286374.211569]  </TASK>

mach · Aug 3, 2022

This also happens to me with a Debian VM with 64GB ram and 512GB disk on zfs storage. It doesn't happen on another Debian VM that is 32GB ram and 256GB disk.

Host is Epyc AMD EPYC 7443P 24-Core Processor, 512GB ram, 8x Intel SSD DC S4610 drives with a Intel DC P4800X Log device setup as raid10 zfs.

I've tried setting the following in VM:
vm.swappiness=10
vm.dirty_ratio=3
vm.dirty_background_ratio=1

And on Host:
options zfs zfs_arc_max=34359738368

But it still happens every few days usually during backup of the VM.

Host only has 4 VMs and 4 LXC. Nothing crazy that would overload anything. Host is running latest proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve)

mach · Aug 4, 2022

I've tried downgrading the host kernel to 5.13, hoping it was the same issue as kernel-bug-cpu-soft-lockup. But it still just happened. So it's not the kernel.

FXKai · Aug 7, 2022

We updated our BIOS (Dell R6525 and the Perc Firmware) which helped to solve the problem. I guess the firmware of the Perc Controller was the main issue

FXKai · Aug 22, 2022

Unfortunately, we are back in the situation, the BIOS update did not solve the problem, but we can nail it down to blocking VM has data on the local RAID controller (Perc H755, 4 x NVME Raid-10). Other VM's (running purely in CEPH Storage) are not affected.

FXKai · Aug 22, 2022

PS : some more findings

Code:

ls -la /etc/pve/qemu-server/

gives you an idea, which vms are still running on this host

Code:

ls -la /var/run/qemu-server/

still gives you access to the vnc, serial and qemu sockets (also for debugging)

Code:

# still working
qm migrate <id> <target server>
qm terminal <id>
qm shutdown <id>

ps ax blocks just before the stale VM, but you can still loop over the procfs to get process info, i.e.

Code:

for i in `ls /proc |egrep "^[0-9]"`; do echo -n $i:`grep Name /proc/$i/status` ;echo " "; done

hepo · Nov 20, 2023

Was this ever solved and how?

We are observing this since recent upgrade to PVE 8 and Ceph Quincy

All VM disks on Ceph

Code:

agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 32
cpu: x86-64-v2-AES
memory: 65536
name: prod-lws141-dbcl41
net0: virtio=46:8D:E1:6D:4F:D6,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: ceph:vm-4141-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsi1: ceph:vm-4141-disk-1,discard=on,iothread=1,size=100G,ssd=1
scsi2: ceph:vm-4141-disk-2,discard=on,iothread=1,size=40G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=f50aafdb-6fa0-4a38-8fef-0e7d57c1d8a7
sockets: 1
vmgenid: b05f52c0-4203-43c7-a4be-f841fe7fdae5

mnih · Jan 3, 2024

we had the same problem with a debian VM two days ago. Time of crash was determined by log entries and is within quiet hours (no cpu / no disk load). VM storage is on nvme zfs mirror. All significant processes were blocked (webserver, database, ...).

Edit: PVE8, PBS3. VM was blocked in such a way that neither console login nor ssh were possible.

Rolland · Jan 4, 2024

PVE 8 + Ceph Reef + PBS 3
I've had a similar issue recently.
This issue is triggered as long as PBS is used for backup operations.
The NFS service is attached to the VM in the backup job.
VM is CentOS Linux release 7.8.2003

hepo · Jan 4, 2024

I have spewed tons of posts on this in the PVE8 thread, last time I've checked the issue continued.
Ceph Quincy here...

Search

Search

Kernel panic, machine stuck, task khugepaged:796 blocked for more than 120 seconds

FXKai

Member

FXKai

Member

FXKai

Member

mach

Member

mach

Member

FXKai

Member

FXKai

Member

FXKai

Member

hepo

Member

mnih

Member

Rolland

Member

hepo

Member