Nested PVE (on PVE host) Kernel panic Host injected async #PF in kernel mode

That’s your guests guest, does your primary guest also have ballooning disabled.
Thats the nested PVE config, the L1 guest
the L2 ubuntu guest vm has ballooning enabled yes.

My bad, I forgot to check whether it's already available in the pve-no-subscription repository.
Don't worry, not a big deal

Thanks @Neobin !!


Anyway, the change in the kernel that causes the crash has been added in kernel 5.8 - see GitHub (for easier readability in the browser) and Linux kernel mailing list. You'll thus need kernel 5.7 or older, and these are not available on PVE 7 either. Looking at the Proxmox VE Roadmap, you'll need to use the even older Proxmox VE 6.4 (download here) to see whether it works. The thing is, async page faults in kernel mode are not allowed for a good reason, so you'll have to see whether it will work better. Last but not least, I just want to mention that the suggestion of trying out PVE 6.4 is for testing purposes only, since that version has reached its end of life in September 2022.

Also, would it be possible to temporarily disable swap on the host to check whether it improves the situation? Again, this is not a general recommendation, since using swap has its benefits, but at least this would confirm our current assumptions about your issue.
YOU FOUND IT!!!

yeah provably all my other VMs have older kernels than 5.8
I will check later.

But yeah it makes total sense to restrict the PF mechanism to user space, where is actually useful.
I don't see it being as useful in kernel space.

The thing is, when disabling PF# in kernel space, did someone make any changes on qemu/kvm so it knows which faults should be dealt with using the PF approach and which ones with the halt one?

Cause the commit you mentioned means that every distro using kernel 5.8+ onward will be susceptible to the same problem as pve8, and that this problem has little to do with nested virt and its all about memory management between host and guest.

And yes i can totally try with PVE 6 to confirm if it is really not affected by it, but that is not really a solution to this problem, neither would be disabling the swap on the host, as is very useful, as not all guest memory regions must be kept active and in ram all the time...


I will also try running a ubuntu VM on the host, with the latest stable kernel and see if it exhibits the same behavior as PVE


So in which layer should the problem be tackled? qemu, linux kernel on host or linux kernel on guest?


Again thanks you all so much for your time i really appreciate it!
 
For the record on a debian 13 host (kernel 6.12.38) I ran a kvm (10.0.2) with PVE9

Code:
qemu-system-x86_64 -machine type=pc,accel=kvm -cpu host -m 8192 -drive file=vm1.qcow2,format=qcow2,if=virtio -k fr  -netdev user,id=net0,hostfwd=tcp:127.0.0.1:8066-:8006,hostfwd=tcp:127.0.0.1:8022-:22 -device virtio-net-pci,netdev=net0,addr=0x08  -serial stdio -vga none -cpu host -smp 4 -display none -cdrom  pve9.iso

Then ran cloudinit debian 13 nested VM inside the PVE, did some thing inside it and after a while I got the "injected async PF" panic on the PVE9 console.

The debian host has swap and doing lots of other things.

@l.leahu-vladucu any idea other than turning off swap (which I'd rather not do on this particular host)?
 
Code:
[32029.460800] Kernel panic - not syncing: Host injected async #PF in kernel mode
[32029.472728] CPU: 2 UID: 0 PID: 136167 Comm: pvestatd Tainted: P           O       6.14.8-2-pve #1
[32029.476915] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[32029.481049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[32029.484024] Call Trace:
[32029.486574]  <TASK>
[32029.490233]  dump_stack_lvl+0x5f/0x90
[32029.490983]  dump_stack+0x10/0x18
[32029.493376]  panic+0x12b/0x2fa
[32029.493874]  ? early_xen_iret_patch+0xc/0xc
[32029.494539]  __kvm_handle_async_pf+0xc3/0xe0
[32029.500711]  exc_page_fault+0xb8/0x1e0
[32029.501528]  asm_exc_page_fault+0x27/0x30
[32029.510582] RIP: 0010:__put_user_4+0xd/0x20
[32029.511285] Code: 66 89 01 31 c9 0f 01 ca e9 90 a0 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 cb 48 c1 fb 3f 48 09 d9 0f 01 cb <89> 01 31 c9 0f 01 ca c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90
[32029.519047] RSP: 0018:ffffbf488aaabf00 EFLAGS: 00050206
[32029.522482] RAX: 00000000000213e7 RBX: 0000000000000000 RCX: 0000760692482e50
[32029.525880] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[32029.531179] RBP: ffffbf488aaabf10 R08: 0000000000000000 R09: 0000000000000000
[32029.535160] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[32029.537327] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[32029.538862]  ? schedule_tail+0x42/0x70
[32029.541446]  ret_from_fork+0x1c/0x70
[32029.543239]  ret_from_fork_asm+0x1a/0x30
[32029.544870] RIP: 0033:0x76069259d202
[32029.547517] Code: Unable to access opcode bytes at 0x76069259d1d8.
[32029.552400] RSP: 002b:00007ffe1c00b500 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[32029.559229] RAX: 0000000000000000 RBX: 00007ffe1c00b500 RCX: 000076069259d202
[32029.566832] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[32029.573968] RBP: 00006415fb85a450 R08: 0000000000000000 R09: 0000000000000000
[32029.580870] R10: 0000760692482e50 R11: 0000000000000246 R12: 00006415f63f22e8
[32029.586908] R13: 0000000000000002 R14: 0000000000000000 R15: 00006415d44df5b0
[32029.593439]  </TASK>
[32029.596630] Kernel Offset: 0x23800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[32029.602331] ---[ end Kernel panic - not syncing: Host injected async #PF in kernel mode ]---
 
I'm seeing the same problem running PVE9 beta in a nested VM on PVE8.2.4.
I'm running Debian13 vm in the nested PVE9, and after a while, I get this crash:

Code:
pve-yvr login: [ 2089.981351] INFO: task cfs_loop:1122 blocked for more than 122 seconds.
[ 2089.981899]       Tainted: P           O       6.14.8-2-pve #1
[ 2089.982220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2089.982654] task:cfs_loop        state:D stack:0     pid:1122  tgid:1120  ppid:1      task_flags:0x400040 flags:0x00000002
[ 2089.983257] Call Trace:
[ 2089.983526]  <TASK>
[ 2089.983673]  __schedule+0x466/0x13f0
[ 2089.984215]  schedule+0x29/0x130
[ 2089.984427]  kvm_async_pf_task_wait_schedule+0x186/0x1c0
[ 2089.984876]  __kvm_handle_async_pf+0x5c/0xe0
[ 2089.985126]  exc_page_fault+0xb8/0x1e0
[ 2089.985359]  asm_exc_page_fault+0x27/0x30
[ 2089.985736] RIP: 0033:0x7f87864b8474
[ 2089.985960] RSP: 002b:00007f8784fefcf0 EFLAGS: 00010246
[ 2089.986310] RAX: 0000000000000001 RBX: 000000000000025c RCX: 00005b2085773f68
[ 2089.986697] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005b208578a938
[ 2089.987061] RBP: 0000000000000001 R08: 0000000000000001 R09: 00005b2085773fa8
[ 2089.987435] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f878673a064
[ 2089.987820] R13: 00005b2085789f28 R14: 000000000000025e R15: 00007f878673a000
[ 2089.988286]  </TASK>
[ 2107.954191] Kernel panic - not syncing: Host injected async #PF in kernel mode
[ 2107.956364] CPU: 1 UID: 109 PID: 23170 Comm: saunafs-uraft-h Tainted: P           O       6.14.8-2-pve #1
[ 2107.958006] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[ 2107.958408] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2023.08-4 02/15/2024
[ 2107.958893] Call Trace:
[ 2107.959136]  <TASK>
[ 2107.959355]  dump_stack_lvl+0x5f/0x90
[ 2107.959720]  dump_stack+0x10/0x18
[ 2107.959990]  panic+0x12b/0x2fa
[ 2107.960317]  ? early_xen_iret_patch+0xc/0xc
[ 2107.960617]  __kvm_handle_async_pf+0xc3/0xe0
[ 2107.960934]  exc_page_fault+0xb8/0x1e0
[ 2107.961231]  asm_exc_page_fault+0x27/0x30
[ 2107.961529] RIP: 0010:__put_user_4+0xd/0x20
[ 2107.961833] Code: 66 89 01 31 c9 0f 01 ca c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 cb 48 c1 fb 3f 48 09 d9 0f 01 cb <89> 01 31 c9 0f 01 ca c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90
[ 2107.962880] RSP: 0018:ffffc0e1a4e1ff00 EFLAGS: 00050202
[ 2107.963233] RAX: 0000000000005a82 RBX: 0000000000000000 RCX: 00007816a5721a10
[ 2107.963660] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 2107.964100] RBP: ffffc0e1a4e1ff10 R08: 0000000000000000 R09: 0000000000000000
[ 2107.964530] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2107.964960] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 2107.965390]  ? schedule_tail+0x42/0x70
[ 2107.965690]  ret_from_fork+0x1c/0x70
[ 2107.966066]  ret_from_fork_asm+0x1a/0x30
[ 2107.966366] RIP: 0033:0x7816a5801202
[ 2107.966651] Code: Unable to access opcode bytes at 0x7816a58011d8.
[ 2107.967041] RSP: 002b:00007ffd950a56a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 2107.967487] RAX: 0000000000000000 RBX: 00007ffd950a56a0 RCX: 00007816a5801202
[ 2107.967913] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 2107.968350] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
[ 2107.968772] R10: 00007816a5721a10 R11: 0000000000000246 R12: 00007ffd950a5810
[ 2107.969194] R13: 0000000000000000 R14: 0000000000000000 R15: 00005f21d4d03c10
[ 2107.969611]  </TASK>
[ 2107.970131] Kernel Offset: 0x38800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2109.010662] Rebooting in 120 seconds..

in PVE9. HOST running PVE8 is not affected, and it is running a bunch of other things, with swap enable too.
 
Thanks for the reports, everyone. It would still be interesting to see whether temporarily disabling swap on the host improves the situation, if you would like to try. Again, this is not a general recommendation, since using swap has its benefits, but at least this would confirm our current assumptions about your issue.

I'm still in the process of investigating under which conditions this bug occurs, but I was not yet able to reproduce it myself. So if you have any information that might be useful, please let me know.