Nested PVE (on PVE host) Kernel panic Host injected async #PF in kernel mode

That’s your guests guest, does your primary guest also have ballooning disabled.
Thats the nested PVE config, the L1 guest
the L2 ubuntu guest vm has ballooning enabled yes.

My bad, I forgot to check whether it's already available in the pve-no-subscription repository.
Don't worry, not a big deal

Thanks @Neobin !!


Anyway, the change in the kernel that causes the crash has been added in kernel 5.8 - see GitHub (for easier readability in the browser) and Linux kernel mailing list. You'll thus need kernel 5.7 or older, and these are not available on PVE 7 either. Looking at the Proxmox VE Roadmap, you'll need to use the even older Proxmox VE 6.4 (download here) to see whether it works. The thing is, async page faults in kernel mode are not allowed for a good reason, so you'll have to see whether it will work better. Last but not least, I just want to mention that the suggestion of trying out PVE 6.4 is for testing purposes only, since that version has reached its end of life in September 2022.

Also, would it be possible to temporarily disable swap on the host to check whether it improves the situation? Again, this is not a general recommendation, since using swap has its benefits, but at least this would confirm our current assumptions about your issue.
YOU FOUND IT!!!

yeah provably all my other VMs have older kernels than 5.8
I will check later.

But yeah it makes total sense to restrict the PF mechanism to user space, where is actually useful.
I don't see it being as useful in kernel space.

The thing is, when disabling PF# in kernel space, did someone make any changes on qemu/kvm so it knows which faults should be dealt with using the PF approach and which ones with the halt one?

Cause the commit you mentioned means that every distro using kernel 5.8+ onward will be susceptible to the same problem as pve8, and that this problem has little to do with nested virt and its all about memory management between host and guest.

And yes i can totally try with PVE 6 to confirm if it is really not affected by it, but that is not really a solution to this problem, neither would be disabling the swap on the host, as is very useful, as not all guest memory regions must be kept active and in ram all the time...


I will also try running a ubuntu VM on the host, with the latest stable kernel and see if it exhibits the same behavior as PVE


So in which layer should the problem be tackled? qemu, linux kernel on host or linux kernel on guest?


Again thanks you all so much for your time i really appreciate it!
 
For the record on a debian 13 host (kernel 6.12.38) I ran a kvm (10.0.2) with PVE9

Code:
qemu-system-x86_64 -machine type=pc,accel=kvm -cpu host -m 8192 -drive file=vm1.qcow2,format=qcow2,if=virtio -k fr  -netdev user,id=net0,hostfwd=tcp:127.0.0.1:8066-:8006,hostfwd=tcp:127.0.0.1:8022-:22 -device virtio-net-pci,netdev=net0,addr=0x08  -serial stdio -vga none -cpu host -smp 4 -display none -cdrom  pve9.iso

Then ran cloudinit debian 13 nested VM inside the PVE, did some thing inside it and after a while I got the "injected async PF" panic on the PVE9 console.

The debian host has swap and doing lots of other things.

@l.leahu-vladucu any idea other than turning off swap (which I'd rather not do on this particular host)?
 
Code:
[32029.460800] Kernel panic - not syncing: Host injected async #PF in kernel mode
[32029.472728] CPU: 2 UID: 0 PID: 136167 Comm: pvestatd Tainted: P           O       6.14.8-2-pve #1
[32029.476915] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[32029.481049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[32029.484024] Call Trace:
[32029.486574]  <TASK>
[32029.490233]  dump_stack_lvl+0x5f/0x90
[32029.490983]  dump_stack+0x10/0x18
[32029.493376]  panic+0x12b/0x2fa
[32029.493874]  ? early_xen_iret_patch+0xc/0xc
[32029.494539]  __kvm_handle_async_pf+0xc3/0xe0
[32029.500711]  exc_page_fault+0xb8/0x1e0
[32029.501528]  asm_exc_page_fault+0x27/0x30
[32029.510582] RIP: 0010:__put_user_4+0xd/0x20
[32029.511285] Code: 66 89 01 31 c9 0f 01 ca e9 90 a0 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 cb 48 c1 fb 3f 48 09 d9 0f 01 cb <89> 01 31 c9 0f 01 ca c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90
[32029.519047] RSP: 0018:ffffbf488aaabf00 EFLAGS: 00050206
[32029.522482] RAX: 00000000000213e7 RBX: 0000000000000000 RCX: 0000760692482e50
[32029.525880] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[32029.531179] RBP: ffffbf488aaabf10 R08: 0000000000000000 R09: 0000000000000000
[32029.535160] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[32029.537327] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[32029.538862]  ? schedule_tail+0x42/0x70
[32029.541446]  ret_from_fork+0x1c/0x70
[32029.543239]  ret_from_fork_asm+0x1a/0x30
[32029.544870] RIP: 0033:0x76069259d202
[32029.547517] Code: Unable to access opcode bytes at 0x76069259d1d8.
[32029.552400] RSP: 002b:00007ffe1c00b500 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[32029.559229] RAX: 0000000000000000 RBX: 00007ffe1c00b500 RCX: 000076069259d202
[32029.566832] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[32029.573968] RBP: 00006415fb85a450 R08: 0000000000000000 R09: 0000000000000000
[32029.580870] R10: 0000760692482e50 R11: 0000000000000246 R12: 00006415f63f22e8
[32029.586908] R13: 0000000000000002 R14: 0000000000000000 R15: 00006415d44df5b0
[32029.593439]  </TASK>
[32029.596630] Kernel Offset: 0x23800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[32029.602331] ---[ end Kernel panic - not syncing: Host injected async #PF in kernel mode ]---
 
I'm seeing the same problem running PVE9 beta in a nested VM on PVE8.2.4.
I'm running Debian13 vm in the nested PVE9, and after a while, I get this crash:

Code:
pve-yvr login: [ 2089.981351] INFO: task cfs_loop:1122 blocked for more than 122 seconds.
[ 2089.981899]       Tainted: P           O       6.14.8-2-pve #1
[ 2089.982220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2089.982654] task:cfs_loop        state:D stack:0     pid:1122  tgid:1120  ppid:1      task_flags:0x400040 flags:0x00000002
[ 2089.983257] Call Trace:
[ 2089.983526]  <TASK>
[ 2089.983673]  __schedule+0x466/0x13f0
[ 2089.984215]  schedule+0x29/0x130
[ 2089.984427]  kvm_async_pf_task_wait_schedule+0x186/0x1c0
[ 2089.984876]  __kvm_handle_async_pf+0x5c/0xe0
[ 2089.985126]  exc_page_fault+0xb8/0x1e0
[ 2089.985359]  asm_exc_page_fault+0x27/0x30
[ 2089.985736] RIP: 0033:0x7f87864b8474
[ 2089.985960] RSP: 002b:00007f8784fefcf0 EFLAGS: 00010246
[ 2089.986310] RAX: 0000000000000001 RBX: 000000000000025c RCX: 00005b2085773f68
[ 2089.986697] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005b208578a938
[ 2089.987061] RBP: 0000000000000001 R08: 0000000000000001 R09: 00005b2085773fa8
[ 2089.987435] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f878673a064
[ 2089.987820] R13: 00005b2085789f28 R14: 000000000000025e R15: 00007f878673a000
[ 2089.988286]  </TASK>
[ 2107.954191] Kernel panic - not syncing: Host injected async #PF in kernel mode
[ 2107.956364] CPU: 1 UID: 109 PID: 23170 Comm: saunafs-uraft-h Tainted: P           O       6.14.8-2-pve #1
[ 2107.958006] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[ 2107.958408] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2023.08-4 02/15/2024
[ 2107.958893] Call Trace:
[ 2107.959136]  <TASK>
[ 2107.959355]  dump_stack_lvl+0x5f/0x90
[ 2107.959720]  dump_stack+0x10/0x18
[ 2107.959990]  panic+0x12b/0x2fa
[ 2107.960317]  ? early_xen_iret_patch+0xc/0xc
[ 2107.960617]  __kvm_handle_async_pf+0xc3/0xe0
[ 2107.960934]  exc_page_fault+0xb8/0x1e0
[ 2107.961231]  asm_exc_page_fault+0x27/0x30
[ 2107.961529] RIP: 0010:__put_user_4+0xd/0x20
[ 2107.961833] Code: 66 89 01 31 c9 0f 01 ca c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 cb 48 c1 fb 3f 48 09 d9 0f 01 cb <89> 01 31 c9 0f 01 ca c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90
[ 2107.962880] RSP: 0018:ffffc0e1a4e1ff00 EFLAGS: 00050202
[ 2107.963233] RAX: 0000000000005a82 RBX: 0000000000000000 RCX: 00007816a5721a10
[ 2107.963660] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 2107.964100] RBP: ffffc0e1a4e1ff10 R08: 0000000000000000 R09: 0000000000000000
[ 2107.964530] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2107.964960] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 2107.965390]  ? schedule_tail+0x42/0x70
[ 2107.965690]  ret_from_fork+0x1c/0x70
[ 2107.966066]  ret_from_fork_asm+0x1a/0x30
[ 2107.966366] RIP: 0033:0x7816a5801202
[ 2107.966651] Code: Unable to access opcode bytes at 0x7816a58011d8.
[ 2107.967041] RSP: 002b:00007ffd950a56a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 2107.967487] RAX: 0000000000000000 RBX: 00007ffd950a56a0 RCX: 00007816a5801202
[ 2107.967913] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 2107.968350] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
[ 2107.968772] R10: 00007816a5721a10 R11: 0000000000000246 R12: 00007ffd950a5810
[ 2107.969194] R13: 0000000000000000 R14: 0000000000000000 R15: 00005f21d4d03c10
[ 2107.969611]  </TASK>
[ 2107.970131] Kernel Offset: 0x38800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2109.010662] Rebooting in 120 seconds..

in PVE9. HOST running PVE8 is not affected, and it is running a bunch of other things, with swap enable too.
 
Thanks for the reports, everyone. It would still be interesting to see whether temporarily disabling swap on the host improves the situation, if you would like to try. Again, this is not a general recommendation, since using swap has its benefits, but at least this would confirm our current assumptions about your issue.

I'm still in the process of investigating under which conditions this bug occurs, but I was not yet able to reproduce it myself. So if you have any information that might be useful, please let me know.
 
disabling swap on the host improves the situation
i am having the same issue with kernel panic. additionally guest in the nested pve also hangs randomly.
...
i disabled swap (ie. swapoff -a) and restarted the guest with pve inside. started the guest in the nested pve and did some load test, which were triggering the kernel panic before after a few minutes.
well, after 30 mins off testing no kernel panic. no nested guest hangs.
pve kernel on HW and in guest is: Linux 6.14.11-4-pve
...
to check visa versus i rebooted the HW host(so swap was enabled again) and did the whole procedure again.
hm .. sadly (or luckily) the kernel panic in the nested pve do not happens again :rolleyes:
but nested guest again hangs.

@l.leahu-vladucu i hope this helps you to proceed
 
Last edited:
I'm having the same issues here. One thing I did to be able to see some debug information on the guest PVE was using `qm teminal <vmid>` to see the terminal console of the guest PVE, and leaving it on a byobu/tmux session on the host PVE, so I could scroll back on the output (over ssh) once I see the guest PVE rebooted.

and I did notice the reboot always happens following a `hrtimer: interrupt took #####ns` message. Sometimes I see a kernel `Oops: general protection fault, probably for non-canonical address blablabla PREEMPT SMP NOPTI` and sometimes just a kernel panic right after the message. Sometimes there's only the `BdsDxe: loading Boot0008 "proxmox" from HD(2,GPT,5B5D98BC-45F7-4251-8318-0A3EC1E71133,0x800,0x100000)/\EFI\proxmox\shimx64.efi` reboot message saying it's loading efi, right after the hrtimer` message.

To setup the serial terminal so you can use `qm terminal`, just add a serial port to the PVE vm in the host PVE, and add:

GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX console=tty0 console=ttyS0,115200"
GRUB_TERMINAL="serial console"
GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200"

to /etc/default/grub, and run `update-grub`.
In the next reboot, you will see grub showing up in `qm terminal <guest pve vm id>`, followed by the kernel output and the login prompt.

btw, don't use the browser webui xterm.js console... it resets the text buffer when the VM reboots, so it's completely useless. Log in the host PVE via ssh, and run `qm terminal <guest vpe vm id>` in a byobu/tmux session, so byobu will keep the text buffer for you and you can come back to it later over ssh again.

I'm going to disable SWAP on my host PVE to see if that improves the problem for me, and I'll update this thread with the result asap.
Hope this helps you guys debug the reboots.

-H
 
Last edited: