VM freezes irregularly

gyrex

New Member
Jul 19, 2022
26
0
1
I just experienced another kernel panic on the same VM 2 hours after the previous one. I've included the log below but it seems like a different panic.

I'm going to try and change the machine type from i440fx to q35 - out of interest, what machine types is everyone else using?

Code:
[ 7720.804438] BUG: #DF stack guard page was hit at 00000000d9071369 (stack is 000000002e08a9df..0000000059db9875)
[ 7720.804460] stack guard page: 0000 [#1] SMP PTI
[ 7720.804464] CPU: 0 PID: 809 Comm: dockerd Not tainted 5.15.0-46-generic #49-Ubuntu
[ 7720.804473] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[ 7720.804475] RIP: 0010:error_entry+0xc/0x130
[ 7720.804498] Code: ff 85 db 0f 85 19 fd ff ff 0f 01 f8 e9 11 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 fc 56 48 8b 74 24 08 48 89 7c 24 08 <52> 51 50 41 50 41 51 41 52 41 53 53 55 41 54 41 55 41 56 41 57 56
[ 7720.804500] RSP: 0000:fffffe0000009000 EFLAGS: 00010087
[ 7720.804503] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804504] RDX: 0000000000000000 RSI: ffffffff87000b48 RDI: fffffe0000009078
[ 7720.804505] RBP: fffffe0000009068 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804506] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009078
[ 7720.804507] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804508] FS:  00007fd11cff9640(0000) GS:ffff8988bbc00000(0000) knlGS:0000000000000000
[ 7720.804509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7720.804511] CR2: fffffe0000008ff8 CR3: 00000001292be000 CR4: 00000000000006f0
[ 7720.804514] Call Trace:
[ 7720.804519]  <#DF>
[ 7720.804534]  ? exc_page_fault+0x1c/0x170
[ 7720.804538]  asm_exc_page_fault+0x26/0x30
[ 7720.804541] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804543] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804545] RSP: 0000:fffffe0000009128 EFLAGS: 00010087
[ 7720.804546] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804547] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe0000009158
[ 7720.804548] RBP: fffffe0000009148 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804548] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009158
[ 7720.804549] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804550]  ? native_iret+0x7/0x7
[ 7720.804562]  asm_exc_page_fault+0x26/0x30
[ 7720.804564] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804566] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804567] RSP: 0000:fffffe0000009208 EFLAGS: 00010087
[ 7720.804568] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804569] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe0000009238
[ 7720.804570] RBP: fffffe0000009228 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804570] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009238
[ 7720.804571] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804592]  ? native_iret+0x7/0x7
[ 7720.804594]  asm_exc_page_fault+0x26/0x30
[ 7720.804597] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804598] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804608] RSP: 0000:fffffe00000092e8 EFLAGS: 00010087
[ 7720.804610] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804610] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe0000009318
[ 7720.804611] RBP: fffffe0000009308 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804612] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009318
[ 7720.804612] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804614]  ? native_iret+0x7/0x7
[ 7720.804616]  asm_exc_page_fault+0x26/0x30
[ 7720.804618] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804620] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804629] RSP: 0000:fffffe00000093c8 EFLAGS: 00010087
[ 7720.804630] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804631] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe00000093f8
[ 7720.804632] RBP: fffffe00000093e8 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804632] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe00000093f8
[ 7720.804633] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804634]  ? native_iret+0x7/0x7
[ 7720.804637]  asm_exc_page_fault+0x26/0x30
[ 7720.804639] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804640] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804641] RSP: 0000:fffffe00000094a8 EFLAGS: 00010087
[ 7720.804642] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804643] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe00000094d8
[ 7720.804643] RBP: fffffe00000094c8 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804644] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe00000094d8
[ 7720.804645] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804645]  ? native_iret+0x7/0x7
[ 7720.804647]  asm_exc_page_fault+0x26/0x30
[ 7720.804649] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804650] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804651] RSP: 0000:fffffe0000009588 EFLAGS: 00010087
[ 7720.804652] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804653] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe00000095b8
[ 7720.804653] RBP: fffffe00000095a8 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804654] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe00000095b8
[ 7720.804655] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804656]  ? native_iret+0x7/0x7
[ 7720.804657]  asm_exc_page_fault+0x26/0x30
[ 7720.804659] RIP: 0010:irqentry_enter+0xf/0x50
[ 7720.804661] Code: 66 66 2e 0f 1f 84 00 00 00 00 00 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 f6 87 88 00 00 00 03 75 17 31 c0 <65> 48 8b 14 25 c0 fb 01 00 f6 42 2c 02 75 13 5d c3 cc cc cc cc e8
[ 7720.804661] RSP: 0000:fffffe0000009668 EFLAGS: 00010046
[ 7720.804662] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804663] RDX: 0000000000000000 RSI: ffffffff87000aea RDI: fffffe0000009698
[ 7720.804677] RBP: fffffe0000009668 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804678] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009698
[ 7720.804679] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804680]  ? native_iret+0x7/0x7
[ 7720.804681]  ? asm_exc_invalid_op+0xa/0x20
[ 7720.804684]  exc_invalid_op+0x25/0x70
[ 7720.804686]  asm_exc_invalid_op+0x1a/0x20
[ 7720.804688] RIP: 0010:asm_exc_invalid_op+0x0/0x20
[ 7720.804690] Code: 00 00 48 89 c4 48 8d 6c 24 01 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff ff e8 ea 7f f9 ff e9 a5 0a 00 00 0f 1f 44 00 00 <0f> 1f 00 6a ff e8 66 09 00 00 48 89 c4 48 8d 6c 24 01 48 89 e7 e8
[ 7720.804691] RSP: 0000:fffffe0000009748 EFLAGS: 00010002
[ 7720.804692] RAX: 000000c0009b6600 RBX: 000000c0008ba750 RCX: 0000000000000028
[ 7720.804693] RDX: 0000000000000090 RSI: 0000000000203000 RDI: 00007fd1244e3138
[ 7720.804694] RBP: 00007fd11cff8af8 R08: 0000000000000003 R09: 00007fd1260cdd3b
[ 7720.804695] R10: 00000000000fbeb0 R11: 00007fd126287fff R12: 000000c0008ba750
[ 7720.804695] R13: 000000c0009b6600 R14: 000000c0009cf860 R15: 0000000000000000
[ 7720.804700] WARNING: stack recursion on stack type 5
[ 7720.804703]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804902]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804904]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804906]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804908]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804910]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804912]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804915]  ? asm_exc_stack_segment+0x10/0x30
[ 7720.804917]  ? vsnprintf+0x359/0x550
[ 7720.804935]  ? vsnprintf+0x359/0x550
[ 7720.804936]  ? sprintf+0x56/0x80
[ 7720.804938]  ? __sprint_symbol.constprop.0+0xee/0x110
[ 7720.804964]  ? symbol_string+0xa2/0x140
[ 7720.804966]  ? symbol_string+0xa2/0x140
[ 7720.804968]  ? vsnprintf+0x397/0x550
[ 7720.804969]  ? vscnprintf+0xd/0x40
[ 7720.804970]  ? printk_sprint+0x79/0xa0
[ 7720.804978]  ? pointer+0x230/0x4f0
[ 7720.804980]  ? vsnprintf+0x397/0x550
[ 7720.804982]  ? vscnprintf+0xd/0x40
[ 7720.804983]  ? printk_sprint+0x5e/0xa0
[ 7720.804985]  ? vprintk_store+0x2fe/0x5b0
[ 7720.804987]  ? defer_console_output+0x3b/0x50
[ 7720.804989]  ? vprintk+0x4a/0x90
[ 7720.804991]  ? is_bpf_text_address+0x17/0x30
[ 7720.805002]  ? kernel_text_address+0xf7/0x100
[ 7720.805011]  ? unwind_next_frame.part.0+0x86/0x200
[ 7720.805020]  ? __kernel_text_address+0x12/0x50
[ 7720.805022]  ? show_trace_log_lvl+0x1cb/0x2df
[ 7720.805033]  ? show_trace_log_lvl+0x1cb/0x2df
[ 7720.805035]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.805038]  ? show_regs.part.0+0x23/0x29
[ 7720.805039]  ? __die_body.cold+0x8/0xd
[ 7720.805056]  ? __die+0x2b/0x37
[ 7720.805057]  ? die+0x30/0x60
[ 7720.805067]  ? handle_stack_overflow+0x4e/0x60
[ 7720.805069]  ? exc_double_fault+0x155/0x190
[ 7720.805071]  ? asm_exc_double_fault+0x1e/0x30
[ 7720.805073]  ? native_iret+0x7/0x7
[ 7720.805074]  ? asm_exc_page_fault+0x8/0x30
[ 7720.805077]  ? error_entry+0xc/0x130
[ 7720.805078]  </#DF>
[ 7720.805083] Modules linked in: tcp_diag udp_diag inet_diag veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay sch_fq_codel cp210x input_leds usbserial cdc_acm joydev serio_raw mac_hid qemu_fw_cfg dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua mtd pstore_blk ramoops netconsole pstore_zone reed_solomon ipmi_devintf ipmi_msghandler msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper hid_generic syscopyarea sysfillrect sysimgblt fb_sys_fops cec usbhid rc_core virtio_net net_failover hid drm psmouse virtio_scsi failover i2c_piix4 pata_acpi floppy
[ 7720.901966] ---[ end trace b7f1a532a0e81c78 ]---
[ 7720.901991] RIP: 0010:error_entry+0xc/0x130
[ 7720.901998] Code: ff 85 db 0f 85 19 fd ff ff 0f 01 f8 e9 11 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 fc 56 48 8b 74 24 08 48 89 7c 24 08 <52> 51 50 41 50 41 51 41 52 41 53 53 55 41 54 41 55 41 56 41 57 56
[ 7720.901999] RSP: 0000:fffffe0000009000 EFLAGS: 00010087
[ 7720.902001] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.902002] RDX: 0000000000000000 RSI: ffffffff87000b48 RDI: fffffe0000009078
[ 7720.902003] RBP: fffffe0000009068 R08: 0000000000000000 R09: 0000000000000000
[ 7720.902004] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009078
[ 7720.902004] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.902005] FS:  00007fd11cff9640(0000) GS:ffff8988bbc00000(0000) knlGS:0000000000000000
[ 7720.902007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7720.902008] CR2: fffffe0000008ff8 CR3: 00000001292be000 CR4: 00000000000006f0
[ 7720.902013] Kernel panic - not syncing: Fatal exception in interrupt
[ 7720.902108] Kernel Offset: 0x5200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 

Holger Huo

New Member
Aug 8, 2022
4
0
1
I'm running 2 VMs on my Proxmox server, pfSense and Ubuntu 22.04 running docker. Both have locked up at various points although not for the past 5 or so days - as usual with Murphy's law, they haven't locked up since running their kernels in verbose mode and running remote logging services in order to try and diagnose the freezes/lockups.

Kernel versions below:

Ubuntu: Linux 5.15.0-43-generic #46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
pfSense/FreeBSD: FreeBSD 12.3-STABLE FreeBSD 12.3-STABLE RELENG_2_6_0-n226742-1285d6d205f pfSense amd64



I also wonder if this has some impact on the VMs running under it. It makes sense that this could potentially be an issue.
This apparently invalidates my theory of Linux kernel bug, also by my own 5.19 kernel on Alma Linux 8 that crashed today.

I'd consulted with the manufacturer today and they said there seemed to be some drawbacks in Intel's 11th gen celeron units, and we might have to wait for software to patch the problem.

They'd recommended to move Linux vms to LXC as the problem shouldn't occur outside KVM environment and they also said Esxi was more stable than PvE(not tested). Besides moving my Linux workloads to LXC, I've also updated my PvE base system via no subscription repo and installed intel-microcode through debian's non-free repo.

My only VM that cannot be migrated to LXC is OpenWRT running kernel 5.10 (this instance never freezes before as it barely has any loads, may also be kernel related). I'll add a testing VM running Alma Linux 9 with kernel 5.14 (which usu. Freezes after a few hours before) to see if the intel-microcode works.
 

rzv

New Member
Aug 1, 2022
11
1
3
This apparently invalidates my theory of Linux kernel bug, also by my own 5.19 kernel on Alma Linux 8 that crashed today.

I'd consulted with the manufacturer today and they said there seemed to be some drawbacks in Intel's 11th gen celeron units, and we might have to wait for software to patch the problem.

They'd recommended to move Linux vms to LXC as the problem shouldn't occur outside KVM environment and they also said Esxi was more stable than PvE(not tested). Besides moving my Linux workloads to LXC, I've also updated my PvE base system via no subscription repo and installed intel-microcode through debian's non-free repo.

My only VM that cannot be migrated to LXC is OpenWRT running kernel 5.10 (this instance never freezes before as it barely has any loads, may also be kernel related). I'll add a testing VM running Alma Linux 9 with kernel 5.14 (which usu. Freezes after a few hours before) to see if the intel-microcode works.
While it's nice that someone finally acknowledged the existence of this problem, I don't hold out much hope as this CPU is now over a year old.
What "software" are we supposed to wait for? Kernel patch? Microcode?

I already tested the microcode from debian non-free repo and it doesn't fix anything.
 

gyrex

New Member
Jul 19, 2022
26
0
1
This apparently invalidates my theory of Linux kernel bug
Why does it invalidate your theory of a kernel bug? Proxmox also runs a linux kernel and there's a lot of similar bugs/kernel panics which have been logged at kernel.org for KVM issues. I'm almost certain this is a kernel bug, the logs demonstrate this but I'm not an expert, hence reporting the bug and attaching the requisite log files for kernel experts to look at.
 

gyrex

New Member
Jul 19, 2022
26
0
1
While it's nice that someone finally acknowledged the existence of this problem, I don't hold out much hope as this CPU is now over a year old.
What "software" are we supposed to wait for? Kernel patch? Microcode?

I already tested the microcode from debian non-free repo and it doesn't fix anything.
Have you considered running VMware ESXi? It's free to run.
 

rzv

New Member
Aug 1, 2022
11
1
3
Have you considered running VMware ESXi? It's free to run.
I don't like ESXi, nothing against vmware but I prefer proxmox because of the built-in web UI and the ease of device passthrough. The same applies for XCP-ng.
For now I moved my workloads to LXC containers and I will change my hardware to something else in the future.
For me this issue is closed for now, but I will keep watching these threads for a solution if we ever get one.
 

Holger Huo

New Member
Aug 8, 2022
4
0
1
While it's nice that someone finally acknowledged the existence of this problem, I don't hold out much hope as this CPU is now over a year old.
What "software" are we supposed to wait for? Kernel patch? Microcode?

I already tested the microcode from debian non-free repo and it doesn't fix anything.
Maybe it's something relating to the virtualization layer and Proxmox can develop some fixes for it.
Why does it invalidate your theory of a kernel bug? Proxmox also runs a linux kernel and there's a lot of similar bugs/kernel panics which have been logged at kernel.org for KVM issues. I'm almost certain this is a kernel bug, the logs demonstrate this but I'm not an expert, hence reporting the bug and attaching the requisite log files for kernel experts to look at.
I meant that it invalidated my theory of bug between linux version 5.10 to 5.14, as this issue persists in kernel version 5.19 and also bsd based systems like pfSense. I do hope there could be some approaches on software layer to fix it as N5105 is the most performant low-power cpu right now..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!