VM freezes irregularly

gyrex · Aug 10, 2022

I just experienced another kernel panic on the same VM 2 hours after the previous one. I've included the log below but it seems like a different panic.

I'm going to try and change the machine type from i440fx to q35 - out of interest, what machine types is everyone else using?

Code:

[ 7720.804438] BUG: #DF stack guard page was hit at 00000000d9071369 (stack is 000000002e08a9df..0000000059db9875)
[ 7720.804460] stack guard page: 0000 [#1] SMP PTI
[ 7720.804464] CPU: 0 PID: 809 Comm: dockerd Not tainted 5.15.0-46-generic #49-Ubuntu
[ 7720.804473] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[ 7720.804475] RIP: 0010:error_entry+0xc/0x130
[ 7720.804498] Code: ff 85 db 0f 85 19 fd ff ff 0f 01 f8 e9 11 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 fc 56 48 8b 74 24 08 48 89 7c 24 08 <52> 51 50 41 50 41 51 41 52 41 53 53 55 41 54 41 55 41 56 41 57 56
[ 7720.804500] RSP: 0000:fffffe0000009000 EFLAGS: 00010087
[ 7720.804503] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804504] RDX: 0000000000000000 RSI: ffffffff87000b48 RDI: fffffe0000009078
[ 7720.804505] RBP: fffffe0000009068 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804506] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009078
[ 7720.804507] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804508] FS:  00007fd11cff9640(0000) GS:ffff8988bbc00000(0000) knlGS:0000000000000000
[ 7720.804509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7720.804511] CR2: fffffe0000008ff8 CR3: 00000001292be000 CR4: 00000000000006f0
[ 7720.804514] Call Trace:
[ 7720.804519]  <#DF>
[ 7720.804534]  ? exc_page_fault+0x1c/0x170
[ 7720.804538]  asm_exc_page_fault+0x26/0x30
[ 7720.804541] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804543] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804545] RSP: 0000:fffffe0000009128 EFLAGS: 00010087
[ 7720.804546] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804547] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe0000009158
[ 7720.804548] RBP: fffffe0000009148 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804548] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009158
[ 7720.804549] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804550]  ? native_iret+0x7/0x7
[ 7720.804562]  asm_exc_page_fault+0x26/0x30
[ 7720.804564] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804566] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804567] RSP: 0000:fffffe0000009208 EFLAGS: 00010087
[ 7720.804568] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804569] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe0000009238
[ 7720.804570] RBP: fffffe0000009228 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804570] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009238
[ 7720.804571] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804592]  ? native_iret+0x7/0x7
[ 7720.804594]  asm_exc_page_fault+0x26/0x30
[ 7720.804597] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804598] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804608] RSP: 0000:fffffe00000092e8 EFLAGS: 00010087
[ 7720.804610] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804610] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe0000009318
[ 7720.804611] RBP: fffffe0000009308 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804612] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009318
[ 7720.804612] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804614]  ? native_iret+0x7/0x7
[ 7720.804616]  asm_exc_page_fault+0x26/0x30
[ 7720.804618] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804620] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804629] RSP: 0000:fffffe00000093c8 EFLAGS: 00010087
[ 7720.804630] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804631] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe00000093f8
[ 7720.804632] RBP: fffffe00000093e8 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804632] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe00000093f8
[ 7720.804633] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804634]  ? native_iret+0x7/0x7
[ 7720.804637]  asm_exc_page_fault+0x26/0x30
[ 7720.804639] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804640] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804641] RSP: 0000:fffffe00000094a8 EFLAGS: 00010087
[ 7720.804642] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804643] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe00000094d8
[ 7720.804643] RBP: fffffe00000094c8 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804644] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe00000094d8
[ 7720.804645] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804645]  ? native_iret+0x7/0x7
[ 7720.804647]  asm_exc_page_fault+0x26/0x30
[ 7720.804649] RIP: 0010:exc_page_fault+0x1c/0x170
[ 7720.804650] Code: 07 01 eb c4 e8 b5 01 00 00 cc cc cc cc cc 55 48 89 e5 41 57 41 56 49 89 f6 41 55 41 54 49 89 fc 0f 20 d0 0f 1f 40 00 49 89 c5 <65> 48 8b 04 25 c0 fb 01 00 48 8b 80 98 08 00 00 0f 18 48 78 66 90
[ 7720.804651] RSP: 0000:fffffe0000009588 EFLAGS: 00010087
[ 7720.804652] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804653] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffe00000095b8
[ 7720.804653] RBP: fffffe00000095a8 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804654] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe00000095b8
[ 7720.804655] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804656]  ? native_iret+0x7/0x7
[ 7720.804657]  asm_exc_page_fault+0x26/0x30
[ 7720.804659] RIP: 0010:irqentry_enter+0xf/0x50
[ 7720.804661] Code: 66 66 2e 0f 1f 84 00 00 00 00 00 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 f6 87 88 00 00 00 03 75 17 31 c0 <65> 48 8b 14 25 c0 fb 01 00 f6 42 2c 02 75 13 5d c3 cc cc cc cc e8
[ 7720.804661] RSP: 0000:fffffe0000009668 EFLAGS: 00010046
[ 7720.804662] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.804663] RDX: 0000000000000000 RSI: ffffffff87000aea RDI: fffffe0000009698
[ 7720.804677] RBP: fffffe0000009668 R08: 0000000000000000 R09: 0000000000000000
[ 7720.804678] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009698
[ 7720.804679] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 7720.804680]  ? native_iret+0x7/0x7
[ 7720.804681]  ? asm_exc_invalid_op+0xa/0x20
[ 7720.804684]  exc_invalid_op+0x25/0x70
[ 7720.804686]  asm_exc_invalid_op+0x1a/0x20
[ 7720.804688] RIP: 0010:asm_exc_invalid_op+0x0/0x20
[ 7720.804690] Code: 00 00 48 89 c4 48 8d 6c 24 01 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff ff e8 ea 7f f9 ff e9 a5 0a 00 00 0f 1f 44 00 00 <0f> 1f 00 6a ff e8 66 09 00 00 48 89 c4 48 8d 6c 24 01 48 89 e7 e8
[ 7720.804691] RSP: 0000:fffffe0000009748 EFLAGS: 00010002
[ 7720.804692] RAX: 000000c0009b6600 RBX: 000000c0008ba750 RCX: 0000000000000028
[ 7720.804693] RDX: 0000000000000090 RSI: 0000000000203000 RDI: 00007fd1244e3138
[ 7720.804694] RBP: 00007fd11cff8af8 R08: 0000000000000003 R09: 00007fd1260cdd3b
[ 7720.804695] R10: 00000000000fbeb0 R11: 00007fd126287fff R12: 000000c0008ba750
[ 7720.804695] R13: 000000c0009b6600 R14: 000000c0009cf860 R15: 0000000000000000
[ 7720.804700] WARNING: stack recursion on stack type 5
[ 7720.804703]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804902]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804904]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804906]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804908]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804910]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804912]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.804915]  ? asm_exc_stack_segment+0x10/0x30
[ 7720.804917]  ? vsnprintf+0x359/0x550
[ 7720.804935]  ? vsnprintf+0x359/0x550
[ 7720.804936]  ? sprintf+0x56/0x80
[ 7720.804938]  ? __sprint_symbol.constprop.0+0xee/0x110
[ 7720.804964]  ? symbol_string+0xa2/0x140
[ 7720.804966]  ? symbol_string+0xa2/0x140
[ 7720.804968]  ? vsnprintf+0x397/0x550
[ 7720.804969]  ? vscnprintf+0xd/0x40
[ 7720.804970]  ? printk_sprint+0x79/0xa0
[ 7720.804978]  ? pointer+0x230/0x4f0
[ 7720.804980]  ? vsnprintf+0x397/0x550
[ 7720.804982]  ? vscnprintf+0xd/0x40
[ 7720.804983]  ? printk_sprint+0x5e/0xa0
[ 7720.804985]  ? vprintk_store+0x2fe/0x5b0
[ 7720.804987]  ? defer_console_output+0x3b/0x50
[ 7720.804989]  ? vprintk+0x4a/0x90
[ 7720.804991]  ? is_bpf_text_address+0x17/0x30
[ 7720.805002]  ? kernel_text_address+0xf7/0x100
[ 7720.805011]  ? unwind_next_frame.part.0+0x86/0x200
[ 7720.805020]  ? __kernel_text_address+0x12/0x50
[ 7720.805022]  ? show_trace_log_lvl+0x1cb/0x2df
[ 7720.805033]  ? show_trace_log_lvl+0x1cb/0x2df
[ 7720.805035]  ? asm_exc_alignment_check+0x30/0x30
[ 7720.805038]  ? show_regs.part.0+0x23/0x29
[ 7720.805039]  ? __die_body.cold+0x8/0xd
[ 7720.805056]  ? __die+0x2b/0x37
[ 7720.805057]  ? die+0x30/0x60
[ 7720.805067]  ? handle_stack_overflow+0x4e/0x60
[ 7720.805069]  ? exc_double_fault+0x155/0x190
[ 7720.805071]  ? asm_exc_double_fault+0x1e/0x30
[ 7720.805073]  ? native_iret+0x7/0x7
[ 7720.805074]  ? asm_exc_page_fault+0x8/0x30
[ 7720.805077]  ? error_entry+0xc/0x130
[ 7720.805078]  </#DF>
[ 7720.805083] Modules linked in: tcp_diag udp_diag inet_diag veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay sch_fq_codel cp210x input_leds usbserial cdc_acm joydev serio_raw mac_hid qemu_fw_cfg dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua mtd pstore_blk ramoops netconsole pstore_zone reed_solomon ipmi_devintf ipmi_msghandler msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper hid_generic syscopyarea sysfillrect sysimgblt fb_sys_fops cec usbhid rc_core virtio_net net_failover hid drm psmouse virtio_scsi failover i2c_piix4 pata_acpi floppy
[ 7720.901966] ---[ end trace b7f1a532a0e81c78 ]---
[ 7720.901991] RIP: 0010:error_entry+0xc/0x130
[ 7720.901998] Code: ff 85 db 0f 85 19 fd ff ff 0f 01 f8 e9 11 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 fc 56 48 8b 74 24 08 48 89 7c 24 08 <52> 51 50 41 50 41 51 41 52 41 53 53 55 41 54 41 55 41 56 41 57 56
[ 7720.901999] RSP: 0000:fffffe0000009000 EFLAGS: 00010087
[ 7720.902001] RAX: 000000000001fbc0 RBX: 0000000000000000 RCX: ffffffff87001187
[ 7720.902002] RDX: 0000000000000000 RSI: ffffffff87000b48 RDI: fffffe0000009078
[ 7720.902003] RBP: fffffe0000009068 R08: 0000000000000000 R09: 0000000000000000
[ 7720.902004] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009078
[ 7720.902004] R13: 000000000001fbc0 R14: 0000000000000000 R15: 0000000000000000
[ 7720.902005] FS:  00007fd11cff9640(0000) GS:ffff8988bbc00000(0000) knlGS:0000000000000000
[ 7720.902007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7720.902008] CR2: fffffe0000008ff8 CR3: 00000001292be000 CR4: 00000000000006f0
[ 7720.902013] Kernel panic - not syncing: Fatal exception in interrupt
[ 7720.902108] Kernel Offset: 0x5200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

gyrex · Aug 10, 2022

I've logged a bug report on kernel.org if anyone's interested: https://bugzilla.kernel.org/show_bug.cgi?id=216349

Holger Huo · Aug 11, 2022

gyrex said:
I'm running 2 VMs on my Proxmox server, pfSense and Ubuntu 22.04 running docker. Both have locked up at various points although not for the past 5 or so days - as usual with Murphy's law, they haven't locked up since running their kernels in verbose mode and running remote logging services in order to try and diagnose the freezes/lockups.

Kernel versions below:

Ubuntu: Linux 5.15.0-43-generic #46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
pfSense/FreeBSD: FreeBSD 12.3-STABLE FreeBSD 12.3-STABLE RELENG_2_6_0-n226742-1285d6d205f pfSense amd64

I also wonder if this has some impact on the VMs running under it. It makes sense that this could potentially be an issue.

This apparently invalidates my theory of Linux kernel bug, also by my own 5.19 kernel on Alma Linux 8 that crashed today.

I'd consulted with the manufacturer today and they said there seemed to be some drawbacks in Intel's 11th gen celeron units, and we might have to wait for software to patch the problem.

They'd recommended to move Linux vms to LXC as the problem shouldn't occur outside KVM environment and they also said Esxi was more stable than PvE(not tested). Besides moving my Linux workloads to LXC, I've also updated my PvE base system via no subscription repo and installed intel-microcode through debian's non-free repo.

My only VM that cannot be migrated to LXC is OpenWRT running kernel 5.10 (this instance never freezes before as it barely has any loads, may also be kernel related). I'll add a testing VM running Alma Linux 9 with kernel 5.14 (which usu. Freezes after a few hours before) to see if the intel-microcode works.

rzv · Aug 12, 2022

Holger Huo said:
This apparently invalidates my theory of Linux kernel bug, also by my own 5.19 kernel on Alma Linux 8 that crashed today.

I'd consulted with the manufacturer today and they said there seemed to be some drawbacks in Intel's 11th gen celeron units, and we might have to wait for software to patch the problem.

They'd recommended to move Linux vms to LXC as the problem shouldn't occur outside KVM environment and they also said Esxi was more stable than PvE(not tested). Besides moving my Linux workloads to LXC, I've also updated my PvE base system via no subscription repo and installed intel-microcode through debian's non-free repo.

My only VM that cannot be migrated to LXC is OpenWRT running kernel 5.10 (this instance never freezes before as it barely has any loads, may also be kernel related). I'll add a testing VM running Alma Linux 9 with kernel 5.14 (which usu. Freezes after a few hours before) to see if the intel-microcode works.

While it's nice that someone finally acknowledged the existence of this problem, I don't hold out much hope as this CPU is now over a year old.
What "software" are we supposed to wait for? Kernel patch? Microcode?

I already tested the microcode from debian non-free repo and it doesn't fix anything.

gyrex · Aug 12, 2022

Holger Huo said:
This apparently invalidates my theory of Linux kernel bug

Why does it invalidate your theory of a kernel bug? Proxmox also runs a linux kernel and there's a lot of similar bugs/kernel panics which have been logged at kernel.org for KVM issues. I'm almost certain this is a kernel bug, the logs demonstrate this but I'm not an expert, hence reporting the bug and attaching the requisite log files for kernel experts to look at.

gyrex · Aug 12, 2022

rzv said:
While it's nice that someone finally acknowledged the existence of this problem, I don't hold out much hope as this CPU is now over a year old.
What "software" are we supposed to wait for? Kernel patch? Microcode?

I already tested the microcode from debian non-free repo and it doesn't fix anything.

Have you considered running VMware ESXi? It's free to run.

rzv · Aug 12, 2022

gyrex said:
Have you considered running VMware ESXi? It's free to run.

I don't like ESXi, nothing against vmware but I prefer proxmox because of the built-in web UI and the ease of device passthrough. The same applies for XCP-ng.
For now I moved my workloads to LXC containers and I will change my hardware to something else in the future.
For me this issue is closed for now, but I will keep watching these threads for a solution if we ever get one.

Holger Huo · Aug 12, 2022

rzv said:
While it's nice that someone finally acknowledged the existence of this problem, I don't hold out much hope as this CPU is now over a year old.
What "software" are we supposed to wait for? Kernel patch? Microcode?

I already tested the microcode from debian non-free repo and it doesn't fix anything.

Maybe it's something relating to the virtualization layer and Proxmox can develop some fixes for it.

gyrex said:
Why does it invalidate your theory of a kernel bug? Proxmox also runs a linux kernel and there's a lot of similar bugs/kernel panics which have been logged at kernel.org for KVM issues. I'm almost certain this is a kernel bug, the logs demonstrate this but I'm not an expert, hence reporting the bug and attaching the requisite log files for kernel experts to look at.

I meant that it invalidated my theory of bug between linux version 5.10 to 5.14, as this issue persists in kernel version 5.19 and also bsd based systems like pfSense. I do hope there could be some approaches on software layer to fix it as N5105 is the most performant low-power cpu right now..

gyrex · Aug 13, 2022

Another freeze today. Log below and attached. Will post the log to the Proxmox (https://bugzilla.proxmox.com/show_bug.cgi?id=4188) and kernel.org (https://bugzilla.kernel.org/show_bug.cgi?id=216349) bugzilla reports I've created.

Code:

[32846.996729] invalid opcode: 0000 [#1] SMP PTI
[32846.997520] CPU: 0 PID: 2951 Comm: cron Not tainted 5.15.0-46-generic #49-Ubuntu
[32846.998310] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[32847.000030] RIP: 0010:asm_exc_page_fault+0x1/0x30
[32847.001067] Code: 28 ff 74 24 28 ff 74 24 28 ff 74 24 28 e8 27 09 00 00 48 89 c4 48 8d 6c 24 01 48 89 e7 e8 b7 86 f9 ff e9 42 0a 00 00 66 90 0f <1f> 00 e8 08 09 00 00 48 89 c4 48 8d 6c 24 01 48 89 e7 48 8b 74 24
[32847.003280] RSP: 0018:ffffb688031c7b08 EFLAGS: 00010086
[32847.004521] RAX: ffffffffc07f5360 RBX: 0000000000000000 RCX: 0000000000000002
[32847.005772] RDX: 0000000000000081 RSI: ffff9cb54955cb60 RDI: ffffffff91681300
[32847.007063] RBP: ffffb688031c7ba8 R08: 0000000000000000 R09: 0000000000000000
[32847.008418] R10: ffff9cb54eab3000 R11: 0000000000000000 R12: ffff9cb54955cb60
[32847.009782] R13: 0000000000000081 R14: ffffffff91681300 R15: d0d0d0d0d0d0d0d0
[32847.011199] FS:  00007fe632c78840(0000) GS:ffff9cb57bc00000(0000) knlGS:0000000000000000
[32847.012623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32847.013711] CR2: ffffffffc63418c3 CR3: 0000000001c14000 CR4: 00000000000006f0
[32847.014847] Call Trace:
[32847.015811]  <TASK>
[32847.016616]  ? asm_sysvec_spurious_apic_interrupt+0x20/0x20
[32847.017488]  ? ovl_verify_inode+0xd0/0xd0 [overlay]
[32847.018341]  ? ovl_verify_inode+0xd0/0xd0 [overlay]
[32847.018939]  ? inode_permission+0xef/0x1a0
[32847.019561]  link_path_walk.part.0.constprop.0+0xc9/0x370
[32847.020165]  ? path_init+0x2c0/0x3f0
[32847.020779]  path_lookupat+0x3e/0x1c0
[32847.021416]  ? generic_fillattr+0x4e/0xe0
[32847.021941]  filename_lookup+0xcf/0x1d0
[32847.022468]  ? __check_object_size+0x1d/0x30
[32847.023003]  ? strncpy_from_user+0x44/0x150
[32847.023583]  ? getname_flags.part.0+0x4c/0x1b0
[32847.024200]  user_path_at_empty+0x3f/0x60
[32847.024880]  vfs_statx+0x7a/0x130
[32847.025450]  __do_sys_newstat+0x3e/0x80
[32847.026194]  ? __secure_computing+0xa9/0x120
[32847.026825]  ? syscall_trace_enter.constprop.0+0xa7/0x1c0
[32847.027410]  __x64_sys_newstat+0x16/0x20
[32847.028025]  do_syscall_64+0x5c/0xc0
[32847.028621]  ? syscall_exit_to_user_mode+0x27/0x50
[32847.029197]  ? __x64_sys_newstat+0x16/0x20
[32847.029748]  ? do_syscall_64+0x69/0xc0
[32847.030298]  ? do_syscall_64+0x69/0xc0
[32847.030853]  ? syscall_exit_to_user_mode+0x27/0x50
[32847.031415]  ? __x64_sys_newstat+0x16/0x20
[32847.032002]  ? do_syscall_64+0x69/0xc0
[32847.032591]  ? do_syscall_64+0x69/0xc0
[32847.033193]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[32847.033786] RIP: 0033:0x7fe632e643a6
[32847.034390] Code: 00 00 75 05 48 83 c4 18 c3 e8 66 f3 01 00 66 0f 1f 44 00 00 41 89 f8 48 89 f7 48 89 d6 41 83 f8 01 77 29 b8 04 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 c3 90 48 8b 15 b9 fa 0c 00 f7 d8 64 89 02
[32847.035617] RSP: 002b:00007ffdce8c1f98 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[32847.036257] RAX: ffffffffffffffda RBX: 00005558fae76690 RCX: 00007fe632e643a6
[32847.036879] RDX: 00007ffdce8c2190 RSI: 00007ffdce8c2190 RDI: 00007ffdce8c2320
[32847.037492] RBP: 00007ffdce8c4380 R08: 0000000000000001 R09: 0000000000000012
[32847.038105] R10: 00005558fae764e8 R11: 0000000000000246 R12: 00007ffdce8c1fe0
[32847.038699] R13: 00007ffdce8c2320 R14: 00007ffdce8c2070 R15: 00005558fa816186
[32847.039280]  </TASK>
[32847.039875] Modules linked in: tls xt_nat xt_tcpudp veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay sch_fq_codel joydev input_leds serio_raw qemu_fw_cfg mac_hid ramoops dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua reed_solomon pstore_blk mtd netconsole pstore_zone ipmi_devintf ipmi_msghandler efi_pstore msr ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea hid_generic sysfillrect sysimgblt xhci_pci fb_sys_fops cec rc_core psmouse xhci_pci_renesas virtio_net usbhid net_failover hid failover drm virtio_scsi i2c_piix4 pata_acpi floppy
[32847.045209] ---[ end trace 3eb46e5a4c095231 ]---

gyrex · Aug 13, 2022

Another panic:

Code:

[38049.665307] traps: PANIC: double fault, error_code: 0x0
[38049.665352] double fault: 0000 [#1] SMP PTI
[38049.665362] CPU: 1 PID: 3295 Comm: lighttpd Not tainted 5.15.0-46-generic #49-Ubuntu
[38049.665388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[38049.665395] RIP: 0010:error_entry+0x0/0x130
[38049.665466] Code: de eb 0a f3 48 0f ae db e9 21 fd ff ff 85 db 0f 85 19 fd ff ff 0f 01 f8 e9 11 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 <fc> 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41
[38049.665471] RSP: 0018:ffffb507830aa35d EFLAGS: 00010002
[38049.665480] RAX: 00007ffe5be0c194 RBX: 00007ffe5be0c1e0 RCX: 00007ffe5be0c194
[38049.665484] RDX: 0000000000000070 RSI: 0000000000000010 RDI: ffffb507830a3cf8
[38049.665487] RBP: ffffb507830a3cd8 R08: 0000000000000001 R09: 0000000000000000
[38049.665491] R10: 0000000000000001 R11: 0000000000000000 R12: ffff91c59f1a9880
[38049.665494] R13: ffff91c4d1be2d80 R14: 0000000000080800 R15: ffff91c49d517700
[38049.665499] FS:  00007f96e38ef680(0000) GS:ffff91c5bbd00000(0000) knlGS:0000000000000000
[38049.665503] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38049.665507] CR2: ffffb507830aa348 CR3: 00000000090c0000 CR4: 00000000000006e0
[38049.665518] Call Trace:
[38049.665542] Modules linked in: cp210x usbserial cdc_acm tls veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay sch_fq_codel joydev input_leds serio_raw qemu_fw_cfg mac_hid dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole ipmi_devintf pstore_blk mtd ramoops pstore_zone reed_solomon ipmi_msghandler msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper hid_generic syscopyarea sysfillrect sysimgblt fb_sys_fops usbhid cec rc_core xhci_pci hid psmouse xhci_pci_renesas drm i2c_piix4 pata_acpi virtio_net net_failover failover virtio_scsi floppy
[38049.687192] ---[ end trace e501d4c27d1b1728 ]---
[38049.687196] RIP: 0010:error_entry+0x0/0x130
[38049.687203] Code: de eb 0a f3 48 0f ae db e9 21 fd ff ff 85 db 0f 85 19 fd ff ff 0f 01 f8 e9 11 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 <fc> 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41
[38049.687227] RSP: 0018:ffffb507830aa35d EFLAGS: 00010002
[38049.687229] RAX: 00007ffe5be0c194 RBX: 00007ffe5be0c1e0 RCX: 00007ffe5be0c194
[38049.687230] RDX: 0000000000000070 RSI: 0000000000000010 RDI: ffffb507830a3cf8
[38049.687231] RBP: ffffb507830a3cd8 R08: 0000000000000001 R09: 0000000000000000
[38049.687241] R10: 0000000000000001 R11: 0000000000000000 R12: ffff91c59f1a9880
[38049.687242] R13: ffff91c4d1be2d80 R14: 0000000000080800 R15: ffff91c49d517700
[38049.687243] FS:  00007f96e38ef680(0000) GS:ffff91c5bbd00000(0000) knlGS:0000000000000000
[38049.687244] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38049.687245] CR2: ffffb507830aa348 CR3: 00000000090c0000 CR4: 00000000000006e0
[38049.687249] Kernel panic - not syncing: Fatal exception in interrupt
[38049.687476] Kernel Offset: 0x34400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

BarTouZ · Aug 13, 2022

Hello,

For my part, I just had a kernel panic on PfSense...

"The advantage" is that the PfSense VM restarts itself but, still the same problem with Pfsense, Ubuntu 22.04.1 with Kernel 5.19.1 ...
It drives me crazy because Proxmox works for days and the VMs randomly crash ...

Code:

Aug 13 14:41:35    kernel        ---<<BOOT>>---
Aug 13 14:41:35    kernel        KDB: enter: panic
Aug 13 14:41:35    kernel        time = 1660394416
Aug 13 14:41:35    kernel        cpuid = 0
Aug 13 14:41:35    kernel        panic: page fault
Aug 13 14:41:35    kernel        timeout stopping cpus
Aug 13 14:41:35    kernel        trap number = 12
Aug 13 14:41:35    kernel        current process = 48546 (awk)
Aug 13 14:41:35    kernel        processor eflags = resume, IOPL = 0
Aug 13 14:41:35    kernel        = DPL 0, pres 1, long 1, def32 0, gran 1
Aug 13 14:41:35    kernel        code segment = base 0x0, limit 0xfffff, type 0x1b
Aug 13 14:41:35    kernel        frame pointer = 0x28:0xfffffe002a6d1300
Aug 13 14:41:35    kernel        stack pointer = 0x28:0xfffffe002a6d12c0
Aug 13 14:41:35    kernel        instruction pointer = 0x20:0xffffffff80daae47
Aug 13 14:41:35    kernel        fault code = supervisor read data, page not present
Aug 13 14:41:35    kernel        fault virtual address = 0x7835778b0
Aug 13 14:41:35    kernel        cpuid = 0; apic id = 27fff0
Aug 13 14:41:35    kernel        Fatal trap 12: page fault while in kernel mode
Aug 13 14:41:35    kernel        kernel trap 12 with interrupts disabled
Aug 13 14:41:35    syslogd        kernel boot file is /boot/kernel/kernel

gyrex · Aug 13, 2022

BarTouZ said:

Hello,

For my part, I just had a kernel panic on PfSense...

"The advantage" is that the PfSense VM restarts itself but, still the same problem with Pfsense, Ubuntu 22.04.1 with Kernel 5.19.1 ...
It drives me crazy because Proxmox works for days and the VMs randomly crash ...

Code:

Aug 13 14:41:35    kernel        ---<<BOOT>>---
Aug 13 14:41:35    kernel        KDB: enter: panic
Aug 13 14:41:35    kernel        time = 1660394416
Aug 13 14:41:35    kernel        cpuid = 0
Aug 13 14:41:35    kernel        panic: page fault
Aug 13 14:41:35    kernel        timeout stopping cpus
Aug 13 14:41:35    kernel        trap number = 12
Aug 13 14:41:35    kernel        current process = 48546 (awk)
Aug 13 14:41:35    kernel        processor eflags = resume, IOPL = 0
Aug 13 14:41:35    kernel        = DPL 0, pres 1, long 1, def32 0, gran 1
Aug 13 14:41:35    kernel        code segment = base 0x0, limit 0xfffff, type 0x1b
Aug 13 14:41:35    kernel        frame pointer = 0x28:0xfffffe002a6d1300
Aug 13 14:41:35    kernel        stack pointer = 0x28:0xfffffe002a6d12c0
Aug 13 14:41:35    kernel        instruction pointer = 0x20:0xffffffff80daae47
Aug 13 14:41:35    kernel        fault code = supervisor read data, page not present
Aug 13 14:41:35    kernel        fault virtual address = 0x7835778b0
Aug 13 14:41:35    kernel        cpuid = 0; apic id = 27fff0
Aug 13 14:41:35    kernel        Fatal trap 12: page fault while in kernel mode
Aug 13 14:41:35    kernel        kernel trap 12 with interrupts disabled
Aug 13 14:41:35    syslogd        kernel boot file is /boot/kernel/kernel

My pfSense VM has also crashed in the past but unfortunately it froze as well and required a hard reset from Proxmox. I'll keep running Proxmox for a while but I'm pretty close to reluctantly moving my VMs across to VMware ESXi until the problem is identified and fixed.

BarTouZ · Aug 13, 2022

Are you sure that VmWare ESXI solves the problem we are currently encountering with the proxmox VM?

gyrex · Aug 15, 2022

BarTouZ said:
Are you sure that VmWare ESXI solves the problem we are currently encountering with the proxmox VM?

I have no idea what the issue is but I can't have my router constantly lock up with kernel panics - my wife is complaining non-stop.

I've moved my VMs onto VMware ESXi so I'll report back if I see them crash on it. If I don't see anything in a week, that's probably a good sign.

BarTouZ · Aug 15, 2022

I am in exactly the same situation as you... having received my new N5105 2 days before the holidays, I assumed that I migrated the nvl and if it sucks, it was ok... except that of course, on vacation, everything crashed and everything was questioned by my wife

Also no, for the moment, I put my production back on my old J1900, they are the same vm and they have been running for 2 now weeks without crashing without anything... I launched the same vm on the n5105 and I struggle to make them last 24 hours...

It's really a strange problem... on this forum (https://forums.servethehome.com/index.php?threads/topton-jasper-lake-quad-i225v-mini-pc-report.36699/page-26) someone managed to run PfSense for 8 days Obviously...

Anyway, we are never with peace of mind...

BarTouZ · Aug 16, 2022

After applying the parameters given here, my PfSense and Ubuntu 22.04 VMs have just passed the 24 hour mark, I had never gotten there...

I'm continuing the lab, I hope I can last the week...

And I have a constant temperature for now :

gyrex · Aug 16, 2022

BarTouZ said:
After applying the parameters given here, my PfSense and Ubuntu 22.04 VMs have just passed the 24 hour mark, I had never gotten there...

I'm continuing the lab, I hope I can last the week...

What did you change? If you changed the machine type from i440fx to q35, I changed that and it made no difference. There's nothing in there which looks different to what I tried, if you can let me know exactly what you changed, I can tell you if it made a difference because I tried many permeatations.

It's not a temperature issue. I cleaned and reapplied new thermal paste and along with changing the CPU governer from performance to powersave, I was sitting at a comfortable and constant 40-45C.

I'm running VMware ESXi and will report back on stability.

BarTouZ · Aug 16, 2022

I changed these settings here :

And i get :

PfSense

Ubuntu :

Works or not, I don't know but in any case, it's the first time that I spend 24 hours on my 3 vms...

jarodmerle · Aug 17, 2022

I was pointed to this thread from a similar one I had created reporting the same issue. Lots of good info for me as a newb; much appreciated.

I was curious if anyone experiencing this problem has tried something as simple as scheduling a reboot of their VMs (or even the Proxmox host itself) daily in the middle of the night as a stop-gap measure? I'm sure that may not be feasible for some scenarios, but given my limited use-case in a basic home setup (OPNSense and a couple of Ubuntu server VMs), it seems to me that might be preferable to having the VMs hang randomly in the middle of the day after a few days of uptime (I've never seen mine crash after less than 24 hours of uptime, unlike some others here).

gyrex · Aug 17, 2022

BarTouZ said:
I changed these settings here :

View attachment 40024

And i get :

PfSense

View attachment 40025

Ubuntu :

View attachment 40026
View attachment 40027

Works or not, I don't know but in any case, it's the first time that I spend 24 hours on my 3 vms...

I already tried passing the host cpu and q35 but they still froze. Not sure about ballooning.

The VMs freeze at random intervals (from hours to days) and my pfSense VM managed to run for a week before it froze again but that's just not stable enough for me.

I'm reading through the thread over at serve the home - might have some interesting info in there.

VM freezes irregularly

Member

Member

New Member

New Member

Member

Member

New Member

New Member

Member

Attachments

Member

Active Member

Member

Active Member

Member

Active Member

Active Member

Member

Active Member

New Member

Member

We value your privacy