[SOLVED] J3455 Compulab Fitlet 2 - Kernel crash / OOPS since latest 5.11.22-4-pve kernel

funtowne · Sep 9, 2021

Hi everyone,

First post; I'll try to provide as much as I can.

A stock install of 7.0, fresh off the ISO, was running stable for me for a few weeks. I then did the usual process of updating packages and a reboot. Now my prox host has random crashes about once every 6-18 hours even with minimal workload or reason.

First, some background on my hardware:

CompuLab Fitlet2
Intel Celeron J3455
8 Gigs DDR 1866 RAM
Transcend mSATA M.2 SSD
Additional igb network adapters (via FACET card) beyond the two on board
Latest BIOS installed

And my Prox environment:
I am running two LXD containers on Debian 11
The only VM running is opnsense on the near-latest version - my box has 1.5 - 2 gigs free RAM at all times
lvm-thin provisioning; XFS for root (I also had similar crashes when using root on zfs) - 4 gig swap partition
5.11.22-4-pve #1 SMP PVE 5.11.22-8 (Fri, 27 Aug 2021 11:51:34 +0200) x86_64 GNU/Linux
Latest intel microcode package from debian non-free (also crashed without microcode update)
processor.max_cstate=1 intel_idle.max_cstate=1 set as boot options - when the system was stable, it was only stable with these enabled... similar to the braswell c-state bug

The pain

My most recent crash is below. I have tried setting aio=native for my sole VM based on a few other posts I saw, however that has not helped the situation much. My two crash logs are attached - I managed to capture both via netconsole without issue. I've also posted my vm config, dmesg output and pveversion --verbose output. Again, the crashes are happening as the VM (and prox host) sits mostly idle as I have moved my production opnsense install to another box.

Crash #1 is attached for length considerations; aio=native was not set.

[96996.156090] BUG: unable to handle page fault for address: ffffffff9b32e5e0
[96996.156138] #PF: supervisor instruction fetch in kernel mode
[96996.156151] #PF: error_code(0x0010) - not-present page
[96996.156162] PGD 1f5215067 P4D 1f5215067 PUD 1f5216063 PMD 0
[96996.156182] Oops: 0010 [#1] SMP NOPTI
[96996.156197] CPU: 0 PID: 6940 Comm: kvm Tainted: P W O 5.11.22-4-pve #1
[96996.156213] Hardware name: N/A N/A/N/A, BIOS FLT2.NBR.0.46.02.01 03/07/2021
[96996.156224] RIP: 0010:0xffffffff9b32e5e0
[96996.156242] Code: Unable to access opcode bytes at RIP 0xffffffff9b32e5b6.
[96996.156253] RSP: 0018:ffffb58c05397f00 EFLAGS: 00010246
[96996.156266] RAX: 0000000000000000 RBX: ffff8bae8a38de01 RCX: 0000000000000000
[96996.156278] RDX: 000000000000ae80 RSI: 000000000000001a RDI: ffff8bae8a38de00
[96996.156290] RBP: ffffb58c05397f30 R08: 0000000000004000 R09: 000000000000001a
[96996.156301] R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000001a
[96996.156312] R13: 000000000000ae80 R14: 0000000000000000 R15: ffff8bae8a38de00
[96996.156324] FS: 00007f4ccbfff700(0000) GS:ffff8baff7c00000(0000) knlGS:0000000000000000
[96996.156339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[96996.156351] CR2: ffffffff9b32e5b6 CR3: 0000000103db6000 CR4: 00000000003526f0
[96996.156366] Call Trace:
[96996.156377] ? __x64_sys_ioctl+0x6f/0xc0
[96996.156397] do_syscall_64+0x38/0x90
[96996.156414] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[96996.156430] RIP: 0033:0x7f4cdaf98cc7
[96996.156443] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
[96996.156464] RSP: 002b:00007f4ccbffa288 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[96996.156479] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f4cdaf98cc7
[96996.156490] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001a
[96996.156501] RBP: 000055e6a1485c90 R08: 000055e69f810b38 R09: 00000000ffffffff
[96996.156512] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[96996.156523] R13: 000055e69fc61e60 R14: 0000000000000000 R15: 0000000000000000
[96996.156539] Modules linked in: nft_counter nft_chain_nat cfg80211 nft_compat nf_tables 8021q garp mrp veth tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core x86_pkg_temp_thermal kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel mei_hdcp aesni_intel crypto_simd cryptd glue_helper pcspkr efi_pstore at24 rapl intel_cstate 8250_dw i915 drm_kms_helper cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt mei_me intel_xhci_usb_role_switch mei mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp drm sunrpc ip_tables x_tables
[96996.156746] autofs4 xfs btrfs blake2b_generic xor raid6_pq netconsole dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c lpc_ich crc32_pclmul i2c_i801 i2c_smbus ahci igb xhci_pci xhci_pci_renesas i2c_algo_bit intel_lpss_pci intel_lpss idma64 xhci_hcd virt_dma libahci dca video intel_pmc_bxt pinctrl_broxton
[96996.156888] CR2: ffffffff9b32e5e0
[96996.156900] ---[ end trace d2ba67674b61f60c ]---
[96996.156901] BUG: unable to handle page fault for address: ffffffff9b32e5e0
[96996.156995] #PF: supervisor instruction fetch in kernel mode
[96996.159804] #PF: error_code(0x0010) - not-present page
[96996.159816] PGD 1f5215067 P4D 1f5215067 PUD 1f5216063 PMD 0
[96996.159836] Oops: 0010 [#2] SMP NOPTI
[96996.159849] CPU: 1 PID: 6941 Comm: kvm Tainted: P D W O 5.11.22-4-pve #1
[96996.159864] Hardware name: N/A N/A/N/A, BIOS FLT2.NBR.0.46.02.01 03/07/2021
[96996.159875] RIP: 0010:0xffffffff9b32e5e0
[96996.159891] Code: Unable to access opcode bytes at RIP 0xffffffff9b32e5b6.
[96996.159902] RSP: 0018:ffffb58c0532bf00 EFLAGS: 00010246
[96996.159916] RAX: 0000000000000000 RBX: ffff8bae8a38dd01 RCX: 0000000000000000
[96996.159928] RDX: 000000000000ae80 RSI: 000000000000001b RDI: ffff8bae8a38dd00
[96996.159940] RBP: ffffb58c0532bf30 R08: 0000000000004000 R09: 000000000000001b
[96996.159951] R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000001b
[96996.159962] R13: 000000000000ae80 R14: 0000000000000000 R15: ffff8bae8a38dd00
[96996.159974] FS: 00007f4ccb7fe700(0000) GS:ffff8baff7c80000(0000) knlGS:0000000000000000
[96996.159988] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[96996.160000] CR2: ffffffff9b32e5b6 CR3: 0000000103db6000 CR4: 00000000003526e0
[96996.160012] Call Trace:
[96996.160022] ? __x64_sys_ioctl+0x6f/0xc0
[96996.160039] do_syscall_64+0x38/0x90
[96996.162276] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[96996.165078] RIP: 0033:0x7f4cdaf98cc7
[96996.165093] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
[96996.165114] RSP: 002b:00007f4ccb7f9288 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[96996.165132] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f4cdaf98cc7
[96996.165144] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001b
[96996.165156] RBP: 000055e6a14bca90 R08: 000055e69f810b38 R09: 00000000ffffffff
[96996.165167] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[96996.165179] R13: 000055e69fc61e60 R14: 0000000000000000 R15: 0000000000000000
[96996.165194] Modules linked in: nft_counter nft_chain_nat cfg80211 nft_compat nf_tables 8021q garp mrp veth tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core x86_pkg_temp_thermal kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel mei_hdcp aesni_intel crypto_simd cryptd glue_helper pcspkr efi_pstore at24 rapl intel_cstate 8250_dw i915 drm_kms_helper cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt mei_me intel_xhci_usb_role_switch mei mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp drm sunrpc ip_tables x_tables
[96996.170304] autofs4 xfs btrfs blake2b_generic xor raid6_pq netconsole dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c lpc_ich crc32_pclmul i2c_i801 i2c_smbus ahci igb xhci_pci xhci_pci_renesas i2c_algo_bit intel_lpss_pci intel_lpss idma64 xhci_hcd virt_dma libahci dca video intel_pmc_bxt pinctrl_broxton
[96996.175287] CR2: ffffffff9b32e5e0
[96996.175302] ---[ end trace d2ba67674b61f60d ]---
[96996.238182] RIP: 0010:0xffffffff9b32e5e0
[96996.238202] RIP: 0010:0xffffffff9b32e5e0
[96996.238228] Code: Unable to access opcode bytes at RIP 0xffffffff9b32e5b6.
[96996.238244] Code: Unable to access opcode bytes at RIP 0xffffffff9b32e5b6.
[96996.238250] RSP: 0018:ffffb58c05397f00 EFLAGS: 00010246
[96996.238262] RSP: 0018:ffffb58c05397f00 EFLAGS: 00010246
[96996.238276] RAX: 0000000000000000 RBX: ffff8bae8a38de01 RCX: 0000000000000000
[96996.238286] RAX: 0000000000000000 RBX: ffff8bae8a38de01 RCX: 0000000000000000
[96996.238298] RDX: 000000000000ae80 RSI: 000000000000001a RDI: ffff8bae8a38de00
[96996.238308] RDX: 000000000000ae80 RSI: 000000000000001a RDI: ffff8bae8a38de00
[96996.238318] RBP: ffffb58c05397f30 R08: 0000000000004000 R09: 000000000000001a
[96996.238329] RBP: ffffb58c05397f30 R08: 0000000000004000 R09: 000000000000001a
[96996.238339] R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000001a
[96996.238348] R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000001a
[96996.238358] R13: 000000000000ae80 R14: 0000000000000000 R15: ffff8bae8a38de00
[96996.238368] R13: 000000000000ae80 R14: 0000000000000000 R15: ffff8bae8a38de00
[96996.238379] FS: 00007f4ccbfff700(0000) GS:ffff8baff7c00000(0000) knlGS:0000000000000000
[96996.238388] FS: 00007f4ccb7fe700(0000) GS:ffff8baff7c80000(0000) knlGS:0000000000000000
[96996.238399] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[96996.238410] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[96996.238419] CR2: ffffffff9b32e5b6 CR3: 0000000103db6000 CR4: 00000000003526f0
[96996.238428] CR2: ffffffff9b32e5b6 CR3: 0000000103db6000 CR4: 00000000003526e0

agent: 1
balloon: 0
boot: order=scsi0;net0
cores: 3
cpu: host
hotplug: 0
machine: q35
memory: 4096
name: opnsense
net0: virtio=[MACADDRESS],bridge=vmbr1
net1: virtio=[MACADDRESS],bridge=vmbr2
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-100-disk-0,cache=none,discard=on,iothread=1,size=26G,ssd=1,aio=native
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=632d2c94-4981-4810-8151-68ed2c125b70
sockets: 1
startup: order=1
vmgenid: f24a78c1-d0ea-464d-9719-5c0baf54730b

proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.11.22-4-pve: 5.11.22-8
ceph-fuse: 15.2.13-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

I've run the tool stress-ng (stressing CPU, HDD, Memory) and Passmark's stress testing tool to try to trigger the crash early - both run super stable and the system remains completely operational. This leads me to believe, with a very limtied education on the matter, that the VM is doing something cheeky and OOPSing or Panicking the kernel.

I am currently going on a whim and temporarily disabling all spectre / PTI mitigations (mitigations=off for grub) on the chance I have some sort of BIOS or microcode bug biting me. My next debug step, if the mitigation thing doesn't work out, is to downgrade to the lowest/oldest 5.11 kernel that I can, eg: the stock one on the installer ISO. The kernel on the prox ISO was super stable which leads me to believe that some sort of backported fix is causing my issues.

Edit: The VM in question has been running stable for a week on another, differently-specced proxmox install with the same package list as the crashing host.

Edit 2: Noted 4 gig swap partition

The reason for this post: if anyone on the forum recognizes anything in the crash logs above that could be helpful.

Edit 3: Update: Solved! nopti needed to be set as a kernel boot option as the kernel mitigation and the microcode were seemingly in conflict with each other - the Kernel should have disbled its mitigations based on some syscalls returned by the CPU, this wasn't happening it seems..? The 0x32 microcode package (for Apollo Lake) or newer needs to be baked in to the BIOS of our PC -or- install intel-microcode from the Debian non-free repo to negate the need for kernel PTI.

mira · Sep 9, 2021

Is your BIOS up-to-date? How does the hardware of this one differ from your other install?

funtowne · Sep 9, 2021

mira said:
Is your BIOS up-to-date? How does the hardware of this one differ from your other install?

BIOS is 100% up to date with the latest from Compulab (as of my post yesterday).

The other install is, OS package and partitioning-wise 100% identical. The other install, however, is on a Supermicro Xeon-D-based motherboard on an NVME boot drive. More or less apples to oranges with the hardware.

Supermicro Xeon 1518-D-based motherboard
32 gigs ECC ram
NVME boot drive
Latest prox updates as of this post, non-subscription repo
Latest Intel microcode from non-free Debian repo
LVM-thin disk; XFS root; 4 gigs swap partition (same as the crashing little box)

proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.11.22-4-pve: 5.11.22-8
ceph-fuse: 15.2.13-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

The dmesg of the stable box (attached, it's long) is a bit noisy from the IOMMU feedback. To wit, in parallel to opnsense on this box is a long-running FreeBSD install with my AHCI host adapter passed through - I don't have the time or heart to migrate away from this setup yet! In theory this should make stability more a concern if the opnsense VM itself was the issue.

The crashing box is my "in the wiring closet" router, the Xeon-D-based other box is acting as my nas and now temporary router until I get this crash sorted.

Edit: the opnsense VM was backed up from the crashing box and moved / restored to the stable other box as-is.

mira · Sep 9, 2021

Could you try the latest PVE 6.4 with the 5.4 kernel? If it works fine with the 5.4 kernel, we at least know that it's a regression.
And it seems Compulab provides instructions for some things for Linux Mint, which is based on Ubuntu 20.04 which the PVE 6.4 kernel is also based on.

funtowne · Sep 9, 2021

mira said:
Could you try the latest PVE 6.4 with the 5.4 kernel? If it works fine with the 5.4 kernel, we at least know that it's a regression.
And it seems Compulab provides instructions for some things for Linux Mint, which is based on Ubuntu 20.04 which the PVE 6.4 kernel is also based on.

Happy to do this if my mitigations=off test fails. Given past crashes, I should know by tomorrow.

Is there any way to force the 5.4 kernel to install on 7.0 by chance?

mira · Sep 9, 2021

No, the 5.4 kernel is only available for PVE 6.
It might be possible to install it manually, but haven't tested it here.

You can find the packages here: http://download.proxmox.com/debian/pve/dists/buster/pve-no-subscription/binary-amd64/

funtowne · Sep 10, 2021

mira said:
No, the 5.4 kernel is only available for PVE 6.
It might be possible to install it manually, but haven't tested it here.

You can find the packages here: http://download.proxmox.com/debian/pve/dists/buster/pve-no-subscription/binary-amd64/

Some good news.

turning off the Specter etc. mitigations and setting nopti resulted in my first 24h uptime since these issues started. I’ll continue to let things run as long as they can before I fall back to the 5.4 kernel.

If the mitigations or nopti are the issue, what data would be useful for the proxmox team? I noticed the crashes both on my BIOS’s microcode (revision 40) and that from the intel-microcode package.

funtowne · Sep 11, 2021

Jumping in on my own thread here:

Based on the stability with the mitigations=off and nopti, I've moved my routing traffic back to the (formerly?) crash-prone box. This is in an effort to stress the environment and make it do _something_ naughty, if possible. I'm still keen to provide whatever I can to help get to the root cause, but for now I am happy to have my network seemingly sorted.

I'm not too keen to run with these mitigations turned off, however the risk in reality is pretty low given the machine is only running a SSH jump host in an LXC and unifi also in an LXC (both Debian 11, updated from 10).

funtowne · Sep 13, 2021

I'm now running stable only with nopti; mitigations=off has been disabeld. Per Intel, the Apollo Lake series chips are microcode-protected from Meltdown, making kernel PTI redundant. I'll post a final update in a few days should this fix stick.

funtowne · Sep 15, 2021

I'll mark this as solved. 3 days of solid uptime with nopti set at boot time. @mira could you confirm if the kernel is supposed to disable the kernel-level PTI mitigations for meltdown if the microcode / CPU being used already is safe? A thread on the intel community site seems to claim such.

Search

Search

[SOLVED] J3455 Compulab Fitlet 2 - Kernel crash / OOPS since latest 5.11.22-4-pve kernel

funtowne

Member

Attachments

mira

Proxmox Staff Member

funtowne

Member

Attachments

mira

Proxmox Staff Member

funtowne

Member

mira

Proxmox Staff Member

funtowne

Member

funtowne

Member

funtowne

Member

funtowne

Member