Host hard crashes, PVE 8.1.4

Risker · Feb 12, 2024

Hello everyone,

Unfortunately my first forum thread will have to be because of an issue. Any help is kindly appreciated.

3 weeks ago I purchased 2x CWWK/Topton Quad-NIC Intel 226v, Intel N100 Mini PCs.

Each has been specced with the following components:
Crucial P3 Plus 2000MB NVMe (ZFS)
Crucial BX500 SATA SSD (Boot drive, EXT4)
Crucial DDR5-4800 SODIMM 32GB

One of the nodes doesn't survive a full 24-h day without crashing, the other runs buttery smooth for 80+h uptime during the testing phase.

Testing phase consists of:

Cluster
2x CWWK/Topton Quad-NIC Intel 226v, Intel N100 Mini PCs
1x Corosync-Qdevice to maintain quorum
2x OPNsense VMs (one on each Node), with local LVM storage on the Boot Drive, no HA running simultaneously (Master/Slave via pfSync and CARP VIPs).
2x distinct Oracle Linux (one on each Node), running on ZFS storage with replication (this is to test replication and live migration), fresh config no packages installed.

Network Topology:

Node1 Node2
eth0--------------------------------------------------WAN switch-------------------------------------------------eth0
eth1--------------------------------------------------LAN switch--------------------------------------------------eth1
eth2--------------------------------------------------------------------------------------------------------------------eth2 (directly connected, dedicated Cluster interface/network)
eth3--------------------------------------------------------------------------------------------------------------------eth3 (directly connected, dedicated pfSync interface/network,
also secondary Cluster Link)

Behavior:

1. Sometimes the node doesn't fully crash but within 6-18h the node itself, as well as all VMs and storage are displayed with a gray question mark. The node GUI is still reachable and responsive, so are the VMs as well as the VM and Node console outputs. Restarting pvestatd solves this issue for about 5-7 min after that, the gray question mark returns. Rebooting the node via shell doesn't run smoothly at all, it becomes unresponsive and has to be powered off via hardware.
2. More often than not the affected node just crashes without a single journalctl log. Power LED on, NIC LEDs on, cannot ping, no video.

3. Within the 6-18h uptime I can see two recurring error logs, you find the full snippets below. The node doesn't always crash when those logs start appearing but sometimes these are the last logs I see before a crash.

ERROR LOG 1 - BUG: unable to handle page fault for address: (most common)
ERROR LOG 2 - segfault(s)

Full output submitted in first comment.

What I have tried:

1. Multiple kernels for proxmox 8.1.4 namely 6.5.11-4, 6.5.11-7, 6.5.11.8

2. Running the node isolated unclustered, without ZFS, 1x OPNsense VM, 1x Oracle Linux on NVMe drive (configured as EXT4, local LVM) 1x Oracle Linux on SATA (configured as EXT4, local LVM)

3. Swapping components (NVMe, SATA, RAM) between the nodes, the error stays with the host, and doesn't migrate with the components.

4. Reinstalling PVE, 5-6 times

5. Memtest over night with 1x of the DIMMs, passed with flying colours

6. I am aware of the below post, but the BIOS doesn't offer any options for On-Die ECC or similar: https://forum.proxmox.com/threads/pve-freezes-during-backup-job.134848/#post-613511

Output of pveversion -v

Code:

root@test:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

Code:

root@pve2:~# zfs list
NAME                USED  AVAIL  REFER  MOUNTPOINT
zfs                55.2G  1.70T    96K  /zfs
zfs/vm-103-disk-0  2.20G  1.70T  2.20G  -
zfs/vm-104-disk-0  53.0G  1.75T  2.21G  -

Risker · Feb 12, 2024

ERROR LOG 1 - BUG: unable to handle page fault for address: (most common)

Code:

Feb 11 11:00:21 pve1 systemd[391720]: Activating special unit exit.target...
Feb 11 11:00:21 pve1 systemd[391720]: Stopped target default.target - Main User Target.
Feb 11 11:00:21 pve1 systemd[391720]: Stopped target basic.target - Basic System.
Feb 11 11:00:21 pve1 systemd[391720]: Stopped target paths.target - Paths.
Feb 11 11:00:21 pve1 systemd[391720]: Stopped target sockets.target - Sockets.
Feb 11 11:00:21 pve1 systemd[391720]: Stopped target timers.target - Timers.
Feb 11 11:00:21 pve1 systemd[391720]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Feb 11 11:00:21 pve1 systemd[391720]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Feb 11 11:00:21 pve1 systemd[391720]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Feb 11 11:00:21 pve1 systemd[391720]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Feb 11 11:00:21 pve1 systemd[391720]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Feb 11 11:00:21 pve1 systemd[391720]: Removed slice app.slice - User Application Slice.
Feb 11 11:00:21 pve1 systemd[391720]: Reached target shutdown.target - Shutdown.
Feb 11 11:00:21 pve1 systemd[391720]: Finished systemd-exit.service - Exit the Session.
Feb 11 11:00:21 pve1 systemd[391720]: Reached target exit.target - Exit the Session.
Feb 11 11:00:21 pve1 (sd-pam)[391721]: pam_unix(systemd-user:session): session closed for user root
Feb 11 11:00:21 pve1 kernel: BUG: unable to handle page fault for address: ffff922f21779f60
Feb 11 11:00:21 pve1 kernel: #PF: supervisor read access in kernel mode
Feb 11 11:00:21 pve1 kernel: #PF: error_code(0x0000) - not-present page
Feb 11 11:00:21 pve1 kernel: PGD 0 P4D 0
Feb 11 11:00:21 pve1 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Feb 11 11:00:21 pve1 kernel: CPU: 1 PID: 1261 Comm: pvestatd Tainted: P    B      O       6.5.11-4-pve #1
Feb 11 11:00:21 pve1 kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 09/28/2023
Feb 11 11:00:21 pve1 kernel: RIP: 0010:kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel: Code: 0f 84 31 02 00 00 48 85 ff 0f 84 28 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8d 88 00 20 00 00 4d 8b 0c 24 48 01 f8 <48> 33 18 48 89 c2 48 89 f8 48 0f ca 48 31 d3 4c 89 c2 65 49 0f c7
Feb 11 11:00:21 pve1 kernel: RSP: 0018:ffffa6f6c58f3b90 EFLAGS: 00010286
Feb 11 11:00:21 pve1 kernel: RAX: ffff922f21779f60 RBX: d3bacd303c76054c RCX: 00000049a3344001
Feb 11 11:00:21 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff922f21779f40
Feb 11 11:00:21 pve1 kernel: RBP: ffffa6f6c58f3bd8 R08: 00000049a3342001 R09: 0000000000039d50
Feb 11 11:00:21 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff92ae40243f00
Feb 11 11:00:21 pve1 kernel: R13: 0000000000002800 R14: ffff92ae479617c0 R15: 0000000000000040
Feb 11 11:00:21 pve1 kernel: FS:  00007fb686062740(0000) GS:ffff92b5bf880000(0000) knlGS:0000000000000000
Feb 11 11:00:21 pve1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 11:00:21 pve1 kernel: CR2: ffff922f21779f60 CR3: 000000010cd12000 CR4: 0000000000752ee0
Feb 11 11:00:21 pve1 kernel: PKRU: 55555554
Feb 11 11:00:21 pve1 kernel: Call Trace:
Feb 11 11:00:21 pve1 kernel:  <TASK>
Feb 11 11:00:21 pve1 kernel:  ? show_regs+0x6d/0x80
Feb 11 11:00:21 pve1 kernel:  ? __die+0x24/0x80
Feb 11 11:00:21 pve1 kernel:  ? page_fault_oops+0x176/0x500
Feb 11 11:00:21 pve1 kernel:  ? kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel:  ? kernelmode_fixup_or_oops+0xb2/0x140
Feb 11 11:00:21 pve1 kernel:  ? __bad_area_nosemaphore+0x1a5/0x280
Feb 11 11:00:21 pve1 kernel:  ? bad_area_nosemaphore+0x16/0x30
Feb 11 11:00:21 pve1 kernel:  ? do_kern_addr_fault+0x7b/0xa0
Feb 11 11:00:21 pve1 kernel:  ? exc_page_fault+0x10d/0x1b0
Feb 11 11:00:21 pve1 kernel:  ? asm_exc_page_fault+0x27/0x30
Feb 11 11:00:21 pve1 kernel:  ? kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel:  ? anon_vma_clone+0x66/0x1d0
Feb 11 11:00:21 pve1 kernel:  anon_vma_clone+0x66/0x1d0
Feb 11 11:00:21 pve1 kernel:  ? down_write+0x12/0x80
Feb 11 11:00:21 pve1 kernel:  anon_vma_fork+0x2e/0x150
Feb 11 11:00:21 pve1 kernel:  dup_mmap+0x5a2/0x760
Feb 11 11:00:21 pve1 kernel:  copy_process+0x1c7f/0x1d50
Feb 11 11:00:21 pve1 kernel:  ? security_file_alloc+0x2e/0xf0
Feb 11 11:00:21 pve1 kernel:  kernel_clone+0xbd/0x440
Feb 11 11:00:21 pve1 kernel:  __do_sys_clone+0x66/0xa0
Feb 11 11:00:21 pve1 kernel:  __x64_sys_clone+0x25/0x40
Feb 11 11:00:21 pve1 kernel:  do_syscall_64+0x58/0x90
Feb 11 11:00:21 pve1 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 11 11:00:21 pve1 kernel: RIP: 0033:0x7fb686175293
Feb 11 11:00:21 pve1 kernel: Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
Feb 11 11:00:21 pve1 kernel: RSP: 002b:00007ffd834743e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Feb 11 11:00:21 pve1 kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb686175293
Feb 11 11:00:21 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Feb 11 11:00:21 pve1 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Feb 11 11:00:21 pve1 kernel: R10: 00007fb686062a10 R11: 0000000000000246 R12: 0000000000000001
Feb 11 11:00:21 pve1 kernel: R13: 00007ffd83474500 R14: 00007ffd83474580 R15: 00007fb68639d020
Feb 11 11:00:21 pve1 kernel:  </TASK>
Feb 11 11:00:21 pve1 kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nfnetlink_cttimeout bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_sof_pci_intel_tgl coretemp snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink kvm_intel soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp kvm snd_sof snd_sof_utils snd_soc_hdac_hda irqbypass crct10dif_pclmul polyval_clmulni snd_hda_ext_core polyval_generic snd_soc_acpi_intel_match ghash_clmulni_intel aesni_intel snd_soc_acpi soundwire_generic_allocation soundwire_bus crypto_simd cryptd snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel rapl snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec mei_hdcp mei_pxp snd_hda_core
Feb 11 11:00:21 pve1 kernel:  i915 intel_cstate snd_hwdep pcspkr wmi_bmof cmdlinepart snd_pcm snd_timer spi_nor snd drm_buddy mtd soundcore ttm drm_display_helper cec rc_core mei_me acpi_pad acpi_tad drm_kms_helper mac_hid i2c_algo_bit mei zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme crc32_pclmul ahci nvme_core xhci_pci i2c_i801 xhci_pci_renesas spi_intel_pci spi_intel i2c_smbus igc nvme_common video xhci_hcd libahci wmi
Feb 11 11:00:21 pve1 kernel: CR2: ffff922f21779f60
Feb 11 11:00:21 pve1 kernel: ---[ end trace 0000000000000000 ]---
Feb 11 11:00:21 pve1 kernel: RIP: 0010:kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel: Code: 0f 84 31 02 00 00 48 85 ff 0f 84 28 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8d 88 00 20 00 00 4d 8b 0c 24 48 01 f8 <48> 33 18 48 89 c2 48 89 f8 48 0f ca 48 31 d3 4c 89 c2 65 49 0f c7
Feb 11 11:00:21 pve1 kernel: RSP: 0018:ffffa6f6c58f3b90 EFLAGS: 00010286
Feb 11 11:00:21 pve1 kernel: RAX: ffff922f21779f60 RBX: d3bacd303c76054c RCX: 00000049a3344001
Feb 11 11:00:21 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff922f21779f40
Feb 11 11:00:21 pve1 kernel: RBP: ffffa6f6c58f3bd8 R08: 00000049a3342001 R09: 0000000000039d50
Feb 11 11:00:21 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff92ae40243f00
Feb 11 11:00:21 pve1 kernel: R13: 0000000000002800 R14: ffff92ae479617c0 R15: 0000000000000040
Feb 11 11:00:21 pve1 kernel: FS:  00007fb686062740(0000) GS:ffff92b5bf880000(0000) knlGS:0000000000000000
Feb 11 11:00:21 pve1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 11:00:21 pve1 kernel: CR2: ffff922f21779f60 CR3: 000000010cd12000 CR4: 0000000000752ee0
Feb 11 11:00:21 pve1 kernel: PKRU: 55555554
Feb 11 11:00:21 pve1 kernel: note: pvestatd[1261] exited with irqs disabled
Feb 11 11:00:21 pve1 systemd[1]: user@0.service: Deactivated successfully.
Feb 11 11:00:21 pve1 systemd[1]: Stopped user@0.service - User Manager for UID 0.
Feb 11 11:00:21 pve1 kernel: BUG: unable to handle page fault for address: ffff922f21779f60
Feb 11 11:00:21 pve1 kernel: #PF: supervisor read access in kernel mode
Feb 11 11:00:21 pve1 kernel: #PF: error_code(0x0000) - not-present page
Feb 11 11:00:21 pve1 kernel: PGD 0 P4D 0
Feb 11 11:00:21 pve1 kernel: Oops: 0000 [#2] PREEMPT SMP NOPTI
Feb 11 11:00:21 pve1 kernel: CPU: 1 PID: 391892 Comm: (time-dir) Tainted: P    B D    O       6.5.11-4-pve #1
Feb 11 11:00:21 pve1 kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 09/28/2023
Feb 11 11:00:21 pve1 kernel: RIP: 0010:kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel: Code: 0f 84 31 02 00 00 48 85 ff 0f 84 28 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8d 88 00 20 00 00 4d 8b 0c 24 48 01 f8 <48> 33 18 48 89 c2 48 89 f8 48 0f ca 48 31 d3 4c 89 c2 65 49 0f c7
Feb 11 11:00:21 pve1 kernel: RSP: 0000:ffffa6f6ca86fcf0 EFLAGS: 00010286
Feb 11 11:00:21 pve1 kernel: RAX: ffff922f21779f60 RBX: d3bacd303c76054c RCX: 00000049a3344001
Feb 11 11:00:21 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff922f21779f40
Feb 11 11:00:21 pve1 kernel: RBP: ffffa6f6ca86fd38 R08: 00000049a3342001 R09: 0000000000039d50
Feb 11 11:00:21 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff92ae40243f00
Feb 11 11:00:21 pve1 kernel: R13: 0000000000000cc0 R14: ffff92ae515cf9c0 R15: 0000000000000040
Feb 11 11:00:21 pve1 kernel: FS:  00007f8efd25a940(0000) GS:ffff92b5bf880000(0000) knlGS:0000000000000000
Feb 11 11:00:21 pve1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 11:00:21 pve1 kernel: CR2: ffff922f21779f60 CR3: 000000010b4b6000 CR4: 0000000000752ee0
Feb 11 11:00:21 pve1 kernel: PKRU: 55555554
Feb 11 11:00:21 pve1 kernel: Call Trace:
Feb 11 11:00:21 pve1 kernel:  <TASK>
Feb 11 11:00:21 pve1 kernel:  ? show_regs+0x6d/0x80
Feb 11 11:00:21 pve1 kernel:  ? __die+0x24/0x80
Feb 11 11:00:21 pve1 kernel:  ? page_fault_oops+0x176/0x500
Feb 11 11:00:21 pve1 kernel:  ? kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel:  ? kernelmode_fixup_or_oops+0xb2/0x140
Feb 11 11:00:21 pve1 kernel:  ? __bad_area_nosemaphore+0x1a5/0x280
Feb 11 11:00:21 pve1 kernel:  ? bad_area_nosemaphore+0x16/0x30
Feb 11 11:00:21 pve1 kernel:  ? do_kern_addr_fault+0x7b/0xa0
Feb 11 11:00:21 pve1 kernel:  ? exc_page_fault+0x10d/0x1b0
Feb 11 11:00:21 pve1 kernel:  ? asm_exc_page_fault+0x27/0x30
Feb 11 11:00:21 pve1 kernel:  ? kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel:  ? __anon_vma_prepare+0x2f/0x180
Feb 11 11:00:21 pve1 kernel:  __anon_vma_prepare+0x2f/0x180
Feb 11 11:00:21 pve1 kernel:  do_anonymous_page+0x2c8/0x3c0
Feb 11 11:00:21 pve1 kernel:  __handle_mm_fault+0xb50/0xc30
Feb 11 11:00:21 pve1 kernel:  handle_mm_fault+0x164/0x360
Feb 11 11:00:21 pve1 kernel:  do_user_addr_fault+0x212/0x6a0
Feb 11 11:00:21 pve1 kernel:  exc_page_fault+0x83/0x1b0
Feb 11 11:00:21 pve1 kernel:  asm_exc_page_fault+0x27/0x30
Feb 11 11:00:21 pve1 kernel: RIP: 0033:0x7f8efdd77188
Feb 11 11:00:21 pve1 kernel: Code: d0 49 83 e8 01 48 8d 7c 17 01 0f 85 12 02 00 00 c5 f8 77 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 0e 4c 8b 4c 16 f9 <48> 89 0f 4c 89 4c 17 f9 49 29 d0 49 83 e8 01 48 8d 7c 17 01 0f 85
Feb 11 11:00:21 pve1 kernel: RSP: 002b:00007ffc46f3c318 EFLAGS: 00010202
Feb 11 11:00:21 pve1 kernel: RAX: 00007f8efe1de000 RBX: 00007ffc46f3c550 RCX: 69642d656d697428
Feb 11 11:00:21 pve1 kernel: RDX: 000000000000000a RSI: 00007ffc46f3c550 RDI: 00007f8efe1de000
Feb 11 11:00:21 pve1 kernel: RBP: 0000000046f3c600 R08: 0000000000001000 R09: 00297269642d656d
Feb 11 11:00:21 pve1 kernel: R10: 0000000000000011 R11: 0000000000000246 R12: 000000000000000a
Feb 11 11:00:21 pve1 kernel: R13: 0000000000001000 R14: 00007f8efe1de000 R15: 000000000000000b
Feb 11 11:00:21 pve1 kernel:  </TASK>
Feb 11 11:00:21 pve1 kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nfnetlink_cttimeout bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_sof_pci_intel_tgl coretemp snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink kvm_intel soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp kvm snd_sof snd_sof_utils snd_soc_hdac_hda irqbypass crct10dif_pclmul polyval_clmulni snd_hda_ext_core polyval_generic snd_soc_acpi_intel_match ghash_clmulni_intel aesni_intel snd_soc_acpi soundwire_generic_allocation soundwire_bus crypto_simd cryptd snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel rapl snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec mei_hdcp mei_pxp snd_hda_core
Feb 11 11:00:21 pve1 kernel:  i915 intel_cstate snd_hwdep pcspkr wmi_bmof cmdlinepart snd_pcm snd_timer spi_nor snd drm_buddy mtd soundcore ttm drm_display_helper cec rc_core mei_me acpi_pad acpi_tad drm_kms_helper mac_hid i2c_algo_bit mei zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme crc32_pclmul ahci nvme_core xhci_pci i2c_i801 xhci_pci_renesas spi_intel_pci spi_intel i2c_smbus igc nvme_common video xhci_hcd libahci wmi
Feb 11 11:00:21 pve1 kernel: CR2: ffff922f21779f60
Feb 11 11:00:21 pve1 kernel: ---[ end trace 0000000000000000 ]---
Feb 11 11:00:21 pve1 kernel: RIP: 0010:kmem_cache_alloc+0x107/0x380
Feb 11 11:00:21 pve1 kernel: Code: 0f 84 31 02 00 00 48 85 ff 0f 84 28 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8d 88 00 20 00 00 4d 8b 0c 24 48 01 f8 <48> 33 18 48 89 c2 48 89 f8 48 0f ca 48 31 d3 4c 89 c2 65 49 0f c7
Feb 11 11:00:21 pve1 kernel: RSP: 0018:ffffa6f6c58f3b90 EFLAGS: 00010286
Feb 11 11:00:21 pve1 kernel: RAX: ffff922f21779f60 RBX: d3bacd303c76054c RCX: 00000049a3344001
Feb 11 11:00:21 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff922f21779f40
Feb 11 11:00:21 pve1 kernel: RBP: ffffa6f6c58f3bd8 R08: 00000049a3342001 R09: 0000000000039d50
Feb 11 11:00:21 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff92ae40243f00
Feb 11 11:00:21 pve1 kernel: R13: 0000000000002800 R14: ffff92ae479617c0 R15: 0000000000000040
Feb 11 11:00:21 pve1 kernel: FS:  00007f8efd25a940(0000) GS:ffff92b5bf880000(0000) knlGS:0000000000000000
Feb 11 11:00:21 pve1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 11:00:21 pve1 kernel: CR2: ffff922f21779f60 CR3: 000000010b4b6000 CR4: 0000000000752ee0
Feb 11 11:00:21 pve1 kernel: PKRU: 55555554
Feb 11 11:00:21 pve1 kernel: note: (time-dir)[391892] exited with irqs disabled
Feb 11 11:00:21 pve1 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
-- Reboot --

Risker · Feb 12, 2024

ERROR LOG 2 - segfault(s)

Code:

Feb 11 21:15:01 pve1 sshd[19783]: Accepted publickey for root from 10.24.7.6 port 48658 ssh2: RSA SHA256:dIpIw3tqrW2QvvsMKPsZ2c3Zb6DCGguvLw//zQdGhqI
Feb 11 21:15:01 pve1 sshd[19783]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Feb 11 21:15:01 pve1 systemd-logind[758]: New session 58 of user root.
Feb 11 21:15:01 pve1 systemd[1]: Started session-58.scope - Session 58 of User root.
Feb 11 21:15:01 pve1 sshd[19783]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 11 21:15:01 pve1 sshd[19783]: Received disconnect from 10.24.7.6 port 48658:11: disconnected by user
Feb 11 21:15:01 pve1 sshd[19783]: Disconnected from user root 10.24.7.6 port 48658
Feb 11 21:15:01 pve1 sshd[19783]: pam_unix(sshd:session): session closed for user root
Feb 11 21:15:01 pve1 systemd[1]: session-58.scope: Deactivated successfully.
Feb 11 21:15:01 pve1 systemd-logind[758]: Session 58 logged out. Waiting for processes to exit.
Feb 11 21:15:01 pve1 systemd-logind[758]: Removed session 58.
Feb 11 21:15:01 pve1 sshd[19793]: Accepted publickey for root from 172.0.0.6 port 55838 ssh2: RSA SHA256:dIpIw3tqrW2QvvsMKPsZ2c3Zb6DCGguvLw//zQdGhqI
Feb 11 21:15:01 pve1 sshd[19793]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Feb 11 21:15:01 pve1 systemd-logind[758]: New session 59 of user root.
Feb 11 21:15:01 pve1 systemd[1]: Started session-59.scope - Session 59 of User root.
Feb 11 21:15:01 pve1 sshd[19793]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 11 21:15:02 pve1 sshd[19793]: Received disconnect from 172.0.0.6 port 55838:11: disconnected by user
Feb 11 21:15:02 pve1 sshd[19793]: Disconnected from user root 172.0.0.6 port 55838
Feb 11 21:15:02 pve1 sshd[19793]: pam_unix(sshd:session): session closed for user root
Feb 11 21:15:02 pve1 systemd[1]: session-59.scope: Deactivated successfully.
Feb 11 21:15:02 pve1 systemd-logind[758]: Session 59 logged out. Waiting for processes to exit.
Feb 11 21:15:02 pve1 systemd-logind[758]: Removed session 59.
Feb 11 21:15:02 pve1 sshd[19808]: Accepted publickey for root from 172.0.0.6 port 55842 ssh2: RSA SHA256:dIpIw3tqrW2QvvsMKPsZ2c3Zb6DCGguvLw//zQdGhqI
Feb 11 21:15:02 pve1 sshd[19808]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Feb 11 21:15:02 pve1 systemd-logind[758]: New session 60 of user root.
Feb 11 21:15:02 pve1 systemd[1]: Started session-60.scope - Session 60 of User root.
Feb 11 21:15:02 pve1 sshd[19808]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 11 21:15:03 pve1 sshd[19808]: Received disconnect from 172.0.0.6 port 55842:11: disconnected by user
Feb 11 21:15:03 pve1 sshd[19808]: Disconnected from user root 172.0.0.6 port 55842
Feb 11 21:15:03 pve1 sshd[19808]: pam_unix(sshd:session): session closed for user root
Feb 11 21:15:03 pve1 systemd[1]: session-60.scope: Deactivated successfully.
Feb 11 21:15:03 pve1 systemd-logind[758]: Session 60 logged out. Waiting for processes to exit.
Feb 11 21:15:03 pve1 systemd-logind[758]: Removed session 60.
Feb 11 21:15:03 pve1 sshd[19818]: Accepted publickey for root from 172.0.0.6 port 55850 ssh2: RSA SHA256:dIpIw3tqrW2QvvsMKPsZ2c3Zb6DCGguvLw//zQdGhqI
Feb 11 21:15:03 pve1 sshd[19818]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Feb 11 21:15:03 pve1 systemd-logind[758]: New session 61 of user root.
Feb 11 21:15:03 pve1 systemd[1]: Started session-61.scope - Session 61 of User root.
Feb 11 21:15:03 pve1 sshd[19818]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 11 21:15:03 pve1 sshd[19818]: Received disconnect from 172.0.0.6 port 55850:11: disconnected by user
Feb 11 21:15:03 pve1 sshd[19818]: Disconnected from user root 172.0.0.6 port 55850
Feb 11 21:15:03 pve1 sshd[19818]: pam_unix(sshd:session): session closed for user root
Feb 11 21:15:03 pve1 systemd[1]: session-61.scope: Deactivated successfully.
Feb 11 21:15:03 pve1 systemd-logind[758]: Session 61 logged out. Waiting for processes to exit.
Feb 11 21:15:03 pve1 systemd-logind[758]: Removed session 61.
Feb 11 21:15:04 pve1 sshd[19858]: Accepted publickey for root from 172.0.0.6 port 55858 ssh2: RSA SHA256:dIpIw3tqrW2QvvsMKPsZ2c3Zb6DCGguvLw//zQdGhqI
Feb 11 21:15:04 pve1 sshd[19858]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Feb 11 21:15:04 pve1 systemd-logind[758]: New session 62 of user root.
Feb 11 21:15:04 pve1 systemd[1]: Started session-62.scope - Session 62 of User root.
Feb 11 21:15:04 pve1 sshd[19858]: pam_env(sshd:session): deprecated reading of user environment enabled
Feb 11 21:15:04 pve1 sshd[19858]: Received disconnect from 172.0.0.6 port 55858:11: disconnected by user
Feb 11 21:15:04 pve1 sshd[19858]: Disconnected from user root 172.0.0.6 port 55858
Feb 11 21:15:04 pve1 sshd[19858]: pam_unix(sshd:session): session closed for user root
Feb 11 21:15:04 pve1 systemd[1]: session-62.scope: Deactivated successfully.
Feb 11 21:15:04 pve1 systemd-logind[758]: Session 62 logged out. Waiting for processes to exit.
Feb 11 21:15:04 pve1 systemd-logind[758]: Removed session 62.
Feb 11 21:15:27 pve1 pmxcfs[1157]: [status] notice: received log
Feb 11 21:17:01 pve1 CRON[20169]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 11 21:17:01 pve1 CRON[20170]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 11 21:17:01 pve1 CRON[20169]: pam_unix(cron:session): session closed for user root
Feb 11 21:17:07 pve1 kernel: BUG: Bad rss-counter state mm:00000000d22bcb76 type:MM_FILEPAGES val:-1
Feb 11 21:17:07 pve1 kernel: BUG: Bad rss-counter state mm:00000000d22bcb76 type:MM_ANONPAGES val:1
Feb 11 21:17:33 pve1 pveproxy[15023]: worker exit
Feb 11 21:17:33 pve1 pveproxy[1424]: worker 15023 finished
Feb 11 21:17:33 pve1 pveproxy[1424]: starting 1 worker(s)
Feb 11 21:17:33 pve1 pveproxy[1424]: worker 20244 started
Feb 11 21:17:42 pve1 pmxcfs[1157]: [status] notice: received log
Feb 11 21:18:42 pve1 pmxcfs[1157]: [status] notice: received log
Feb 11 21:18:42 pve1 pmxcfs[1157]: [status] notice: received log
Feb 11 21:20:17 pve1 pve-firewall[1267]: status update error: command 'ebtables-save' failed: got signal 11
Feb 11 21:20:17 pve1 kernel: show_signal_msg: 13 callbacks suppressed
Feb 11 21:20:17 pve1 kernel: ebtables-save[20699]: segfault at ae ip 0000560ef1e262fe sp 00007ffeec187fb0 error 4 in perl[560ef1d3c000+195000] likely on CPU 2 (core 2, socket 0)
Feb 11 21:20:17 pve1 kernel: Code: 85 f6 74 0a f6 46 0e 10 0f 85 d6 02 00 00 48 8b 70 20 48 85 f6 74 0a f6 46 0e 10 0f 85 3b 02 00 00 48 8b 70 10 48 85 f6 74 0a <f6> 46 0e 10 0f 85 50 02 00 00 8b 53 08 83 fa 01 0f 86 6a 02 00 00
Feb 11 21:24:54 pve1 smartd[748]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 49 to 46
Feb 11 21:24:56 pve1 pmxcfs[1157]: [dcdb] notice: data verification successful

fabian · Feb 13, 2024

that points to either memory or CPU troubles, which unfortunately is not unheard of with those mini pcs from ali-express.. does it run stable if you don't run any guests or other loads, and just boot and leave it running?

Risker · Feb 13, 2024

fabian said:
that points to either memory or CPU troubles, which unfortunately is not unheard of with those mini pcs from ali-express.. does it run stable if you don't run any guests or other loads, and just boot and leave it running?

Hi Fabian, many thanks for your input.

Expected the same, my research also points towards some hardware issue primarily.

Currently I am doing a test run on PVE 7.4 with 5.15 kernel unclustered, no ZFS with the same load:

1x Oracle Linux on EXT4 SATA boot drive
1x Oracle Linux on EXT4 NVMe drive
1x OPNsense on EXT4 SATA boot drive
All in factory default config.

I will also proceed with installing a bare PVE 8.1 and just not run any guest for a full test run.

Would you recommend running an apt update && apt upgrade for said test run or leave it exactly as shipped with the 8.1 official ISO?

Would running a factory default Debian Server with/without PVE kernel/packages yield any relevant results?

Should that also fail is it safe to assume a hardware fault with the CPU/MoBo given the fact that "transplanting" the components (RAM, SATA, NVMe) doesn't cause the issue to migrate with the components?

fabian · Feb 13, 2024

Risker said:
I will also proceed with installing a bare PVE 8.1 and just not run any guest for a full test run.

ack

Risker said:
Would you recommend running an apt update && apt upgrade for said test run or leave it exactly as shipped with the 8.1 official ISO?

well, that would give you a hint (if the iso one works, but this doesn't - then you know it's likely kernel or firmware upgrade related

)

Risker said:
Would running a factory default Debian Server with/without PVE kernel/packages yield any relevant results?

on its own, not really. compared to the above two, it might give another data point w.r.t. kernel or firmware being at fault

Risker said:
Should that also fail is it safe to assume a hardware fault with the CPU/MoBo given the fact that "transplanting" the components (RAM, SATA, NVMe) doesn't cause the issue to migrate with the components?

well, hardware issues can also be caused by a particular combination of hardware..

opticblu · Feb 14, 2024

Fun fact, Intel has never made a quad or even dual i226 device, their product data sheet shows nothing other than single port i226

However these vendors are hacking together dual and quad i226 devices is opaque at best, i can't find anything on them, but the i226 itself is about a $5 part.

These are home desktop class devices to begin with, hacked together in a way that was never intended, and then sold to people who have documentation literacy/skill issues, who then, in all seriousness, wonder if the production software deployed in thousands of datacenters all over the globe running mission critical workloads 24/7 without issue, is fatally bugged.

quit. buying. cheap. hacky. ccp. garbage.

If power utilization is a conern, buy a used HP Z workstation with a platinum power supply or a dell precision sff-- if space is an issue--- (like 9 watts at idle, seriously, cheapest platinum power supplies you'll find), grab a used x550 nic from ebay, and enjoy 24/7 reliability

Risker · Feb 14, 2024

opticblu said:
Fun fact, Intel has never made a quad or even dual i226 device, their product data sheet shows nothing other than single port i226

However these vendors are hacking together dual and quad i226 devices is opaque at best, i can't find anything on them, but the i226 itself is about a $5 part.

These are home desktop class devices to begin with, hacked together in a way that was never intended, and then sold to people who have documentation literacy/skill issues, who then, in all seriousness, wonder if the production software deployed in thousands of datacenters all over the globe running mission critical workloads 24/7 without issue, is fatally bugged.

quit. buying. cheap. hacky. ccp. garbage.

If power utilization is a conern, buy a used HP Z workstation with a platinum power supply or a dell precision sff-- if space is an issue--- (like 9 watts at idle, seriously, cheapest platinum power supplies you'll find), grab a used x550 nic from ebay, and enjoy 24/7 reliability

I think you must have misunderstood, this is not a purchase inquiry, and even if it were, I wouldn't take advice who sounds like the Apple fan of the Datacenters.

"iNtEl dId nOt aPpRoVe! SIs can't just do what they want with PCI-e bandwidth, that's illegal" - Tell me the same 20 years ago where silicon lottery plus overclocking might lead you to finding an extra core.

Trading your hobby room for a newborn will have you find a solution for the white noise as well, fanless design. Judging by your attitude you won't have this issue anytime soon though

Skedaddle · Feb 14, 2024

I have a N100 device with the same hardware name / bios from your crashlogs. Seeing exactly the same crashes / freezes and segfaults. I have tried a bunch of kernel tweaks to no avail. Working on a RMA now.

Code:

 i915.enable_psr2_sel_fetch=0
 i915.enable_psr=0
 i915.enable_dc=0
 i915.disable_power_well=1
 ahci.mobile_lpm_policy=1
 intel_idle.max_cstate=1
 processor.max_cstate=1
 usbcore.autosuspend=-1

Skedaddle · Feb 14, 2024

I'm also very interested in continued status updates on the one you have that is currently stable. To make sure it is even worth replacing the device.

Matthias17 · Feb 15, 2024

I can confirm the behavior. A N100 with 4x Intel i226-v NIC, equipped with 32GB DDR5 RAM by Crucial.
The system stops responding after a few minutes. In the log stands Bad RSS Counter, I have read up to this point, that the RAM is working at wrong speed.

A different device, N100 with 2xRealtek NIC, works much better.

Skedaddle · Feb 16, 2024

If it makes any difference mine seems stable (48 hours now) when not running any VMs. Also memtest86 for 24 hours was also not an issue.

fabian · Feb 16, 2024

@Skedaddle the next step could then be testing various load scenarios on their own - for example with stress-ng and fio.

esi_y · Feb 16, 2024

Risker said:
2x CWWK/Topton Quad-NIC Intel 226v, Intel N100 Mini PCs.

Each has been specced with the following components:
Crucial P3 Plus 2000MB NVMe (ZFS)
Crucial BX500 SATA SSD (Boot drive, EXT4)
Crucial DDR5-4800 SODIMM 32GB

One of the nodes doesn't survive a full 24-h day without crashing, the other runs buttery smooth for 80+h uptime during the testing phase.

What I have tried:

1. Multiple kernels for proxmox 8.1.4 namely 6.5.11-4, 6.5.11-7, 6.5.11.8

2. Running the node isolated unclustered, without ZFS, 1x OPNsense VM, 1x Oracle Linux on NVMe drive (configured as EXT4, local LVM) 1x Oracle Linux on SATA (configured as EXT4, local LVM)

3. Swapping components (NVMe, SATA, RAM) between the nodes, the error stays with the host, and doesn't migrate with the components.

Did you try swapping the PSUs too?

Risker said:
4. Reinstalling PVE, 5-6 times

5. Memtest over night with 1x of the DIMMs, passed with flying colours

Do you have a spare DIMM with lower MT/s to test in the buggy system?

Skedaddle said:
I have a N100 device with the same hardware name / bios from your crashlogs. Seeing exactly the same crashes / freezes and segfaults. I have tried a bunch of kernel tweaks to no avail. Working on a RMA now.

Code:

i915.enable_psr2_sel_fetch=0 i915.enable_psr=0 i915.enable_dc=0 i915.disable_power_well=1 ahci.mobile_lpm_policy=1 intel_idle.max_cstate=1 processor.max_cstate=1 usbcore.autosuspend=-1

The max_cstate is something that makes some NUCs stable, but strangely enough others (batch) do not need it (same model). At times it appeared kernel-dependent with Debian. Disabling any devices in firmware that won't be used is also good idea. Mini PCs, noname or brand, are best RMA'd if it was not the RAM.

Matthias17 · Feb 16, 2024

I suspected long time heat issues as the device is cooled passively, but removing the hood for better air circulation, did not improve anything.

I started the device again, keeping an eye on the log…when starting a Windows VM installation, a blue screen occurred…in the log: „CPU locked“

Risker · Feb 26, 2024

tempacc346235 said:
Did you try swapping the PSUs too?

I did try that, as well as the other components.

tempacc346235 said:
Do you have a spare DIMM with lower MT/s to test in the buggy system?

I don't have other DDR5 SoDIMMs sitting around unfortunately.

Skedaddle said:
I'm also very interested in continued status updates on the one you have that is currently stable. To make sure it is even worth replacing the device.

I definitely have no active issues with the stable unit, runs buttery smooth, clustered/unclustered, ext4/zfs, no BIOS tweaking whatsoever.

I'm currently testing Proxmox 7.4 on the buggy unit at the moment, could you please do the same so that we share results?

Skedaddle said:
I have a N100 device with the same hardware name / bios from your crashlogs. Seeing exactly the same crashes / freezes and segfaults. I have tried a bunch of kernel tweaks to no avail. Working on a RMA now.

Working on RMA as well with Topton since the 13th, they have recommended some customary basic troubleshooting steps which I fulfilled and documented. One of them was cooling, it did not change anything.

They are are blaming Chinese New Year for the delay in properly addressing my dispute with the crashing unit. I have given them a deadline until the 27th of Feb to address this, otherwise I will open a dispute with AliExpress directly and write a review on their product page accordingly.

esi_y · Feb 26, 2024

Risker said:
I don't have other DDR5 SoDIMMs sitting around unfortunately.

I am not sure now if there's 2 modules per unit or 1, but if it's 2, I would try running with one at a time.

Risker said:
They are are blaming Chinese New Year for the delay in properly addressing my dispute with the crashing unit.

That was 10th through 17th, people are back from holidays after that.

Risker · Feb 26, 2024

tempacc346235 said:
I am not sure now if there's 2 modules per unit or 1, but if it's 2, I would try running with one at a time.

Single channel, as per CPU specs, meaning only 1 DIMM per unit.

esi_y · Feb 26, 2024

Risker said:
Single channel, as per CPU specs, meaning only 1 DIMM per unit.

What I meant was that "Crucial DDR5-4800 SODIMM 32GB", you only have 1 piece of that (in each machine), correct?

Risker · Feb 28, 2024

esi_y said:
What I meant was that "Crucial DDR5-4800 SODIMM 32GB", you only have 1 piece of that (in each machine), correct?

Yes I confirm, each unit has 1x RAM slot, as per the chipset specs (single channel only support) there is 1x Crucial DDR5-4800 SODIMM 32GB in each machine.

Host hard crashes, PVE 8.1.4

New Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Active Member

Active Member

New Member

Active Member

Proxmox Staff Member

Renowned Member

New Member

New Member

Renowned Member

New Member

Renowned Member

New Member