Hello everyone,
Unfortunately my first forum thread will have to be because of an issue. Any help is kindly appreciated.
3 weeks ago I purchased 2x CWWK/Topton Quad-NIC Intel 226v, Intel N100 Mini PCs.
Each has been specced with the following components:
Crucial P3 Plus 2000MB NVMe (ZFS)
Crucial BX500 SATA SSD (Boot drive, EXT4)
Crucial DDR5-4800 SODIMM 32GB
One of the nodes doesn't survive a full 24-h day without crashing, the other runs buttery smooth for 80+h uptime during the testing phase.
Testing phase consists of:
Cluster
2x CWWK/Topton Quad-NIC Intel 226v, Intel N100 Mini PCs
1x Corosync-Qdevice to maintain quorum
2x OPNsense VMs (one on each Node), with local LVM storage on the Boot Drive, no HA running simultaneously (Master/Slave via pfSync and CARP VIPs).
2x distinct Oracle Linux (one on each Node), running on ZFS storage with replication (this is to test replication and live migration), fresh config no packages installed.
Network Topology:
Node1 Node2
eth0--------------------------------------------------WAN switch-------------------------------------------------eth0
eth1--------------------------------------------------LAN switch--------------------------------------------------eth1
eth2--------------------------------------------------------------------------------------------------------------------eth2 (directly connected, dedicated Cluster interface/network)
eth3--------------------------------------------------------------------------------------------------------------------eth3 (directly connected, dedicated pfSync interface/network,
also secondary Cluster Link)
Behavior:
1. Sometimes the node doesn't fully crash but within 6-18h the node itself, as well as all VMs and storage are displayed with a gray question mark. The node GUI is still reachable and responsive, so are the VMs as well as the VM and Node console outputs. Restarting pvestatd solves this issue for about 5-7 min after that, the gray question mark returns. Rebooting the node via shell doesn't run smoothly at all, it becomes unresponsive and has to be powered off via hardware.
2. More often than not the affected node just crashes without a single journalctl log. Power LED on, NIC LEDs on, cannot ping, no video.
3. Within the 6-18h uptime I can see two recurring error logs, you find the full snippets below. The node doesn't always crash when those logs start appearing but sometimes these are the last logs I see before a crash.
ERROR LOG 1 - BUG: unable to handle page fault for address: (most common)
ERROR LOG 2 - segfault(s)
Full output submitted in first comment.
What I have tried:
1. Multiple kernels for proxmox 8.1.4 namely 6.5.11-4, 6.5.11-7, 6.5.11.8
2. Running the node isolated unclustered, without ZFS, 1x OPNsense VM, 1x Oracle Linux on NVMe drive (configured as EXT4, local LVM) 1x Oracle Linux on SATA (configured as EXT4, local LVM)
3. Swapping components (NVMe, SATA, RAM) between the nodes, the error stays with the host, and doesn't migrate with the components.
4. Reinstalling PVE, 5-6 times
5. Memtest over night with 1x of the DIMMs, passed with flying colours
6. I am aware of the below post, but the BIOS doesn't offer any options for On-Die ECC or similar: https://forum.proxmox.com/threads/pve-freezes-during-backup-job.134848/#post-613511
Output of pveversion -v
Unfortunately my first forum thread will have to be because of an issue. Any help is kindly appreciated.
3 weeks ago I purchased 2x CWWK/Topton Quad-NIC Intel 226v, Intel N100 Mini PCs.
Each has been specced with the following components:
Crucial P3 Plus 2000MB NVMe (ZFS)
Crucial BX500 SATA SSD (Boot drive, EXT4)
Crucial DDR5-4800 SODIMM 32GB
One of the nodes doesn't survive a full 24-h day without crashing, the other runs buttery smooth for 80+h uptime during the testing phase.
Testing phase consists of:
Cluster
2x CWWK/Topton Quad-NIC Intel 226v, Intel N100 Mini PCs
1x Corosync-Qdevice to maintain quorum
2x OPNsense VMs (one on each Node), with local LVM storage on the Boot Drive, no HA running simultaneously (Master/Slave via pfSync and CARP VIPs).
2x distinct Oracle Linux (one on each Node), running on ZFS storage with replication (this is to test replication and live migration), fresh config no packages installed.
Network Topology:
Node1 Node2
eth0--------------------------------------------------WAN switch-------------------------------------------------eth0
eth1--------------------------------------------------LAN switch--------------------------------------------------eth1
eth2--------------------------------------------------------------------------------------------------------------------eth2 (directly connected, dedicated Cluster interface/network)
eth3--------------------------------------------------------------------------------------------------------------------eth3 (directly connected, dedicated pfSync interface/network,
also secondary Cluster Link)
Behavior:
1. Sometimes the node doesn't fully crash but within 6-18h the node itself, as well as all VMs and storage are displayed with a gray question mark. The node GUI is still reachable and responsive, so are the VMs as well as the VM and Node console outputs. Restarting pvestatd solves this issue for about 5-7 min after that, the gray question mark returns. Rebooting the node via shell doesn't run smoothly at all, it becomes unresponsive and has to be powered off via hardware.
2. More often than not the affected node just crashes without a single journalctl log. Power LED on, NIC LEDs on, cannot ping, no video.
3. Within the 6-18h uptime I can see two recurring error logs, you find the full snippets below. The node doesn't always crash when those logs start appearing but sometimes these are the last logs I see before a crash.
ERROR LOG 1 - BUG: unable to handle page fault for address: (most common)
ERROR LOG 2 - segfault(s)
Full output submitted in first comment.
What I have tried:
1. Multiple kernels for proxmox 8.1.4 namely 6.5.11-4, 6.5.11-7, 6.5.11.8
2. Running the node isolated unclustered, without ZFS, 1x OPNsense VM, 1x Oracle Linux on NVMe drive (configured as EXT4, local LVM) 1x Oracle Linux on SATA (configured as EXT4, local LVM)
3. Swapping components (NVMe, SATA, RAM) between the nodes, the error stays with the host, and doesn't migrate with the components.
4. Reinstalling PVE, 5-6 times
5. Memtest over night with 1x of the DIMMs, passed with flying colours
6. I am aware of the below post, but the BIOS doesn't offer any options for On-Die ECC or similar: https://forum.proxmox.com/threads/pve-freezes-during-backup-job.134848/#post-613511
Output of pveversion -v
Code:
root@test:~# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
Code:
root@pve2:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zfs 55.2G 1.70T 96K /zfs
zfs/vm-103-disk-0 2.20G 1.70T 2.20G -
zfs/vm-104-disk-0 53.0G 1.75T 2.21G -
Last edited: