System lock up at the PVE level.

Jun 17, 2022
15
0
1
System:
Nuc8 i7
32G ram
2Tb disk
PVE 7.3-3
One VM Windows 11 New install

Problem:
I installed the system two weeks ago. While the system is running without VMs running, it is rock stable. It can run for days without issues. Once the VM is booted and idle for more than a couple of hours, the system locks up. I have some of the kernel log entries from this I can share. The behavior I noticed was that the web interface was not responding. The only way I could get it back was to power cycle the PC. This is an unsubscribed version as I am testing some functionality.

Kern.Log
Dec 19 10:01:48 PVEYoda kernel: [ 8.010958] e1000e 0000:00:1f.6 eno1: NIC Link is Up 100 Mbps Full Duplex, Flow Control: None Dec 19 10:01:48 PVEYoda kernel: [ 8.010962] e1000e 0000:00:1f.6 eno1: 10/100 speed: disabling TSO Dec 19 10:01:48 PVEYoda kernel: [ 8.011033] vmbr0: port 1(eno1) entered blocking state Dec 19 10:01:48 PVEYoda kernel: [ 8.011035] vmbr0: port 1(eno1) entered forwarding state Dec 19 10:01:48 PVEYoda kernel: [ 8.011163] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready Dec 19 10:01:58 PVEYoda kernel: [ 17.528375] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details. Dec 20 15:24:30 PVEYoda kernel: [105769.791171] device tap200i0 entered promiscuous mode Dec 20 15:24:30 PVEYoda kernel: [105769.801950] vmbr0: port 2(tap200i0) entered blocking state Dec 20 15:24:30 PVEYoda kernel: [105769.801953] vmbr0: port 2(tap200i0) entered disabled state Dec 20 15:24:30 PVEYoda kernel: [105769.802054] vmbr0: port 2(tap200i0) entered blocking state Dec 20 15:24:30 PVEYoda kernel: [105769.802055] vmbr0: port 2(tap200i0) entered forwarding state Dec 20 15:35:37 PVEYoda kernel: [106436.147596] vmbr0: port 2(tap200i0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.230799] vmbr0: port 2(fwpr200p0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.230802] vmbr0: port 2(fwpr200p0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.230844] device fwpr200p0 entered promiscuous mode Dec 20 15:35:37 PVEYoda kernel: [106436.230867] vmbr0: port 2(fwpr200p0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.230868] vmbr0: port 2(fwpr200p0) entered forwarding state Dec 20 15:35:37 PVEYoda kernel: [106436.238750] fwbr200i0: port 1(fwln200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.238753] fwbr200i0: port 1(fwln200i0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.238798] device fwln200i0 entered promiscuous mode Dec 20 15:35:37 PVEYoda kernel: [106436.238825] fwbr200i0: port 1(fwln200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.238826] fwbr200i0: port 1(fwln200i0) entered forwarding state Dec 20 15:35:37 PVEYoda kernel: [106436.246701] fwbr200i0: port 2(tap200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.246703] fwbr200i0: port 2(tap200i0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.246762] fwbr200i0: port 2(tap200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.246763] fwbr200i0: port 2(tap200i0) entered forwarding state Dec 21 07:11:05 PVEYoda kernel: [ 0.000000] Linux version 5.15.74-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) ()
Messages:
Dec 19 10:01:48 PVEYoda kernel: [ 8.010962] e1000e 0000:00:1f.6 eno1: 10/100 speed: disabling TSO Dec 19 10:01:48 PVEYoda kernel: [ 8.011033] vmbr0: port 1(eno1) entered blocking state Dec 19 10:01:48 PVEYoda kernel: [ 8.011035] vmbr0: port 1(eno1) entered forwarding state Dec 19 10:01:48 PVEYoda kernel: [ 8.011163] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready Dec 19 10:01:58 PVEYoda kernel: [ 17.528375] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details. Dec 20 15:24:30 PVEYoda kernel: [105769.791171] device tap200i0 entered promiscuous mode Dec 20 15:24:30 PVEYoda kernel: [105769.801950] vmbr0: port 2(tap200i0) entered blocking state Dec 20 15:24:30 PVEYoda kernel: [105769.801953] vmbr0: port 2(tap200i0) entered disabled state Dec 20 15:24:30 PVEYoda kernel: [105769.802054] vmbr0: port 2(tap200i0) entered blocking state Dec 20 15:24:30 PVEYoda kernel: [105769.802055] vmbr0: port 2(tap200i0) entered forwarding state Dec 20 15:35:37 PVEYoda kernel: [106436.147596] vmbr0: port 2(tap200i0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.230799] vmbr0: port 2(fwpr200p0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.230802] vmbr0: port 2(fwpr200p0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.230844] device fwpr200p0 entered promiscuous mode Dec 20 15:35:37 PVEYoda kernel: [106436.230867] vmbr0: port 2(fwpr200p0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.230868] vmbr0: port 2(fwpr200p0) entered forwarding state Dec 20 15:35:37 PVEYoda kernel: [106436.238750] fwbr200i0: port 1(fwln200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.238753] fwbr200i0: port 1(fwln200i0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.238798] device fwln200i0 entered promiscuous mode Dec 20 15:35:37 PVEYoda kernel: [106436.238825] fwbr200i0: port 1(fwln200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.238826] fwbr200i0: port 1(fwln200i0) entered forwarding state Dec 20 15:35:37 PVEYoda kernel: [106436.246701] fwbr200i0: port 2(tap200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.246703] fwbr200i0: port 2(tap200i0) entered disabled state Dec 20 15:35:37 PVEYoda kernel: [106436.246762] fwbr200i0: port 2(tap200i0) entered blocking state Dec 20 15:35:37 PVEYoda kernel: [106436.246763] fwbr200i0: port 2(tap200i0) entered forwarding state Dec 21 07:11:05 PVEYoda lvm[422]: Monitoring thin pool pve-data-tpool. Dec 21 07:11:05 PVEYoda kernel: [ 0.000000] Linux version 5.15.74-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) ()
 
Hi,

have you tried if the issue persists on other kernel versions?

The behavior I noticed was that the web interface was not responding.
Can you log in over SSH or through physical access? Or is the system frozen completely?
 
Hi,

have you tried if the issue persists on other kernel versions?


Can you log in over SSH or through physical access? Or is the system frozen completely?
SSH Fails, a full scan with Nmap doesn't show the system as alive (No Ping response)
I have not tried with other kernel versions. do you have a link to instructions on how to try out other kernel versions?

Regards,

John
 
e1000e 0000:00:1f.6 eno1: NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
This looks suspicious, it should give 1000Mbit. Where is it plugged in? For old routers and switches it would be normal to have only 100Mbit-ports. If the device is not that old, it could be a powersave mode (up to switch off the port as a whole), which could lead to the problem. Check if this device offers something like that in its webUI and switch off if possible.

Is the cable ok and plugged in tight? Broken cables inside (not visible) could provoke downgrade to 100Mbit and timeouts.
 
This looks suspicious, it should give 1000Mbit. Where is it plugged in? For old routers and switches it would be normal to have only 100Mbit-ports. If the device is not that old, it could be a powersave mode (up to switch off the port as a whole), which could lead to the problem. Check if this device offers something like that in its webUI and switch off if possible.

Is the cable ok and plugged in tight? Broken cables inside (not visible) could provoke downgrade to 100Mbit and timeouts.
It is plugged into a 100Mbit device. currently set up with some old rubbish network gear I had sitting around. Another clue leading away from networking -> If the system is idle, there is some HDD activity. When it is locked up there is none.
 
This is normal, there is always minimal activity, mostly logging. Be it the hypervisor or windows, indexing, autoupdates etc


Can you post a smartctl -q noserial -x /dev/sdX from the disk?
root@PVEYoda:/dev# smartctl -q noserial -x /dev/nvme0 smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: KINGSTON SNVS2000G Firmware Version: S8442105 PCI Vendor/Subsystem ID: 0x2646 IEEE OUI Identifier: 0x0026b7 Controller ID: 1 NVMe Version: 1.3 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 0026b7 685eba0125 Local Time is: Wed Dec 21 14:07:30 2022 PST Firmware Updates (0x12): 1 Slot, no Reset required Optional Admin Commands (0x0016): Format Frmw_DL Self_Test Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg Maximum Data Transfer Size: 64 Pages Warning Comp. Temp. Threshold: 85 Celsius Critical Comp. Temp. Threshold: 90 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.00W - - 0 0 0 0 0 0 1 + 3.00W - - 1 1 1 1 0 0 2 + 1.50W - - 2 2 2 2 0 0 3 - 0.0250W - - 3 3 3 3 8000 3000 4 - 0.0040W - - 4 4 4 4 25000 25000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 22 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 219,580 [112 GB] Data Units Written: 272,000 [139 GB] Host Read Commands: 2,567,867 Host Write Commands: 5,503,261 Controller Busy Time: 629 Power Cycles: 15 Power On Hours: 297 Unsafe Shutdowns: 8 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, 16 of 64 entries) No Errors Logged
 
Ok, that looks healthy.

Can you try to update the firmware?
When googling for "KINGSTON SNVS2000G linux bug", I immediately get hits with exact your problem/only hard reset etc.
Maybe you have a translator on hand: https://www.heise.de/news/Kingston-SSD-A2000-Firmware-Update-behebt-Linux-Abstuerze-6030819.html
Not unusual that your model has the same bug.

If this is already the last firmware, then you could try to switch off power savings. The bug appears in combination with deeper power states.
 
Last edited:
Where you able to get this issues fixed. I been having a very similar issue to you and think I am running into the same thing.
 
Yes, I found it was likely firmware on the computer. Ultimately I took the SSD and memory out of the NUC that was locking up and placed it into a newer NUC. The older computer was a 2019 i7 NUC8 Model number BOXNUC8i7BEH. Simply moving away from this computer and to a new one made this problem disappear. The firmware for this computer hadn't been updated in a long time so that is why I suspect that is the case.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!