VM network freeze

kloklo

New Member
Jun 28, 2024
1
0
1
Hello,
From time to time, my network adapters on virtual machines freeze. This happens often, once every 2-3 minutes. The VM stops responding via SSH, does not ping.
The connection is restored automatically in 20-30 seconds.There are no kernel error messages either in the VM or on the host. This behavior is observed on 2 nodes.
At the same time, the Proxmox nodes themselves are accessible via the network, there is access to the GUI.

VMs are accessible via noVNC/SPICE found several threads with similar symptoms on the forum and tried solutions from there. I did not achieve the result

My installation:

2 nodes Proxmox 8.4.1 kernel 6.11.11-2 (no-subscription)
5-10 different VMs on each node (Windows, Linux, FreeBSD)
VirtIO drivers and Guest-agent are installed and available on all VMs

Hardware:
Node1:
CPU: 12th Gen Intel(R) Core(TM) i5-12500
NET: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05) ; driver: r8169
00:00.0 Host bridge [0600]: Intel Corporation 12th Gen Core Processor Host Bridge [8086:4650] (rev 05)
DeviceName: Onboard - Other
Subsystem: Gigabyte Technology Co., Ltd 12th Gen Core Processor Host Bridge [1458:5000]
libkmod: ERROR ../libkmod/libkmod-config.c:712 kmod_config_parse: /etc/modprobe.d/zfs.conf line 1: ignoring bad line starting with '7516192768'
00:02.0 VGA compatible controller [0300]: Intel Corporation Alder Lake-S GT1 [UHD Graphics 770] [8086:4690] (rev 0c)
DeviceName: Onboard - Video
Subsystem: Gigabyte Technology Co., Ltd Alder Lake-S GT1 [UHD Graphics 770] [1458:d000]
Kernel driver in use: i915
Kernel modules: i915, xe
00:06.0 PCI bridge [0604]: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 [8086:464d] (rev 05)
Kernel driver in use: pcieport
00:14.0 USB controller [0c03]: Intel Corporation Raptor Lake USB 3.2 Gen 2x2 (20 Gb/s) XHCI Host Controller [8086:7a60] (rev 11)
DeviceName: Onboard - Other
Subsystem: Gigabyte Technology Co., Ltd Raptor Lake USB 3.2 Gen 2x2 (20 Gb/s) XHCI Host Controller [1458:5007]
Kernel driver in use: xhci_hcd
Kernel modules: mei_me, xhci_pci
00:14.2 RAM memory [0500]: Intel Corporation Raptor Lake-S PCH Shared SRAM [8086:7a27] (rev 11)
DeviceName: Onboard - Other
00:14.3 Network controller [0280]: Intel Corporation Raptor Lake-S PCH CNVi WiFi [8086:7a70] (rev 11)
DeviceName: Onboard - Ethernet
Subsystem: Intel Corporation Raptor Lake-S PCH CNVi WiFi [8086:0094]
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
00:15.0 Serial bus controller [0c80]: Intel Corporation Raptor Lake Serial IO I2C Host Controller [8086:7a4c] (rev 11)
DeviceName: Onboard - Other
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:15.1 Serial bus controller [0c80]: Intel Corporation Raptor Lake Serial IO I2C Host Controller [8086:7a4d] (rev 11)
DeviceName: Onboard - Other
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:15.2 Serial bus controller [0c80]: Intel Corporation Raptor Lake Serial IO I2C Host Controller [8086:7a4e] (rev 11)
DeviceName: Onboard - Other
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:15.3 Serial bus controller [0c80]: Intel Corporation Device [8086:7a4f] (rev 11)
DeviceName: Onboard - Other
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:16.0 Communication controller [0780]: Intel Corporation Raptor Lake CSME HECI [8086:7a68] (rev 11)
DeviceName: Onboard - Other
Subsystem: Gigabyte Technology Co., Ltd Raptor Lake CSME HECI [1458:1c3a]
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller [0106]: Intel Corporation Raptor Lake SATA AHCI Controller [8086:7a62] (rev 11)
DeviceName: Onboard - SATA
Subsystem: Gigabyte Technology Co., Ltd Raptor Lake SATA AHCI Controller [1458:b005]
Kernel driver in use: ahci
Kernel modules: ahci
00:19.0 Serial bus controller [0c80]: Intel Corporation Device [8086:7a7c] (rev 11)
DeviceName: Onboard - Other
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:19.1 Serial bus controller [0c80]: Intel Corporation Device [8086:7a7d] (rev 11)
DeviceName: Onboard - Other
Kernel driver in use: intel-lpss
Kernel modules: intel_lpss_pci
00:1a.0 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port [8086:7a48] (rev 11)
Kernel driver in use: pcieport
00:1c.0 PCI bridge [0604]: Intel Corporation Raptor Lake PCI Express Root Port [8086:7a38] (rev 11)
Kernel driver in use: pcieport
00:1c.2 PCI bridge [0604]: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 [8086:7a3a] (rev 11)
Subsystem: Gigabyte Technology Co., Ltd Raptor Point-S PCH - PCI Express Root Port 3 [1458:5001]
Kernel driver in use: pcieport
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:7a06] (rev 11)
DeviceName: Onboard - Other
Subsystem: Gigabyte Technology Co., Ltd Device [1458:5001]
00:1f.3 Audio device [0403]: Intel Corporation Raptor Lake High Definition Audio Controller [8086:7a50] (rev 11)
DeviceName: Onboard - Sound
Subsystem: Gigabyte Technology Co., Ltd Raptor Lake High Definition Audio Controller [1458:a194]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel, snd_sof_pci_intel_tgl
00:1f.4 SMBus [0c05]: Intel Corporation Raptor Lake-S PCH SMBus Controller [8086:7a23] (rev 11)
DeviceName: Onboard - Other
Subsystem: Gigabyte Technology Co., Ltd Raptor Lake-S PCH SMBus Controller [1458:5001]
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
00:1f.5 Serial bus controller [0c80]: Intel Corporation Raptor Lake SPI (flash) Controller [8086:7a24] (rev 11)
DeviceName: Onboard - Other
Kernel driver in use: intel-spi
Kernel modules: spi_intel_pci
01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD [15b7:5006]
Subsystem: Sandisk Corp SanDisk Extreme Pro / WD Black SN750 / PC SN730 / Red SN700 NVMe SSD [15b7:5006]
Kernel driver in use: nvme
Kernel modules: nvme
02:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD [15b7:5006]
Subsystem: Sandisk Corp SanDisk Extreme Pro / WD Black SN750 / PC SN730 / Red SN700 NVMe SSD [15b7:5006]
Kernel driver in use: nvme
Kernel modules: nvme
04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
Subsystem: Gigabyte Technology Co., Ltd RTL8125 2.5GbE Controller [1458:e000]
Kernel driver in use: r8169
Kernel modules: r8169

Node2:
CPU: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
NET: Intel Corporation Ethernet Connection (11) I219-V [8086:0d4d] (rev 11); driver e1000e
00:00.0 Host bridge [0600]: Intel Corporation Comet Lake-S 6c Host Bridge/DRAM Controller [8086:9b53] (rev 05)
Subsystem: ASRock Incorporation Comet Lake-S 6c Host Bridge/DRAM Controller [1849:9b53]
Kernel driver in use: skl_uncore
00:02.0 VGA compatible controller [0300]: Intel Corporation CometLake-S GT2 [UHD Graphics 630] [8086:9bc5] (rev 05)
Subsystem: ASRock Incorporation CometLake-S GT2 [UHD Graphics 630] [1849:9bc5]
Kernel driver in use: i915
Kernel modules: i915
00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
Subsystem: ASRock Incorporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [1849:1911]
00:14.0 USB controller [0c03]: Intel Corporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller [8086:43ed] (rev 11)
Subsystem: ASRock Incorporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller [1849:43ed]
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
00:14.2 RAM memory [0500]: Intel Corporation Tiger Lake-H Shared SRAM [8086:43ef] (rev 11)
00:16.0 Communication controller [0780]: Intel Corporation Tiger Lake-H Management Engine Interface [8086:43e0] (rev 11)
Subsystem: ASRock Incorporation Tiger Lake-H Management Engine Interface [1849:43e0]
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller [0106]: Intel Corporation Device [8086:43d2] (rev 11)
Subsystem: ASRock Incorporation Device [1849:43d2]
Kernel driver in use: ahci
Kernel modules: ahci
00:1f.0 ISA bridge [0601]: Intel Corporation B560 LPC/eSPI Controller [8086:4387] (rev 11)
Subsystem: ASRock Incorporation B560 LPC/eSPI Controller [1849:4387]
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:f0c8] (rev 11)
Subsystem: ASRock Incorporation Device [1849:1897]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
00:1f.4 SMBus [0c05]: Intel Corporation Tiger Lake-H SMBus Controller [8086:43a3] (rev 11)
Subsystem: ASRock Incorporation Tiger Lake-H SMBus Controller [1849:43a3]
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
00:1f.5 Serial bus controller [0c80]: Intel Corporation Tiger Lake-H SPI Controller [8086:43a4] (rev 11)
Subsystem: ASRock Incorporation Tiger Lake-H SPI Controller [1849:43a4]
Kernel driver in use: intel-spi
Kernel modules: spi_intel_pci
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (11) I219-V [8086:0d4d] (rev 11)
Subsystem: ASRock Incorporation Ethernet Connection (11) I219-V [1849:0d4d]
Kernel driver in use: e1000e
Kernel modules: e1000e

What was tried:
Changing Proxmox host kernels:
6.11.11-2-pve
6.8.12-10-pve
6.8.12-8-pve

Changing the network adapter type on the VM:
intel E1000, VirtIO, vmxnet
Disable tcp segmentation offload and generic segmentation offload on Proxmox hosts:
Bash:
ethtook -K <interface> tso off gso off
Bash:
ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off
I am attaching the kernel logs.
 

Attachments

Same happens on a Mac mini 2014, with 3 NIC's (all broadcom). All NIC's are stalling for 20-50 seconds, every 2-8 hours. No matter what the system or network load actually is. Two NIC's are PCI passthrough for a pfSense firewall VM. The other is the standard built-in NIC, only in use for accessing ProxMox.

The moment when the NIC's go into freeze, there's no apparent system or network load. It even happens during the night when there's barely any traffic. The time when this happens isn't a regular pattern, nor the time between the incidents.

Tried pinning several kernals. At this moment I'm on 6.8.12-4-pve. After the reboot I got within 7 hours two freezes/stalling of all NIC's.
At this moment of writing the system seems to run fine, around 16 hours now... longer than before.
Still when I the system was on PVE 8.3.5, everything worked fine for months.

After pinning 6.8.12-4 (with PVE 8.4.1) I did notice these lines (below) in the system log I haven seen with the newer kernels in the last two days. On these logged times there was no high load on the node, not on the network or in the system.

Code:
14:26:32 pve kernel: perf: interrupt took too long (2602 > 2500), lowering kernel.perf_event_max_sample_rate to 76000
15:12:09 pve kernel: perf: interrupt took too long (3293 > 3252), lowering kernel.perf_event_max_sample_rate to 60000
18:13:22 pve kernel: perf: interrupt took too long (4126 > 4116), lowering kernel.perf_event_max_sample_rate to 48000
 
Last edited: