Proxmox kernel 6.x intermittent vm freeze

Inglebard

Renowned Member
May 20, 2016
108
7
83
32
Hello,

We have an issue since linux kernel 6.x.

We have a vm with GPU passthroutgh.
Everything works great with kernel 5.15.131-2-pve.
However with kernel 6.5.13-6-pve 6.8.12-4-pve we encounter strange spikes of processor load.

During this spike, the vm is unresponsive.

Any idea what is happening ?

proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-9
pve-kernel-5.13: 7.1-9
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

Edit: I have a similar issue a year ago without solution : https://forum.proxmox.com/threads/proxmox-high-cpu-usage-after-upgrade-to-version-8.140402/
 

Attachments

  • Capture d’écran du 2025-01-06 11-22-03.png
    Capture d’écran du 2025-01-06 11-22-03.png
    14.2 KB · Views: 9
Last edited:
However with kernel 6.5.13-6-pve 6.8.12-4-pve we encounter strange spikes of processor load.
"processor load" of the Proxmox host?

We have a vm with GPU passthroutgh.
How do you know/assume this is the cause?
What else is running on that Proxmox node? Have you tried test-running the node without that VM running?

You don't provide much details about your HW or that VM. Hard for others to help.
 
@gfngfn256
this is the processor load of the VM.


I suppose is it related to the GPU passthrough because we have 2 vms on this server with the same OS. Only the vm with the GPU passthrough have this issue. We also have 20+ Proxmox hosts (same Proxmox version but not the same hardware) with vms without GPU passthrough which don't have this issue.
So I am not 100% if the issue come from GPU passthrough but it is a strange coincidence.
I am only 100% sure it comes form Proxmox and linux kernel because the issue happen on Proxmox host kernel changes without any changes inside the vm.

Here are the vm details :
Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=ide2;ide0;net0
cores: 28
cpu: host
efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:af:00,pcie=1,x-vga=1
ide0: local-lvm:vm-101-disk-1,size=800G,ssd=1
ide2: none,media=cdrom
machine: pc-q35-7.0
memory: 450560
meta: creation-qemu=6.2.0,ctime=1661334363
name: VM
net0: virtio={hidden},bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: local-lvm:vm-101-disk-3,discard=on,size=3500G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid={hidden}
sockets: 1
startup: up=120
tpmstate0: local-lvm:vm-101-disk-2,size=4M,version=v2.0
usb0: host=1-12
usb1: host=1-10
vga: none
vmgenid: {hidden}

Here are the host details :
Code:
Motherboard : Intel S2600STBR
CPU: CPU Intel XEON Gold 6226R x2
RAM: M393A8G40AB2-CWEx8 (total: 512GB)
GPU: PNY 20GB RTX A4500
Network: Intel 10Gb 2-Port Lan Riser Accessory Kit (2xSFP+)
RAID :  BC MegaRAID 9560-8i PCIe x8 SAS/NVMe sgl.
 
Things I note in your VM's config:

You also have 2 USB passthroughs in addition to the GPU. (Keyboard & Mouse?). If you comment these out - does the VM's CPU load go down?
Are you sure about those ports (Bus 1 Ports 10 & 12)?
I notice you have numa disabled - yet you have 2 socketed CPUs. Have you ever considered enabling it?
I notice you are running an older version of the pc-q35-7.0 - have you tried the latest ?
 
@gfngfn256
- Yes, the 2 usb passthrough is for a keyboard and an mouse, just in case. You also point this is the only VM where I have USB passthrough. I didn't test to comment them out. I am sure about those port with linux kernel 5, if linux kernel 6 change the order then I am not. I will check next time.

- I tried to enable numa but i encounter an issue about hugepagesize (don't remember the error message, but I think i should just enabled it in grub). I was supposed to enable it later after this update. Since this update is not going well, I would not take the risk to add more problems and add difficulties to identify where the original issue come from. But If it's the solution I will try.

- Changing the machine type cannot cause change with virtual hardware and then issue with windows license activation ?
 
- Changing the machine type cannot cause change with virtual hardware and then issue with windows license activation ?
Possibly. Although probably unlikely by only using a newer q35 machine version. Also - AFAIK - if you just restore that VM back afterwards from a backup you should be alright.

I take no responsibility.
 
Hi,
I update to the latest kernel with Proxmox 8.

I remove the 2 usb passthrought and enable numa/pagesize.

The issue is still present.

With the new kernel (when the issue appear), I see in the log the following that doesn't seems good:

Code:
Aug 20 17:02:50 proxmox kernel: blacklist: Problem with revocation key (-65)
Aug 20 17:02:50 proxmox systemd-modules-load[685]: Failed to find module 'vfio_virqfd'
Aug 20 17:38:02 proxmox QEMU[2587]: kvm: vhost_set_mem_table failed: Argument list too long (7)
Aug 20 17:38:02 proxmox QEMU[2587]: kvm: unable to start vhost net: 7: falling back on userspace virtio

Can this explain my issue ?

EDIT :
GRUB_CMDLINE_LINUX_DEFAULT :
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on pcie_acs_override=downstream,multifunction video=efifb:off video=vesa:off vfio-pci.ids=10de:2232,10de:1aef vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.ignore_msrs=1 modprobe.blacklist=radeon,nouveau,nvidia,nvidiafb,nvidia-gpu"
 
Last edited: