PROXMOX 8.0.4 Host-side problems with Win10 Pro guest stalls, also mlx4 VF compat

pipe2null

New Member
Feb 26, 2023
3
0
1
PVE 8.0.4 with 2x E5-2697v4.

I have 2 (or more?) issues to fix, the biggest problem is #2:

1) Is there an updated mlx4_core/en/ib for PVE 8.0.4/kernel 6.2.16-6-pve available somewhere?
- Current modinfo mlx4_core => 4.0-0
- Windows Error 43 on "VPI" adapter (ConnectX3 ethernet mode VF pcie-passthrough)
- Not sure how to apply patch to fix problems with windows guests
- I need to get finished transitioning all VMs to ConnectX3 SRIOV VFs, and there is a known issue with Windows guests due to host-side ConnectX driver implementation.
- AFAIK something like that. If I'm headed in the wrong direction for fixing, please correct me.

Referenced threads:
https://forums.servethehome.com/ind...x-host-and-windows-guest-via-kvm.28956/page-3
https://forum.proxmox.com/threads/h...ox-connectx-3-cards-for-sriov-and-vfs.121927/
https://forum.proxmox.com/threads/pve-kernel-6-2-16-6-pve-build-issue.131779/

2) Windows 10 Pro "whole" guest stalls/momentary freeze
- The whole guest system stalls for ~200-500ms regularly but with non-periodic pattern.
- Timing of stall is anywhere from every few seconds to every few minutes.
- Issue is most evident when playing audio using pcie-passed-through hardware (audio buzzes for the duration of the guest system stall AKA hardware repeatedly replays ~50ms buffer due to lack of new data from OS), but even when nothing other than guest OS is running, everything freezes including mouse pointer.
- Stall does not appear to be directly CPU load dependent since stall occurs even when system is idle, but does appear to occur more frequently when under load.
- Stall is not obviously correlated to any virtual or physical IO, occurs with or without any signficant IO load, and significant IO load does not automatically trigger a stall.
- Watching resource monitor shows a SIMULTANEOUS spike in CPU across ALL cores AFTER the stall ends and monitor telemetry continues, as is there is no significant CPU usage prior to the moment of stall, but the first data point obtained during/immediately after stall shows a single momentary spike in CPU on every core no matter the previous or subsequent load. EVERY stall has some amount of simultaneous CPU spike across all cores.

Windows10GuestStall_Cropped.png


Any ideas on this one?

Thanks!
 
Also:

agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0;ide0
cores: 18
cpu: host
efidisk0: local-dir:666/vm-666-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:81:00,pcie=1,x-vga=1
hostpci2: 0000:03:00,pcie=1
ide0: local-dir:iso/virtio-win-0.1.229.iso,media=cdrom,size=522284K
machine: q35
memory: 16000
meta: creation-qemu=8.0.2,ctime=1692051742
name: Win10
net0: virtio=42:AF:31:71:92:B8,bridge=vmbr0040
numa: 0
ostype: l26
scsi0: local-zfs:vm-666-disk-0,iothread=1,size=256G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=5d39371a-8298-42f4-9d79-e16ab78e411f
sockets: 1
tpmstate0: local-dir:666/vm-666-disk-2.raw,size=4M,version=v2.0


NOTE: "hostpci1" is usually the ConnectX3 SRIOV VF, but I removed it for testing the Guest Stall problem (makes no difference to stall).
"hostpci0:" = RTX4090
"hostpci2:" = PCIe USB controller card
"CPUs:" = Intel Xeon 2x E5-2697v4

Afterthought:
Monitor and usb keyboard/mouse are connected to pcie-passthrough'ed guest hardware.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!