VM loses console output & slow boot with PCI passthrough

MisterDeeds

Well-Known Member
Nov 11, 2021
156
36
48
36
Dear all

We are running a PVE server with a VM that we use for a local AI instance. The VM has the following configuration:

VM Configuration:

Code:
cat /etc/pve/qemu-server/10050.conf

agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=ide2;scsi0
cores: 64
cpu: host
efidisk0: data:vm-10050-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:21:00
hostpci1: 0000:a1:00
hostpci2: 0000:c1:00
ide2: none,media=cdrom
machine: q35
memory: 524288
meta: creation-qemu=8.1.5,ctime=1721896847
name: AIBOT1
net0: virtio=BC:24:11:23:BF:D5,bridge=vmbr0,firewall=1,tag=10
numa: 1
ostype: l26
scsi0: data:vm-10050-disk-1,discard=on,iothread=1,size=1000G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=0d337412-675f-4736-afbc-d1038ee39652
sockets: 2
tags: Deb12
vga: std
vmgenid: 9a484e20-8155-4daf-9376-ad9f0a2356ea

Code:
pveversion --verbose
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-6
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.5.13-5-pve: 6.5.13-5
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

Problem:
As soon as the GPUs are passed through via PCI passthrough, the VM no longer shows any output on the console (even though the checkbox primary GPU is not selected). Regardless of which Display option I select, there is no output during the boot process. Additionally, the VM’s boot time increases significantly – with PCI passthrough, it takes about 10 minutes, whereas without it, only around 20 seconds.

Does anyone have an idea what could be causing this or what further troubleshooting steps I could take?

Thanks for your help!

Best regards
 
Last edited:
Same problem here...working with 8xH200 (141Gb) that I finally got working but the boot time is insane (30+ mins) if I pass 4 GPUs to a single VM.
Did a mirror setup on identical hardware but with libvirt on RHEL9, boot time is still slow but more like 5 minutes with the same VM setup

Tried fiddling a bit with hugepages and NUMA but no difference.
 
Bumping with same issue: (fewer cards, 2xh100s). After setting up PCI passthrough Boot is very slow > 4min.

monitoring on the host with journalctl -f
I see multiple instances of:
watchdog: BUG: soft lockup - CPU#35 stuck for 26s! [CPU 0/KVM:5531] with different CPU#s
(could be a red herring)

have also setup hugepages and fiddled with NUMA but no progress, hoping for a resolution because everything else seems to work once the VM is booted.