A6000 passthrough on supermicro/Epyc Milan stalling bad on vm boot, then working

superpower

New Member
Nov 27, 2023
4
0
1
On a H12SSL-NT supermicro superserver with AMD EPYC 7543 cpu, and 3x A6000, passing them through to linux vms stalls the vm badly in the beginning but eventually manages to boot, - found out by accident - .
I can install the nvidia drivers inside and can run various compute workloads like llms etc.
The problem is this takes about 4 to 5 minutes keeping one core on the host at 100% and I'm not sure if I can passthrough more than 1 A6000 to the same vm because last time I tried it still hadn't booted in more than 10 minutes.
The journalctl looks like this :
Code:
Jan 16 02:34:06 rtc pvedaemon[5953]: VM 102 started with PID 5964.
Jan 16 02:34:06 rtc pvedaemon[2178]: <root@pam> end task UPID:rtc:00001741:0000CBFE:678853FB:qmstart:102:root@pam: OK
Jan 16 02:34:06 rtc pvedaemon[6157]: starting vnc proxy UPID:rtc:0000180D:0000CD21:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pvedaemon[2179]: <root@pam> starting task UPID:rtc:0000180D:0000CD21:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pvedaemon[6159]: starting vnc proxy UPID:rtc:0000180F:0000CD25:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pvedaemon[2180]: <root@pam> starting task UPID:rtc:0000180F:0000CD25:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pveproxy[2244]: proxy detected vanished client connection
Jan 16 02:34:11 rtc qm[6161]: VM 102 qmp command failed - VM 102 qmp command 'set_password' failed - got timeout
Jan 16 02:34:11 rtc pvedaemon[6159]: Failed to run vncproxy.
Jan 16 02:34:11 rtc pvedaemon[2180]: <root@pam> end task UPID:rtc:0000180F:0000CD25:678853FE:vncproxy:102:root@pam: Failed to run vncproxy.
Jan 16 02:34:16 rtc pvedaemon[6157]: connection timed out
Jan 16 02:34:16 rtc pvedaemon[2179]: <root@pam> end task UPID:rtc:0000180D:0000CD21:678853FE:vncproxy:102:root@pam: connection timed out
Jan 16 02:34:18 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:34:18 rtc pvestatd[2158]: status update time (8.089 seconds)
Jan 16 02:34:19 rtc pvedaemon[2178]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:34:38 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:34:38 rtc pvestatd[2158]: status update time (8.083 seconds)
Jan 16 02:34:58 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:34:58 rtc pvestatd[2158]: status update time (8.079 seconds)
Jan 16 02:35:19 rtc sshd[6492]: Accepted publickey for root from 192.168.100.211 port 53218 ssh2: ED25519 SHA256:QyEmaPbAvyRKR7IxdrN4M0aYs2V3j7KoY/v1KtLf4MI
Jan 16 02:35:19 rtc sshd[6492]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jan 16 02:35:19 rtc systemd-logind[1780]: New session 3 of user root.
░░ Subject: A new session 3 has been created for user root
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ Documentation: sd-login(3)
░░
░░ A new session with the ID 3 has been created for the user root.
░░
░░ The leading process of the session is 6492.
Jan 16 02:35:19 rtc systemd[1]: Started session-3.scope - Session 3 of User root.
░░ Subject: A start job for unit session-3.scope has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit session-3.scope has finished successfully.
░░
░░ The job identifier is 360.
Jan 16 02:35:19 rtc sshd[6492]: pam_env(sshd:session): deprecated reading of user environment enabled
Jan 16 02:35:23 rtc pvedaemon[2180]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:35:28 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:35:28 rtc pvestatd[2158]: status update time (8.083 seconds)
Jan 16 02:35:48 rtc pvedaemon[2179]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:35:48 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:35:48 rtc pvestatd[2158]: status update time (8.086 seconds)
Jan 16 02:36:00 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:36:00 rtc pvestatd[2158]: status update time (9.786 seconds)
Jan 16 02:36:08 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:08 rtc pvestatd[2158]: status update time (8.089 seconds)
Jan 16 02:36:13 rtc pvedaemon[2180]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:19 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:19 rtc pvestatd[2158]: status update time (9.391 seconds)
Jan 16 02:36:28 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:28 rtc pvestatd[2158]: status update time (8.092 seconds)
Jan 16 02:36:38 rtc pvedaemon[2178]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:38 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:39 rtc pvestatd[2158]: status update time (8.084 seconds)
Jan 16 02:36:48 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:48 rtc pvestatd[2158]: status update time (8.090 seconds)
Jan 16 02:36:58 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:37:04 rtc pvedaemon[2179]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to
 102 qmp socket - timeout after 51 retries
Jan 16 02:37:07 rtc pvestatd[2158]: status update time (7.495 seconds)
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x3a data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0xd90 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x570 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x571 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x572 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x560 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x561 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x580 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x581 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x582 data 0x0
Jan 16 02:37:31 rtc pvedaemon[7296]: starting vnc proxy UPID:rtc:00001C80:00011D5D:678854CB:vncproxy:102:root@pam:
Jan 16 02:37:31 rtc pvedaemon[2178]: <root@pam> starting task UPID:rtc:00001C80:00011D5D:678854CB:vncproxy:102:root@pam:


After all the above, the VM can be logged into, ollama started, confyui, whatever you need and nvidia-smi reports normal output :
[ This nvidia-smi output is inside VM 102 ]
Code:
nvidia-smi
Wed Jan 15 19:42:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142                Driver Version: 550.142        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:01:00.0 Off |                  Off |
| 30%   32C    P8              5W /  300W |      13MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       701      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

the vm file :
Code:
bios: ovmf
boot: order=scsi0;net0
cores: 8
cpu: host
efidisk0: zfs-sas-ssd:vm-102-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: mapping=A600081,pcie=1,x-vga=1
ide2: local:iso/debian-12.9.0-amd64-netinst.iso,media=cdrom,size=632M
machine: pc-q35-9.0,viommu=virtio
memory: 16000
meta: creation-qemu=9.0.2,ctime=1736902284
name: vm102
net0: virtio=BC:24:11:AB:BF:90,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: zfs-sas-ssd:vm-102-disk-1,iothread=1,size=132G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=e8c3569c-1ace-4e46-92ba-3e82dc0fa48c
sockets: 1
vga: virtio,memory=256
vmgenid: f829f4d5-7342-4e13-8b5d-ab098220653d
args: -D /var/log/qemu/vm-102-debug.log -d cpu_reset,guest_errors,page,mmu,cpu

It's the exact same problem on both 6.11.0-2-pve and 6.8.12-5-pve and attached is the debug log as configured above
 

Attachments

Last edited:
you could try to set "aio=native" for your hard drive scsi0. with "io_uring" (default) it took 5 minutes to pass the first second (!) of kernel messages on our setup.
 
Last edited:
what does the disk have to do with it ?
In any case, with aio native, it behaves the exact same way

Code:
Jan 17 17:15:33 rtc pvedaemon[809139]: VM 102 started with PID 809149.
Jan 17 17:15:33 rtc pvedaemon[236828]: <root@pam> end task UPID:rtc:000C58B3:00D55508:678A7412:qmstart:102:root@pam: OK
Jan 17 17:15:42 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 17 17:15:43 rtc pvestatd[2158]: status update time (8.085 seconds)
Jan 17 17:15:46 rtc pvedaemon[236828]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 17 17:15:46 rtc pvedaemon[809436]: starting vnc proxy UPID:rtc:000C59DC:00D55B52:678A7422:vncproxy:102:root@pam:
Jan 17 17:15:46 rtc pvedaemon[236828]: <root@pam> starting task UPID:rtc:000C59DC:00D55B52:678A7422:vncproxy:102:root@pam:
Jan 17 17:15:47 rtc pvedaemon[233149]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 17 17:15:51 rtc pvestatd[2158]: status update time (7.794 seconds)
Jan 17 17:16:01 rtc pvedaemon[236828]: <root@pam> end task UPID:rtc:000C59DC:00D55B52:678A7422:vncproxy:102:root@pam: OK
Jan 17 17:16:02 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 17 17:16:02 rtc pvestatd[2158]: status update time (8.091 seconds)
Jan 17 17:16:08 rtc pveproxy[2242]: worker 473082 finished
Jan 17 17:16:08 rtc pveproxy[2242]: starting 1 worker(s)
Jan 17 17:16:08 rtc pveproxy[2242]: worker 809562 started
Jan 17 17:16:09 rtc pvedaemon[233149]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 17 17:16:09 rtc pveproxy[809561]: proxy detected vanished client connection
Jan 17 17:16:10 rtc pvedaemon[809563]: starting vnc proxy UPID:rtc:000C5A5B:00D564C4:678A743A:vncproxy:102:root@pam:
Jan 17 17:16:10 rtc pvedaemon[236828]: <root@pam> starting task UPID:rtc:000C5A5B:00D564C4:678A743A:vncproxy:102:root@pam:
Jan 17 17:16:10 rtc pvestatd[2158]: status update time (5.798 seconds)
Jan 17 17:16:11 rtc pveproxy[809561]: worker exit
Jan 17 17:16:22 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 17 17:16:22 rtc pvestatd[2158]: status update time (8.090 seconds)
Jan 17 17:16:51 rtc pvedaemon[236828]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 17 17:16:56 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 17 17:16:56 rtc pvestatd[2158]: status update time (11.694 seconds)
Jan 17 17:17:01 rtc CRON[809900]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 17 17:17:01 rtc CRON[809901]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 17 17:17:01 rtc CRON[809900]: pam_unix(cron:session): session closed for user root
Jan 17 17:17:01 rtc pveproxy[2242]: worker 473083 finished
Jan 17 17:17:01 rtc pveproxy[2242]: starting 1 worker(s)
Jan 17 17:17:01 rtc pveproxy[2242]: worker 809909 started
Jan 17 17:17:02 rtc pveproxy[809903]: got inotify poll request in wrong process - disabling inotify
Jan 17 17:17:14 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 17 17:17:14 rtc pvestatd[2158]: status update time (8.087 seconds)
Jan 17 17:17:18 rtc pvedaemon[236828]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries

Just to make sure we are on the same page, this is an issue with passthrough of one or more A6000 Gpus.
Otherwise, when no pci passthrough is attempted, the vm functions perfectly
 
what does the disk have to do with it ?
In any case, with aio native, it behaves the exact same way
I honestly have no idea... All I can say is that in our case it helped avoiding the stalling when booting a VM with an Nvidia A40 passed trough. Without any GPU passtrough, boot times are as expected.
 
I honestly have no idea... All I can say is that in our case it helped avoiding the stalling when booting a VM with an Nvidia A40 passed trough. Without any GPU passtrough, boot times are as expected.
Could you be so kind as to give me your qemu config for one of those specific vms ?
I am very curious to see what other general params you have configured; (the /etc/pve/qemu-server/<id>.conf file in question) thx in advance.
 
Sure, see below.

The host is a Dell 7525 with 2x AMD 75F3 CPU and 2x Nvidia A40 running PVE 8.3.2.
Note that passthrough of more than 1 GPU per VM does not work (anymore) on this host. It used to work flawlessly until some PVE update broke things (I cannot pinpoint the exact version), but this is not super critical to our use case where 1 GPU per VM is fine.

Code:
root@idm-vhost01:~# qm config 153
agent: 1
args: -global q35-pcihost.pci-hole64-size=2048G
boot: order=ide2;scsi0
cipassword: **********
ciuser: user
cores: 16
cpu: host
hostpci1: mapping=vhost01-a40-02,pcie=1
ide2: none,media=cdrom
ipconfig0: ip=10.220.14.255/20,gw=10.220.0.1
machine: q35
memory: 65000
meta: creation-qemu=8.1.5,ctime=1720520352
name: hpc-tmp-gpu-worker-3-vhost01
net0: virtio=BC:24:11:21:88:9C,bridge=vmbr1,firewall=1
numa: 1
ostype: l26
scsi0: pool01:vm-153-disk-0,aio=native,discard=on,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=00ac497f-ef18-484e-a2d9-7628975c97c2
sockets: 1
sshkeys: ssh-ed25519...
vmgenid: ab9f484b-255f-47df-a208-1eeaf8ec194a
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!