On a H12SSL-NT supermicro superserver with AMD EPYC 7543 cpu, and 3x A6000, passing them through to linux vms stalls the vm badly in the beginning but eventually manages to boot, - found out by accident - .
I can install the nvidia drivers inside and can run various compute workloads like llms etc.
The problem is this takes about 4 to 5 minutes keeping one core on the host at 100% and I'm not sure if I can passthrough more than 1 A6000 to the same vm because last time I tried it still hadn't booted in more than 10 minutes.
The journalctl looks like this :
After all the above, the VM can be logged into, ollama started, confyui, whatever you need and nvidia-smi reports normal output :
[ This nvidia-smi output is inside VM 102 ]
the vm file :
It's the exact same problem on both
I can install the nvidia drivers inside and can run various compute workloads like llms etc.
The problem is this takes about 4 to 5 minutes keeping one core on the host at 100% and I'm not sure if I can passthrough more than 1 A6000 to the same vm because last time I tried it still hadn't booted in more than 10 minutes.
The journalctl looks like this :
Code:
Jan 16 02:34:06 rtc pvedaemon[5953]: VM 102 started with PID 5964.
Jan 16 02:34:06 rtc pvedaemon[2178]: <root@pam> end task UPID:rtc:00001741:0000CBFE:678853FB:qmstart:102:root@pam: OK
Jan 16 02:34:06 rtc pvedaemon[6157]: starting vnc proxy UPID:rtc:0000180D:0000CD21:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pvedaemon[2179]: <root@pam> starting task UPID:rtc:0000180D:0000CD21:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pvedaemon[6159]: starting vnc proxy UPID:rtc:0000180F:0000CD25:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pvedaemon[2180]: <root@pam> starting task UPID:rtc:0000180F:0000CD25:678853FE:vncproxy:102:root@pam:
Jan 16 02:34:06 rtc pveproxy[2244]: proxy detected vanished client connection
Jan 16 02:34:11 rtc qm[6161]: VM 102 qmp command failed - VM 102 qmp command 'set_password' failed - got timeout
Jan 16 02:34:11 rtc pvedaemon[6159]: Failed to run vncproxy.
Jan 16 02:34:11 rtc pvedaemon[2180]: <root@pam> end task UPID:rtc:0000180F:0000CD25:678853FE:vncproxy:102:root@pam: Failed to run vncproxy.
Jan 16 02:34:16 rtc pvedaemon[6157]: connection timed out
Jan 16 02:34:16 rtc pvedaemon[2179]: <root@pam> end task UPID:rtc:0000180D:0000CD21:678853FE:vncproxy:102:root@pam: connection timed out
Jan 16 02:34:18 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:34:18 rtc pvestatd[2158]: status update time (8.089 seconds)
Jan 16 02:34:19 rtc pvedaemon[2178]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:34:38 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:34:38 rtc pvestatd[2158]: status update time (8.083 seconds)
Jan 16 02:34:58 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:34:58 rtc pvestatd[2158]: status update time (8.079 seconds)
Jan 16 02:35:19 rtc sshd[6492]: Accepted publickey for root from 192.168.100.211 port 53218 ssh2: ED25519 SHA256:QyEmaPbAvyRKR7IxdrN4M0aYs2V3j7KoY/v1KtLf4MI
Jan 16 02:35:19 rtc sshd[6492]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jan 16 02:35:19 rtc systemd-logind[1780]: New session 3 of user root.
░░ Subject: A new session 3 has been created for user root
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ Documentation: sd-login(3)
░░
░░ A new session with the ID 3 has been created for the user root.
░░
░░ The leading process of the session is 6492.
Jan 16 02:35:19 rtc systemd[1]: Started session-3.scope - Session 3 of User root.
░░ Subject: A start job for unit session-3.scope has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit session-3.scope has finished successfully.
░░
░░ The job identifier is 360.
Jan 16 02:35:19 rtc sshd[6492]: pam_env(sshd:session): deprecated reading of user environment enabled
Jan 16 02:35:23 rtc pvedaemon[2180]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:35:28 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:35:28 rtc pvestatd[2158]: status update time (8.083 seconds)
Jan 16 02:35:48 rtc pvedaemon[2179]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:35:48 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:35:48 rtc pvestatd[2158]: status update time (8.086 seconds)
Jan 16 02:36:00 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - got timeout
Jan 16 02:36:00 rtc pvestatd[2158]: status update time (9.786 seconds)
Jan 16 02:36:08 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:08 rtc pvestatd[2158]: status update time (8.089 seconds)
Jan 16 02:36:13 rtc pvedaemon[2180]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:19 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:19 rtc pvestatd[2158]: status update time (9.391 seconds)
Jan 16 02:36:28 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:28 rtc pvestatd[2158]: status update time (8.092 seconds)
Jan 16 02:36:38 rtc pvedaemon[2178]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:38 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:39 rtc pvestatd[2158]: status update time (8.084 seconds)
Jan 16 02:36:48 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:36:48 rtc pvestatd[2158]: status update time (8.090 seconds)
Jan 16 02:36:58 rtc pvestatd[2158]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 51 retries
Jan 16 02:37:04 rtc pvedaemon[2179]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to
102 qmp socket - timeout after 51 retries
Jan 16 02:37:07 rtc pvestatd[2158]: status update time (7.495 seconds)
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x3a data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0xd90 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x570 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x571 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x572 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x560 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x561 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x580 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x581 data 0x0
Jan 16 02:37:15 rtc kernel: kvm: kvm [5964]: ignored rdmsr: 0x582 data 0x0
Jan 16 02:37:31 rtc pvedaemon[7296]: starting vnc proxy UPID:rtc:00001C80:00011D5D:678854CB:vncproxy:102:root@pam:
Jan 16 02:37:31 rtc pvedaemon[2178]: <root@pam> starting task UPID:rtc:00001C80:00011D5D:678854CB:vncproxy:102:root@pam:
After all the above, the VM can be logged into, ollama started, confyui, whatever you need and nvidia-smi reports normal output :
[ This nvidia-smi output is inside VM 102 ]
Code:
nvidia-smi
Wed Jan 15 19:42:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142 Driver Version: 550.142 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 32C P8 5W / 300W | 13MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 701 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
the vm file :
Code:
bios: ovmf
boot: order=scsi0;net0
cores: 8
cpu: host
efidisk0: zfs-sas-ssd:vm-102-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: mapping=A600081,pcie=1,x-vga=1
ide2: local:iso/debian-12.9.0-amd64-netinst.iso,media=cdrom,size=632M
machine: pc-q35-9.0,viommu=virtio
memory: 16000
meta: creation-qemu=9.0.2,ctime=1736902284
name: vm102
net0: virtio=BC:24:11:AB:BF:90,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: zfs-sas-ssd:vm-102-disk-1,iothread=1,size=132G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=e8c3569c-1ace-4e46-92ba-3e82dc0fa48c
sockets: 1
vga: virtio,memory=256
vmgenid: f829f4d5-7342-4e13-8b5d-ab098220653d
args: -D /var/log/qemu/vm-102-debug.log -d cpu_reset,guest_errors,page,mmu,cpu
It's the exact same problem on both
6.11.0-2-pve
and 6.8.12-5-pve
and attached is the debug log as configured aboveAttachments
Last edited: