Dual 3080 GPUs work in a single VM, but not if I split them to have one each in two VMs.

Nathan Stratton

Well-Known Member
Dec 28, 2018
43
3
48
48
System Setup
Proxmox 8.0.4
Supermicro H12SSL
1 Nvidia 4090
3 Nvidia 3080

Machine q35

virt101 - 3080 PCI Device 0000:02:00
virt103 - 4090 PCI Device 0000:01:00

I had virt 105 with two 3080s, PCI Device 0000:81:00 and 0000:82:00

Everything works great with this setup; I shut down 105, cloned 106, and put 0000:81:00 on 105 and 0000:82:00 on 106. If I start 105 first, 106 will not start and if I swap them and start 106 first, it is fine, but then 105 won't start.

Does anyone have any ideas? How can I get better logs from kvm or qemu? This is what I see from journal:

Code:
Sep 29 11:10:55 virt1 pvedaemon[667600]: start VM 106: UPID:virt1:000A2FD0:0CDF7EA5:6516E8FF:qmstart:106:root@pam:
Sep 29 11:10:55 virt1 pvedaemon[651968]: <root@pam> starting task UPID:virt1:000A2FD0:0CDF7EA5:6516E8FF:qmstart:106:root@pam:
Sep 29 11:10:56 virt1 systemd[1]: Started 106.scope.
Sep 29 11:10:56 virt1 systemd-udevd[617]: Configuration file /etc/udev/rules.d/60-persistent-storage-hptblock.rules is marked executable. Please remove executable permission bits. Proceeding anyway.
Sep 29 11:10:56 virt1 kernel: device tap106i0 entered promiscuous mode
Sep 29 11:10:56 virt1 kernel: vmbr2: port 8(tap106i0) entered blocking state
Sep 29 11:10:56 virt1 kernel: vmbr2: port 8(tap106i0) entered disabled state
Sep 29 11:10:56 virt1 kernel: vmbr2: port 8(tap106i0) entered blocking state
Sep 29 11:10:56 virt1 kernel: vmbr2: port 8(tap106i0) entered forwarding state
Sep 29 11:10:57 virt1 kernel: device tap106i1 entered promiscuous mode
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 1(fwln106i1) entered disabled state
Sep 29 11:10:57 virt1 kernel: vmbr1: port 5(fwpr106p1) entered disabled state
Sep 29 11:10:57 virt1 kernel: device fwln106i1 left promiscuous mode
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 1(fwln106i1) entered disabled state
Sep 29 11:10:57 virt1 kernel: device fwpr106p1 left promiscuous mode
Sep 29 11:10:57 virt1 kernel: vmbr1: port 5(fwpr106p1) entered disabled state
Sep 29 11:10:57 virt1 kernel: vmbr1: port 5(fwpr106p1) entered blocking state
Sep 29 11:10:57 virt1 kernel: vmbr1: port 5(fwpr106p1) entered disabled state
Sep 29 11:10:57 virt1 kernel: device fwpr106p1 entered promiscuous mode
Sep 29 11:10:57 virt1 kernel: vmbr1: port 5(fwpr106p1) entered blocking state
Sep 29 11:10:57 virt1 kernel: vmbr1: port 5(fwpr106p1) entered forwarding state
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 1(fwln106i1) entered blocking state
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 1(fwln106i1) entered disabled state
Sep 29 11:10:57 virt1 kernel: device fwln106i1 entered promiscuous mode
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 1(fwln106i1) entered blocking state
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 1(fwln106i1) entered forwarding state
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 2(tap106i1) entered blocking state
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 2(tap106i1) entered disabled state
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 2(tap106i1) entered blocking state
Sep 29 11:10:57 virt1 kernel: fwbr106i1: port 2(tap106i1) entered forwarding state
Sep 29 11:11:05 virt1 pvedaemon[651969]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:07 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:07 virt1 pvestatd[3884]: status update time (8.078 seconds)
Sep 29 11:11:17 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:17 virt1 pvestatd[3884]: status update time (8.083 seconds)
Sep 29 11:11:17 virt1 pvedaemon[668116]: stop VM 106: UPID:virt1:000A31D4:0CDF8720:6516E915:qmstop:106:root@pam:
Sep 29 11:11:17 virt1 pvedaemon[651969]: <root@pam> starting task UPID:virt1:000A31D4:0CDF8720:6516E915:qmstop:106:root@pam:
Sep 29 11:11:27 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:27 virt1 pvestatd[3884]: status update time (8.090 seconds)
Sep 29 11:11:27 virt1 pvedaemon[668116]: can't lock file '/var/lock/qemu-server/lock-106.conf' - got timeout
Sep 29 11:11:27 virt1 pvedaemon[651969]: <root@pam> end task UPID:virt1:000A31D4:0CDF8720:6516E915:qmstop:106:root@pam: can't lock file '/var/lock/qemu-server/lock-106.conf' - got timeout
Sep 29 11:11:30 virt1 pvedaemon[651970]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:37 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:37 virt1 pvestatd[3884]: status update time (8.088 seconds)
Sep 29 11:11:47 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:47 virt1 pvestatd[3884]: status update time (8.089 seconds)
Sep 29 11:11:55 virt1 pvedaemon[651969]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:57 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:11:57 virt1 pvestatd[3884]: status update time (8.086 seconds)
Sep 29 11:12:00 virt1 pvedaemon[667600]: start failed: command '/usr/bin/kvm -id 106 -name 'gpu1,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=619cd9b6-6474-4493-8c9b-33556553d455' -smp '8,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga none -nographic -cpu 'host,kvm=off,+kvm_pv_eoi,+kvm_pv_unhalt' -m 65536 -object 'iothread,id=iothread-virtioscsi0' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=8da5648e-d72a-405e-833e-1ae3cce1e881' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:82:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,x-vga=on,multifunction=on' -device 'vfio-pci,host=0000:82:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:eac3be58846' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=/dev/zvol/nvme_virt1/vm-106-disk-0,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2A:96:50:42:E3:F6,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=101' -netdev 'type=tap,id=net1,ifname=tap106i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=DE:A3:76:A8:FC:05,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256' -machine 'smm=off,type=q35+pve0'' failed: got timeout
Sep 29 11:12:00 virt1 pvedaemon[651968]: <root@pam> end task UPID:virt1:000A2FD0:0CDF7EA5:6516E8FF:qmstart:106:root@pam: start failed: command '/usr/bin/kvm -id 106 -name 'gpu1,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=619cd9b6-6474-4493-8c9b-33556553d455' -smp '8,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga none -nographic -cpu 'host,kvm=off,+kvm_pv_eoi,+kvm_pv_unhalt' -m 65536 -object 'iothread,id=iothread-virtioscsi0' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=8da5648e-d72a-405e-833e-1ae3cce1e881' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:82:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,x-vga=on,multifunction=on' -device 'vfio-pci,host=0000:82:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:eac3be58846' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=/dev/zvol/nvme_virt1/vm-106-disk-0,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2A:96:50:42:E3:F6,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=101' -netdev 'type=tap,id=net1,ifname=tap106i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=DE:A3:76:A8:FC:05,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256' -machine 'smm=off,type=q35+pve0'' failed: got timeout
Sep 29 11:12:07 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:12:07 virt1 pvestatd[3884]: status update time (8.089 seconds)
Sep 29 11:12:17 virt1 pvestatd[3884]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
Sep 29 11:12:17 virt1 pvestatd[3884]: status update time (8.089 seconds)
Sep 29 11:12:20 virt1 pvedaemon[651968]: VM 106 qmp command failed - VM 106 qmp command 'query-proxmox-support' failed - unable to connect to VM 106 qmp socket - timeout after 51 retries
 
Check your IOMMU groups (without using pcie_acs_override). Maybe they are both in the same group and therefore cannot be shared between VMs (and/or the Proxmox host).
Remember that all VM memory must be pinned into actual host memory and make sure that your have enough memory available (try starting the VMs with much less memory to test this).
 
Thanks, they are now in different groups (with pcie_acs_override), but without this they are not. Yes, plenty of RAM, its something with passthrough.
 
Thanks, they are now in different groups (with pcie_acs_override), but without this they are not. Yes, plenty of RAM, its something with passthrough.
Be aware that the VMs are no longer securely isolated and read/write each others memory. If some of the devices in the IOMMU group are still on the Proxmox host, they can read (stuff like passwords) from all of the memory, including Proxmox and other VMs. I'm not sure from your reply whether it is working now.
 
Sorry, your right I was not clear. It is still in the same state as the original post, I can start one or the other, but not both if they are not in the same VM. I am using pcie_acs_override, and see them each in a different group.

Code:
/sys/kernel/iommu_groups/48/devices/0000:81:00.0
/sys/kernel/iommu_groups/48/devices/0000:81:00.1
/sys/kernel/iommu_groups/49/devices/0000:82:00.0
/sys/kernel/iommu_groups/49/devices/0000:82:00.1

I am aware of the security risk, all VMs are mine, so not a huge deal.
 
So a bit more info, the 4 GPUs are one two x16 slots that are bifurcated into two x8 slots for each GPU. When I boot without the pcie_acs_override=downstream the first two cards are in Group 13 and the last two Group 49, so that wont work with 4 VMs each using 1 card. With the pcie_acs_override=downstream I get all 4 cards in their own group:

Code:
root@virt1:~# find /sys/kernel/iommu_groups/ -type l |grep 000:01
/sys/kernel/iommu_groups/99/devices/0000:01:00.0
/sys/kernel/iommu_groups/99/devices/0000:01:00.1
root@virt1:~# find /sys/kernel/iommu_groups/ -type l |grep 000:02
/sys/kernel/iommu_groups/100/devices/0000:02:00.0
/sys/kernel/iommu_groups/100/devices/0000:02:00.1
root@virt1:~# find /sys/kernel/iommu_groups/ -type l |grep 000:81
/sys/kernel/iommu_groups/48/devices/0000:81:00.0
/sys/kernel/iommu_groups/48/devices/0000:81:00.1
root@virt1:~# find /sys/kernel/iommu_groups/ -type l |grep 000:82
/sys/kernel/iommu_groups/49/devices/0000:82:00.0
/sys/kernel/iommu_groups/49/devices/0000:82:00.1

The odd part for me, is that the 99/100 cards work fine with pcie_acs_override=downstream, but the 48/49 cards can't be on at the same time for some reason.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!