Help with troubleshooting the VM with GPU won't start but another VM with GPU is running fine for over a month.

mirek186

Member
Sep 14, 2020
9
1
8
44
Hi,

I wonder if someone can help me with troubleshooting steps for the VM with GPU that won't start. I've got a server with 8 NVIDIA Tesla M40 GPU cards. I had GPU configured and working fine for over a year. Recently I had a request to create a new VM with GPU, couldn't get it to start. After the reboot, the VM started OK, and it's been working ok since. However, now I have another request for a GPU VM, and I can't keep restarting the server, hoping it will sort out my issue. There is not much in the logs other than a timeout message when I try to start VM:
Code:
Mar 03 10:37:48 proxgpu pvestatd[9079]: VM 145 qmp command failed - VM 145 qmp command 'query-proxmox-support' failed - unable to connect to VM 145 qmp socket - timeout after 31 retries
Mar 03 10:37:48 proxgpu pvestatd[9079]: status update time (6.316 seconds)
Mar 03 10:37:57 proxgpu pvedaemon[2024118]: start failed: command '/usr/bin/kvm -id 145 -name ub2204-gpu-template -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/145.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/145.pid -daemonize -smbios 'type=1,uuid=039fc4b6-a44c-4e47-a699-5b56296b7343' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/145.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 8192 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=525ecadd-5fd9-4653-a214-06ed788dc92c' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:83:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -drive 'file=/var/lib/vz/template/iso/ubuntu-22.04-live-server-amd64.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-145-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap145i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=A2:1F:8C:47:53:EA,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 10:37:57 proxgpu pvedaemon[2776800]: <root@pam> end task UPID:proxgpu:001EE2B6:0CBA7DBC:6401CDE6:qmstart:145:root@pam: start failed: command '/usr/bin/kvm -id 145 -name ub2204-gpu-template -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/145.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/145.pid -daemonize -smbios 'type=1,uuid=039fc4b6-a44c-4e47-a699-5b56296b7343' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/145.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 8192 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=525ecadd-5fd9-4653-a214-06ed788dc92c' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:83:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -drive 'file=/var/lib/vz/template/iso/ubuntu-22.04-live-server-amd64.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-145-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap145i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=A2:1F:8C:47:53:EA,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 10:37:58 proxgpu pvestatd[9079]: VM 145 qmp command failed - VM 145 qmp command 'query-proxmox-support' failed - unable to connect to VM 145 qmp socket - timeout after 31 retries

Any help is much appreciated.
Thanks
 
If there is no actual error and just a timeout, then Proxmox cannot find enough free memory for the VM. Because of PCI(e) passthrough (and therfore possible device-initiated DMA) all VM memory must be pinned into actual host memory and ballooning and KSM won't work (but memory hotplug can). Try starting your VMs that use passthrough with less memory (like half).

EDIT: I guess it's something else but without any error message (except timeout) I have no clue, sorry.
 
Last edited:
If there is no actual error and just a timeout, then Proxmox cannot find enough free memory for the VM. Because of PCI(e) passthrough (and therfore possible device-initiated DMA) all VM memory must be pinned into actual host memory and ballooning and KSM won't work (but memory hotplug can). Try starting your VMs that use passthrough with less memory (like half).
Hi, I don't think it's a memory issue, this VM actually only have 16GB allocated, and on the summary page, I can see:
Code:
RAM usage   82.49% (467.61 GiB of 566.84 GiB)
Free command is showing:
Code:
free -m
              total        used        free      shared  buff/cache   available
Mem:         580441      477494       98696         233        4250       98974
Swap:             0           0           0

I've also tried to reduce memory to 4GB and 1CPU but still have the same error:
Code:
root@proxgpu:~# journalctl -f
-- Logs begin at Mon 2023-02-06 17:26:12 GMT. --
Mar 03 12:11:14 proxgpu login[2504420]: pam_unix(login:session): session opened for user root by root(uid=0)
Mar 03 12:11:14 proxgpu login[2504425]: ROOT LOGIN  on '/dev/pts/1' from '172.16.220.1'
Mar 03 12:11:17 proxgpu pvedaemon[2502427]: start failed: command '/usr/bin/kvm -id 144 -name gaurav-lab -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/144.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/144.pid -daemonize -smbios 'type=1,uuid=cc232c3a-04da-48c5-a5ce-9f5e8192b4b0' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/144.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=e670323b-6353-4ee0-b2fc-03ce1636b87e' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:08:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-144-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap144i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2E:E7:E6:06:E5:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=101' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 12:11:17 proxgpu pvedaemon[2776800]: <root@pam> end task UPID:proxgpu:00262F1B:0CC3092C:6401E3C6:qmstart:144:root@pam: start failed: command '/usr/bin/kvm -id 144 -name gaurav-lab -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/144.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/144.pid -daemonize -smbios 'type=1,uuid=cc232c3a-04da-48c5-a5ce-9f5e8192b4b0' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/144.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=e670323b-6353-4ee0-b2fc-03ce1636b87e' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:08:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-144-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap144i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2E:E7:E6:06:E5:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=101' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 12:11:18 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:19 proxgpu pvestatd[9079]: status update time (6.292 seconds)
Mar 03 12:11:28 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:28 proxgpu pvestatd[9079]: status update time (6.321 seconds)
Mar 03 12:11:38 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:38 proxgpu pvestatd[9079]: status update time (6.316 seconds)
Mar 03 12:11:48 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:49 proxgpu pvestatd[9079]: status update time (6.294 seconds)
 
Last edited:
I've tried to make sure I allocate enough memory for PCI GPU card. Wasn't sure if it matter but thought if my cards are 24GB I've tried to allocate at least 32GB, still have over 60GB left on the server, but the same thing. The VM won't even try to boot.

Does anyone know any good tutorials on how to get any meaningful logs when troubleshooting PCI passthrough, GPU cards? At the moment I'm shooting in the dark and can't find any culprit of the issue.
 
I've tried a few more things, mainly to get any extra logging / debugging messages from KVM. So I've tried to rule out it's a timeout issue related to big memory allocation. I've run KVM as a standalone command.
Code:
/usr/bin/kvm -D /tmp/debug.log -id 144 -name gaurav-lab -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/144.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/144.pid -smbios 'type=1,uuid=cc232c3a-04da-48c5-a5ce-9f5e8192b4b0' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on' -vnc unix:/var/run/qemu-server/144.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 32768 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=e670323b-6353-4ee0-b2fc-03ce1636b87e' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:84:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-144-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap144i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2E:E7:E6:06:E5:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=101' -machine 'type=q35+pve0'

Nothing happens even when I've left it for more than 30 min. I've then added -D to get any debug. The log file gets created but it's empty. I've added -d trace: pci_cfg_read, still an empty log file. Tried few switches to enable the serial console or something but still nothing in stdout or stderr.

I've also looked at iommu mapping, making sure there is not mix up there, all looks good.
Code:
for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done | grep Tesla

I've tried every free GPU card I have just to rule out that as well. Still nothing.

I've also checked every log in the system, kern.log, messages, dmesg, daemon.log absolutely nothing about any PCI, IO errors there. When starting KVM the only related message I found is:
Code:
Mar  3 12:12:48 proxgpu kernel: [2141167.211554] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Mar  3 12:12:48 proxgpu kernel: [2141167.211573] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x19@0x900

But I couldn't google anything related to the message and it does not look like an error.

The moment I remove GPU, VM boots fine.

My proxmox is 6.4-13 but I can't just upgrade it now, I would have to schedule a proper maintenance window.

Does anyone know anything that could help me figure it out?

Thanks,
Mirek
 
Last edited:
Latest update. I've decided to upgrade from 6.4 to 7.2 to see if it helps at all. Before I went with the upgrade I rebooted the server just to test that after a reboot, I was able to allocate PCI just fine and as suspected both VM started with no issues at all. One has 2x GPU and 160GB RAM and another one with 1x GPU and 32GB RAM. I think it's something locking / not releasing resources once the server is running for a while, just have no idea how to troubleshoot that.

I'll continue with the upgrade and see if I'll have the same issue on 7.2
 
I think final update. I've upgraded proxmox to 7.3-6 with Kernel 5.19, and so far, I've managed to start/stop multiple VM with GPU cards attached two days in a row. Looks like whatever has been holding resources in the past has been fixed now.
 
  • Like
Reactions: leesteken
I think final update. I've upgraded proxmox to 7.3-6 with Kernel 5.19, and so far, I've managed to start/stop multiple VM with GPU cards attached two days in a row. Looks like whatever has been holding resources in the past has been fixed now.
Note that kernel 5.19 is already out of support by Proxmox and won't get updates and will become insecure (if it's not already). Please upgrade to kernel 6.1, which will be the default for the next (major) Proxmox version.
 
Note that kernel 5.19 is already out of support by Proxmox and won't get updates and will become insecure (if it's not already). Please upgrade to kernel 6.1, which will be the default for the next (major) Proxmox version.
Thanks for pointing it out. I've made a big jump from 6.4 to 7.3 and read about a few issues with amdgpu on 6.1 so I thought I'll stay on a stable tested kernel and at the next maintenance window do another upgrade.
 
I understand but I do want to point out that 5.19 for Proxmox is not actually stable and was made available temporarily for testing.
I didn't know that either. Thanks for pointing that out. I'll try to set up a maintenance window a bit earlier than that and switch to 6.1. I assume that from a proxmox point of view, it does not matter if you upgrade the kernel in a cluster sequentially and you run in two different versions for a short period of time.
 
Latest update. I've decided to upgrade from 6.4 to 7.2 to see if it helps at all. Before I went with the upgrade I rebooted the server just to test that after a reboot, I was able to allocate PCI just fine and as suspected both VM started with no issues at all. One has 2x GPU and 160GB RAM and another one with 1x GPU and 32GB RAM. I think it's something locking / not releasing resources once the server is running for a while, just have no idea how to troubleshoot that.

I'll continue with the upgrade and see if I'll have the same issue on 7.2
Hey sorry for necro, but I've been having a similar issue as you recently and wondering if you gained any other insights in the mean time?

When I reboot my proxmox node the vm with gpu passthrough starts fine. But if I shut down just the vm it is unable to start and shows the same task log as your OP. Only way to start the vm is to reboot the whole node again.

I'm on pve 8.0.4.
 
Hi all,

Did someone out there have any updates about this topic..
I have the same issue.
Here is my config :
pve : 7.4-17
kernel : 5.19.17-2-pve and 5.15.116-1-pve
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!