Help with troubleshooting the VM with GPU won't start but another VM with GPU is running fine for over a month.

mirek186 · Mar 3, 2023

Hi,

I wonder if someone can help me with troubleshooting steps for the VM with GPU that won't start. I've got a server with 8 NVIDIA Tesla M40 GPU cards. I had GPU configured and working fine for over a year. Recently I had a request to create a new VM with GPU, couldn't get it to start. After the reboot, the VM started OK, and it's been working ok since. However, now I have another request for a GPU VM, and I can't keep restarting the server, hoping it will sort out my issue. There is not much in the logs other than a timeout message when I try to start VM:

Code:

Mar 03 10:37:48 proxgpu pvestatd[9079]: VM 145 qmp command failed - VM 145 qmp command 'query-proxmox-support' failed - unable to connect to VM 145 qmp socket - timeout after 31 retries
Mar 03 10:37:48 proxgpu pvestatd[9079]: status update time (6.316 seconds)
Mar 03 10:37:57 proxgpu pvedaemon[2024118]: start failed: command '/usr/bin/kvm -id 145 -name ub2204-gpu-template -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/145.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/145.pid -daemonize -smbios 'type=1,uuid=039fc4b6-a44c-4e47-a699-5b56296b7343' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/145.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 8192 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=525ecadd-5fd9-4653-a214-06ed788dc92c' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:83:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -drive 'file=/var/lib/vz/template/iso/ubuntu-22.04-live-server-amd64.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-145-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap145i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=A2:1F:8C:47:53:EA,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 10:37:57 proxgpu pvedaemon[2776800]: <root@pam> end task UPID:proxgpu:001EE2B6:0CBA7DBC:6401CDE6:qmstart:145:root@pam: start failed: command '/usr/bin/kvm -id 145 -name ub2204-gpu-template -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/145.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/145.pid -daemonize -smbios 'type=1,uuid=039fc4b6-a44c-4e47-a699-5b56296b7343' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/145.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 8192 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=525ecadd-5fd9-4653-a214-06ed788dc92c' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:83:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -drive 'file=/var/lib/vz/template/iso/ubuntu-22.04-live-server-amd64.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-145-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap145i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=A2:1F:8C:47:53:EA,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 10:37:58 proxgpu pvestatd[9079]: VM 145 qmp command failed - VM 145 qmp command 'query-proxmox-support' failed - unable to connect to VM 145 qmp socket - timeout after 31 retries

Any help is much appreciated.
Thanks

leesteken · Mar 3, 2023

If there is no actual error and just a timeout, then Proxmox cannot find enough free memory for the VM. Because of PCI(e) passthrough (and therfore possible device-initiated DMA) all VM memory must be pinned into actual host memory and ballooning and KSM won't work (but memory hotplug can). Try starting your VMs that use passthrough with less memory (like half).

EDIT: I guess it's something else but without any error message (except timeout) I have no clue, sorry.

mirek186 · Mar 3, 2023

leesteken said:
If there is no actual error and just a timeout, then Proxmox cannot find enough free memory for the VM. Because of PCI(e) passthrough (and therfore possible device-initiated DMA) all VM memory must be pinned into actual host memory and ballooning and KSM won't work (but memory hotplug can). Try starting your VMs that use passthrough with less memory (like half).

Hi, I don't think it's a memory issue, this VM actually only have 16GB allocated, and on the summary page, I can see:

Code:

RAM usage   82.49% (467.61 GiB of 566.84 GiB)

Free command is showing:

Code:

free -m
              total        used        free      shared  buff/cache   available
Mem:         580441      477494       98696         233        4250       98974
Swap:             0           0           0

I've also tried to reduce memory to 4GB and 1CPU but still have the same error:

Code:

root@proxgpu:~# journalctl -f
-- Logs begin at Mon 2023-02-06 17:26:12 GMT. --
Mar 03 12:11:14 proxgpu login[2504420]: pam_unix(login:session): session opened for user root by root(uid=0)
Mar 03 12:11:14 proxgpu login[2504425]: ROOT LOGIN  on '/dev/pts/1' from '172.16.220.1'
Mar 03 12:11:17 proxgpu pvedaemon[2502427]: start failed: command '/usr/bin/kvm -id 144 -name gaurav-lab -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/144.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/144.pid -daemonize -smbios 'type=1,uuid=cc232c3a-04da-48c5-a5ce-9f5e8192b4b0' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/144.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=e670323b-6353-4ee0-b2fc-03ce1636b87e' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:08:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-144-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap144i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2E:E7:E6:06:E5:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=101' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 12:11:17 proxgpu pvedaemon[2776800]: <root@pam> end task UPID:proxgpu:00262F1B:0CC3092C:6401E3C6:qmstart:144:root@pam: start failed: command '/usr/bin/kvm -id 144 -name gaurav-lab -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/144.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/144.pid -daemonize -smbios 'type=1,uuid=cc232c3a-04da-48c5-a5ce-9f5e8192b4b0' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/144.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=e670323b-6353-4ee0-b2fc-03ce1636b87e' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:08:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-144-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap144i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2E:E7:E6:06:E5:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=101' -machine 'type=q35+pve0'' failed: got timeout
Mar 03 12:11:18 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:19 proxgpu pvestatd[9079]: status update time (6.292 seconds)
Mar 03 12:11:28 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:28 proxgpu pvestatd[9079]: status update time (6.321 seconds)
Mar 03 12:11:38 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:38 proxgpu pvestatd[9079]: status update time (6.316 seconds)
Mar 03 12:11:48 proxgpu pvestatd[9079]: VM 144 qmp command failed - VM 144 qmp command 'query-proxmox-support' failed - unable to connect to VM 144 qmp socket - timeout after 31 retries
Mar 03 12:11:49 proxgpu pvestatd[9079]: status update time (6.294 seconds)

mirek186 · Mar 3, 2023

I've tried to make sure I allocate enough memory for PCI GPU card. Wasn't sure if it matter but thought if my cards are 24GB I've tried to allocate at least 32GB, still have over 60GB left on the server, but the same thing. The VM won't even try to boot.

Does anyone know any good tutorials on how to get any meaningful logs when troubleshooting PCI passthrough, GPU cards? At the moment I'm shooting in the dark and can't find any culprit of the issue.

mirek186 · Mar 3, 2023

I've tried a few more things, mainly to get any extra logging / debugging messages from KVM. So I've tried to rule out it's a timeout issue related to big memory allocation. I've run KVM as a standalone command.

Code:

/usr/bin/kvm -D /tmp/debug.log -id 144 -name gaurav-lab -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/144.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/144.pid -smbios 'type=1,uuid=cc232c3a-04da-48c5-a5ce-9f5e8192b4b0' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on' -vnc unix:/var/run/qemu-server/144.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 32768 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=e670323b-6353-4ee0-b2fc-03ce1636b87e' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:84:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:26c85950ee8b' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/zvol/localdata-zfs/vm-144-disk-0,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap144i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=2E:E7:E6:06:E5:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=101' -machine 'type=q35+pve0'

Nothing happens even when I've left it for more than 30 min. I've then added -D to get any debug. The log file gets created but it's empty. I've added -d trace: pci_cfg_read, still an empty log file. Tried few switches to enable the serial console or something but still nothing in stdout or stderr.

I've also looked at iommu mapping, making sure there is not mix up there, all looks good.

Code:

for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done | grep Tesla

I've tried every free GPU card I have just to rule out that as well. Still nothing.

I've also checked every log in the system, kern.log, messages, dmesg, daemon.log absolutely nothing about any PCI, IO errors there. When starting KVM the only related message I found is:

Code:

Mar  3 12:12:48 proxgpu kernel: [2141167.211554] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Mar  3 12:12:48 proxgpu kernel: [2141167.211573] vfio-pci 0000:08:00.0: vfio_ecap_init: hiding ecap 0x19@0x900

But I couldn't google anything related to the message and it does not look like an error.

The moment I remove GPU, VM boots fine.

My proxmox is 6.4-13 but I can't just upgrade it now, I would have to schedule a proper maintenance window.

Does anyone know anything that could help me figure it out?

Thanks,
Mirek

mirek186 · Mar 4, 2023

Latest update. I've decided to upgrade from 6.4 to 7.2 to see if it helps at all. Before I went with the upgrade I rebooted the server just to test that after a reboot, I was able to allocate PCI just fine and as suspected both VM started with no issues at all. One has 2x GPU and 160GB RAM and another one with 1x GPU and 32GB RAM. I think it's something locking / not releasing resources once the server is running for a while, just have no idea how to troubleshoot that.

I'll continue with the upgrade and see if I'll have the same issue on 7.2

mirek186 · Mar 5, 2023

I think final update. I've upgraded proxmox to 7.3-6 with Kernel 5.19, and so far, I've managed to start/stop multiple VM with GPU cards attached two days in a row. Looks like whatever has been holding resources in the past has been fixed now.

leesteken · Mar 5, 2023

mirek186 said:
I think final update. I've upgraded proxmox to 7.3-6 with Kernel 5.19, and so far, I've managed to start/stop multiple VM with GPU cards attached two days in a row. Looks like whatever has been holding resources in the past has been fixed now.

Note that kernel 5.19 is already out of support by Proxmox and won't get updates and will become insecure (if it's not already). Please upgrade to kernel 6.1, which will be the default for the next (major) Proxmox version.

mirek186 · Mar 5, 2023

leesteken said:
Note that kernel 5.19 is already out of support by Proxmox and won't get updates and will become insecure (if it's not already). Please upgrade to kernel 6.1, which will be the default for the next (major) Proxmox version.

Thanks for pointing it out. I've made a big jump from 6.4 to 7.3 and read about a few issues with amdgpu on 6.1 so I thought I'll stay on a stable tested kernel and at the next maintenance window do another upgrade.

leesteken · Mar 5, 2023

mirek186 said:
Thanks for pointing it out. I've made a big jump from 6.4 to 7.3 and read about a few issues with amdgpu on 6.1 so I thought I'll stay on a stable tested kernel and at the next maintenance window do another upgrade.

I understand but I do want to point out that 5.19 for Proxmox is not actually stable and was made available temporarily for testing.

mirek186 · Mar 6, 2023

leesteken said:
I understand but I do want to point out that 5.19 for Proxmox is not actually stable and was made available temporarily for testing.

I didn't know that either. Thanks for pointing that out. I'll try to set up a maintenance window a bit earlier than that and switch to 6.1. I assume that from a proxmox point of view, it does not matter if you upgrade the kernel in a cluster sequentially and you run in two different versions for a short period of time.

Odinos · Sep 30, 2023

mirek186 said:
Latest update. I've decided to upgrade from 6.4 to 7.2 to see if it helps at all. Before I went with the upgrade I rebooted the server just to test that after a reboot, I was able to allocate PCI just fine and as suspected both VM started with no issues at all. One has 2x GPU and 160GB RAM and another one with 1x GPU and 32GB RAM. I think it's something locking / not releasing resources once the server is running for a while, just have no idea how to troubleshoot that.

I'll continue with the upgrade and see if I'll have the same issue on 7.2

Hey sorry for necro, but I've been having a similar issue as you recently and wondering if you gained any other insights in the mean time?

When I reboot my proxmox node the vm with gpu passthrough starts fine. But if I shut down just the vm it is unable to start and shows the same task log as your OP. Only way to start the vm is to reboot the whole node again.

I'm on pve 8.0.4.

Search

Search

Help with troubleshooting the VM with GPU won't start but another VM with GPU is running fine for over a month.

mirek186

Member

leesteken

Distinguished Member

mirek186

Member

mirek186

Member

mirek186

Member

mirek186

Member

mirek186

Member

leesteken

Distinguished Member

mirek186

Member

leesteken

Distinguished Member

mirek186

Member

Odinos

Member