GPU unavailable / stuck

DerDieDas · Mar 9, 2021

Hi community

we ran into the following problem multiple times now. The PCI-e GPU gets stuck and is unavailable, on the host and, of course, any virtual machine.
Setup: a Proxmox host with a NVIDIA Tesla T4:

Code:

# lspci | grep NVIDIA
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

-> Nvidia GRID driver version 450.89 is installed:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

This setup worked fine and we were able to assign the mediated device (the GPU) to different VMs. After we removed those mediated devices, they became unavailable. Or said differently: they were not freed after removal to be used in an other VM. The GPU change to mediated device NO. We tried to add the GPU by adding it as a PCI-e device (pci-e passthrough mode) and received the error in the Proxmox GUI:

Code:

TASK ERROR: start failed: command '/usr/bin/kvm -id 193 -name nvidia-docker-test-andy -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/193.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/193.pid -daemonize -smbios 'type=1,uuid=37501dc1-e4e5-4520-b24c-838157d49d4d' -smp '20,sockets=2,cores=10,maxcpus=20' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/193.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 262144 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=214b3fbd-072f-4fbb-9bba-6c93504c01f8' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'vfio-pci,host=0000:37:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/193.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4c155aa8f549' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/ironcluster/VM/images/193/vm-193-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap193i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=AE:C3:27:A7:XX:XX,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout

and in the error log:

pvedaemon[88369]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries

Now using nvidia-smi, we get this error (before it was fine):

Code:

# nvidia-smi
Failed to initialize NVML: Unknown Error

lspci still returns the correct gpu:

Code:

# lspci | grep NVIDIA
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

and the modules are loaded:

Code:

# lsmod | grep ^nvidia
nvidia_vgpu_vfio       53248  0
nvidia              19750912  10 nvidia_vgpu_vfio

It will probably run fine after a reboot of the host (it always did previously), but this is a host where currently 43 virtual machines are running and therfore not the ideal option.

Does anyone know how to solve this issue without restarting every couple of weeks?
Thanks in advance for any input!

dcsapak · Mar 9, 2021

DerDieDas said:
After we removed those mediated devices, they became unavailable. Or said differently: they were not freed after removal to be used in an other VM. The GPU change to mediated device NO.

how excatly did you remove it? because just removing it from the config does nothing with the device. stopping the vm already removes the vgpu

can you post dmesg from that time period?
the nvidia-smi error indicates that somehow the driver gets confused and refuses to work...

DerDieDas · Mar 9, 2021

Thanks for your reply. Not only the driver is confused, so am I

I removed them via Proxmox GUI. Additionally, I shut down the VM.

[717417.977684] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 164
[717417.978326] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[717489.764855] vfio_mdev 00000000-0000-0000-0000-000000000189: Adding to iommu group 164
[717489.765505] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: group_id = 164
[...]
[717502.773001] vfio_mdev 00000000-0000-0000-0000-000000000190: Adding to iommu group 165
[717502.773601] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: group_id = 165
[...]
[717504.400229] vfio_mdev 00000000-0000-0000-0000-000000000191: Adding to iommu group 166
[717504.400997] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: group_id = 166
[...]
[717668.570517] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 166
[717668.571740] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[717699.334008] vfio_mdev 00000000-0000-0000-0000-000000000191: Adding to iommu group 166
[717699.334771] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: group_id = 166
[...]
[718736.743003] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 166
[718736.743693] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[718742.783049] vfio_mdev 00000000-0000-0000-0000-000000000190: Removing from iommu group 165
[718742.783716] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: detaching iommu
[718747.582981] vfio_mdev 00000000-0000-0000-0000-000000000189: Removing from iommu group 164
[718747.583720] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: detaching iommu
[718805.629600] vfio_mdev 00000000-0000-0000-0000-000000000189: Adding to iommu group 164
[718805.630271] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: group_id = 164
[...]
[718807.317827] vfio_mdev 00000000-0000-0000-0000-000000000190: Adding to iommu group 165
[718807.318502] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: group_id = 165
[...]
[718809.577328] vfio_mdev 00000000-0000-0000-0000-000000000191: Adding to iommu group 166
[718809.577971] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: group_id = 166
[...]
[1225605.640323] vfio_mdev 00000000-0000-0000-0000-000000000189: Removing from iommu group 164
[1225605.640966] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: detaching iommu
[1225608.626406] vmbr0: port 30(tap190i0) entered disabled state
[1225608.876421] vfio_mdev 00000000-0000-0000-0000-000000000190: Removing from iommu group 165
[1225608.877085] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: detaching iommu
[1225612.083088] vmbr0: port 32(tap191i0) entered disabled state
[1225612.344345] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 166
[1225612.345334] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[...]
======== BREAK POINT =========
[1240385.450317] vfio_mdev 00000000-0000-0000-0000-000000000193: Adding to iommu group 164
[1240385.451002] vfio_mdev 00000000-0000-0000-0000-000000000193: MDEV: group_id = 164
[1240385.453562] vfio_mdev 00000001-0000-0000-0000-000000000193: Adding to iommu group 165
[1240385.454269] vfio_mdev 00000001-0000-0000-0000-000000000193: MDEV: group_id = 165
[1240385.456702] vfio_mdev 00000002-0000-0000-0000-000000000193: Adding to iommu group 166
[1240385.457191] vfio_mdev 00000002-0000-0000-0000-000000000193: MDEV: group_id = 166
[1240385.459245] vfio_mdev 00000003-0000-0000-0000-000000000193: Adding to iommu group 167
[1240385.459740] vfio_mdev 00000003-0000-0000-0000-000000000193: MDEV: group_id = 167
[...]
[1240396.903497] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000193: start failed. status: 0x65 Timeout Occured
[...]
[1243044.289447] nvidia 0000:37:00.0: MDEV: Unregistering
[1243044.290282] vfio_mdev 00000000-0000-0000-0000-000000000193: Removing from iommu group 164
[1243044.290914] vfio_mdev 00000000-0000-0000-0000-000000000193: MDEV: detaching iommu
[1243044.291495] vfio_mdev 00000001-0000-0000-0000-000000000193: Removing from iommu group 165
[1243044.291937] vfio_mdev 00000001-0000-0000-0000-000000000193: MDEV: detaching iommu
[1243044.292413] vfio_mdev 00000002-0000-0000-0000-000000000193: Removing from iommu group 166
[1243044.292822] vfio_mdev 00000002-0000-0000-0000-000000000193: MDEV: detaching iommu
[1243044.293266] vfio_mdev 00000003-0000-0000-0000-000000000193: Removing from iommu group 167
[1243044.293660] vfio_mdev 00000003-0000-0000-0000-000000000193: MDEV: detaching iommu
[...]
[1256109.736792] vfio-pci 0000:37:00.0: vfio_cap_init: hiding cap 0x0@0x68
[1256109.737401] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[1256109.737874] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[...]

I added the breaking point into the dmesg and some code from when it was working (above that point). It broke after adding the mdev device to vmid 193.

DerDieDas · Mar 9, 2021

To add to this:
I think that [1243044.289447] nvidia 0000:37:00.0: MDEV: Unregistering is the problem.
dmesg for mdev:

Code:

[   51.073175] nvidia 0000:37:00.0: MDEV: Registered
[...]
[1243044.289447] nvidia 0000:37:00.0: MDEV: Unregistering

Now I'd wonder why it got unregistered.

dcsapak · Mar 9, 2021

mhmm... no idea sadly. also i do not have such a card to test here, so i am not able to reproduce it. maybe you can post the whole dmesg/journal?

DerDieDas · Mar 9, 2021

Well, then we have to wait until the next maintenance window so we can restart the host :-(
Here are the journal entries that are relevant (during the time I was working on said VM):

Mär 08 17:09:24 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -name nvidia-docker-test-andy
Mär 08 17:09:30 --HOST-- pveproxy[5008]: worker 1093 finished
Mär 08 17:09:30 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:09:30 --HOST-- pveproxy[5008]: worker 18083 started
Mär 08 17:09:31 --HOST-- pveproxy[18082]: worker exit
Mär 08 17:09:48 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci0
Mär 08 17:09:50 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci1
Mär 08 17:09:52 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci2
Mär 08 17:09:54 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci3
Mär 08 17:10:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:10:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:10:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:10:10 --HOST-- pvedaemon[53275]: <root@pam> starting task UPID:--HOST--:00006020:0767B302:60464C62:qmstart:193:root@pam:
Mär 08 17:10:10 --HOST-- pvedaemon[24608]: start VM 193: UPID:--HOST--:00006020:0767B302:60464C62:qmstart:193:root@pam:
Mär 08 17:10:10 --HOST-- systemd[1]: Started 193.scope.
Mär 08 17:10:10 --HOST-- systemd-udevd[24620]: Using default interface naming scheme 'v240'.
Mär 08 17:10:10 --HOST-- systemd-udevd[24620]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mär 08 17:10:10 --HOST-- systemd-udevd[24620]: Could not generate persistent MAC address for tap193i0: No such file or directory
Mär 08 17:10:10 --HOST-- kernel: device tap193i0 entered promiscuous mode
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered forwarding state
Mär 08 17:10:11 --HOST-- pvedaemon[53275]: <root@pam> end task UPID:--HOST--:00006020:0767B302:60464C62:qmstart:193:root@pam: OK
Mär 08 17:11:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:11:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:11:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:11:02 --HOST-- pvedaemon[53275]: <root@pam> update VM 193: resize --disk scsi0 --size +18G
Mär 08 17:12:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:12:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:12:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:12:15 --HOST-- QEMU[24629]: kvm: terminating on signal 15 from pid 4035 (/usr/sbin/qmeventd)
Mär 08 17:12:16 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:12:16 --HOST-- systemd[1]: 193.scope: Succeeded.
Mär 08 17:12:16 --HOST-- qmeventd[4034]: Starting cleanup for 193
Mär 08 17:12:16 --HOST-- qmeventd[4034]: Finished cleanup for 193
Mär 08 17:13:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:13:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:13:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:13:42 --HOST-- pvestatd[4878]: auth key pair too old, rotating..
Mär 08 17:14:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:14:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:14:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:14:30 --HOST-- pvedaemon[47187]: worker exit
Mär 08 17:14:30 --HOST-- pvedaemon[4901]: worker 47187 finished
Mär 08 17:14:30 --HOST-- pvedaemon[4901]: starting 1 worker(s)
Mär 08 17:14:30 --HOST-- pvedaemon[4901]: worker 67702 started
Mär 08 17:14:30 --HOST-- pvedaemon[85626]: <root@pam> starting task UPID:--HOST--:00010886:076818A2:60464D66:qmstart:193:root@pam:
Mär 08 17:14:30 --HOST-- pvedaemon[67718]: start VM 193: UPID:--HOST--:00010886:076818A2:60464D66:qmstart:193:root@pam:
Mär 08 17:14:30 --HOST-- systemd[1]: Started 193.scope.
Mär 08 17:14:30 --HOST-- systemd-udevd[67745]: Using default interface naming scheme 'v240'.
Mär 08 17:14:30 --HOST-- systemd-udevd[67745]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mär 08 17:14:30 --HOST-- systemd-udevd[67745]: Could not generate persistent MAC address for tap193i0: No such file or directory
Mär 08 17:14:31 --HOST-- kernel: device tap193i0 entered promiscuous mode
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered forwarding state
Mär 08 17:14:31 --HOST-- pvedaemon[85626]: <root@pam> end task UPID:--HOST--:00010886:076818A2:60464D66:qmstart:193:root@pam: OK
Mär 08 17:15:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:15:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:15:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:15:43 --HOST-- pvedaemon[53275]: <root@pam> successful auth for user 'x@y'
Mär 08 17:16:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:16:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:16:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:16:14 --HOST-- pvedaemon[53275]: <root@pam> starting task UPID:--HOST--:00014A49:07684125:60464DCE:qmshutdown:193:root@pam:
Mär 08 17:16:14 --HOST-- pvedaemon[84553]: shutdown VM 193: UPID:--HOST--:00014A49:07684125:60464DCE:qmshutdown:193:root@pam:
Mär 08 17:16:15 --HOST-- QEMU[67761]: kvm: terminating on signal 15 from pid 4035 (/usr/sbin/qmeventd)
Mär 08 17:16:15 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:16:15 --HOST-- pvedaemon[53275]: <root@pam> end task UPID:--HOST--:00014A49:07684125:60464DCE:qmshutdown:193:root@pam: OK
Mär 08 17:16:15 --HOST-- systemd[1]: 193.scope: Succeeded.
Mär 08 17:16:15 --HOST-- qmeventd[4034]: Starting cleanup for 193
Mär 08 17:16:15 --HOST-- qmeventd[4034]: Finished cleanup for 193
Mär 08 17:17:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:17:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:17:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:17:01 --HOST-- CRON[93033]: pam_unix(cron:session): session opened for user root by (uid=0)
Mär 08 17:17:01 --HOST-- CRON[93034]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mär 08 17:17:01 --HOST-- CRON[93033]: pam_unix(cron:session): session closed for user root
Mär 08 17:17:18 --HOST-- pveproxy[50542]: worker exit
Mär 08 17:17:18 --HOST-- pveproxy[5008]: worker 50542 finished
Mär 08 17:17:18 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:17:18 --HOST-- pveproxy[5008]: worker 95744 started
Mär 08 17:17:37 --HOST-- pveproxy[5008]: worker 61311 finished
Mär 08 17:17:37 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:17:37 --HOST-- pveproxy[5008]: worker 1287 started
Mär 08 17:17:37 --HOST-- pveproxy[1286]: got inotify poll request in wrong process - disabling inotify
Mär 08 17:17:39 --HOST-- pveproxy[1286]: worker exit
Mär 08 17:17:57 --HOST-- pvedaemon[67702]: <root@pam> successful auth for user 'root@pam'
Mär 08 17:18:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:18:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:18:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:18:45 --HOST-- pvedaemon[53275]: <root@pam> successful auth for user 'root@pam'
Mär 08 17:19:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:19:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:19:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:19:00 --HOST-- sshd[15295]: Accepted publickey for root from 10.3.38.239 port 44730 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxx
Mär 08 17:19:00 --HOST-- sshd[15295]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mär 08 17:19:00 --HOST-- systemd[1]: Created slice User Slice of UID 0.
Mär 08 17:19:00 --HOST-- systemd[1]: Starting User Runtime Directory /run/user/0...
Mär 08 17:19:00 --HOST-- systemd-logind[4036]: New session 423 of user root.
Mär 08 17:19:00 --HOST-- systemd[1]: Started User Runtime Directory /run/user/0.
Mär 08 17:19:00 --HOST-- systemd[1]: Starting User Manager for UID 0...
Mär 08 17:19:00 --HOST-- systemd[15300]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG network certificate management daemon.
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Timers.
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent and passphrase cache.
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Paths.
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Sockets.
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Basic System.
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Default.
Mär 08 17:19:00 --HOST-- systemd[15300]: Startup finished in 62ms.
Mär 08 17:19:00 --HOST-- systemd[1]: Started User Manager for UID 0.
Mär 08 17:19:00 --HOST-- systemd[1]: Started Session 423 of user root.
Mär 08 17:19:31 --HOST-- pvedaemon[67702]: <root@pam> successful auth for user 'xy'
Mär 08 17:20:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:20:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:20:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:21:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:21:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:21:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:21:19 --HOST-- pveproxy[18083]: worker exit
Mär 08 17:21:19 --HOST-- pveproxy[5008]: worker 18083 finished
Mär 08 17:21:19 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:21:19 --HOST-- pveproxy[5008]: worker 38084 started
Mär 08 17:21:20 --HOST-- pvedaemon[67702]: <root@pam> update VM 193: -hostpci0 37:00.0
Mär 08 17:21:23 --HOST-- pvedaemon[67702]: <root@pam> starting task UPID:--HOST--:00009741:0768BA0F:60464F03:qmstart:193:root@pam:
Mär 08 17:21:23 --HOST-- pvedaemon[38721]: start VM 193: UPID:--HOST--:00009741:0768BA0F:60464F03:qmstart:193:root@pam:
Mär 08 17:21:23 --HOST-- kernel: nvidia 0000:37:00.0: MDEV: Unregistering
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000000-0000-0000-0000-000000000193: Removing from iommu group 164
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000000-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000001-0000-0000-0000-000000000193: Removing from iommu group 165
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000001-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000002-0000-0000-0000-000000000193: Removing from iommu group 166
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000002-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000003-0000-0000-0000-000000000193: Removing from iommu group 167
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000003-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:25 --HOST-- systemd[1]: Started 193.scope.
Mär 08 17:21:25 --HOST-- systemd-udevd[38765]: Using default interface naming scheme 'v240'.
Mär 08 17:21:25 --HOST-- systemd-udevd[38765]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mär 08 17:21:25 --HOST-- systemd-udevd[38765]: Could not generate persistent MAC address for tap193i0: No such file or directory
Mär 08 17:21:25 --HOST-- kernel: device tap193i0 entered promiscuous mode
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered forwarding state
Mär 08 17:21:31 --HOST-- pvedaemon[53275]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - got timeout
Mär 08 17:21:38 --HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:38 --HOST-- pvestatd[4878]: status update time (6.208 seconds)
Mär 08 17:21:48 --HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:48 --HOST-- pvestatd[4878]: status update time (6.210 seconds)
Mär 08 17:21:50 --HOST-- pvedaemon[85626]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:58 --HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:59 --HOST-- pvestatd[4878]: status update time (6.205 seconds)
Mär 08 17:22:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:22:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
=============
SPAMS THE MESSAGE
--HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
UNTIL I MOVED THE VM TO AN OTHER NODE
=============

DerDieDas · Aug 12, 2021

Hello again

sorry to revive this topic, but unfortunately, the exact same error occured again.
We shut down a VM with a attached GPU, set the CPU to "host" instead of kvm64 and rebooted it only to face the same problem again: VM fails to boot and after that nvidia-smi shows Failed to initialize NVML: Unknown Error. (edit: setting CPU back to kvm64 does not solve the problem)

This time, tho, dmesg looks normal.

Code:

[Aug12 13:58] vmbr0: port 27(tap200i0) entered disabled state
[Aug12 13:59] vmbr0: port 26(tap217i0) entered disabled state
[ +39.745989] device tap217i0 entered promiscuous mode
[  +0.014235] vmbr0: port 26(tap217i0) entered blocking state
[  +0.000519] vmbr0: port 26(tap217i0) entered disabled state
[  +0.001119] vmbr0: port 26(tap217i0) entered blocking state
[  +0.000403] vmbr0: port 26(tap217i0) entered forwarding state
[Aug12 14:00] vfio-pci 0000:37:00.0: vfio_cap_init: hiding cap 0x0@0x68
[  +0.000610] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[  +0.000480] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x19@0x900

0000:37:00.0 is the GPU.

And the journal with the error:

Code:

Aug 12 14:15:30 ictesxdev02 pvedaemon[66836]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:30 ictesxdev02 pveproxy[91343]: worker exit
Aug 12 14:15:30 ictesxdev02 pveproxy[54626]: worker 91343 finished
Aug 12 14:15:30 ictesxdev02 pveproxy[54626]: starting 1 worker(s)
Aug 12 14:15:30 ictesxdev02 pveproxy[54626]: worker 73099 started
Aug 12 14:15:31 ictesxdev02 pvestatd[93469]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:31 ictesxdev02 pvestatd[93469]: status update time (6.587 seconds)
Aug 12 14:15:41 ictesxdev02 pvestatd[93469]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:42 ictesxdev02 pvestatd[93469]: status update time (6.589 seconds)
Aug 12 14:15:49 ictesxdev02 pvedaemon[66836]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:51 ictesxdev02 pvestatd[93469]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:51 ictesxdev02 pvestatd[93469]: status update time (6.712 seconds)
Aug 12 14:15:54 ictesxdev02 pvedaemon[71505]: start failed: command '/usr/bin/kvm -id 217 -name rndmarkotest01 -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/217.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/217.pid -daemonize -smbios 'type=1,uuid=fed1736c-5e62-46e7-a2f8-041eda7f739d' -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/217.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=47e6c499-0fc3-481a-bdfc-04a9c04d6a41' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'vfio-pci,host=0000:37:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/217.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4c155aa8f549' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/ironcluster/VM/images/217/vm-217-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap217i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=06:AF:F7:61:6D:51,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout
Aug 12 14:15:54 ictesxdev02 pvedaemon[62675]: <root@pam> end task UPID:ictesxdev02:00011751:2DEFAA0B:611510DA:qmstart:217:root@pam: start failed: command '/usr/bin/kvm -id 217 -name rndmarkotest01 -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/217.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/217.pid -daemonize -smbios 'type=1,uuid=fed1736c-5e62-46e7-a2f8-041eda7f739d' -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/217.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=47e6c499-0fc3-481a-bdfc-04a9c04d6a41' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'vfio-pci,host=0000:37:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/217.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4c155aa8f549' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/ironcluster/VM/images/217/vm-217-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap217i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=06:AF:F7:61:6D:51,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout

proxmox-ve: 6.4-1 (running kernel: 5.4.114-1-pve)
pve-manager: 6.4-6 (running version: 6.4-6/be2fa32c)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.6-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-4
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-3
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Does anyone have a similar problem?

Edit 1:
I unloaded the modules with rmmod nvidia and tried to reload the modules. Now modprobe tells me

Code:

# modprobe nvidia -vv
modprobe: INFO: ../libkmod/libkmod.c:364 kmod_set_log_fn() custom logging function 0x5595bfb77150 registered
insmod /lib/modules/5.4.114-1-pve/kernel/drivers/video/nvidia.ko
modprobe: INFO: ../libkmod/libkmod-module.c:886 kmod_module_insert_module() Failed to insert module '/lib/modules/5.4.114-1-pve/kernel/drivers/video/nvidia.ko': No such device
modprobe: ERROR: could not insert 'nvidia': No such device
modprobe: INFO: ../libkmod/libkmod.c:331 kmod_unref() context 0x5595c02c7460 released

dmesg:

Code:

[7708840.865723] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[7708840.866431] NVRM: This can occur when a driver such as:
                 NVRM: nouveau, rivafb, nvidiafb or rivatv
                 NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[7708840.868105] NVRM: Try unloading the conflicting kernel module (and/or
                 NVRM: reconfigure your kernel without the conflicting
                 NVRM: driver(s)), then try loading the NVIDIA kernel module
                 NVRM: again.
[7708840.870169] NVRM: No NVIDIA devices probed.
[7708840.870872] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234

No other module (nouveau, riva, nvidiafb) are loaded.

Is it possible that Proxmox somehow puts the GPU in power save mode?

Edit 2: So, after unloading vfio-pci via rmmod vfio-pci, modprobe nvidia was possible again. Now, the host at least sees the GPU:

Code:

# nvidia-smi
Thu Aug 12 15:13:16 2021     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.89       Driver Version: 450.89       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:37:00.0 Off |                    0 |
| N/A   71C    P0    25W /  70W |      0MiB / 15109MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                            
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If we reload vfio-pci (modprobe vfio-pci; nvidia-smi still works) and try to start the VM with attached GPU, all breaks, again.
So, it looks like something's fishy with the vfio-pci module.

Edit 3: Removing the GPU from the host (echo 1 > /sys/bus/pci/devices/0000:37:00.0/remove) and then rescan it (echo 1 > /sys/bus/pci/rescan) does not help either.

Edit 4: When I manually bind the GPU to vfio-pci (echo "vfio-pci" > /sys/bus/pci/devices/0000\:37\:00.0/driver_override) and use the commandline to boot the virtual machine, it returns the error
kernel: vfio-pci 0000:37:00.0: no suspend buffer for LTR; ASPM issues possible after resume

I think that this problem is similar to this one: https://forum.proxmox.com/threads/gpu-passthrough-only-works-after-suspend-resume-of-the-host.58486/

Best,
Andy

DerDieDas · Dec 6, 2021

Hi all,

this problem still persists to this day

Today, the problem occured again after we restored a backup from a virtual machine with a GPU attached (VM was shut down first). We have the exact same error messages like before.
We're going crazy with this damn error!!!!

Best,
Andy

DerDieDas · Dec 13, 2021

Sorry if I spam my own thread, but I'll report any finding here in the hopes that maybe one day someone with a similar issue finds help.

This being said, I think we have a new clue: most of the times, the issues occures when some sort of backup machanism was at work. Either when restoring a snapshot or backup or when somewhere else in the cluster some backup jobs are running. We'll investigate this further as it is the hotest trail as of now.

Best,
Andy

Robert.H · Jun 16, 2022

Any update to share yet?

damarges · Nov 27, 2023

Same error here with me on my side. Any updates?

DerDieDas · Apr 10, 2024

Hi guys. Sorry for the late update @damarges @Robert.H

The error did not appear in the last year or two.
Steps we try to always do with the GPU VMs:
- Shutdown the VM properly before removing the GPU
- Not ballooning, always fixed RAM
- Assign the GPU to a powered off VM

Maybe this helps someone.

Search

Search

GPU unavailable / stuck

DerDieDas

Member

dcsapak

Proxmox Staff Member

DerDieDas

Member

DerDieDas

Member

dcsapak

Proxmox Staff Member

DerDieDas

Member

DerDieDas

Member

DerDieDas

Member

DerDieDas

Member

Robert.H

Active Member

damarges

Member

DerDieDas

Member