GPU unavailable / stuck

DerDieDas

Active Member
Dec 3, 2019
19
3
43
44
Hi community

we ran into the following problem multiple times now. The PCI-e GPU gets stuck and is unavailable, on the host and, of course, any virtual machine.
Setup: a Proxmox host with a NVIDIA Tesla T4:
Code:
# lspci | grep NVIDIA
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
-> Nvidia GRID driver version 450.89 is installed:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

This setup worked fine and we were able to assign the mediated device (the GPU) to different VMs. After we removed those mediated devices, they became unavailable. Or said differently: they were not freed after removal to be used in an other VM. The GPU change to mediated device NO. We tried to add the GPU by adding it as a PCI-e device (pci-e passthrough mode) and received the error in the Proxmox GUI:
Code:
TASK ERROR: start failed: command '/usr/bin/kvm -id 193 -name nvidia-docker-test-andy -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/193.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/193.pid -daemonize -smbios 'type=1,uuid=37501dc1-e4e5-4520-b24c-838157d49d4d' -smp '20,sockets=2,cores=10,maxcpus=20' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/193.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 262144 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=214b3fbd-072f-4fbb-9bba-6c93504c01f8' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'vfio-pci,host=0000:37:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/193.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4c155aa8f549' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/ironcluster/VM/images/193/vm-193-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap193i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=AE:C3:27:A7:XX:XX,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout

and in the error log:
pvedaemon[88369]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries

Now using nvidia-smi, we get this error (before it was fine):
Code:
# nvidia-smi
Failed to initialize NVML: Unknown Error

lspci still returns the correct gpu:
Code:
# lspci | grep NVIDIA
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

and the modules are loaded:
Code:
# lsmod | grep ^nvidia
nvidia_vgpu_vfio       53248  0
nvidia              19750912  10 nvidia_vgpu_vfio

It will probably run fine after a reboot of the host (it always did previously), but this is a host where currently 43 virtual machines are running and therfore not the ideal option.

Does anyone know how to solve this issue without restarting every couple of weeks?
Thanks in advance for any input!
 
After we removed those mediated devices, they became unavailable. Or said differently: they were not freed after removal to be used in an other VM. The GPU change to mediated device NO.
how excatly did you remove it? because just removing it from the config does nothing with the device. stopping the vm already removes the vgpu

can you post dmesg from that time period?
the nvidia-smi error indicates that somehow the driver gets confused and refuses to work...
 
Thanks for your reply. Not only the driver is confused, so am I :D
I removed them via Proxmox GUI. Additionally, I shut down the VM.

[717417.977684] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 164
[717417.978326] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[717489.764855] vfio_mdev 00000000-0000-0000-0000-000000000189: Adding to iommu group 164
[717489.765505] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: group_id = 164
[...]
[717502.773001] vfio_mdev 00000000-0000-0000-0000-000000000190: Adding to iommu group 165
[717502.773601] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: group_id = 165
[...]
[717504.400229] vfio_mdev 00000000-0000-0000-0000-000000000191: Adding to iommu group 166
[717504.400997] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: group_id = 166
[...]
[717668.570517] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 166
[717668.571740] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[717699.334008] vfio_mdev 00000000-0000-0000-0000-000000000191: Adding to iommu group 166
[717699.334771] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: group_id = 166
[...]
[718736.743003] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 166
[718736.743693] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[718742.783049] vfio_mdev 00000000-0000-0000-0000-000000000190: Removing from iommu group 165
[718742.783716] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: detaching iommu
[718747.582981] vfio_mdev 00000000-0000-0000-0000-000000000189: Removing from iommu group 164
[718747.583720] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: detaching iommu
[718805.629600] vfio_mdev 00000000-0000-0000-0000-000000000189: Adding to iommu group 164
[718805.630271] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: group_id = 164
[...]
[718807.317827] vfio_mdev 00000000-0000-0000-0000-000000000190: Adding to iommu group 165
[718807.318502] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: group_id = 165
[...]
[718809.577328] vfio_mdev 00000000-0000-0000-0000-000000000191: Adding to iommu group 166
[718809.577971] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: group_id = 166
[...]
[1225605.640323] vfio_mdev 00000000-0000-0000-0000-000000000189: Removing from iommu group 164
[1225605.640966] vfio_mdev 00000000-0000-0000-0000-000000000189: MDEV: detaching iommu
[1225608.626406] vmbr0: port 30(tap190i0) entered disabled state
[1225608.876421] vfio_mdev 00000000-0000-0000-0000-000000000190: Removing from iommu group 165
[1225608.877085] vfio_mdev 00000000-0000-0000-0000-000000000190: MDEV: detaching iommu
[1225612.083088] vmbr0: port 32(tap191i0) entered disabled state
[1225612.344345] vfio_mdev 00000000-0000-0000-0000-000000000191: Removing from iommu group 166
[1225612.345334] vfio_mdev 00000000-0000-0000-0000-000000000191: MDEV: detaching iommu
[...]
======== BREAK POINT =========
[1240385.450317] vfio_mdev 00000000-0000-0000-0000-000000000193: Adding to iommu group 164
[1240385.451002] vfio_mdev 00000000-0000-0000-0000-000000000193: MDEV: group_id = 164
[1240385.453562] vfio_mdev 00000001-0000-0000-0000-000000000193: Adding to iommu group 165
[1240385.454269] vfio_mdev 00000001-0000-0000-0000-000000000193: MDEV: group_id = 165
[1240385.456702] vfio_mdev 00000002-0000-0000-0000-000000000193: Adding to iommu group 166
[1240385.457191] vfio_mdev 00000002-0000-0000-0000-000000000193: MDEV: group_id = 166
[1240385.459245] vfio_mdev 00000003-0000-0000-0000-000000000193: Adding to iommu group 167
[1240385.459740] vfio_mdev 00000003-0000-0000-0000-000000000193: MDEV: group_id = 167
[...]
[1240396.903497] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000193: start failed. status: 0x65 Timeout Occured
[...]
[1243044.289447] nvidia 0000:37:00.0: MDEV: Unregistering
[1243044.290282] vfio_mdev 00000000-0000-0000-0000-000000000193: Removing from iommu group 164
[1243044.290914] vfio_mdev 00000000-0000-0000-0000-000000000193: MDEV: detaching iommu
[1243044.291495] vfio_mdev 00000001-0000-0000-0000-000000000193: Removing from iommu group 165
[1243044.291937] vfio_mdev 00000001-0000-0000-0000-000000000193: MDEV: detaching iommu
[1243044.292413] vfio_mdev 00000002-0000-0000-0000-000000000193: Removing from iommu group 166
[1243044.292822] vfio_mdev 00000002-0000-0000-0000-000000000193: MDEV: detaching iommu
[1243044.293266] vfio_mdev 00000003-0000-0000-0000-000000000193: Removing from iommu group 167
[1243044.293660] vfio_mdev 00000003-0000-0000-0000-000000000193: MDEV: detaching iommu
[...]
[1256109.736792] vfio-pci 0000:37:00.0: vfio_cap_init: hiding cap 0x0@0x68
[1256109.737401] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[1256109.737874] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[...]
I added the breaking point into the dmesg and some code from when it was working (above that point). It broke after adding the mdev device to vmid 193.
 
To add to this:
I think that [1243044.289447] nvidia 0000:37:00.0: MDEV: Unregistering is the problem.
dmesg for mdev:
Code:
[   51.073175] nvidia 0000:37:00.0: MDEV: Registered
[...]
[1243044.289447] nvidia 0000:37:00.0: MDEV: Unregistering
Now I'd wonder why it got unregistered.
 
mhmm... no idea sadly. also i do not have such a card to test here, so i am not able to reproduce it. maybe you can post the whole dmesg/journal?
 
Well, then we have to wait until the next maintenance window so we can restart the host :-(
Here are the journal entries that are relevant (during the time I was working on said VM):

Mär 08 17:09:24 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -name nvidia-docker-test-andy
Mär 08 17:09:30 --HOST-- pveproxy[5008]: worker 1093 finished
Mär 08 17:09:30 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:09:30 --HOST-- pveproxy[5008]: worker 18083 started
Mär 08 17:09:31 --HOST-- pveproxy[18082]: worker exit
Mär 08 17:09:48 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci0
Mär 08 17:09:50 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci1
Mär 08 17:09:52 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci2
Mär 08 17:09:54 --HOST-- pvedaemon[47187]: <root@pam> update VM 193: -delete hostpci3
Mär 08 17:10:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:10:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:10:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:10:10 --HOST-- pvedaemon[53275]: <root@pam> starting task UPID:--HOST--:00006020:0767B302:60464C62:qmstart:193:root@pam:
Mär 08 17:10:10 --HOST-- pvedaemon[24608]: start VM 193: UPID:--HOST--:00006020:0767B302:60464C62:qmstart:193:root@pam:
Mär 08 17:10:10 --HOST-- systemd[1]: Started 193.scope.
Mär 08 17:10:10 --HOST-- systemd-udevd[24620]: Using default interface naming scheme 'v240'.
Mär 08 17:10:10 --HOST-- systemd-udevd[24620]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mär 08 17:10:10 --HOST-- systemd-udevd[24620]: Could not generate persistent MAC address for tap193i0: No such file or directory
Mär 08 17:10:10 --HOST-- kernel: device tap193i0 entered promiscuous mode
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:10:10 --HOST-- kernel: vmbr0: port 30(tap193i0) entered forwarding state
Mär 08 17:10:11 --HOST-- pvedaemon[53275]: <root@pam> end task UPID:--HOST--:00006020:0767B302:60464C62:qmstart:193:root@pam: OK
Mär 08 17:11:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:11:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:11:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:11:02 --HOST-- pvedaemon[53275]: <root@pam> update VM 193: resize --disk scsi0 --size +18G
Mär 08 17:12:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:12:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:12:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:12:15 --HOST-- QEMU[24629]: kvm: terminating on signal 15 from pid 4035 (/usr/sbin/qmeventd)
Mär 08 17:12:16 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:12:16 --HOST-- systemd[1]: 193.scope: Succeeded.
Mär 08 17:12:16 --HOST-- qmeventd[4034]: Starting cleanup for 193
Mär 08 17:12:16 --HOST-- qmeventd[4034]: Finished cleanup for 193
Mär 08 17:13:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:13:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:13:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:13:42 --HOST-- pvestatd[4878]: auth key pair too old, rotating..
Mär 08 17:14:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:14:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:14:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:14:30 --HOST-- pvedaemon[47187]: worker exit
Mär 08 17:14:30 --HOST-- pvedaemon[4901]: worker 47187 finished
Mär 08 17:14:30 --HOST-- pvedaemon[4901]: starting 1 worker(s)
Mär 08 17:14:30 --HOST-- pvedaemon[4901]: worker 67702 started
Mär 08 17:14:30 --HOST-- pvedaemon[85626]: <root@pam> starting task UPID:--HOST--:00010886:076818A2:60464D66:qmstart:193:root@pam:
Mär 08 17:14:30 --HOST-- pvedaemon[67718]: start VM 193: UPID:--HOST--:00010886:076818A2:60464D66:qmstart:193:root@pam:
Mär 08 17:14:30 --HOST-- systemd[1]: Started 193.scope.
Mär 08 17:14:30 --HOST-- systemd-udevd[67745]: Using default interface naming scheme 'v240'.
Mär 08 17:14:30 --HOST-- systemd-udevd[67745]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mär 08 17:14:30 --HOST-- systemd-udevd[67745]: Could not generate persistent MAC address for tap193i0: No such file or directory
Mär 08 17:14:31 --HOST-- kernel: device tap193i0 entered promiscuous mode
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:14:31 --HOST-- kernel: vmbr0: port 30(tap193i0) entered forwarding state
Mär 08 17:14:31 --HOST-- pvedaemon[85626]: <root@pam> end task UPID:--HOST--:00010886:076818A2:60464D66:qmstart:193:root@pam: OK
Mär 08 17:15:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:15:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:15:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:15:43 --HOST-- pvedaemon[53275]: <root@pam> successful auth for user 'x@y'
Mär 08 17:16:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:16:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:16:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:16:14 --HOST-- pvedaemon[53275]: <root@pam> starting task UPID:--HOST--:00014A49:07684125:60464DCE:qmshutdown:193:root@pam:
Mär 08 17:16:14 --HOST-- pvedaemon[84553]: shutdown VM 193: UPID:--HOST--:00014A49:07684125:60464DCE:qmshutdown:193:root@pam:
Mär 08 17:16:15 --HOST-- QEMU[67761]: kvm: terminating on signal 15 from pid 4035 (/usr/sbin/qmeventd)
Mär 08 17:16:15 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:16:15 --HOST-- pvedaemon[53275]: <root@pam> end task UPID:--HOST--:00014A49:07684125:60464DCE:qmshutdown:193:root@pam: OK
Mär 08 17:16:15 --HOST-- systemd[1]: 193.scope: Succeeded.
Mär 08 17:16:15 --HOST-- qmeventd[4034]: Starting cleanup for 193
Mär 08 17:16:15 --HOST-- qmeventd[4034]: Finished cleanup for 193
Mär 08 17:17:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:17:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:17:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:17:01 --HOST-- CRON[93033]: pam_unix(cron:session): session opened for user root by (uid=0)
Mär 08 17:17:01 --HOST-- CRON[93034]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mär 08 17:17:01 --HOST-- CRON[93033]: pam_unix(cron:session): session closed for user root
Mär 08 17:17:18 --HOST-- pveproxy[50542]: worker exit
Mär 08 17:17:18 --HOST-- pveproxy[5008]: worker 50542 finished
Mär 08 17:17:18 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:17:18 --HOST-- pveproxy[5008]: worker 95744 started
Mär 08 17:17:37 --HOST-- pveproxy[5008]: worker 61311 finished
Mär 08 17:17:37 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:17:37 --HOST-- pveproxy[5008]: worker 1287 started
Mär 08 17:17:37 --HOST-- pveproxy[1286]: got inotify poll request in wrong process - disabling inotify
Mär 08 17:17:39 --HOST-- pveproxy[1286]: worker exit
Mär 08 17:17:57 --HOST-- pvedaemon[67702]: <root@pam> successful auth for user 'root@pam'
Mär 08 17:18:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:18:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:18:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:18:45 --HOST-- pvedaemon[53275]: <root@pam> successful auth for user 'root@pam'
Mär 08 17:19:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:19:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:19:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:19:00 --HOST-- sshd[15295]: Accepted publickey for root from 10.3.38.239 port 44730 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxx
Mär 08 17:19:00 --HOST-- sshd[15295]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mär 08 17:19:00 --HOST-- systemd[1]: Created slice User Slice of UID 0.
Mär 08 17:19:00 --HOST-- systemd[1]: Starting User Runtime Directory /run/user/0...
Mär 08 17:19:00 --HOST-- systemd-logind[4036]: New session 423 of user root.
Mär 08 17:19:00 --HOST-- systemd[1]: Started User Runtime Directory /run/user/0.
Mär 08 17:19:00 --HOST-- systemd[1]: Starting User Manager for UID 0...
Mär 08 17:19:00 --HOST-- systemd[15300]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG network certificate management daemon.
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Timers.
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent and passphrase cache.
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Paths.
Mär 08 17:19:00 --HOST-- systemd[15300]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Sockets.
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Basic System.
Mär 08 17:19:00 --HOST-- systemd[15300]: Reached target Default.
Mär 08 17:19:00 --HOST-- systemd[15300]: Startup finished in 62ms.
Mär 08 17:19:00 --HOST-- systemd[1]: Started User Manager for UID 0.
Mär 08 17:19:00 --HOST-- systemd[1]: Started Session 423 of user root.
Mär 08 17:19:31 --HOST-- pvedaemon[67702]: <root@pam> successful auth for user 'xy'
Mär 08 17:20:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:20:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:20:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:21:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:21:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
Mär 08 17:21:00 --HOST-- systemd[1]: Started Proxmox VE replication runner.
Mär 08 17:21:19 --HOST-- pveproxy[18083]: worker exit
Mär 08 17:21:19 --HOST-- pveproxy[5008]: worker 18083 finished
Mär 08 17:21:19 --HOST-- pveproxy[5008]: starting 1 worker(s)
Mär 08 17:21:19 --HOST-- pveproxy[5008]: worker 38084 started
Mär 08 17:21:20 --HOST-- pvedaemon[67702]: <root@pam> update VM 193: -hostpci0 37:00.0
Mär 08 17:21:23 --HOST-- pvedaemon[67702]: <root@pam> starting task UPID:--HOST--:00009741:0768BA0F:60464F03:qmstart:193:root@pam:
Mär 08 17:21:23 --HOST-- pvedaemon[38721]: start VM 193: UPID:--HOST--:00009741:0768BA0F:60464F03:qmstart:193:root@pam:
Mär 08 17:21:23 --HOST-- kernel: nvidia 0000:37:00.0: MDEV: Unregistering
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000000-0000-0000-0000-000000000193: Removing from iommu group 164
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000000-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000001-0000-0000-0000-000000000193: Removing from iommu group 165
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000001-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000002-0000-0000-0000-000000000193: Removing from iommu group 166
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000002-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000003-0000-0000-0000-000000000193: Removing from iommu group 167
Mär 08 17:21:23 --HOST-- kernel: vfio_mdev 00000003-0000-0000-0000-000000000193: MDEV: detaching iommu
Mär 08 17:21:25 --HOST-- systemd[1]: Started 193.scope.
Mär 08 17:21:25 --HOST-- systemd-udevd[38765]: Using default interface naming scheme 'v240'.
Mär 08 17:21:25 --HOST-- systemd-udevd[38765]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mär 08 17:21:25 --HOST-- systemd-udevd[38765]: Could not generate persistent MAC address for tap193i0: No such file or directory
Mär 08 17:21:25 --HOST-- kernel: device tap193i0 entered promiscuous mode
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered disabled state
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered blocking state
Mär 08 17:21:25 --HOST-- kernel: vmbr0: port 30(tap193i0) entered forwarding state
Mär 08 17:21:31 --HOST-- pvedaemon[53275]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - got timeout
Mär 08 17:21:38 --HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:38 --HOST-- pvestatd[4878]: status update time (6.208 seconds)
Mär 08 17:21:48 --HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:48 --HOST-- pvestatd[4878]: status update time (6.210 seconds)
Mär 08 17:21:50 --HOST-- pvedaemon[85626]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:58 --HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Mär 08 17:21:59 --HOST-- pvestatd[4878]: status update time (6.205 seconds)
Mär 08 17:22:00 --HOST-- systemd[1]: Starting Proxmox VE replication runner...
Mär 08 17:22:00 --HOST-- systemd[1]: pvesr.service: Succeeded.
=============
SPAMS THE MESSAGE
--HOST-- pvestatd[4878]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
UNTIL I MOVED THE VM TO AN OTHER NODE
=============
 
Last edited:
Hello again

sorry to revive this topic, but unfortunately, the exact same error occured again.
We shut down a VM with a attached GPU, set the CPU to "host" instead of kvm64 and rebooted it only to face the same problem again: VM fails to boot and after that nvidia-smi shows Failed to initialize NVML: Unknown Error. (edit: setting CPU back to kvm64 does not solve the problem)

This time, tho, dmesg looks normal.
Code:
[Aug12 13:58] vmbr0: port 27(tap200i0) entered disabled state
[Aug12 13:59] vmbr0: port 26(tap217i0) entered disabled state
[ +39.745989] device tap217i0 entered promiscuous mode
[  +0.014235] vmbr0: port 26(tap217i0) entered blocking state
[  +0.000519] vmbr0: port 26(tap217i0) entered disabled state
[  +0.001119] vmbr0: port 26(tap217i0) entered blocking state
[  +0.000403] vmbr0: port 26(tap217i0) entered forwarding state
[Aug12 14:00] vfio-pci 0000:37:00.0: vfio_cap_init: hiding cap 0x0@0x68
[  +0.000610] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[  +0.000480] vfio-pci 0000:37:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
0000:37:00.0 is the GPU.

And the journal with the error:
Code:
Aug 12 14:15:30 ictesxdev02 pvedaemon[66836]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:30 ictesxdev02 pveproxy[91343]: worker exit
Aug 12 14:15:30 ictesxdev02 pveproxy[54626]: worker 91343 finished
Aug 12 14:15:30 ictesxdev02 pveproxy[54626]: starting 1 worker(s)
Aug 12 14:15:30 ictesxdev02 pveproxy[54626]: worker 73099 started
Aug 12 14:15:31 ictesxdev02 pvestatd[93469]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:31 ictesxdev02 pvestatd[93469]: status update time (6.587 seconds)
Aug 12 14:15:41 ictesxdev02 pvestatd[93469]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:42 ictesxdev02 pvestatd[93469]: status update time (6.589 seconds)
Aug 12 14:15:49 ictesxdev02 pvedaemon[66836]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:51 ictesxdev02 pvestatd[93469]: VM 217 qmp command failed - VM 217 qmp command 'query-proxmox-support' failed - unable to connect to VM 217 qmp socket - timeout after 31 retries
Aug 12 14:15:51 ictesxdev02 pvestatd[93469]: status update time (6.712 seconds)
Aug 12 14:15:54 ictesxdev02 pvedaemon[71505]: start failed: command '/usr/bin/kvm -id 217 -name rndmarkotest01 -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/217.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/217.pid -daemonize -smbios 'type=1,uuid=fed1736c-5e62-46e7-a2f8-041eda7f739d' -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/217.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=47e6c499-0fc3-481a-bdfc-04a9c04d6a41' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'vfio-pci,host=0000:37:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/217.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4c155aa8f549' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/ironcluster/VM/images/217/vm-217-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap217i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=06:AF:F7:61:6D:51,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout
Aug 12 14:15:54 ictesxdev02 pvedaemon[62675]: <root@pam> end task UPID:ictesxdev02:00011751:2DEFAA0B:611510DA:qmstart:217:root@pam: start failed: command '/usr/bin/kvm -id 217 -name rndmarkotest01 -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/217.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/217.pid -daemonize -smbios 'type=1,uuid=fed1736c-5e62-46e7-a2f8-041eda7f739d' -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/217.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 4096 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=47e6c499-0fc3-481a-bdfc-04a9c04d6a41' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'vfio-pci,host=0000:37:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/217.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4c155aa8f549' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/ironcluster/VM/images/217/vm-217-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap217i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=06:AF:F7:61:6D:51,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout

proxmox-ve: 6.4-1 (running kernel: 5.4.114-1-pve)
pve-manager: 6.4-6 (running version: 6.4-6/be2fa32c)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.6-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-4
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-3
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Does anyone have a similar problem?

Edit 1:
I unloaded the modules with rmmod nvidia and tried to reload the modules. Now modprobe tells me
Code:
# modprobe nvidia -vv
modprobe: INFO: ../libkmod/libkmod.c:364 kmod_set_log_fn() custom logging function 0x5595bfb77150 registered
insmod /lib/modules/5.4.114-1-pve/kernel/drivers/video/nvidia.ko
modprobe: INFO: ../libkmod/libkmod-module.c:886 kmod_module_insert_module() Failed to insert module '/lib/modules/5.4.114-1-pve/kernel/drivers/video/nvidia.ko': No such device
modprobe: ERROR: could not insert 'nvidia': No such device
modprobe: INFO: ../libkmod/libkmod.c:331 kmod_unref() context 0x5595c02c7460 released

dmesg:
Code:
[7708840.865723] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[7708840.866431] NVRM: This can occur when a driver such as:
                 NVRM: nouveau, rivafb, nvidiafb or rivatv
                 NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[7708840.868105] NVRM: Try unloading the conflicting kernel module (and/or
                 NVRM: reconfigure your kernel without the conflicting
                 NVRM: driver(s)), then try loading the NVIDIA kernel module
                 NVRM: again.
[7708840.870169] NVRM: No NVIDIA devices probed.
[7708840.870872] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234

No other module (nouveau, riva, nvidiafb) are loaded.

Is it possible that Proxmox somehow puts the GPU in power save mode?

Edit 2: So, after unloading vfio-pci via rmmod vfio-pci, modprobe nvidia was possible again. Now, the host at least sees the GPU:

Code:
# nvidia-smi
Thu Aug 12 15:13:16 2021     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.89       Driver Version: 450.89       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:37:00.0 Off |                    0 |
| N/A   71C    P0    25W /  70W |      0MiB / 15109MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                            
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If we reload vfio-pci (modprobe vfio-pci; nvidia-smi still works) and try to start the VM with attached GPU, all breaks, again.
So, it looks like something's fishy with the vfio-pci module.

Edit 3: Removing the GPU from the host (echo 1 > /sys/bus/pci/devices/0000:37:00.0/remove) and then rescan it (echo 1 > /sys/bus/pci/rescan) does not help either.

Edit 4: When I manually bind the GPU to vfio-pci (echo "vfio-pci" > /sys/bus/pci/devices/0000\:37\:00.0/driver_override) and use the commandline to boot the virtual machine, it returns the error
kernel: vfio-pci 0000:37:00.0: no suspend buffer for LTR; ASPM issues possible after resume

I think that this problem is similar to this one: https://forum.proxmox.com/threads/gpu-passthrough-only-works-after-suspend-resume-of-the-host.58486/


Best,
Andy
 
Last edited:
Hi all,

this problem still persists to this day :mad:
Today, the problem occured again after we restored a backup from a virtual machine with a GPU attached (VM was shut down first). We have the exact same error messages like before.
We're going crazy with this damn error!!!!

Best,
Andy
 
Sorry if I spam my own thread, but I'll report any finding here in the hopes that maybe one day someone with a similar issue finds help.

This being said, I think we have a new clue: most of the times, the issues occures when some sort of backup machanism was at work. Either when restoring a snapshot or backup or when somewhere else in the cluster some backup jobs are running. We'll investigate this further as it is the hotest trail as of now.

Best,
Andy
 
Hi guys. Sorry for the late update @damarges @Robert.H

The error did not appear in the last year or two.
Steps we try to always do with the GPU VMs:
- Shutdown the VM properly before removing the GPU
- Not ballooning, always fixed RAM
- Assign the GPU to a powered off VM

Maybe this helps someone.
 
  • Like
Reactions: Robert.H

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!