Hi community
we ran into the following problem multiple times now. The PCI-e GPU gets stuck and is unavailable, on the host and, of course, any virtual machine.
Setup: a Proxmox host with a NVIDIA Tesla T4:
-> Nvidia GRID driver version 450.89 is installed:
This setup worked fine and we were able to assign the mediated device (the GPU) to different VMs. After we removed those mediated devices, they became unavailable. Or said differently: they were not freed after removal to be used in an other VM. The GPU change to mediated device NO. We tried to add the GPU by adding it as a PCI-e device (pci-e passthrough mode) and received the error in the Proxmox GUI:
and in the error log:
Now using
and the modules are loaded:
It will probably run fine after a
Does anyone know how to solve this issue without restarting every couple of weeks?
Thanks in advance for any input!
we ran into the following problem multiple times now. The PCI-e GPU gets stuck and is unavailable, on the host and, of course, any virtual machine.
Setup: a Proxmox host with a NVIDIA Tesla T4:
Code:
# lspci | grep NVIDIA
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
This setup worked fine and we were able to assign the mediated device (the GPU) to different VMs. After we removed those mediated devices, they became unavailable. Or said differently: they were not freed after removal to be used in an other VM. The GPU change to mediated device NO. We tried to add the GPU by adding it as a PCI-e device (pci-e passthrough mode) and received the error in the Proxmox GUI:
Code:
TASK ERROR: start failed: command '/usr/bin/kvm -id 193 -name nvidia-docker-test-andy -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/193.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/193.pid -daemonize -smbios 'type=1,uuid=37501dc1-e4e5-4520-b24c-838157d49d4d' -smp '20,sockets=2,cores=10,maxcpus=20' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/193.vnc,password -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 262144 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=214b3fbd-072f-4fbb-9bba-6c93504c01f8' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'vfio-pci,host=0000:37:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/193.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:4c155aa8f549' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/ironcluster/VM/images/193/vm-193-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap193i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=AE:C3:27:A7:XX:XX,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout
and in the error log:
pvedaemon[88369]: VM 193 qmp command failed - VM 193 qmp command 'query-proxmox-support' failed - unable to connect to VM 193 qmp socket - timeout after 31 retries
Now using
nvidia-smi
, we get this error (before it was fine):
Code:
# nvidia-smi
Failed to initialize NVML: Unknown Error
lspci
still returns the correct gpu:
Code:
# lspci | grep NVIDIA
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
and the modules are loaded:
Code:
# lsmod | grep ^nvidia
nvidia_vgpu_vfio 53248 0
nvidia 19750912 10 nvidia_vgpu_vfio
It will probably run fine after a
reboot
of the host (it always did previously), but this is a host where currently 43 virtual machines are running and therfore not the ideal option.Does anyone know how to solve this issue without restarting every couple of weeks?
Thanks in advance for any input!