gvt-g mdev missing after updating to 6.2-6

everwisher · Jul 3, 2020

I just did apt upgrade and gvt-g error followed. Here's the log:

Use of uninitialized value $vmid in sprintf at /usr/share/perl5/PVE/SysFSTools.pm line 358.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:02.0/00000001-0000-0000-0000-000000000000,id=hostpci1,bus=pci.0,addr=0x11: vfio /sys/bus/pci/devices/0000:00:02.0/00000001-0000-0000-0000-000000000000: no such host device: No such file or directory
TASK ERROR: start failed: QEMU exited with code 1

I checked associated files and found the folder "/sys/bus/pci/devices/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_8/devices" still present but there is nothing in it, unlike things were before the latest upgrade.

I manually created a mdev by the commandline:

Code:

echo '00000001-0000-0000-0000-000000000000' > /sys/bus/pci/devices/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_8/create

and the vm using gvt-g works like it did

How can I fix this glitch?

dcsapak · Jul 3, 2020

Is there anything in dmesg? Can you try with an older kernel?

BobMccapherey · Jul 12, 2020

I have the same error - the device does seem to be made. However I am having the same issue.

It seems like the device is being made with the right value:

Code:

[ 1277.732498] vfio_mdev 00000000-0000-0000-0000-000000000401: Adding to iommu group 18
[ 1277.732499] vfio_mdev 00000000-0000-0000-0000-000000000401: MDEV: group_id = 18
[ 1278.087861] device tap401i0 entered promiscuous mode
[ 1278.093354] custnet: port 2(tap401i0) entered blocking state
[ 1278.093355] custnet: port 2(tap401i0) entered disabled state
[ 1278.093414] custnet: port 2(tap401i0) entered blocking state
[ 1278.093415] custnet: port 2(tap401i0) entered forwarding state
[ 1278.238873] custnet: port 2(tap401i0) entered disabled state
[ 1278.688167] vfio_mdev 00000000-0000-0000-0000-000000000401: Removing from iommu group 18
[ 1278.688170] vfio_mdev 00000000-0000-0000-0000-000000000401: MDEV: detaching iommu

However, qemu is using the wrong path when attempting to start a VM

Code:

Use of uninitialized value $vmid in sprintf at /usr/share/perl5/PVE/SysFSTools.pm line 358.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:02.0/00000000-0000-0000-0000-000000000000,id=hostpci0,bus=pci.0,addr=0x10,rombar=0: vfio /sys/bus/pci/devices/0000:00:02.0/00000000-0000-0000-0000-000000000000: no such host device: No such file or directory
TASK ERROR: start failed: QEMU exited with code 1

Looks like the wrong path is being used above when starting QEMU.

Now here's the weird thing - I have 3 other hosts in the cluster that work fine. This host that is having issues is a newly-provisioned one that was recently configured, updated, and added to the existing cluster. When it fails to start on this one node, HA migrates it to another node where it starts with no issues. All hosts are on the newest version of packages from the same repo.

jollyrogr · Jul 12, 2020

my error message was:

root@pve02:~# qm start 100
Use of uninitialized value $vmid in sprintf at /usr/share/perl5/PVE/SysFSTools.pm line 358.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:02.0/00000000-0000-0000-0000-000000000000,id=hostpci0,bus=pci.0,addr=0x10,rombar=0: vfio /sys/bus/pci/devices/0000:00:02.0/00000000-0000-0000-0000-000000000000: no such host device: No such file or directory
start failed: QEMU exited with code 1

I tried

Code:

echo '00000000-0000-0000-0000-000000000000' > /sys/bus/pci/devices/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/create

VM still wont start, error is now:

Use of uninitialized value $vmid in sprintf at /usr/share/perl5/PVE/SysFSTools.pm line 358.
pci device '0000:00:02.0' has no available instances of 'i915-GVTg_V5_4'

BobMccapherey · Jul 12, 2020

jollyrogr said:
my error message was:

I tried

Code:

echo '00000000-0000-0000-0000-000000000000' > /sys/bus/pci/devices/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/create

VM still wont start, error is now:

Well the problem there is you can't just keep making virtual GPUs - if you have too many it won't let you create more. Likewise if you try to reuse a GUID it won't let you. You'd have to delete the vGPU and readd it (or just reuse the same one).

The problem is that Proxmox generates the GUID to use based on the VMID and puts that at the end of the GUID. For some reason that variable is not getting defined and the GUID is just all zeroes. For example, if your VM ID is 100, there should be a 100 at the end of the GUID. In my case my VM ID is 401, but there's no 401 in the GUID. Either way, something broke where the VM ID is not getting passed correctly to the parameters to infer the GUID of the vGPU and start the VM appropriately.

jollyrogr · Jul 13, 2020

For giggles I tried:

echo '00000000-0000-0000-0000-000000000100' > /sys/bus/pci/devices/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/create

and the error that came back was:

Use of uninitialized value $vmid in sprintf at /usr/share/perl5/PVE/SysFSTools.pm line 358.
mdev instance '00000000-0000-0000-0000-000000000100' already existed, using it.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:02.0/00000000-0000-0000-0000-000000000000,id=hostpci0,bus=pci.0,addr=0x10,rombar=0: vfio /sys/bus/pci/devices/0000:00:02.0/00000000-0000-0000-0000-000000000000: no such host device: No such file or directory
start failed: QEMU exited with code 1

So Bob I think you're on to something but I dunno what the problem is. It's over my head.

dmesg:

[19427.926188] vfio_mdev 00000000-0000-0000-0000-000000000100: Removing from iommu group 14
[19427.926194] vfio_mdev 00000000-0000-0000-0000-000000000100: MDEV: detaching iommu

BobMccapherey · Jul 13, 2020

jollyrogr said:
For giggles I tried:

and the error that came back was:

So Bob I think you're on to something but I dunno what the problem is. It's over my head.

dmesg:

So at first glance this looks like it wants to use the correct mdev GUID. Then when it actually tries to run QEMU it uses the wrong one. I tried looking at the conf files in /etc/pve/qemu-server but the GUID doesn't seem to be defined there and seems to be generated by Proxmox. So there's definitely a bug in the code in file /usr/share/perl5/PVE/SysFSTools.pm line 358 where whatever calls it isn't passing the vmid. The problem is there's no clear indication what calls this function (generate_mdev_uuid) because we don't have a complete stack trace from the command log output.

Perl:

# encode the hostpci index and vmid into the uuid
sub generate_mdev_uuid {
    my ($vmid, $index) = @_;

    my $string = sprintf("%08d-0000-0000-0000-%012d", $index, $vmid);

    return $string;
}

BobMccapherey · Jul 13, 2020

Is there an easy way to downgrade to 6.2-4? Seems that newer versions have completely hosed starting mdev gvt-g machines.

BobMccapherey · Jul 13, 2020

I think I figured out the issue:

/usr/share/perl5/PVE/QemuServer/PCI.pm line 425 references a variable $vmid, but it's not declared or defined anywhere else in the file.

BobMccapherey · Jul 13, 2020

Had to do this in order to fix things:

apt install libpve-common-perl=6.1-3 qemu-server=6.2-3 pve-manager=6.2-6 pve-container=3.1-8 pve-qemu-kvm=5.0.0-4 and then reboot every node in the cluster.

I based this on my recent apt logs to downgrade to a known working version.

jollyrogr · Jul 15, 2020

i just updated the node today and the issue appears to be fixed with the new updates.

Search

Search

gvt-g mdev missing after updating to 6.2-6

everwisher

Member

dcsapak

Proxmox Staff Member

BobMccapherey

Member

jollyrogr

New Member

BobMccapherey

Member

jollyrogr

New Member

BobMccapherey

Member

BobMccapherey

Member

BobMccapherey

Member

BobMccapherey

Member

jollyrogr

New Member

We value your privacy