[SOLVED] PCIe card falls off the bus after vm reboot, how can I automatically reset it?

AlfredFranklin

New Member
Feb 17, 2023
5
0
1
I have an nvidia P4 card i'm passing through to a vm. It works fine after a host reboot, the vm autostarts and the pcie card is fine to use.

If I reboot the VM though, the VM's dmesg is constantly spammed with "the nvidia gpu at (passthrough pcie location) has fallen off the bus" and of course the gpu is unavailable to be used in the VM.

In the proxmox hosts dmesg, the only thing I see related to this is the following message: "vfio-pci 0000:21:00.0: can't change power state from D0 to D3hot (config space inaccessible)". This only appears the first time the VM is restarted, when the problem first occurs. It does not happen on subsequent vm reboots while the problem is active.

I can solve this manually by shutting the vm down, and then running the following commands in the proxmox host shell:

Code:
 echo "1" > /sys/bus/pci/devices/{physical card address}/remove
 sleep 1
 echo "1" > /sys/bus/pci/rescan

When I start the VM again, it can see the gpu fine until the next reboot, when the same problem happens.

Is there a way I can add the above commands to reset the gpu into the VMs reboot process?

Sorry if this is obvious, I am new to proxmox and cant find a solution for this.
 
Yes you can put such commands in a hookscript. and run it after the VM shuts down (or before it starts). The example the manual refers to is written in Perl but you can also use a Bash script.
 
Awesome, that works perfectly, thank you!

Here is the code in case someone else ever needs it (most likely me in a year), i stuck it in /var/lib/vz/snippets and added it to the VM with qm set {vmid} --hookscript local:snippets/resetpcie.sh

Code:
#! /bin/bash

VM_ID=$1;
EXECUTION_PHASE=$2;
LOGGING=/var/log/pciereset.log;

/usr/bin/date >> $LOGGING;

if [[ "$EXECUTION_PHASE" == "pre-start" ]]; then
        /usr/bin/echo "Phase is $EXECUTION_PHASE , Resetting PCIe device 21" >> $LOGGING;
        /usr/bin/echo "1" > /sys/bus/pci/devices/0000\:21\:00.0/remove;
        /usr/bin/sleep 1;
        /usr/bin/echo "1" > /sys/bus/pci/rescan;
        /usr/bin/echo "PCIe device 21 has been reset" >> $LOGGING;
else
        /usr/bin/echo "Phase is $EXECUTION_PHASE , skipping reset of PCIe device 21..." >> $LOGGING;
fi

/usr/bin/echo "##################" >> $LOGGING;

and it will produce a log file at /var/log/pciereset.log that looks like this:

Code:
Fri Feb 17 11:40:03 EST 2023
Phase is pre-start , Resetting PCIe device 21
PCIe device 21 has been reset
##################
Fri Feb 17 11:40:09 EST 2023
Phase is post-start , skipping reset of PCIe device 21...
##################

Thanks again!
 
Awesome, that works perfectly, thank you!

Here is the code in case someone else ever needs it (most likely me in a year), i stuck it in /var/lib/vz/snippets and added it to the VM with qm set {vmid} --hookscript local:snippets/resetpcie.sh

Code:
#! /bin/bash

VM_ID=$1;
EXECUTION_PHASE=$2;
LOGGING=/var/log/pciereset.log;

/usr/bin/date >> $LOGGING;

if [[ "$EXECUTION_PHASE" == "pre-start" ]]; then
        /usr/bin/echo "Phase is $EXECUTION_PHASE , Resetting PCIe device 21" >> $LOGGING;
        /usr/bin/echo "1" > /sys/bus/pci/devices/0000\:21\:00.0/remove;
        /usr/bin/sleep 1;
        /usr/bin/echo "1" > /sys/bus/pci/rescan;
        /usr/bin/echo "PCIe device 21 has been reset" >> $LOGGING;
else
        /usr/bin/echo "Phase is $EXECUTION_PHASE , skipping reset of PCIe device 21..." >> $LOGGING;
fi

/usr/bin/echo "##################" >> $LOGGING;

and it will produce a log file at /var/log/pciereset.log that looks like this:

Code:
Fri Feb 17 11:40:03 EST 2023
Phase is pre-start , Resetting PCIe device 21
PCIe device 21 has been reset
##################
Fri Feb 17 11:40:09 EST 2023
Phase is post-start , skipping reset of PCIe device 21...
##################

Thanks again!

@AlfredFranklin So far I have just read posts here but I just registered in order to say THANK YOU for sharing the script. It has solved my Debian 11 VM Frigate Google Coral PCI issue which kept me busy for several days without any real solution. After a VM restart Frigate successfully finds the Google Coral PCI device without the need of a full host reboot.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!