Proxmox crashes webui & vm unavailable after few hours

Michel47388

New Member
Aug 12, 2024
9
0
1
Hi,
I'm running a windows Vm in proxmox with physical gpu passtrought and iommu enabled.
The issue i'm facing is that proxmox works fine for a few hours with evrything working perfectly (I can use in remote my vm with no issue).
But after around 2 hours, my vm becomes unavailable and the proxmox WebUi is unreachable.
It seems here to be a hard crash because the only way te regain access to the node is to physically reboot it (the power button is unresponsive i have to force shutdown by long pressing it).

Do you have any clues why this is happening beacause this make my server unusable ?

Best regards,
Michel Melhem

System informations :
pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.12-1-pve)
Ryzen 9 5950X

 
Welcome to the Proxmox forum, Michel!

It would be helpful to know how you setup your GPU passthrough (e.g. cat /etc/cmdline and any /etc/modprobe.d/*, etc. you set up) and what your system log (journalctl) on your PVE host has logged before becoming unavailable.
 
Hello, thank you for your answer !

These are the logs before the system is unresponsive :

Code:
Aug 11 23:43:15 pve kernel: kvm_amd: kvm [1709]: vcpu0, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:16 pve kernel: kvm_amd: kvm [1709]: vcpu1, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:16 pve kernel: kvm_amd: kvm [1709]: vcpu2, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:16 pve kernel: kvm_amd: kvm [1709]: vcpu3, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:16 pve kernel: kvm_amd: kvm [1709]: vcpu4, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:16 pve kernel: kvm_amd: kvm [1709]: vcpu5, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:17 pve kernel: kvm_amd: kvm [1709]: vcpu6, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:17 pve kernel: kvm_amd: kvm [1709]: vcpu7, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:17 pve kernel: kvm_amd: kvm [1709]: vcpu8, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:17 pve kernel: kvm_amd: kvm [1709]: vcpu9, guest rIP: 0xfffff857fd977d29 Unhandled WRMSR(0xc0010115) = 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored rdmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored wrmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored rdmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored wrmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored rdmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored wrmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored rdmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored wrmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored rdmsr: 0xc001100d data 0x0
Aug 11 23:43:17 pve kernel: kvm: kvm [1570]: ignored wrmsr: 0xc001100d data 0x0


these are the modprob files :
Code:
options vfio-pci ids=10de:10f0,1002:aaf0 disable_vga=1

options vfio_iommu_type1 allow_unsafe_interrupts=1

options kvm ignore_msrs=1


I don't think the issue is related the way i passed my gpu beaucause it works fine at first.
I mostly followed this guide and modified steps based on my setup : https://gist.github.com/KasperSkytte/6a2d4e8c91b7117314bceec84c30016b
 
Thank you for the extra information! Unfortunately the syslog doesn't tell us much about the problem directly, but it seems like that the there is a kernel stacktrace that couldn't be written to the syslog in time. Your configuration seems fine, but just to check: Did you also add your GPU's driver to the blacklist?

What are the circumstances when the host and VM start to hang (e.g. sudden jump in using memory, running out of memory, etc.)? Does the PVE host also crash when you reboot/shut it down and boot again? Do you know if your card might suffer from the AMD vendor reset bug?

EDIT: I'm curious about your vfio-pci options, as both ids seem to be audio interfaces (one from NVidia and one from AMD). Could you also provide more information about your motherboard and the graphics card you want to pass through?
 
Last edited:
Thank you for the extra information! Unfortunately the syslog doesn't tell us much about the problem directly, but it seems like that the there is a kernel stacktrace that couldn't be written to the syslog in time. Your configuration seems fine, but just to check: Did you also add your GPU's driver to the blacklist?

What are the circumstances when the host and VM start to hang (e.g. sudden jump in using memory, running out of memory, etc.)? Does the PVE host also crash when you reboot/shut it down and boot again? Do you know if your card might suffer from the AMD vendor reset bug?

EDIT: I'm curious about your vfio-pci options, as both ids seem to be audio interfaces (one from NVidia and one from AMD). Could you also provide more information about your motherboard and the graphics card you want to pass through?
As my cpu doesn't have integrated graphics, i have a cheap Nvidia Cpu for proxmox to use. And for my vm i have a more powerful rx 590. I made sure to blacklist the AMD drivers to be sure that the rx590 is not picked up by the system. Maybe i messed up the id's inside vfio pci. I'm going to double check
 
Thank you for the extra information! Unfortunately the syslog doesn't tell us much about the problem directly, but it seems like that the there is a kernel stacktrace that couldn't be written to the syslog in time. Your configuration seems fine, but just to check: Did you also add your GPU's driver to the blacklist?

What are the circumstances when the host and VM start to hang (e.g. sudden jump in using memory, running out of memory, etc.)? Does the PVE host also crash when you reboot/shut it down and boot again? Do you know if your card might suffer from the AMD vendor reset bug?

EDIT: I'm curious about your vfio-pci options, as both ids seem to be audio interfaces (one from NVidia and one from AMD). Could you also provide more information about your motherboard and the graphics card you want to pass through?
I updated vfio-pci options, but after three hours the system crashed again being unresponsive. Any clues if it could come from proxmox itself ?
 
Instability is usually not caused by forgetting to add the proxmoxStable=1month kernel parameter ;). It's always a hardware issue (unless it's a Linux kernel incompatibility with the hardware) or BIOS/overclock setting. Maybe stress-test your system with the Ubuntu installer (without installing it) to rule out Proxmox? Maybe start replacing hardware parts until the instability goes away? Maybe try (completely) different hardware and see if Proxmox runs stable on that (like it does for most people) and search the forum for known issue with your current hardware? My 5950X runs fine (up to 5.084MHz with 4 DIMMS of ECC memory and multiple GPU and USB passthrough on X570S AERO G).
 
Hello thank you for your feedback :)

I followed your advises and and I did those steps :
- I shutdown the windows vm on my proxmox server
- Once the vm was shutdown i disabled start on boot and I rebooted the system
- The only vm still running was a truenas one and then I followed by running a cpu stress test on proxmox for the whole night (It ran for 11 hours)

This morning when I checked the system was still running with no issues at all and the remaining truenas vm worked fine.

Therefore I also did a stress test of the rx 590 use by the vm on another computer and the card held up with no issues.

Now it seems that the issue i'm facing is not hardware related but more on the software side.
Btw i'm using a Msi mpg B550 with the latest Bios as my moatherboard, with the top pci 16x slot used for passing the gpu to the vm as it is the only one that have a separate iommu group.

Is there some known issue with this chipset ?

Also if that not the case do you know some logging tools that could help me catch the crash so i have an error message to post ?

Thank you :D

 
If you disable the passthrough, or even not start the VM at all and let the system run for a while, does it then still crash (preferably though with SOME kind of load on it still though)
Do you have any clues how I can troubleshot further ?
 
Are you able to run the VM for a couple of hours with the card inserted, but in the settings of the VM the passthrough itself disabled (and only there)?

I'm not all that proficient with GPU-passthrough (only done it for a network-card before even) but the two things I'm thinking about is either proxmox trying to access the GPU in some way and failing (although the driver-blacklist should prevent it) or the VM causing some weird interaction with it. If the VM runs but without the grahics-card (and with some load / normal processes) and it still crashes we can exclude the passthrough itself and we should look into what the VM itself is doing, if it does not crash, the passthrough is a possible blaim.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!