VM's having issues booting

gobble45

New Member
Sep 7, 2022
2
0
1
Ill start off with my hardware and software, and then the backstory, and a timeline of events that lead to this point.

Hardware/software:
HP DL580 G7 - 4x Xeon X7550 - 96GB RAM
nvidia Quadro P600
Proxmox 7.2-4
kernel version - 5.15.35-2


Backstory:
So yesterday my new P600 GPU arrived. I shutdown my Proxmox PVE server via the webgui. I open up the chassis and physically install the P600 GPU. I put the chassis back together, reconnect all cables, and boot it up. I have no display on my monitor - but that's because the HP Bios was set to use the P600 for output, not the onboard graphics - no issue, as i wait for the webgui to become available again.
I sign back in, all my VM's fire up, and everything looks good.

I immediately try and assign the GPU to my VM, however get some errors regarding 'TASK ERROR: IOMMU not present'. I dive down the rabbit hole of PCI Passthrough, following the guides outlined here: https://pve.proxmox.com/wiki/Pci_passthrough#Enable_the_IOMMU and here https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/
After a reboot - i started to experience some weird behavior - my PVE Webgui would load for about 60 seconds, and then crash beyond usability. I was also unable to SSH into the server after this 'crash'.
After removing the GPU from the system, in hopes that it was the culprit, i could now see my original display - after Proxmox finished booting, i could see a crapload of errors on screen. I stupidly didn't snap a photo of these errors - however one of the errors i noted down was
PTE Read Access is not set
Very similar to what is outlined here: https://forum.proxmox.com/threads/p...with-dmar-error-read-access-is-not-set.67074/

At this point I was pretty lost and unsure what to do. I rebooted my server once more, and when it came time to booting into the Proxmox PVE, i instead went into the manual settings, and chose to boot with an earlier kernel version. I found kernel version 5.4.189-1 was available, and chose it.

Proxmox PVE now boots - hooray. The webgui doesn't crash instantly, and i can manage my VM's once more.

I spend some more time working on the IOMMU issues, trying to get my GPU passed through to the VM. However i have had no luck getting this to work. I believe my issue relates to HP and RMRR - a 'fix' is outlined here: https://forum.proxmox.com/threads/c...ntel-iommu-driver-to-remove-rmrr-check.36374/ however given the issues i'm already working through, i don't want to add even more on top.

I made a change to my GRUB file to allow Proxmox to automatically boot using the older kernel - https://forum.proxmox.com/threads/revert-to-prior-kernel.100310/ - my system now automatically boots into the 5.4.189-1 kernel - which seems to be stable and working for the most part.

The only issue i have now is that a couple of my Ubuntu VM's wont boot. I have two VMs stuck with the following output:
Screenshot 2022-09-08 134805.png
I've tried stopping and starting it. I've tried rebooting the host. Nothing seems to work. I've spent a little bit of time googling solutions, however none of them have worked. I don't have any idea where to go next with this.
I also don't want to start chopping and changing too much of my Proxmox PVE, in case i make things worse.

What options do i have to repair this?

Here is what my pveversion -v outputs:
Code:
root@gobblemox:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.4.189-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.4: 6.4-17
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.4.189-1-pve: 5.4.189-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libqb0: 1.0.5-1~bpo9+2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-9
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
root@gobblemox:~#

Any help is appreciated :)
 
Okay - so an update to this - i got impatient and decided to try and fix some things myself.

I reverted every change i made, in reverse order.

And now i am booting off of the newer kernel with no issues.

My VMs are now also booting correctly as far as i can tell.

So i guess the issue was simply the changes made to get PCI Passthrough working.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!