Ill start off with my hardware and software, and then the backstory, and a timeline of events that lead to this point.
Hardware/software:
HP DL580 G7 - 4x Xeon X7550 - 96GB RAM
nvidia Quadro P600
Proxmox 7.2-4
kernel version - 5.15.35-2
Backstory:
So yesterday my new P600 GPU arrived. I shutdown my Proxmox PVE server via the webgui. I open up the chassis and physically install the P600 GPU. I put the chassis back together, reconnect all cables, and boot it up. I have no display on my monitor - but that's because the HP Bios was set to use the P600 for output, not the onboard graphics - no issue, as i wait for the webgui to become available again.
I sign back in, all my VM's fire up, and everything looks good.
I immediately try and assign the GPU to my VM, however get some errors regarding 'TASK ERROR: IOMMU not present'. I dive down the rabbit hole of PCI Passthrough, following the guides outlined here: https://pve.proxmox.com/wiki/Pci_passthrough#Enable_the_IOMMU and here https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/
After a reboot - i started to experience some weird behavior - my PVE Webgui would load for about 60 seconds, and then crash beyond usability. I was also unable to SSH into the server after this 'crash'.
After removing the GPU from the system, in hopes that it was the culprit, i could now see my original display - after Proxmox finished booting, i could see a crapload of errors on screen. I stupidly didn't snap a photo of these errors - however one of the errors i noted down was
At this point I was pretty lost and unsure what to do. I rebooted my server once more, and when it came time to booting into the Proxmox PVE, i instead went into the manual settings, and chose to boot with an earlier kernel version. I found kernel version 5.4.189-1 was available, and chose it.
Proxmox PVE now boots - hooray. The webgui doesn't crash instantly, and i can manage my VM's once more.
I spend some more time working on the IOMMU issues, trying to get my GPU passed through to the VM. However i have had no luck getting this to work. I believe my issue relates to HP and RMRR - a 'fix' is outlined here: https://forum.proxmox.com/threads/c...ntel-iommu-driver-to-remove-rmrr-check.36374/ however given the issues i'm already working through, i don't want to add even more on top.
I made a change to my GRUB file to allow Proxmox to automatically boot using the older kernel - https://forum.proxmox.com/threads/revert-to-prior-kernel.100310/ - my system now automatically boots into the 5.4.189-1 kernel - which seems to be stable and working for the most part.
The only issue i have now is that a couple of my Ubuntu VM's wont boot. I have two VMs stuck with the following output:
I've tried stopping and starting it. I've tried rebooting the host. Nothing seems to work. I've spent a little bit of time googling solutions, however none of them have worked. I don't have any idea where to go next with this.
I also don't want to start chopping and changing too much of my Proxmox PVE, in case i make things worse.
What options do i have to repair this?
Here is what my pveversion -v outputs:
Any help is appreciated
Hardware/software:
HP DL580 G7 - 4x Xeon X7550 - 96GB RAM
nvidia Quadro P600
Proxmox 7.2-4
kernel version - 5.15.35-2
Backstory:
So yesterday my new P600 GPU arrived. I shutdown my Proxmox PVE server via the webgui. I open up the chassis and physically install the P600 GPU. I put the chassis back together, reconnect all cables, and boot it up. I have no display on my monitor - but that's because the HP Bios was set to use the P600 for output, not the onboard graphics - no issue, as i wait for the webgui to become available again.
I sign back in, all my VM's fire up, and everything looks good.
I immediately try and assign the GPU to my VM, however get some errors regarding 'TASK ERROR: IOMMU not present'. I dive down the rabbit hole of PCI Passthrough, following the guides outlined here: https://pve.proxmox.com/wiki/Pci_passthrough#Enable_the_IOMMU and here https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/
After a reboot - i started to experience some weird behavior - my PVE Webgui would load for about 60 seconds, and then crash beyond usability. I was also unable to SSH into the server after this 'crash'.
After removing the GPU from the system, in hopes that it was the culprit, i could now see my original display - after Proxmox finished booting, i could see a crapload of errors on screen. I stupidly didn't snap a photo of these errors - however one of the errors i noted down was
Very similar to what is outlined here: https://forum.proxmox.com/threads/p...with-dmar-error-read-access-is-not-set.67074/PTE Read Access is not set
At this point I was pretty lost and unsure what to do. I rebooted my server once more, and when it came time to booting into the Proxmox PVE, i instead went into the manual settings, and chose to boot with an earlier kernel version. I found kernel version 5.4.189-1 was available, and chose it.
Proxmox PVE now boots - hooray. The webgui doesn't crash instantly, and i can manage my VM's once more.
I spend some more time working on the IOMMU issues, trying to get my GPU passed through to the VM. However i have had no luck getting this to work. I believe my issue relates to HP and RMRR - a 'fix' is outlined here: https://forum.proxmox.com/threads/c...ntel-iommu-driver-to-remove-rmrr-check.36374/ however given the issues i'm already working through, i don't want to add even more on top.
I made a change to my GRUB file to allow Proxmox to automatically boot using the older kernel - https://forum.proxmox.com/threads/revert-to-prior-kernel.100310/ - my system now automatically boots into the 5.4.189-1 kernel - which seems to be stable and working for the most part.
The only issue i have now is that a couple of my Ubuntu VM's wont boot. I have two VMs stuck with the following output:
I've tried stopping and starting it. I've tried rebooting the host. Nothing seems to work. I've spent a little bit of time googling solutions, however none of them have worked. I don't have any idea where to go next with this.
I also don't want to start chopping and changing too much of my Proxmox PVE, in case i make things worse.
What options do i have to repair this?
Here is what my pveversion -v outputs:
Code:
root@gobblemox:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.4.189-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.4: 6.4-17
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.4.189-1-pve: 5.4.189-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libqb0: 1.0.5-1~bpo9+2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-9
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
root@gobblemox:~#
Any help is appreciated