I have a multi GPU workstation which I spent weeks setting up as Proxmox host with VMs on a ZFS pool and passing through PCI devices including GPUs. The setup had been down for a few months for some upgrades and reconfiguration, and now is back up, so I updated to latest Proxmox via apt update.
The problem:
My Ampere GPUs show a blank screen after VM startup. Doesn't even show the Proxmox splash/logo/bios screen. Just blank. At the kernel module level, the passthrough must be happening as I see this in the host log (PCI ids 03:00 and 49:00 corresponding to 2 GPUs passed)
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.1: enabling device (0000 -> 0002)
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.1: enabling device (0000 -> 0002)
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.3: enabling device (0000 -> 0002)
After many hours of going over the internet and scouring the resources on this website, I decided to boot from a rescue Proxmox USB on the existing installation and everything worked again! Figuring the issue might be some sort of corruption, I then reinstalled Proxmox and reconfigured it as the original following the guide here
https://pve.proxmox.com/wiki/PCI(e)_Passthrough and making sure to make note of my previous config and restore that.
I was forced to use the kernel 5.11 image as the regular ISO would fail to install ( "Starting a root shell on TTY3" being the last msg)
http://download.proxmox.com/temp/proxmox-ve-6.4-iso-with-5.11-kernel/
The 5.11 install finished (I guess the Ampere GPUs might be too new for 5.4) and I reconfigured everything following my notes. Unfortunately the blank screen is now back, and rescue boot or not, it persists. Now I no longer have the original Proxmox VE 6.2/kernel 5.4 installation and I can't redo it so I am stuck. I am not the most beginner user but the lack of clues or error messages is very frustrating - there is definitely a problem with this upgrade as evidenced by my experience.
The problem:
My Ampere GPUs show a blank screen after VM startup. Doesn't even show the Proxmox splash/logo/bios screen. Just blank. At the kernel module level, the passthrough must be happening as I see this in the host log (PCI ids 03:00 and 49:00 corresponding to 2 GPUs passed)
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.1: enabling device (0000 -> 0002)
Jun 20 13:18:35 designare kernel: vfio-pci 0000:03:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.1: enabling device (0000 -> 0002)
Jun 20 13:18:35 designare kernel: vfio-pci 0000:49:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Jun 20 13:18:35 designare kernel: vfio-pci 0000:46:00.3: enabling device (0000 -> 0002)
After many hours of going over the internet and scouring the resources on this website, I decided to boot from a rescue Proxmox USB on the existing installation and everything worked again! Figuring the issue might be some sort of corruption, I then reinstalled Proxmox and reconfigured it as the original following the guide here
https://pve.proxmox.com/wiki/PCI(e)_Passthrough and making sure to make note of my previous config and restore that.
I was forced to use the kernel 5.11 image as the regular ISO would fail to install ( "Starting a root shell on TTY3" being the last msg)
http://download.proxmox.com/temp/proxmox-ve-6.4-iso-with-5.11-kernel/
The 5.11 install finished (I guess the Ampere GPUs might be too new for 5.4) and I reconfigured everything following my notes. Unfortunately the blank screen is now back, and rescue boot or not, it persists. Now I no longer have the original Proxmox VE 6.2/kernel 5.4 installation and I can't redo it so I am stuck. I am not the most beginner user but the lack of clues or error messages is very frustrating - there is definitely a problem with this upgrade as evidenced by my experience.