[SOLVED] 3090Ti passthrough not working - Stuck at "Guest has not initialized...."

Sandbo

Member
Jul 4, 2019
65
6
13
32
Hi,

Recently I purchased a new GPU for numerical work for my workstation.
It was installed roughly 1.5 year ago and has been running PVE 6.4, running on a AMD Threadripper 3970X, with PSU being Corsair AX1600i.
Here is the kernel version:
Linux polaris 5.4.189-1-pve #1 SMP PVE 5.4.189-1 (Wed, 11 May 2022 07:10:20 +0200) x86_64

I followed this page:
https://pve.proxmox.com/wiki/Pci_passthrough

And I was able to pass an Nvidia GT710 as a test, which I can see under lspci -v in the ubuntu guest.
However, when it comes to passing the 3090Ti, the VM simply failed to start, stuck at the screen showing "Guest has not initialized..."
I checked the CPU summary of the Guest when it got stuck and there wasn't any loading so I believe it wasn't booting up correctly.

On the host I could find the 3090Ti, though I realized I could not see the model of the GPU like I could for GT710.
Code:
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2203 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5090
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 255
        NUMA node: 0
        Region 0: Memory at e6000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at e800000000 (64-bit, prefetchable) [size=32G]
        Region 3: Memory at f000000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 3000 [size=128]
        Expansion ROM at e7000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 1048576ns
                Max no snoop latency: 1048576ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=32768ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] #19
        Capabilities: [bb0 v1] #15
        Capabilities: [c1c v1] #26
        Capabilities: [d00 v1] #27
        Capabilities: [e00 v1] #25
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

01:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5090
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin B routed to IRQ 255
        NUMA node: 0
        Region 0: Memory at e7080000 (32-bit, non-prefetchable) [disabled] [size=16K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [160 v1] #25
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

Could it be because the kernel is too old for the new GPU?

If so, is it the best I just upgrade Proxmox?
Unfortunately it is a busy production system which I cannot simply upgrade, so if upgrading Proxmox is the only solution, I should plan a bit for it (which cannot be done at the moment).

Just adding what I have added on the host:
For grub with update-grub applied:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on video=efifb:off iommu=pt vfio_iommu_ty$

Solution: see #4 and #5.
 
Last edited:

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
8,221
1,044
164
34
Vienna
Could it be because the kernel is too old for the new GPU?
unlikely, since the kernel does not interact much with the card if it's passed through

can you post the vm config, as well as the journal/dmesg when you're starting the vm ?
 

Sandbo

Member
Jul 4, 2019
65
6
13
32
unlikely, since the kernel does not interact much with the card if it's passed through

can you post the vm config, as well as the journal/dmesg when you're starting the vm ?
Here is the config:
Code:
balloon: 8192
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 32
cpu: host
efidisk0: local-lvm:vm-105-disk-1,size=4M
hostpci0: 0000:01:00,pcie=1
ide2: local:iso/ubuntu-22.04-live-server-amd64.iso,media=cdrom
machine: q35
memory: 65536
name: jupyter-polaris
net0: virtio=82:DC:E7:F5:73:0B,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-105-disk-0,size=128G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=77c9902c-80b3-44b0-b464-bf546d843267
sockets: 1
startup: order=3
vmgenid: 98655edb-7e9f-4e62-a9d3-1c77db07b5de

And the below is the dmesg as I tried to boot the VM:
Code:
[   98.149277] fwbr106i0: port 2(tap106i0) entered blocking state
[   98.149278] fwbr106i0: port 2(tap106i0) entered forwarding state
[   99.014096] show_signal: 51 callbacks suppressed
[   99.014097] traps: light-locker[9244] trap int3 ip:7ff2b7384295 sp:7ffefc0c0c50 error:0 in libglib-2.0.so.0.6400.6[7ff2b7348000+84000]
[   99.518991] traps: light-locker[9959] trap int3 ip:7f07f1d5e295 sp:7ffd4f49ce50 error:0 in libglib-2.0.so.0.6400.6[7f07f1d22000+84000]
[   99.530181] traps: light-locker[10018] trap int3 ip:7fa0a4a94295 sp:7ffec39e3ba0 error:0 in libglib-2.0.so.0.6400.6[7fa0a4a58000+84000]
[   99.530235] traps: light-locker[10043] trap int3 ip:7fe09e768295 sp:7fffc5acd730 error:0 in libglib-2.0.so.0.6400.6[7fe09e72c000+84000]
[   99.535359] traps: light-locker[10082] trap int3 ip:7f70eb75a295 sp:7ffd4fa10450 error:0 in libglib-2.0.so.0.6400.6[7f70eb71e000+84000]
[  101.464860] traps: light-locker[11100] trap int3 ip:7f8b44442295 sp:7ffe17d95fe0 error:0 in libglib-2.0.so.0.6400.6[7f8b44406000+84000]
[11325.114598] device tap105i0 entered promiscuous mode
[11325.134809] fwbr105i0: port 1(fwln105i0) entered blocking state
[11325.134810] fwbr105i0: port 1(fwln105i0) entered disabled state
[11325.134877] device fwln105i0 entered promiscuous mode
[11325.134931] fwbr105i0: port 1(fwln105i0) entered blocking state
[11325.134931] fwbr105i0: port 1(fwln105i0) entered forwarding state
[11325.137267] vmbr0: port 7(fwpr105p0) entered blocking state
[11325.137269] vmbr0: port 7(fwpr105p0) entered disabled state
[11325.137374] device fwpr105p0 entered promiscuous mode
[11325.137402] vmbr0: port 7(fwpr105p0) entered blocking state
[11325.137403] vmbr0: port 7(fwpr105p0) entered forwarding state
[11325.139797] fwbr105i0: port 2(tap105i0) entered blocking state
[11325.139799] fwbr105i0: port 2(tap105i0) entered disabled state
[11325.139922] fwbr105i0: port 2(tap105i0) entered blocking state
[11325.139923] fwbr105i0: port 2(tap105i0) entered forwarding state
[11331.316862] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[11331.316882] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[11331.316889] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
[11331.316890] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
[11331.316892] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
[11331.352704] vfio-pci 0000:01:00.1: enabling device (0000 -> 0002)
[11331.352822] vfio-pci 0000:01:00.1: vfio_ecap_init: hiding ecap 0x25@0x160

While I hope it isn't the case, it could also be the GPU not working as it should, I shall test it with another OS maybe Windows just to be sure.
As another GPU (GT710) seems to work (at least it passthrough, boots and got detected), I might just got a lemon, or the installation was bad (I tried to unplug and plug already).
 
Last edited:

Sandbo

Member
Jul 4, 2019
65
6
13
32
I have a quick update:

Today I was trying to upgrade another system running also Proxmox 6.4 with Kernel being 5.4.
I was just trying to pass a card (vega FE) that I have done before (same system, same Proxmox 6.x maybe a bit older, around 1 year ago), I realized I couldn't and I saw the same black screen issue.

I was puzzled, but as I am the only user of this system, I was able to test by upgrading to Proxmox 7.2 and kernel 5.15. However, it didn't really change anything. I then tried to play with various setting, and to my surprise, setting it to Seabios allows me to pass with no problem. I have no idea why, but I was definitely using OVMF before when I previously did the passthrough, but now only Seabios works.

For fun, I simply tried the same with my 3090Ti system, and it also just works! I am not sure what might have changed such that now only Seabios could work.
 

Sandbo

Member
Jul 4, 2019
65
6
13
32
Final update:

I think I nailed down the reason, and it was indeed mentioned in the wiki which I missed.
https://pve.proxmox.com/wiki/Pci_passthrough#The_.27romfile.27_Option

For some reason, even with the same system that worked in the past (I have done pass through on all slots with GPUs), I didn't have to supply the VBIOS for OVMF BIOS to work, but now I do. I haven't checked the case of 3090Ti, but I observed the same for Vega FE for my old system and it was resolved as I gave it the romfile from Techpowerup.

This has been confirmed with fresh VM comparing with and without that particular romfile option line in the conf.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!