Hi,
Recently I purchased a new GPU for numerical work for my workstation.
It was installed roughly 1.5 year ago and has been running PVE 6.4, running on a AMD Threadripper 3970X, with PSU being Corsair AX1600i.
Here is the kernel version:
I followed this page:
https://pve.proxmox.com/wiki/Pci_passthrough
And I was able to pass an Nvidia GT710 as a test, which I can see under lspci -v in the ubuntu guest.
However, when it comes to passing the 3090Ti, the VM simply failed to start, stuck at the screen showing "Guest has not initialized..."
I checked the CPU summary of the Guest when it got stuck and there wasn't any loading so I believe it wasn't booting up correctly.
On the host I could find the 3090Ti, though I realized I could not see the model of the GPU like I could for GT710.
Could it be because the kernel is too old for the new GPU?
If so, is it the best I just upgrade Proxmox?
Unfortunately it is a busy production system which I cannot simply upgrade, so if upgrading Proxmox is the only solution, I should plan a bit for it (which cannot be done at the moment).
Just adding what I have added on the host:
For grub with update-grub applied:
Solution: see #4 and #5.
Recently I purchased a new GPU for numerical work for my workstation.
It was installed roughly 1.5 year ago and has been running PVE 6.4, running on a AMD Threadripper 3970X, with PSU being Corsair AX1600i.
Here is the kernel version:
Linux polaris 5.4.189-1-pve #1 SMP PVE 5.4.189-1 (Wed, 11 May 2022 07:10:20 +0200) x86_64
I followed this page:
https://pve.proxmox.com/wiki/Pci_passthrough
And I was able to pass an Nvidia GT710 as a test, which I can see under lspci -v in the ubuntu guest.
However, when it comes to passing the 3090Ti, the VM simply failed to start, stuck at the screen showing "Guest has not initialized..."
I checked the CPU summary of the Guest when it got stuck and there wasn't any loading so I believe it wasn't booting up correctly.
On the host I could find the 3090Ti, though I realized I could not see the model of the GPU like I could for GT710.
Code:
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2203 (rev a1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5090
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 255
NUMA node: 0
Region 0: Memory at e6000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at e800000000 (64-bit, prefetchable) [size=32G]
Region 3: Memory at f000000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at 3000 [size=128]
Expansion ROM at e7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 1048576ns
Max no snoop latency: 1048576ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=32768ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] #19
Capabilities: [bb0 v1] #15
Capabilities: [c1c v1] #26
Capabilities: [d00 v1] #27
Capabilities: [e00 v1] #25
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
01:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5090
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin B routed to IRQ 255
NUMA node: 0
Region 0: Memory at e7080000 (32-bit, non-prefetchable) [disabled] [size=16K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [160 v1] #25
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
Could it be because the kernel is too old for the new GPU?
If so, is it the best I just upgrade Proxmox?
Unfortunately it is a busy production system which I cannot simply upgrade, so if upgrading Proxmox is the only solution, I should plan a bit for it (which cannot be done at the moment).
Just adding what I have added on the host:
For grub with update-grub applied:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on video=efifb:off iommu=pt vfio_iommu_ty$
Solution: see #4 and #5.
Last edited: