VM random restart

konjan

Member
Nov 9, 2022
7
1
8
Hello everyone,
so I manage to install windows on proxmox with nvidia drivers and parsec. VM has gpu passthrough RTX 3090. I can boot and it seams work good. But VM work stable by 5-15 minutes (in IDLE, game or on web browser), next get crash and restart. Parsec show messange "connection lost" and VM use full cpu range, typical behaviour for blue-screen.
This situation isn't depended of windows version 10 or 11, or nvidia drivers version 526.86 or 512.59.

Anybody has hints for me how I can troubleshoot from here?


Server spec:
- HP DL380 g9 (2x E5-2683 v4, 64gb ram, 2x PSU 500W)
- INNO3D RTX 3090 X3
- debian 11.5 bullseye
- proxmox 7.2-11

/etc/default/grub:
Code:
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
GRUB_CMDLINE_LINUX=""

/etc/modprobe.d/blacklist.conf
Code:
blacklist nouveau
blacklist nvidia
blacklist radeon
blacklist nvidiafb
blacklist snd_hda_intel

/etc/modprobe.d/iommu_unsafe_interupts.conf
Code:
options vfio_iommu_type1 allow_unsafe_interrupts=1

/etc/modprobe.d/kvm.conf
Code:
options kvm ignore_msrs=1

/etc/modprobe.d/vfio.conf
Code:
options vfio-pci ids=10de:2204,10de:1aef disable_vga=1

lspci -knn
Code:
0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
        Subsystem: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:1454]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
0b:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:1454]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

VM hardware
1668353239013.png

VM options
1668353270776.png

VM config file
Code:
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
audio0: device=ich9-intel-hda,driver=spice
bios: ovmf
boot: order=ide2;scsi0;net0
cores: 16
cpu: host,hidden=1,flags=+pcid
efidisk0: local:102/vm-102-disk-0.raw,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:0b:00,pcie=1,x-vga=1
ide2: local:iso/Win11_22H2_Polish_x64.iso,media=cdrom,size=5311480K
machine: pc-q35-7.0
memory: 16384
meta: creation-qemu=7.0.0,ctime=1668287862
name: gaming-3
net0: rtl8139=3A:33:EC:8E:C9:90,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsi0: gaming:102/vm-102-disk-0.raw,size=160G
scsihw: virtio-scsi-single
smbios1: uuid=cdf74ef2-6486-45f9-ba81-0a61bd0c8940
sockets: 1
tpmstate0: local:102/vm-102-disk-1.raw,size=4M,version=v2.0
vga: none
vmgenid: 0b1630e9-bdd9-488a-9469-fb481908da9c
 
Isn't a 500W power supply much too light for a RTX 3090 (even with only one CPU)? Just because it is redundant (if one fails, its uses the other) does not make it a 1000W PSU.
 
The power supply is not a problem. I installed as main system Windows 11 on second disk and boot from it. Everything work good and I haven't got any problem. Only on proxmox VM has "restart" problem. I thing its connected with Nvidia drivers. VM coudn't see Nvidia hdmi audio despite as select "all function" in VM configuration and as secound PCI device. Nvidia driver too couldn't see audio device and install only graphic drivers.
Mayby problem is in configuration, but I can't find it.
 
So, when i browse "Event Viewer" in windows VM I found 4 events connected with Nvidia driver.

Screenshot from 2022-11-14 11-53-52.png
Screenshot from 2022-11-14 11-54-06.png
Screenshot from 2022-11-14 11-54-18.png
Screenshot from 2022-11-14 11-54-34.png

Maybe this information can be helpful to resolve.

lspci -vv
Code:
0b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: NVIDIA Corporation GA102 [GeForce RTX 3090]
    Physical Slot: 2
    Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 16
    NUMA node: 0
    IOMMU group: 37
    Region 0: Memory at 93000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at 39fe0000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at 39ff0000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at 2000 [size=128]
    Expansion ROM at 94080000 [virtual] [disabled] [size=512K]
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
        Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
        DevCtl:    CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 256 bytes, MaxReadReq 4096 bytes
        DevSta:    CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl:    ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 8GT/s (ok), Width x8 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [b4] Vendor Specific Information: Len=14 <?>
    Capabilities: [100 v1] Virtual Channel
        Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:    ArbSelect=Fixed
        Status:    InProgress-
        VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
            Status:    NegoPending- InProgress-
    Capabilities: [258 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=0ns
        L1SubCtl2: T_PwrOn=10us
    Capabilities: [128 v1] Power Budgeting <?>
    Capabilities: [420 v2] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap:    First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [bb0 v1] Physical Resizable BAR
        BAR 0: current size: 16MB, supported: 16MB
        BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
        BAR 3: current size: 32MB, supported: 32MB
    Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00 v1] Lane Margining at the Receiver <?>
    Capabilities: [e00 v1] Data Link Feature <?>
    Kernel driver in use: vfio-pci
    Kernel modules: nvidiafb, nouveau

0b:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
    Subsystem: NVIDIA Corporation Device 1454
    Physical Slot: 2
    Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin B routed to IRQ 17
    NUMA node: 0
    IOMMU group: 37
    Region 0: Memory at 94000000 (32-bit, non-prefetchable) [size=16K]
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [78] Express (v2) Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
        DevCtl:    CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 4096 bytes
        DevSta:    CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl:    ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 8GT/s (ok), Width x8 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [100 v2] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap:    First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [160 v1] Data Link Feature <?>
    Kernel driver in use: vfio-pci
    Kernel modules: snd_hda_intel
 
Maybe have you other ideas?
No still the same idea: Proxmox+Windows VM+Parsec+GPU is a higher load than just Windows+GPU and power supply is insufficient (during peaks in the load) and power is temporarily interrupted. Do you happen to have a Watt-meter to check (wall) power usage just before restarts?

PS: E5-2683 v4 is already up to 120W each, RTX 3090 takes up to 360W.
 
Last edited:
No still the same idea: Proxmox+Windows VM+Parsec+GPU is a higher load than just Windows+GPU and power supply is insufficient (during peaks in the load) and power is temporarily interrupted. Do you happen to have a Watt-meter to check (wall) power usage just before restarts?

PS: E5-2683 v4 is already up to 120W each, RTX 3090 takes up to 360W.
So, I tested server by Watt-meter on VM full load on Heaven Benchmark 4.0 by 10 minutes. In peak server wear taken 610W and regular around 460W. (Server has got set full performance on redundant PSU) After test VM head workt in IDLE by next 10 minutes and get restart at 150W on Watt-meter.
 
  • Like
Reactions: leesteken
For this I suspect something is missing in the VM configuration.
For test I connected monitor directly to GPU instead emulator. After 8 minutes in IDLE I gotten black screen and VM restart.

[Edit]
Of course, no sound over hdmi in VM.
 
Last edited:
For this I suspect something is missing in the VM configuration.
Passthrough works and VM configuration look fine. You have not yet installed VirtIO drivers and QEMU Guest Agent but they are optional and only increase performance. Still feels like a hardware or heat issue. Have you tried a Ubuntu VM (with passthrough but I don't know how well Linux supports RTX 3090) to see if it also restarts the whole system after some minutes being idle?
 
Passthrough works and VM configuration look fine. You have not yet installed VirtIO drivers and QEMU Guest Agent but they are optional and only increase performance. Still feels like a hardware or heat issue. Have you tried a Ubuntu VM (with passthrough but I don't know how well Linux supports RTX 3090) to see if it also restarts the whole system after some minutes being idle?

I tried a linux (Ubuntu), VM work stable on 6 hours test (Heaven Benchmark). Even hdmi audio device started working, but with artifacts.

Code:
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 16
cpu: host,hidden=1,flags=+pcid
efidisk0: local:101/vm-101-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:0b:00.0;0000:0b:00.1,pcie=1,romfile=RTX3090.rom,x-vga=1
ide2: none,media=cdrom
machine: q35
memory: 8192
meta: creation-qemu=7.0.0,ctime=1668453631
name: ubuntu-test
net0: virtio=76:EB:D1:12:CD:25,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local:101/vm-101-disk-1.qcow2,size=60G
scsihw: virtio-scsi-pci
smbios1: uuid=8a8fbee5-e895-4e99-8dcf-c1e014e642a6
sockets: 1
usb0: host=045e:0745,usb3=1
vmgenid: 2c437617-7a0e-4b1b-baa6-3f7f4633b927
 
Last edited:
hostpci0: 0000:0b:00.0;0000:0b:00.1,pcie=1,romfile=RTX3090.rom,x-vga=1
This looks weird. I would expect hostpci0: 0b:00,pcie=1,romfile=RTX3090.rom,x-vga=1
If your problems are (most likely) related to Windows and/or NVidia drivers, I don't know what to do.

EDIT: Make sure to also early bind the audio function (0b:00.1 except use the lspci -nn numbers) to vfio-pci.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!