GPU Passthrough Fan 100% Drivers Recognized X570

en4ble

Member
Feb 24, 2023
73
5
8
Hello.

I'm having an issue with one of the GPUs when VM (22.04) starts. Fan on the GPU hits 100% (other GPUs default at 30%) during boot and remains at that speed.

When checking nvidia-smi drivers are recognized but fan shows 0%. Other 2 do not have the same symptom - settings are the same on all.
Code:
nvidia-smi
Wed Dec 18 23:55:28 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142                Driver Version: 550.142        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 4000                Off |   00000000:01:00.0 Off |                  N/A |
|  0%   45C    P8             12W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

GPU is located on the primary/main pcie slot (CPU).

HW System overview:
  1. X570 Taichi
    1. It was running on older bios so it was flashed to the newest* Lb.61 (02/27/2024) from L4.82 [Beta] 2022/6/13
    2. IOMMU wasn't enabled by default. I went with the recommendation from VFIO group on enabling it.
      1. IOMMU: enabled
      2. AER Cap: enabled
      3. ACS enable: Auto
  2. Triple Quadro RTX 4000 on 550.14
    1. Tried different drivers on impacted VM but still the same issue

Proxmox Overview:
  1. PVE 8.3.2
    Grub updated per the guide - pasteBIN
    Code:
    GRUB_DEFAULT=0GRUB_TIMEOUT=5
    GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
    #GRUB_CMDLINE_LINUX_DEFAULT="quiet"
    GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,e>
    GRUB_CMDLINE_LINUX=""
  2. GPU recognized by the system:
    Code:
    pve01:~# lspci -vvv -s 03:00.0 | grep "LnkCap\|LnkSta"                
    LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                    LnkSta: Speed 8GT/s, Width x4 (downgraded)
                    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    pve01:~# lspci -vvv -s 0f:00.0 | grep "LnkCap\|LnkSta"
                    LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                    LnkSta: Speed 8GT/s, Width x8 (downgraded)
                    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    pve01:~# lspci -vvv -s 0e:00.0 | grep "LnkCap\|LnkSta"
                    LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                    LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
                    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
  3. VM Hardware Settings:
    1734566730671.png
    1734566754840.png
    1734566885868.png

Things I've tried so far(will update as I'll try different things):
  1. Bios updated and IOMMU enabled
  2. vIOMMU changed to VirtIO - fan no longer going 100% but drivers are not recogznied
  3. vIOMMU changed to Intel - drivers recognized but fan goes 100%. Both 2-3 running version "latest"

Any thoughts on what else I could try to get this fixed? Other two GPUs are working fine - not sure why would the 3rd one acting strange with fan control. I haven't tried windows VM yet.
 
Quick update.

Installed windows. GPU drivers got updated with OS update. Couple strange artifacts:
  1. As soon as windows updated and drivers got installed the GPU started running 100% on fan
  2. Recognized as GPU in task manager
  3. Nvidia Drivers update fails - can't find compatible hardware
  4. Device manager shows unknown other device (PCI Device)
  5. MSI afterburner, shows GPU set for 30% fan and manual doesn't make any difference.
1734575472028.png
 
More updates;
  1. disabled CSM (need to to enable 4g encoding)
  2. 4g encoding enabled - still same issue
  3. SR-IOV - same issue
  4. swapped GPUs on the lanes - issue follows GPU
 
Moved GPU to another system - issue follows the GPU.

It seems like something is wrong with that GPU, even though default drivers somehow get installed. No idea.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!