GPU Passthrough Fan 100% Drivers Recognized X570

en4ble

Member
Feb 24, 2023
74
5
13
Hello.

I'm having an issue with one of the GPUs when VM (22.04) starts. Fan on the GPU hits 100% (other GPUs default at 30%) during boot and remains at that speed.

When checking nvidia-smi drivers are recognized but fan shows 0%. Other 2 do not have the same symptom - settings are the same on all.
Code:
nvidia-smi
Wed Dec 18 23:55:28 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142                Driver Version: 550.142        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 4000                Off |   00000000:01:00.0 Off |                  N/A |
|  0%   45C    P8             12W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

GPU is located on the primary/main pcie slot (CPU).

HW System overview:
  1. X570 Taichi
    1. It was running on older bios so it was flashed to the newest* Lb.61 (02/27/2024) from L4.82 [Beta] 2022/6/13
    2. IOMMU wasn't enabled by default. I went with the recommendation from VFIO group on enabling it.
      1. IOMMU: enabled
      2. AER Cap: enabled
      3. ACS enable: Auto
  2. Triple Quadro RTX 4000 on 550.14
    1. Tried different drivers on impacted VM but still the same issue

Proxmox Overview:
  1. PVE 8.3.2
    Grub updated per the guide - pasteBIN
    Code:
    GRUB_DEFAULT=0GRUB_TIMEOUT=5
    GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
    #GRUB_CMDLINE_LINUX_DEFAULT="quiet"
    GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,e>
    GRUB_CMDLINE_LINUX=""
  2. GPU recognized by the system:
    Code:
    pve01:~# lspci -vvv -s 03:00.0 | grep "LnkCap\|LnkSta"                
    LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                    LnkSta: Speed 8GT/s, Width x4 (downgraded)
                    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    pve01:~# lspci -vvv -s 0f:00.0 | grep "LnkCap\|LnkSta"
                    LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
                    LnkSta: Speed 8GT/s, Width x8 (downgraded)
                    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    pve01:~# lspci -vvv -s 0e:00.0 | grep "LnkCap\|LnkSta"
                    LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                    LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
                    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
  3. VM Hardware Settings:
    1734566730671.png
    1734566754840.png
    1734566885868.png

Things I've tried so far(will update as I'll try different things):
  1. Bios updated and IOMMU enabled
  2. vIOMMU changed to VirtIO - fan no longer going 100% but drivers are not recogznied
  3. vIOMMU changed to Intel - drivers recognized but fan goes 100%. Both 2-3 running version "latest"

Any thoughts on what else I could try to get this fixed? Other two GPUs are working fine - not sure why would the 3rd one acting strange with fan control. I haven't tried windows VM yet.
 
Quick update.

Installed windows. GPU drivers got updated with OS update. Couple strange artifacts:
  1. As soon as windows updated and drivers got installed the GPU started running 100% on fan
  2. Recognized as GPU in task manager
  3. Nvidia Drivers update fails - can't find compatible hardware
  4. Device manager shows unknown other device (PCI Device)
  5. MSI afterburner, shows GPU set for 30% fan and manual doesn't make any difference.
1734575472028.png
 
More updates;
  1. disabled CSM (need to to enable 4g encoding)
  2. 4g encoding enabled - still same issue
  3. SR-IOV - same issue
  4. swapped GPUs on the lanes - issue follows GPU
 
Moved GPU to another system - issue follows the GPU.

It seems like something is wrong with that GPU, even though default drivers somehow get installed. No idea.