Hello.
I'm having an issue with one of the GPUs when VM (22.04) starts. Fan on the GPU hits 100% (other GPUs default at 30%) during boot and remains at that speed.
When checking nvidia-smi drivers are recognized but fan shows 0%. Other 2 do not have the same symptom - settings are the same on all.
GPU is located on the primary/main pcie slot (CPU).
HW System overview:
Proxmox Overview:
Things I've tried so far(will update as I'll try different things):
Any thoughts on what else I could try to get this fixed? Other two GPUs are working fine - not sure why would the 3rd one acting strange with fan control. I haven't tried windows VM yet.
I'm having an issue with one of the GPUs when VM (22.04) starts. Fan on the GPU hits 100% (other GPUs default at 30%) during boot and remains at that speed.
When checking nvidia-smi drivers are recognized but fan shows 0%. Other 2 do not have the same symptom - settings are the same on all.
Code:
nvidia-smi
Wed Dec 18 23:55:28 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142 Driver Version: 550.142 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro RTX 4000 Off | 00000000:01:00.0 Off | N/A |
| 0% 45C P8 12W / 125W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
GPU is located on the primary/main pcie slot (CPU).
HW System overview:
- X570 Taichi
- It was running on older bios so it was flashed to the newest* Lb.61 (02/27/2024) from L4.82 [Beta] 2022/6/13
- IOMMU wasn't enabled by default. I went with the recommendation from VFIO group on enabling it.
- IOMMU: enabled
- AER Cap: enabled
- ACS enable: Auto
- Triple Quadro RTX 4000 on 550.14
- Tried different drivers on impacted VM but still the same issue
Proxmox Overview:
- PVE 8.3.2
Grub updated per the guide - pasteBIN
Code:GRUB_DEFAULT=0GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` #GRUB_CMDLINE_LINUX_DEFAULT="quiet" GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,e> GRUB_CMDLINE_LINUX=""
- GPU recognized by the system:
Code:pve01:~# lspci -vvv -s 03:00.0 | grep "LnkCap\|LnkSta" LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us LnkSta: Speed 8GT/s, Width x4 (downgraded) LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ pve01:~# lspci -vvv -s 0f:00.0 | grep "LnkCap\|LnkSta" LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us LnkSta: Speed 8GT/s, Width x8 (downgraded) LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ pve01:~# lspci -vvv -s 0e:00.0 | grep "LnkCap\|LnkSta" LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded) LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
- VM Hardware Settings:
Things I've tried so far(will update as I'll try different things):
- Bios updated and IOMMU enabled
- vIOMMU changed to VirtIO - fan no longer going 100% but drivers are not recogznied
- vIOMMU changed to Intel - drivers recognized but fan goes 100%. Both 2-3 running version "latest"
Any thoughts on what else I could try to get this fixed? Other two GPUs are working fine - not sure why would the 3rd one acting strange with fan control. I haven't tried windows VM yet.