AMD RX 6650 XT - High temperatures when idle

mrish

New Member
Apr 7, 2023
12
0
1
Hi,

I am running Proxmox on my desktop with the following specs:
- Intel 13600K
- MSI Z690 A WiFi D4
- AMD RX 6650 XT
- 2x32 GB of RAM
- 2x Samsung 870 Evo 500 GB as boot drives in ZFS RAID1 configuration
- XPG S70 Blade 1 TB in LVM-thin to run my VMs

Usage scenario: VM running Windows/Linux) with GPU passthrough

What works: No blacklisting of drivers, no VFIO ids. Only 'driverctl' to pass the GPU to vfio and back using a perl hookscript(attached for reference).
Here is a an extract from the log 'after' VM is shutdown to prove that host(proxmox) gets the control back and starts using amdgpu driver, so VFIO is not controlling it anymore:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1002:73ef] (rev c1)
Subsystem: ASUSTeK Computer Inc. Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1043:05e3]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

So, what's the Issue: Irrespective of the approach I use for passing the GPU or the distro used for the VM, it's temperature starts going up every time it's returned to the host(proxmox) and it gets really hot without any usage.

I have no issues passing the GPU back to different VMs, as the hookscript works perfectly. The temperatures are fine, as long as its not the host but any VM controlling it via its own drivers(amdgpu for linux VMs or proprietory drivers in Windows). But this GPU-heat up is driving me mad. I am even contemplating bidding farewall to proxmox if I can't find any solution. I have been struggling with this since several months now. I know I can let a VM run just to manage it, but that solution is not very intuitive running a VM for no purpose with the same amdgpu driver, especially when the same amdgpu driver is actively managing it in the host, like how a linux VM would do it when GPU is passed through to it.

Wondering if there is a better solution to handle it well in the host itself, or am I missing something here..

Thanks,
 

Attachments

  • driverctl-hookscript.pl.txt
    7.2 KB · Views: 3
Unless there is a driver loaded, GPUs are usually in a high-power VGA compatibility mode. Either have a VM with passthrough running or unbind the vfio-pci driver and bind the amdgpu driver. There are examples on this forum on how to do this with the help with hookscripts.
 
Unless there is a driver loaded, GPUs are usually in a high-power VGA compatibility mode. Either have a VM with passthrough running or unbind the vfio-pci driver and bind the amdgpu driver. There are examples on this forum on how to do this with the help with hookscripts.
Thanks, I had gone through previous threads on this topic and had a feeling you would reply :)

Ntot sure if my post explained it but that's exactly what I have been doing. So, the vfio driver is unbound and amdgpu does take over. Pls. refer the log I posted above, after the VM is shutdown and host takes over. I also posted the hookscript I used to perform this operation.
 
Thanks, I had gone through previous threads on this topic and had a feeling you would reply :)
I attempted to reply before (which was more detailed) but your post got stuck in the spam filter for some time.
Ntot sure if my post explained it but that's exactly what I have been doing. So, the vfio driver is unbound and amdgpu does take over. Pls. refer the log I posted above, after the VM is shutdown and host takes over. I also posted the hookscript I used to perform this operation.
I missed that, sorry (and I have a hard time reading Perl as I simply use a Bash-script). If you bind the amdgpu driver on the host, it should not get warm.
I don't know if driverctl actually does that (because I don't use it) and whether you still have this issue. I simply use echo "0000:0b:00.0" >/sys/bus/pci/drivers/amdgpu/bind (in post-stop and after unbinding vfio-pci of course and I also do that for the audio and USB functions).
I don't think there is a better way (but maybe a simpler script), as only the device driver knows how to put the device in idle/low-power mode.
 
I did try with a shell script and a python one, but all had the same result.

Would you mind sharing your script? I am also passing the audio and some USB devices with their identifiers(not ports).

Thanks,
 
I did try with a shell script and a python one, but all had the same result.
It's still not clear to me if you still have an issue, sorry.
Would you mind sharing your script? I am also passing the audio and some USB devices with their identifiers(not ports).
Bash:
#!/bin/bash
function Unbind {
    if [ -e "/sys/bus/$2/devices/$1/driver/." ]
    then
        echo "Unbind $1 from bus $2..."
        echo "$1" > "/sys/bus/$2/devices/$1/driver/unbind" && sleep 1
    fi
}
function BindTo {
    echo "Bind $1 to driver $3 on bus $2..."
    echo "$1" >"/sys/bus/$2/drivers/$3/bind"
}
function RebindTo {
    Unbind "$1" "$2" && BindTo "$1" "$2" "$3"
}
echo "$0 $*"
if [ "$2" == "pre-start" ]
then
#    echo device_specific >"/sys/bus/pci/devices/0000:0e:00.0/reset_method"
    echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
elif [ "$2" == "post-stop" ]
then
        sleep 1
        RebindTo "0000:0e:00.0" pci amdgpu
        RebindTo "0000:0e:00.1" pci snd_hda_intel
        RebindTo "0000:0e:00.2" pci xhci_hcd
        RebindTo "0000:0e:00.3" pci i2c-designware-pci
        RebindTo "0000:10:00.3" pci xhci_hcd
        echo '1' >/sys/class/vtconsole/vtcon0/bind
        sleep 1
fi
This is for a 6950XT (0e:00.*) and a Ryzen USB controller (10:00.3). That way, I can use the Proxmox host console (keyboard & display) when the VM is not running.
 
Thanks, I had used a similar script earlier, but I customized and used yours instead. Since, mine is a an RX 6000 series model, supposedly which had the reset bug, I uncommented and changed the line in your script to:

echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"

But hookscript script fails giving the following following error:

/var/lib/vz/snippets/hookscript-amdgpu.sh 102 pre-start
/var/lib/vz/snippets/hookscript-amdgpu.sh: line 20: echo: write error: Invalid argument
hookscript error for 102 on pre-start: command '/var/lib/vz/snippets/hookscript-amdgpu.sh 102 pre-start' failed: exit code 1
 
/var/lib/vz/snippets/hookscript-amdgpu.sh: line 20: echo: write error: Invalid argument
What is line 20? And did you change other lines to match your setup also?

EDIT: Assuming it's echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method" then the cause is most likely that vendor-reset does no support the Radean 6000-series. you cannot use vendor-reset to fix your reset issue.
 
Last edited:
What is line 20? And did you change other lines to match your setup also?
Line 20 is the line that I uncommented and changed:
echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"

Yes, I changed it, so far only for the GPU only(commented everything else):

code_language.shell:
#!/bin/bash
function Unbind {
    if [ -e "/sys/bus/$2/devices/$1/driver/." ]
    then
        echo "Unbind $1 from bus $2..."
        echo "$1" > "/sys/bus/$2/devices/$1/driver/unbind" && sleep 3
    fi
}
function BindTo {
    echo "Bind $1 to driver $3 on bus $2..."
    echo "$1" >"/sys/bus/$2/drivers/$3/bind"
}
function RebindTo {
    Unbind "$1" "$2" && BindTo "$1" "$2" "$3"
}
echo "$0 $*"
if [ "$2" == "pre-start" ]
then
        echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
        echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"
elif [ "$2" == "post-stop" ]
then
        sleep 3
        RebindTo "0000:03:00.0" pci amdgpu
        RebindTo "0000:03:00.1" pci snd_hda_intel
        #RebindTo "0000:0e:00.2" pci xhci_hcd
        #RebindTo "0000:0e:00.3" pci i2c-designware-pci
        #RebindTo "0000:10:00.3" pci xhci_hcd
        echo "1" > /sys/class/vtconsole/vtcon0/bind
        sleep 3
fi
 
Update: So, I removed the vendor-reset line and changed the config to :
code_language.shell:
#!/bin/bash
function Unbind {
    if [ -e "/sys/bus/$2/devices/$1/driver/." ]
    then
        echo "Unbind $1 from bus $2..."
        echo "$1" > "/sys/bus/$2/devices/$1/driver/unbind" && sleep 3
    fi
}
function BindTo {
    echo "Bind $1 to driver $3 on bus $2..."
    echo "$1" >"/sys/bus/$2/drivers/$3/bind"
}
function RebindTo {
    Unbind "$1" "$2" && BindTo "$1" "$2" "$3"
}
echo "$0 $*"
if [ "$2" == "pre-start" ]
then
        echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
        #echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"
elif [ "$2" == "post-stop" ]
then
        sleep 3
        RebindTo "0000:03:00.0" pci amdgpu
        RebindTo "0000:03:00.1" pci snd_hda_intel
        #RebindTo "0000:0e:00.2" pci xhci_hcd
        #RebindTo "0000:0e:00.3" pci i2c-designware-pci
        #RebindTo "0000:10:00.3" pci xhci_hcd
        echo "1" > /sys/class/vtconsole/vtcon0/bind
        sleep 3
fi

The script worked albeit with the same result as my earlier perl script. GPU still gets hot when it's passed back to the host.

Thanks.
 
Last edited:
The script worked albeit with the same result as my earlier perl script. GPU still gets hot when it's passed back to the host.
That's surprising. Does lspci -knns 03:00 show amdgpu in use? Maybe the driver is not yet fully finished for 6650XT, as Proxmox does not run the latest kernel version? Does sensors (maybe install lm_sensors first?) show high temperatures? Maybe other Linux distribution also have this issue?
 
That's surprising. Does lspci -knns 03:00 show amdgpu in use? Maybe the driver is not yet fully finished for 6650XT, as Proxmox does not run the latest kernel version?

Output of 'lspci -knns 03:00' BEFORE VM start:
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1002:73ef] (rev c1)
Subsystem: ASUSTeK Computer Inc. Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1043:05e3]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

Output of 'lspci -knns 03:00' AFTER VM shutdown:
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1002:73ef] (rev c1)
Subsystem: ASUSTeK Computer Inc. Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1043:05e3]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

--------------------------------
Does sensors (maybe install lm_sensors first?) show high temperatures?

Output of 'sensors':
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: N/A
nvme-pci-0400
Adapter: PCI adapter
Composite: +31.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +31.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
acpitz-acpi-0
Adapter: ACPI interface
temp1: +27.8°C (crit = +95.0°C)
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +36.0°C (high = +70.0°C, crit = +90.0°C)
Core 0: +31.0°C (high = +70.0°C, crit = +90.0°C)
Core 4: +31.0°C (high = +70.0°C, crit = +90.0°C)
Core 8: +29.0°C (high = +70.0°C, crit = +90.0°C)
Core 12: +30.0°C (high = +70.0°C, crit = +90.0°C)
Core 16: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 20: +32.0°C (high = +70.0°C, crit = +90.0°C)
Core 24: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 25: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 26: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 27: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 28: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 29: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 30: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 31: +34.0°C (high = +70.0°C, crit = +90.0°C)
nvme-pci-0700
Adapter: PCI adapter
Composite: +39.9°C (low = -40.1°C, high = +99.8°C)
(crit = +109.8°C)
Sensor 1: +39.9°C (low = -40.1°C, high = +99.8°C)

Maybe other Linux distribution also have this issue?
- It does not happen with other distributions as long as the GPU is used in the host. I have not checked what happens when virt-manager is used in those distributions. I tried with Ubuntu, Debian 12, Garuda Linux as host OS.
- Generally, it does not seem to be a distribution problem, as the GPU is handled well by ANY distro running inside the VMs, even Windows for that matter. I have tried with Ubuntu 22.04, Debian 12, Garuda Linux, POP OS, Windows 11 to name a few..

Thanks,
 
I don't see the 6650XT show up in the sensors output, even though amdgpu appears to be loaded. Anything stands out in journalctl when unbinding vfio-pci and binding amdgpu? Does the GPU work fine when you start a VM with passthrough (after VM shutdown)? Do you have a usable Proxmox host console (after VM shutdown)? Maybe the amdgpu driver does not handle the state in which the GPU if left (after VM shutdown) very well?
 
I don't see the 6650XT show up in the sensors output, even though amdgpu appears to be loaded.
Even I noticed that too after posting the log. So, I suspected an issue with my BIOS settings for PCIE Native Power Management. I enabled it and it appears in the 'sensors' output now.

Anything stands out in journalctl when unbinding vfio-pci and binding amdgpu?
Nothing suspicious that I could find..

Does the GPU work fine when you start a VM with passthrough (after VM shutdown)?
Yes, it does. Everytime I start the VM after shutdown.

Do you have a usable Proxmox host console (after VM shutdown)?
Yes, everytime the VM is shutdown.

Maybe the amdgpu driver does not handle the state in which the GPU if left (after VM shutdown) very well?
Not sure about that, but my best guess is that's not the issue.

I figured out something which is better put in its own post as it seems the problem lies somewhere else, not VMs and passthrough at all but Proxmox itself. More details to follow in my next post soon..

Thanks,
 
Okay, so I realized that we need to keep the 'passthrough' discussion out of context, as that's not the culprit here.

Here are some temperature observations: (No GPU Passthrough or VM)

Proxmox Boot without HDMI connected:
amdgpu-pci-0300
vddgfx: 6.00 mV
fan1: 0 RPM (min = 0 RPM, max = 3630 RPM)
edge: +37.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +37.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +34.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
PPT: 4.00 W (cap = 150.00 W)

Proxmox Boot with HDMI connected:
amdgpu-pci-0300
vddgfx: 6.00 mV
fan1: 0 RPM (min = 0 RPM, max = 3630 RPM)
edge: +55.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +55.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +56.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
PPT: 13.00 W (cap = 150.00 W)

HDMI removed:
amdgpu-pci-0300
vddgfx: 6.00 mV
fan1: 0 RPM (min = 0 RPM, max = 3630 RPM)
edge: +56.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +56.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +57.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
PPT: 14.00 W (cap = 150.00 W)

Summary:
- If Proxmox is booted without HDMI connected, the GPU stays in low-power mode(~4W) at sub-40 temeperatures.
- If Proxmox is booted with HDMI connection, the GPU starts drawing ~14W for displaying the console and temperatures start going upwards of 55. Fans are never turned on.
- Even when the HDMI is removed, it doesnot cause a drop in temperature or power-draw. Fans still don't turn on.

Only solution to restore its low-power/temperature state is to restart the server.

Not sure what can help here. We are not even talking of any passthrough stuff here, as it's always amdgpu in charge.

Thanks,
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!