AMD RX 6650 XT - High temperatures when idle

mrish · Mar 10, 2024

Hi,

I am running Proxmox on my desktop with the following specs:
- Intel 13600K
- MSI Z690 A WiFi D4
- AMD RX 6650 XT
- 2x32 GB of RAM
- 2x Samsung 870 Evo 500 GB as boot drives in ZFS RAID1 configuration
- XPG S70 Blade 1 TB in LVM-thin to run my VMs

Usage scenario: VM running Windows/Linux) with GPU passthrough

What works: No blacklisting of drivers, no VFIO ids. Only 'driverctl' to pass the GPU to vfio and back using a perl hookscript(attached for reference).
Here is a an extract from the log 'after' VM is shutdown to prove that host(proxmox) gets the control back and starts using amdgpu driver, so VFIO is not controlling it anymore:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1002:73ef] (rev c1)

Subsystem: ASUSTeK Computer Inc. Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1043:05e3]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

So, what's the Issue: Irrespective of the approach I use for passing the GPU or the distro used for the VM, it's temperature starts going up every time it's returned to the host(proxmox) and it gets really hot without any usage.

I have no issues passing the GPU back to different VMs, as the hookscript works perfectly. The temperatures are fine, as long as its not the host but any VM controlling it via its own drivers(amdgpu for linux VMs or proprietory drivers in Windows). But this GPU-heat up is driving me mad. I am even contemplating bidding farewall to proxmox if I can't find any solution. I have been struggling with this since several months now. I know I can let a VM run just to manage it, but that solution is not very intuitive running a VM for no purpose with the same amdgpu driver, especially when the same amdgpu driver is actively managing it in the host, like how a linux VM would do it when GPU is passed through to it.

Wondering if there is a better solution to handle it well in the host itself, or am I missing something here..

Thanks,

leesteken · Mar 10, 2024

Unless there is a driver loaded, GPUs are usually in a high-power VGA compatibility mode. Either have a VM with passthrough running or unbind the vfio-pci driver and bind the amdgpu driver. There are examples on this forum on how to do this with the help with hookscripts.

mrish · Mar 10, 2024

leesteken said:
Unless there is a driver loaded, GPUs are usually in a high-power VGA compatibility mode. Either have a VM with passthrough running or unbind the vfio-pci driver and bind the amdgpu driver. There are examples on this forum on how to do this with the help with hookscripts.

Thanks, I had gone through previous threads on this topic and had a feeling you would reply

Ntot sure if my post explained it but that's exactly what I have been doing. So, the vfio driver is unbound and amdgpu does take over. Pls. refer the log I posted above, after the VM is shutdown and host takes over. I also posted the hookscript I used to perform this operation.

leesteken · Mar 10, 2024

mrish said:
Thanks, I had gone through previous threads on this topic and had a feeling you would reply

I attempted to reply before (which was more detailed) but your post got stuck in the spam filter for some time.

mrish said:
Ntot sure if my post explained it but that's exactly what I have been doing. So, the vfio driver is unbound and amdgpu does take over. Pls. refer the log I posted above, after the VM is shutdown and host takes over. I also posted the hookscript I used to perform this operation.

I missed that, sorry (and I have a hard time reading Perl as I simply use a Bash-script). If you bind the amdgpu driver on the host, it should not get warm.
I don't know if driverctl actually does that (because I don't use it) and whether you still have this issue. I simply use echo "0000:0b:00.0" >/sys/bus/pci/drivers/amdgpu/bind (in post-stop and after unbinding vfio-pci of course and I also do that for the audio and USB functions).
I don't think there is a better way (but maybe a simpler script), as only the device driver knows how to put the device in idle/low-power mode.

mrish · Mar 10, 2024

I did try with a shell script and a python one, but all had the same result.

Would you mind sharing your script? I am also passing the audio and some USB devices with their identifiers(not ports).

Thanks,

leesteken · Mar 10, 2024

mrish said:
I did try with a shell script and a python one, but all had the same result.

It's still not clear to me if you still have an issue, sorry.

mrish said:
Would you mind sharing your script? I am also passing the audio and some USB devices with their identifiers(not ports).

Bash:

#!/bin/bash
function Unbind {
    if [ -e "/sys/bus/$2/devices/$1/driver/." ]
    then
        echo "Unbind $1 from bus $2..."
        echo "$1" > "/sys/bus/$2/devices/$1/driver/unbind" && sleep 1
    fi
}
function BindTo {
    echo "Bind $1 to driver $3 on bus $2..."
    echo "$1" >"/sys/bus/$2/drivers/$3/bind"
}
function RebindTo {
    Unbind "$1" "$2" && BindTo "$1" "$2" "$3"
}
echo "$0 $*"
if [ "$2" == "pre-start" ]
then
#    echo device_specific >"/sys/bus/pci/devices/0000:0e:00.0/reset_method"
    echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
elif [ "$2" == "post-stop" ]
then
        sleep 1
        RebindTo "0000:0e:00.0" pci amdgpu
        RebindTo "0000:0e:00.1" pci snd_hda_intel
        RebindTo "0000:0e:00.2" pci xhci_hcd
        RebindTo "0000:0e:00.3" pci i2c-designware-pci
        RebindTo "0000:10:00.3" pci xhci_hcd
        echo '1' >/sys/class/vtconsole/vtcon0/bind
        sleep 1
fi

This is for a 6950XT (0e:00.*) and a Ryzen USB controller (10:00.3). That way, I can use the Proxmox host console (keyboard & display) when the VM is not running.

mrish · Mar 10, 2024

Thanks, I had used a similar script earlier, but I customized and used yours instead. Since, mine is a an RX 6000 series model, supposedly which had the reset bug, I uncommented and changed the line in your script to:

echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"

But hookscript script fails giving the following following error:

/var/lib/vz/snippets/hookscript-amdgpu.sh 102 pre-start
/var/lib/vz/snippets/hookscript-amdgpu.sh: line 20: echo: write error: Invalid argument

hookscript error for 102 on pre-start: command '/var/lib/vz/snippets/hookscript-amdgpu.sh 102 pre-start' failed: exit code 1

leesteken · Mar 10, 2024

mrish said:
/var/lib/vz/snippets/hookscript-amdgpu.sh: line 20: echo: write error: Invalid argument

What is line 20? And did you change other lines to match your setup also?

EDIT: Assuming it's echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method" then the cause is most likely that vendor-reset does no support the Radean 6000-series. you cannot use vendor-reset to fix your reset issue.

mrish · Mar 10, 2024

leesteken said:
What is line 20? And did you change other lines to match your setup also?

Line 20 is the line that I uncommented and changed:
echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"

Yes, I changed it, so far only for the GPU only(commented everything else):

code_language.shell:

#!/bin/bash
function Unbind {
    if [ -e "/sys/bus/$2/devices/$1/driver/." ]
    then
        echo "Unbind $1 from bus $2..."
        echo "$1" > "/sys/bus/$2/devices/$1/driver/unbind" && sleep 3
    fi
}
function BindTo {
    echo "Bind $1 to driver $3 on bus $2..."
    echo "$1" >"/sys/bus/$2/drivers/$3/bind"
}
function RebindTo {
    Unbind "$1" "$2" && BindTo "$1" "$2" "$3"
}
echo "$0 $*"
if [ "$2" == "pre-start" ]
then
        echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
        echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"
elif [ "$2" == "post-stop" ]
then
        sleep 3
        RebindTo "0000:03:00.0" pci amdgpu
        RebindTo "0000:03:00.1" pci snd_hda_intel
        #RebindTo "0000:0e:00.2" pci xhci_hcd
        #RebindTo "0000:0e:00.3" pci i2c-designware-pci
        #RebindTo "0000:10:00.3" pci xhci_hcd
        echo "1" > /sys/class/vtconsole/vtcon0/bind
        sleep 3
fi

mrish · Mar 10, 2024

Update: So, I removed the vendor-reset line and changed the config to :

code_language.shell:

#!/bin/bash
function Unbind {
    if [ -e "/sys/bus/$2/devices/$1/driver/." ]
    then
        echo "Unbind $1 from bus $2..."
        echo "$1" > "/sys/bus/$2/devices/$1/driver/unbind" && sleep 3
    fi
}
function BindTo {
    echo "Bind $1 to driver $3 on bus $2..."
    echo "$1" >"/sys/bus/$2/drivers/$3/bind"
}
function RebindTo {
    Unbind "$1" "$2" && BindTo "$1" "$2" "$3"
}
echo "$0 $*"
if [ "$2" == "pre-start" ]
then
        echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
        #echo 'device_specific' >"/sys/bus/pci/devices/0000:03:00.0/reset_method"
elif [ "$2" == "post-stop" ]
then
        sleep 3
        RebindTo "0000:03:00.0" pci amdgpu
        RebindTo "0000:03:00.1" pci snd_hda_intel
        #RebindTo "0000:0e:00.2" pci xhci_hcd
        #RebindTo "0000:0e:00.3" pci i2c-designware-pci
        #RebindTo "0000:10:00.3" pci xhci_hcd
        echo "1" > /sys/class/vtconsole/vtcon0/bind
        sleep 3
fi

The script worked albeit with the same result as my earlier perl script. GPU still gets hot when it's passed back to the host.

Thanks.

leesteken · Mar 10, 2024

mrish said:
The script worked albeit with the same result as my earlier perl script. GPU still gets hot when it's passed back to the host.

That's surprising. Does lspci -knns 03:00 show amdgpu in use? Maybe the driver is not yet fully finished for 6650XT, as Proxmox does not run the latest kernel version? Does sensors (maybe install lm_sensors first?) show high temperatures? Maybe other Linux distribution also have this issue?

mrish · Mar 11, 2024

That's surprising. Does lspci -knns 03:00 show amdgpu in use? Maybe the driver is not yet fully finished for 6650XT, as Proxmox does not run the latest kernel version?

Output of 'lspci -knns 03:00' BEFORE VM start:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1002:73ef] (rev c1)

Subsystem: ASUSTeK Computer Inc. Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1043:05e3]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

Output of 'lspci -knns 03:00' AFTER VM shutdown:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1002:73ef] (rev c1)

Subsystem: ASUSTeK Computer Inc. Navi 23 [Radeon RX 6650 XT / 6700S / 6800S] [1043:05e3]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

--------------------------------

Does sensors (maybe install lm_sensors first?) show high temperatures?

Output of 'sensors':
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: N/A
nvme-pci-0400
Adapter: PCI adapter
Composite: +31.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +31.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
acpitz-acpi-0
Adapter: ACPI interface
temp1: +27.8°C (crit = +95.0°C)
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +36.0°C (high = +70.0°C, crit = +90.0°C)
Core 0: +31.0°C (high = +70.0°C, crit = +90.0°C)
Core 4: +31.0°C (high = +70.0°C, crit = +90.0°C)
Core 8: +29.0°C (high = +70.0°C, crit = +90.0°C)
Core 12: +30.0°C (high = +70.0°C, crit = +90.0°C)
Core 16: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 20: +32.0°C (high = +70.0°C, crit = +90.0°C)
Core 24: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 25: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 26: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 27: +34.0°C (high = +70.0°C, crit = +90.0°C)
Core 28: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 29: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 30: +33.0°C (high = +70.0°C, crit = +90.0°C)
Core 31: +34.0°C (high = +70.0°C, crit = +90.0°C)
nvme-pci-0700
Adapter: PCI adapter
Composite: +39.9°C (low = -40.1°C, high = +99.8°C)
(crit = +109.8°C)
Sensor 1: +39.9°C (low = -40.1°C, high = +99.8°C)

Maybe other Linux distribution also have this issue?

- It does not happen with other distributions as long as the GPU is used in the host. I have not checked what happens when virt-manager is used in those distributions. I tried with Ubuntu, Debian 12, Garuda Linux as host OS.
- Generally, it does not seem to be a distribution problem, as the GPU is handled well by ANY distro running inside the VMs, even Windows for that matter. I have tried with Ubuntu 22.04, Debian 12, Garuda Linux, POP OS, Windows 11 to name a few..

Thanks,

leesteken · Mar 11, 2024

I don't see the 6650XT show up in the sensors output, even though amdgpu appears to be loaded. Anything stands out in journalctl when unbinding vfio-pci and binding amdgpu? Does the GPU work fine when you start a VM with passthrough (after VM shutdown)? Do you have a usable Proxmox host console (after VM shutdown)? Maybe the amdgpu driver does not handle the state in which the GPU if left (after VM shutdown) very well?

mrish · Mar 14, 2024

leesteken said:
I don't see the 6650XT show up in the sensors output, even though amdgpu appears to be loaded.

Even I noticed that too after posting the log. So, I suspected an issue with my BIOS settings for PCIE Native Power Management. I enabled it and it appears in the 'sensors' output now.

Anything stands out in journalctl when unbinding vfio-pci and binding amdgpu?

Nothing suspicious that I could find..

Does the GPU work fine when you start a VM with passthrough (after VM shutdown)?

Yes, it does. Everytime I start the VM after shutdown.

Do you have a usable Proxmox host console (after VM shutdown)?

Yes, everytime the VM is shutdown.

Maybe the amdgpu driver does not handle the state in which the GPU if left (after VM shutdown) very well?

Not sure about that, but my best guess is that's not the issue.

I figured out something which is better put in its own post as it seems the problem lies somewhere else, not VMs and passthrough at all but Proxmox itself. More details to follow in my next post soon..

Thanks,

mrish · Mar 14, 2024

Okay, so I realized that we need to keep the 'passthrough' discussion out of context, as that's not the culprit here.

Here are some temperature observations: (No GPU Passthrough or VM)

Proxmox Boot without HDMI connected:
amdgpu-pci-0300
vddgfx: 6.00 mV
fan1: 0 RPM (min = 0 RPM, max = 3630 RPM)
edge: +37.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +37.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +34.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
PPT: 4.00 W (cap = 150.00 W)

Proxmox Boot with HDMI connected:
amdgpu-pci-0300
vddgfx: 6.00 mV
fan1: 0 RPM (min = 0 RPM, max = 3630 RPM)
edge: +55.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +55.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +56.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
PPT: 13.00 W (cap = 150.00 W)

HDMI removed:
amdgpu-pci-0300
vddgfx: 6.00 mV
fan1: 0 RPM (min = 0 RPM, max = 3630 RPM)
edge: +56.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +56.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +57.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
PPT: 14.00 W (cap = 150.00 W)

Summary:
- If Proxmox is booted without HDMI connected, the GPU stays in low-power mode(~4W) at sub-40 temeperatures.
- If Proxmox is booted with HDMI connection, the GPU starts drawing ~14W for displaying the console and temperatures start going upwards of 55. Fans are never turned on.
- Even when the HDMI is removed, it doesnot cause a drop in temperature or power-draw. Fans still don't turn on.

Only solution to restore its low-power/temperature state is to restart the server.

Not sure what can help here. We are not even talking of any passthrough stuff here, as it's always amdgpu in charge.

Thanks,

henrysand · Oct 13, 2024

Hi, can you provide me with the i2c ir35217 dump of that graphics card? I need to reprogram mine.

dizzydre21 · Dec 26, 2024

leesteken said:

It's still not clear to me if you still have an issue, sorry.

Bash:

#!/bin/bash
function Unbind {
    if [ -e "/sys/bus/$2/devices/$1/driver/." ]
    then
        echo "Unbind $1 from bus $2..."
        echo "$1" > "/sys/bus/$2/devices/$1/driver/unbind" && sleep 1
    fi
}
function BindTo {
    echo "Bind $1 to driver $3 on bus $2..."
    echo "$1" >"/sys/bus/$2/drivers/$3/bind"
}
function RebindTo {
    Unbind "$1" "$2" && BindTo "$1" "$2" "$3"
}
echo "$0 $*"
if [ "$2" == "pre-start" ]
then
#    echo device_specific >"/sys/bus/pci/devices/0000:0e:00.0/reset_method"
    echo 0 | tee /sys/class/vtconsole/vtcon*/bind >/dev/null
elif [ "$2" == "post-stop" ]
then
        sleep 1
        RebindTo "0000:0e:00.0" pci amdgpu
        RebindTo "0000:0e:00.1" pci snd_hda_intel
        RebindTo "0000:0e:00.2" pci xhci_hcd
        RebindTo "0000:0e:00.3" pci i2c-designware-pci
        RebindTo "0000:10:00.3" pci xhci_hcd
        echo '1' >/sys/class/vtconsole/vtcon0/bind
        sleep 1
fi

This is for a 6950XT (0e:00.*) and a Ryzen USB controller (10:00.3). That way, I can use the Proxmox host console (keyboard & display) when the VM is not running.

@leesteken I am planning on using this for an rx6400 so that it can be used for the Proxmox console when one of my Vms is not in use. Would you mind elaborating to a script noob on how exactly this works? I have been able to bind/unbind manually by using similar commands, but I don't understand what the $1, $2, $3 stuff means. It looks like it's just a dynamic variable, but what is getting pushed into it for the commands?

leesteken · Dec 26, 2024

dizzydre21 said:
Would you mind elaborating to a script noob on how exactly this works? I have been able to bind/unbind manually by using similar commands, but I don't understand what the $1, $2, $3 stuff means.

Although is unrelated to Proxmox, here is an explanation about functions and their parameters in Bash (but there are many more on the internet): https://linuxsimply.com/bash-scripting-tutorial/parameters/function-parameters .

dizzydre21 · Dec 26, 2024

leesteken said:
Although is unrelated to Proxmox, here is an explanation about functions and their parameters in Bash (but there are many more on the internet): https://linuxsimply.com/bash-scripting-tutorial/parameters/function-parameters .

@leesteken Thank you! I figured something like that was happening with the PCIe devices at the bottom. It makes more sense now.

I do have a couple questions, though, if you have the time to explain or point me to any other relevant literature.

1) If argument $2, which in this case is pci, what determines if it is equal to pre-start or post-stop? I know this is in reference to the VM starting/stopping, but looking at the actual location on my Proxmox host, it's just a directory and not a file containing those values.

2) What is $0? The link you shared just mentions starting with arg1 and it being $1.

leesteken · Dec 26, 2024

dizzydre21 said:
1) If argument $2, which in this case is pci, what determines if it is equal to pre-start or post-stop? I know this is in reference to the VM starting/stopping, but looking at the actual location on my Proxmox host, it's just a directory and not a file containing those values.

$2 is pci in a function call like RebindTo "0000:0e:00.0" pci amdgpu but when not inside a function declaration, $2 is the second parameter of the script. See also the link below.

dizzydre21 said:
2) What is $0? The link you shared just mentions starting with arg1 and it being $1.

I found this by searching for "bash what is $0": https://tecadmin.net/bash-special-variables/

AMD RX 6650 XT - High temperatures when idle

New Member

Attachments

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

New Member

Member

​

​

Distinguished Member

Member

Distinguished Member

We value your privacy