[SOLVED] Host freezes (mostly while gaming) - how to debug

firobad66 · May 1, 2022

Hi,

I do encounter - more or less randomly - freezes of the complete host (Debian with proxmox) however it seems to be related to when I game on the Windows 10 VM.

Host System:

CPU	AMD Phenom 2 1090 XT @3.2GHz (6 core)
Board	AS rock 890FX (Deluxe 4)
GPU (passed trough)	AMD Radeon R9 285
RAM	32GB @667mhz

The config of the windows VM:

Code:

audio0: device=ich9-intel-hda,driver=spice
balloon: 0
boot: order=ide0;ide2;net0
cores: 4
cpu: host
hostpci0: 0000:07:00,pcie=1,x-vga=1
ide0: local:100/vm-100-disk-0.qcow2,size=120G
ide2: none,media=cdrom
machine: pc-q35-6.1
memory: 12288
meta: creation-qemu=6.1.0,ctime=1637058178
name: pWin10
net0: e1000=3E:4D:9B:B7:98:1B,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=c9547027-5d60-4960-9280-8ee54a480cd7
sockets: 1
usb0: host=0a12:0001,usb3=1
vga: none
vmgenid: e3b10431-1cb4-4e7b-8a96-0fd21a3e71f4

The whole system completely freezes and the only thing i can do is a hard reset. So far it happen mostly when I was gaming on the Windows VM, apart from that it runs flawlessly, only every now and then a freeze occurs.

I have checked the dmesg output while running (-w switch) but it really doesn't show anything of value (at least i dont see anything)
Here as example the last crash:

Code:

[42355.726529] fwbr100i0: port 2(tap100i0) entered disabled state
[42355.749345] fwbr100i0: port 1(fwln100i0) entered disabled state
[42355.749535] vmbr0: port 2(fwpr100p0) entered disabled state
[42355.749680] device fwln100i0 left promiscuous mode
[42355.749690] fwbr100i0: port 1(fwln100i0) entered disabled state
[42355.768896] device fwpr100p0 left promiscuous mode
[42355.768905] vmbr0: port 2(fwpr100p0) entered disabled state
[42356.096438] usb 9-2: reset full-speed USB device number 2 using ohci-pci
[42483.976444] device tap100i0 entered promiscuous mode
[42484.030109] vmbr0: port 2(fwpr100p0) entered blocking state
[42484.030119] vmbr0: port 2(fwpr100p0) entered disabled state
[42484.030222] device fwpr100p0 entered promiscuous mode
[42484.030299] vmbr0: port 2(fwpr100p0) entered blocking state
[42484.030302] vmbr0: port 2(fwpr100p0) entered forwarding state
[42484.036676] fwbr100i0: port 1(fwln100i0) entered blocking state
[42484.036685] fwbr100i0: port 1(fwln100i0) entered disabled state
[42484.036814] device fwln100i0 entered promiscuous mode
[42484.036882] fwbr100i0: port 1(fwln100i0) entered blocking state
[42484.036885] fwbr100i0: port 1(fwln100i0) entered forwarding state
[42484.044604] fwbr100i0: port 2(tap100i0) entered blocking state
[42484.044614] fwbr100i0: port 2(tap100i0) entered disabled state
[42484.044747] fwbr100i0: port 2(tap100i0) entered blocking state
[42484.044750] fwbr100i0: port 2(tap100i0) entered forwarding state
[42489.564400] vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[42489.564415] vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[42503.634902] usb 9-2: reset full-speed USB device number 2 using ohci-pci

The last line happened way before the freeze... Also the journalctl doesnt show anything of value to me:

Code:

May 01 12:16:57 center sudo[90219]:    matze : TTY=pts/4 ; PWD=/home/matze ; USER=root ; COMMAND=/usr/bin/dmesg -w
May 01 12:16:57 center sudo[90219]: pam_unix(sudo:session): session opened for user root(uid=0) by matze(uid=1000)
May 01 12:17:01 center CRON[90240]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 01 12:17:01 center CRON[90241]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 01 12:17:01 center CRON[90240]: pam_unix(cron:session): session closed for user root
May 01 12:17:13 center sshd[90264]: Accepted password for matze from 192.168.1.103 port 47466 ssh2
May 01 12:17:13 center sshd[90264]: pam_unix(sshd:session): session opened for user matze(uid=1000) by (uid=0)
May 01 12:17:13 center systemd-logind[730]: New session 24 of user matze.
May 01 12:17:13 center systemd[1]: Started Session 24 of user matze.
May 01 12:17:19 center gnome-software[82915]: Only 0 apps for recent list, hiding
May 01 12:17:19 center gnome-software[82915]: hiding category games featured applications: found only 0 to show, need at least 9
May 01 12:17:19 center gnome-software[82915]: hiding category productivity featured applications: found only 0 to show, need at least 9
May 01 12:17:19 center gnome-software[82915]: automatically prevented from changing kind on system/package/debian-stable-main/generic/org.gphoto.libgphoto2/* from generic>
May 01 12:17:19 center gnome-software[2265]: hiding category productivity featured applications: found only 0 to show, need at least 9
May 01 12:17:19 center gnome-software[2265]: hiding category games featured applications: found only 0 to show, need at least 9
May 01 12:17:19 center gnome-software[2265]: Only 0 apps for recent list, hiding
May 01 12:17:19 center gnome-software[2265]: automatically prevented from changing kind on system/package/debian-stable-main/generic/org.gphoto.libgphoto2/* from generic >
May 01 12:17:20 center PackageKit[2194]: get-updates transaction /4441_aecbabce from uid 1000 finished with success after 669ms
May 01 12:17:20 center PackageKit[2194]: get-updates transaction /4442_bdeaaaea from uid 1001 finished with success after 657ms
May 01 12:23:28 center pvedaemon[89324]: <root@pam> successful auth for user 'matze@pam'
May 01 12:23:36 center smartd[720]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 61 to 47
May 01 12:23:36 center smartd[720]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 60
May 01 12:24:03 center pvedaemon[2109]: <root@pam> successful auth for user 'matze@pam'
May 01 12:24:11 center pveproxy[88476]: worker exit
May 01 12:24:11 center pveproxy[2329]: worker 88476 finished
May 01 12:24:11 center pveproxy[2329]: starting 1 worker(s)
May 01 12:24:11 center pveproxy[2329]: worker 91183 started
May 01 12:28:35 center pveproxy[88958]: worker exit
May 01 12:28:35 center pveproxy[2329]: worker 88958 finished
May 01 12:28:35 center pveproxy[2329]: starting 1 worker(s)
May 01 12:28:35 center pveproxy[2329]: worker 91702 started
May 01 12:33:08 center pveproxy[89669]: worker exit
May 01 12:33:08 center pveproxy[2329]: worker 89669 finished
May 01 12:33:08 center pveproxy[2329]: starting 1 worker(s)
May 01 12:33:08 center pveproxy[2329]: worker 92248 started
May 01 12:38:29 center pvedaemon[89324]: <root@pam> successful auth for user 'matze@pam'
May 01 12:39:03 center pvedaemon[89324]: <root@pam> successful auth for user 'matze@pam'

The htop that was open showed me a CPU temp of 61 degrees celsius and a load avg 3.8-4.2

The Iommu itself works fine and the groups are also fine.
Nevertheless here my configs:

Code:

cat /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on vfio-pci.ids=1002:6939,1002:aad8,10de:13c2,10de:0fbb"

Code:

cat /etc/modules
vfio
vfio_iommu_type1
vfio_pci ids=1002:6939,1002:aad8,10de:13c2,10de:0fbb
vfio_virqfd

Code:

cat /etc/modprobe.d/vfio_pci.conf

options vfio-pci ids=1002:6939,1002:aad8,10de:13c2,10de:0fbb
softdep amdgpu pre: vfio-pci
softdep nvidia pre: vfio-pci
softdep nouveau pre: vfio-pci

Don't mind the NVIDIA stuff, this was when i added an NVIDIA GPU. The problem showed also up when I did not have the entries for the nvidia card.

Any ideas how to debug this? The only ideas I have found in similar threads is the recommendation to run a memtest?
Help appreciated, thanks in advance

leesteken · May 1, 2022

Sounds like some component can't handle the stress after a while. Can you try another power supply (PSU)?

firobad66 · May 1, 2022

leesteken said:
Sounds like some component can't handle the stress after a while. Can you try another power supply (PSU)?

Unfortunately I dont have a spare one at the moment to test with. The PSU is however maybe 2 months old (I believe thermaltake) and is supposed to deliver 700W....

Whats funny though is that the screen (1 old radeon hd 6x gpu for debian proxmox host) is still shown. However no ping possible, the LED at the NIC stays completely orange (onboard Lan)

Just tested with uningine heaven on highest settings on the connected screen - it ran for roughly 10 min, then it crashed, a few green dots were shown on the screen and then after some time it went black.

Could it be a memory issue as well?

In the BIOS I disabled all OC stuff and for CPU C-States are disabled, also any power saving features are disabled. Only IOMMU and Secure Virtual Machine are enabled actually.

Is there somehow a possibility to get more logs software wise?

leesteken · May 1, 2022

A PCI(e) device can take down the whole system but I don't think this is specific to passthrough or Proxmox.
You might keep an eye on a running watch -n 1 sensors to check temperatures/voltages while stressing different parts of the system.
You could run a Memtest86, if it fails then it was the RAM. Also try prime95 to stress the CPU (and optionally memory) without putting stress on the GPU. If both succeed then it's likely the GPU or PSU, otherwise it was the CPU or motherboard (VRMs) or the PSU.
EDIT: Run sensors on the Proxmox host for CPU/motherboard and run senors inside the VM for GPU temps.

firobad66 · May 1, 2022

Thanks a lot for the hints and guide.
Will test out the components that way.

Yes sensors is on my monitoring list, however the point here is that my board has only 2 found:

This is (right now 25 min gpu benchmark runs) my sensors stats:

Code:

Every 2.0s: sensors                                                                  center: Sun May  1 15:00:25 2022

radeon-pci-0600
Adapter: PCI adapter
temp1:        +60.5°C  (crit = +120.0°C, hyst = +90.0°C)

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +53.4°C  (high = +70.0°C)

The radeon-pci-0600 is this other radeon card (passively cooled) that i use on the host as display.
The k10 is the CPU

The GPU has after 20 min now 77 degree celsius (I see that in the overlay of MSI Afterburner in the VM).

Hmm one more thing - afterburner allows OC the GPU - could this be an issue too?
I never thought about that but I am running it often overclocked i think....

firobad66 · May 2, 2022

So prime95 worked fine for over one hour in the blending test.
Whats left is the Memtest, however my assumption here is that it is caused by the GPU itself - especially if you overclock it a bit. I ran a game yesterday for several hours without overclocking and it didnt crash.

Too bad that this PCI-E device can bring down the complete host....

Apart from the reset bug on which this card is affected (this makes auto backups hard as i always need to shut it down from within the VM), you can't also overclock the GPU without worries or issues. I might replace it with an NVIDIA one in the future.

Anyway a big thanks once again for your help. I will mark the thread as solved.

leesteken · May 2, 2022

firobad66 said:
Apart from the reset bug on which this card is affected (this makes auto backups hard as i always need to shut it down from within the VM), you can't also overclock the GPU without worries or issues. I might replace it with an NVIDIA one in the future..

Many recent AMD GPUs don't have this issue (in case you are going to buy a new card). Unfortunately, your current GPU appears to be too old for vendor-reset.

[SOLVED] Host freezes (mostly while gaming) - how to debug

firobad66

Member

leesteken

Distinguished Member

firobad66

Member

leesteken

Distinguished Member

firobad66

Member

firobad66

Member

leesteken

Distinguished Member

We value your privacy