Random freezes due to host CPU type

Decco1337

Member
Apr 12, 2023
44
3
8
Hey guys,

I am facing random freezes of the whole on Proxmox 8.2.4 with the newest kernel (also older kernels), the newest microcode updates, disabled c-states, Eco mode on Ryzen 9 7950X. I found out that the issue just occurs if VMs have the host CPU type and especially with Windows VMs. I have to reboot the host manually that it cames back, I cannot type anything, it just don't respond anymore. Has anyone the same issues and a solution for it?

I also replaced the whole hardware, without any changes.
 
Updated: Seems not to be only windows related. Also got the same issue with Linux VMs (Debian, Ubuntu…)
 
2 of my servers would freeze at least 3 times per day before I ran that script.
Unfortunately did not help... Crashes again. Moved all VMs away to my second host which has no issues and did a memtest again. No errors, it passed without erros.

The second host crashed in that time, seems still be VM related.
 
Last edited:
  • Like
Reactions: Johannes S
Hello,

Please never run random scripts from the internet as root.

If you need to get the latest microcode updates for your CPU we have instructions in our documentation. Not having the microcode installed properly could explain the issue you mention.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_firmware_cpu
Already did before and I checked the script, I am not dumb. Your answer is not helpful… Still having issues, seems to be Proxmox related. You should fix your software!
 
I experience the same issue randomly. I tried using standard Debian, and it runs fine on the same hardware.
 
If you are experiencing "random" freezes, I would recommend installing the CPU microcode and if that does not help, then I would recommend to try different kernel versions. Starting with the latest version available in the repositories (6.8.12-4 as of today), then the opt-in 6.11, and finally 6.5 if none of the previous ones help.

Debian comes with an older kernel version which might have not had the same bug.

You can find instructions on how to pin a kernel version at our documentation [1].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_kernel_pin
 
  • Like
Reactions: Johannes S
I have freezes too.
Ryzen 3950X on a x570 motherboard.
Ran memtest without issues.
Latest BIOS and with AGESA to version ComboV2PI 1.2.0.Cc.

I ran the mobo and CPU + same memory on Win10 for 5 years without freezing issues.
Although I had issues rebooting sometimes, which I think had to do with memory voltage not getting enough voltage quick enough,
but never any freezes.

The system freezes within about 24 hours, total freeze where prompt stops blinking, no visible kernel panic..

I have installed kdump-tools but it doesn't dump anything (it does when forcing a kernel panic test), not sure how to make it dump when it is actually a freeze and not a kernel panic.
Not sure if there are any microcode available or where to find one for 3950x..
 
Hi,

I am having the following issue, if using a Ryzen 7950x3d or 7800x3d on a b650e or x670e motherboard I get the same behaviour.

As long as I use the cpu type "host" I get bluescreens on my Windows 11 Pro VM when I open the device manager and "scan for hardware changes" sometimes Windows does it by itself when installing specific software or drivers and I get the bluescreens.

If I use the cpu type "x86-x64-v4" I can do whatever I want there will be no crash.

I updated the grub with this line "GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt", since those should help with CPU reset bugs, I even tried changing bios boot type to "CSM from UEFI", they both have the same effect, they make the VM work stable if I scan for hardware changes.

But only until I start a software like Passmark for CPU benchmarks or MSI Afterburner, basically anything that accesses the CPU sensors, as soon as those start and they show CPU temperature "0 degrees", if I then scan again for hardware changes I get the bluescreen, even if I close the software first. I noticed that if I use the CPU type "x86-64-v4" when I start any of those software for CPU temp it will just show "N/A" instead of the temperature and then there will be no crashes on hardware scanning. I tried a lot of things but could not get a stable VM with the CPU type "host" with those 2 CPU types, does anyone have any suggestion? I would very much appreciate any advice!

Have all the latest drivers installed, latest Proxmox version with latest kernel 6.8.12-8, and also amd-microcode installed.

maybe this helps, those are the flags of my CPU:

root@prox:/etc/default# cat /proc/cpuinfo | grep flags | head -n 1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d

Updated to kernel 6.11.11-1 - problem still there.

Best Regards,
Dean
 
Last edited:
Hi,

I am having the following issue, if using a Ryzen 7950x3d or 7800x3d on a b650e or x670e motherboard I get the same behaviour.

As long as I use the cpu type "host" I get bluescreens on my Windows 11 Pro VM when I open the device manager and "scan for hardware changes" sometimes Windows does it by itself when installing specific software or drivers and I get the bluescreens.

If I use the cpu type "x86-x64-v4" I can do whatever I want there will be no crash.

I updated the grub with this line "GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt", since those should help with CPU reset bugs, I even tried changing bios boot type to "CSM from UEFI", they both have the same effect, they make the VM work stable if I scan for hardware changes.

But only until I start a software like Passmark for CPU benchmarks or MSI Afterburner, basically anything that accesses the CPU sensors, as soon as those start and they show CPU temperature "0 degrees", if I then scan again for hardware changes I get the bluescreen, even if I close the software first. I noticed that if I use the CPU type "x86-64-v4" when I start any of those software for CPU temp it will just show "N/A" instead of the temperature and then there will be no crashes on hardware scanning. I tried a lot of things but could not get a stable VM with the CPU type "host" with those 2 CPU types, does anyone have any suggestion? I would very much appreciate any advice!

Have all the latest drivers installed, latest Proxmox version with latest kernel 6.8.12-8, and also amd-microcode installed.

maybe this helps, those are the flags of my CPU:

root@prox:/etc/default# cat /proc/cpuinfo | grep flags | head -n 1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d

Updated to kernel 6.11.11-1 - problem still there.

Best Regards,
Dean
Hey Dean

Did you have more success?

I went from a
Ryzen 5 3600 + B450 + 64GB DDR4
to
Ryzen 9 7900 + X670E + 96GB DDR5
-
plus Nvidia 4060ti PCi-E passthrough.

That is where my issues started. However a bit different.
If I run the resources in a Debian/FreeBSD VM all is fine even when I stress burn it to the ground.

Windows 10 VM used to work flawlessly with any game.
Normal setup, nothing fancy except for CPU as HOST.
If I run a stress on it with the usual suspects (prime + furmark) it can burn for hours no issue.
Running the CPU and GPU in my ML VM also just works.

Now when I game on Windows for a bit it resets my whole machine.

I found an interesting post about watchdog timers, sadly my motherboard (MSI X670E-GAMING-PLUS-WIFI) does not have the option to disable it.


To diagnose the problem I tried it all.

Memtest86 with and without XMP.
BIOS versions up to the latest and back until my board was even released.
Windows repair + updates (thanks snapshots)
Previous kernel pinning etc.


I don't game much, thus the reason it exists.
Playing Satisfactory it all works great except the resets.
Then I started to think what could it be in the VM, since it has access directly with the host CPU and GPU.

Then as one final test I stopped Steam and got the game in another manner to test.
Issue gone. So it could have been something in Steams and cheat etc that was causing the lockup and reset.

Hope some of this helps.

Cheers
Carl

Sofware/Server details:

pveperf
CPU BOGOMIPS: 177602.16
REGEX/SECOND: 2423057
HD SIZE: 642.33 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 391.94
DNS EXT: 173.69 ms
DNS INT: 33.43 ms


cat /proc/cpuinfo | grep flags | head -n 1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d

pveversion --verbose

proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.1
libpve-rs-perl: 0.9.2
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.3-1
proxmox-backup-file-restore: 3.3.3-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.6
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.0
pve-qemu-kvm: 9.2.0-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1