ProxMox Freezing On AMD RYZEN Machines

Stevebaumgartner · Aug 8, 2019

I've been a ProxMox user for several years now with quite a few one off machines running along with a cluster of 6 servers. I had no reason to actually post but today I'm out of options and looking for some help. I have several (about 30) Lenovo M715q machines. These all have AMD RYZEN CPU 8 GB of ram. With an assortment of NVMe drives and SSDs. They all seem to freeze randomly and need to be force restarted. This is just during the windows update process or just waiting on me to finish my samich after a reboot. I have not even got a load on the machine to place it in production yet. I've reinstalled ProxMox on several of them with the same results. Single VM running windows server 2012 with 4 cores and 4 GB of ram. All the intel machines are doing great. These new machines not so much. they could be up for 20 minutes, they could be up for 2 days.

Any suggestions would be greatly appreciated.

Stoiko Ivanov · Aug 8, 2019

make sure to update all available firmware for the machines (BIOS, microcode package, NIC firmware,..., everything) to the latest released version - This does help in those cases quite often.

Else check the output of `dmesg` after a boot - maybe there is a hint there

hope this helps

Stevebaumgartner · Aug 8, 2019

So nothing really in the dmesg data. Bios is updated already and no other firmware is available for anything else. I did a full wipe and reinstall with PVE 6.0 and added 1 VM. I was watching processes and nothing out of the ordinary besides KVM hit 131%. It was doing windows updated and that's the exact moment it all locked up. ping failed and ssh locked up. I had to reboot the machine. I attached the demsg incase anyone sees something I missed.

Stevebaumgartner · Aug 8, 2019

this is the last entry each time in the syslog it locks up

systemd[1]: Started Proxmox VE replication runner.

morph027 · Aug 8, 2019

Hm....interesting. Got an IO lockup today on an Epyc Server while ZFS was taking snapshots too (sanoid). It stayed reachable via SSH and so on, but i needed to reboot to get rid of the high I/O wait. Will keep an eye on it...

morph027 · Aug 8, 2019

Stevebaumgartner · Aug 9, 2019

Soooo, did some more testing.. NMVe drives are going to use up their 790 TB of write by the time I'm done.. LOL Well, paint me green and call me a pickle, I didn't get a crash if I use a Virt NIC vs e1000 on the VM. Why in the world would that cause the entire host to lock up? Beats me. I've never had that happen before. But I can reproduce it on multiple hosts. This kind of makes me worried that a crashing virtual device offered to a VM can bring down a whole host.

Thoughts?

Ryzen3600 · Aug 9, 2019

I am having issues with containers on Ryzen 3600. VMs are working fine. Installation did not work. Had to install debian with much difficulty (hardware issues) and convert.

morph027 · Aug 9, 2019

Hm, another crash ... not sure where to look at

Code:

[Fr Aug  9 13:23:53 2019] INFO: task zvol:619 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task txg_quiesce:1478 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:14072 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:29354 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:29355 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:8138 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:9485 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:21766 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task zvol:18104 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task zvol:18106 blocked for more than 120 seconds.

Dark26 · Aug 9, 2019

Did you try to disable c6 in the bios if available ?

View here :

https://forum.proxmox.com/threads/proxmox-ve-5-0-ryzen-7-1700x-crashes-daily.36123/page-3

morph027 · Aug 9, 2019

Thanks for the hint, i'll try.

gradinaruvasile · Aug 9, 2019

Try setting

Code:

rcu_nocbs=0-N

in the kernel command line. (where N is threads-1)
For 8 threads it is

Code:

rcu_nocbs=0-7

I had to set this on my 2200G otherwise it would lock up randomly. With this setting it never hangs (it runs 24/7).

morph027 · Aug 9, 2019

In my case, it's no lockup, it's just IO which is stalling...in the end, the result is the same, due to heavy IO load i can't reboot or power off the system (hangs forever in systemd shutdown process) and i need to hard reset.

Ryzen3600 · Aug 10, 2019

Dark26 said:
Did you try to disable c6 in the bios if available ?

View here :

https://forum.proxmox.com/threads/proxmox-ve-5-0-ryzen-7-1700x-crashes-daily.36123/page-3

not a ryzen 3000 issue.

Stevebaumgartner · Aug 21, 2019

After many hours with Lenovo support, 6 different bios versions, 4 different NVMe models, 6 test machines and 30 different installs, Disabling C6 resolves the problem on my hardware. They all pur like a tiger now.

guletz · Aug 22, 2019

Stevebaumgartner said:
After many hours with Lenovo support, 6 different bios versions, 4 different NVMe models, 6 test machines and 30 different installs, Disabling C6 resolves the problem on my hardware. They all pur like a tiger now.

Also for my case! Disabling C6 solve the problems!

morph027 · Aug 22, 2019

Same here

kladze · Aug 22, 2019

How do you add the kernel parameter? - having the exact same issues with my ryzen 1700

guletz · Aug 22, 2019

kladze said:
How do you add the kernel parameter? - having the exact same issues with my ryzen 1700

In /etc/default/grub:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"

Then run:

Code:

update-grub

I have 2 nodes with: AMD Ryzen 5 1600X Six-Core Processor
After this change , the problems are wanish(almost 7 days uptime). Before this grub change, in the best case, after 2 days at most, ANY node was freezig/stuck!

You will need to check if your kernell will have something like this in dmesg output:

[ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.0.18-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet processor.max_cstate=1

Good luck!

kladze · Aug 22, 2019

guletz said:
In /etc/default/grub:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"

I have 2 nodes with: AMD Ryzen 5 1600X Six-Core Processor
After this change , the problems are wanish(almost 7 days uptime). Before this grub change, in the best case, after 2 days at most, the node was freezig/stuck!

Thanks! - Is rcu_nocbs=0-15 the only required parameter or is the parameter processor.max_cstate=1 also needed?

ProxMox Freezing On AMD RYZEN Machines

Member

Proxmox Staff Member

Member

Attachments

Member

Renowned Member

Renowned Member

Member

New Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

New Member

Member

Famous Member

Renowned Member

Renowned Member

Famous Member

Renowned Member