ProxMox Freezing On AMD RYZEN Machines

Aug 8, 2019
5
0
6
46
I've been a ProxMox user for several years now with quite a few one off machines running along with a cluster of 6 servers. I had no reason to actually post but today I'm out of options and looking for some help. I have several (about 30) Lenovo M715q machines. These all have AMD RYZEN CPU 8 GB of ram. With an assortment of NVMe drives and SSDs. They all seem to freeze randomly and need to be force restarted. This is just during the windows update process or just waiting on me to finish my samich after a reboot. I have not even got a load on the machine to place it in production yet. I've reinstalled ProxMox on several of them with the same results. Single VM running windows server 2012 with 4 cores and 4 GB of ram. All the intel machines are doing great. These new machines not so much. they could be up for 20 minutes, they could be up for 2 days.

Any suggestions would be greatly appreciated.
 
make sure to update all available firmware for the machines (BIOS, microcode package, NIC firmware,..., everything) to the latest released version - This does help in those cases quite often.

Else check the output of `dmesg` after a boot - maybe there is a hint there

hope this helps
 
So nothing really in the dmesg data. Bios is updated already and no other firmware is available for anything else. I did a full wipe and reinstall with PVE 6.0 and added 1 VM. I was watching processes and nothing out of the ordinary besides KVM hit 131%. It was doing windows updated and that's the exact moment it all locked up. ping failed and ssh locked up. I had to reboot the machine. I attached the demsg incase anyone sees something I missed.
 

Attachments

  • Capture.PNG
    Capture.PNG
    22.1 KB · Views: 74
  • demsg.txt
    73.8 KB · Views: 37
Hm....interesting. Got an IO lockup today on an Epyc Server while ZFS was taking snapshots too (sanoid). It stayed reachable via SSH and so on, but i needed to reboot to get rid of the high I/O wait. Will keep an eye on it...
 
Soooo, did some more testing.. NMVe drives are going to use up their 790 TB of write by the time I'm done.. LOL Well, paint me green and call me a pickle, I didn't get a crash if I use a Virt NIC vs e1000 on the VM. Why in the world would that cause the entire host to lock up? Beats me. I've never had that happen before. But I can reproduce it on multiple hosts. This kind of makes me worried that a crashing virtual device offered to a VM can bring down a whole host.

Thoughts?
 
I am having issues with containers on Ryzen 3600. VMs are working fine. Installation did not work. Had to install debian with much difficulty (hardware issues) and convert.
 
Hm, another crash ... not sure where to look at

Code:
[Fr Aug  9 13:23:53 2019] INFO: task zvol:619 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task txg_quiesce:1478 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:14072 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:29354 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:29355 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:8138 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:9485 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task kvm:21766 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task zvol:18104 blocked for more than 120 seconds.
[Fr Aug  9 13:23:53 2019] INFO: task zvol:18106 blocked for more than 120 seconds.

Screenshot from 2019-08-09 13-44-41.png
 
Try setting
Code:
rcu_nocbs=0-N
in the kernel command line. (where N is threads-1)
For 8 threads it is
Code:
rcu_nocbs=0-7
I had to set this on my 2200G otherwise it would lock up randomly. With this setting it never hangs (it runs 24/7).
 
In my case, it's no lockup, it's just IO which is stalling...in the end, the result is the same, due to heavy IO load i can't reboot or power off the system (hangs forever in systemd shutdown process) and i need to hard reset.
 
After many hours with Lenovo support, 6 different bios versions, 4 different NVMe models, 6 test machines and 30 different installs, Disabling C6 resolves the problem on my hardware. They all pur like a tiger now.
 
How do you add the kernel parameter? - having the exact same issues with my ryzen 1700
 
How do you add the kernel parameter? - having the exact same issues with my ryzen 1700

In /etc/default/grub:

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"

Then run:

Code:
update-grub




I have 2 nodes with: AMD Ryzen 5 1600X Six-Core Processor
After this change , the problems are wanish(almost 7 days uptime). Before this grub change, in the best case, after 2 days at most, ANY node was freezig/stuck!

You will need to check if your kernell will have something like this in dmesg output:

[ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.0.18-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet processor.max_cstate=1

Good luck!
 
In /etc/default/grub:

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"

I have 2 nodes with: AMD Ryzen 5 1600X Six-Core Processor
After this change , the problems are wanish(almost 7 days uptime). Before this grub change, in the best case, after 2 days at most, the node was freezig/stuck!

Thanks! - Is rcu_nocbs=0-15 the only required parameter or is the parameter processor.max_cstate=1 also needed?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!