Proxmox VE 5.0 Ryzen 7 1700X crashes daily

Mellozor

New Member
Aug 11, 2017
6
0
1
35
You are running a Nvidia card.. Did you read what I wrote to OP?

Anyway, that is your problem :)

See: https://forum.proxmox.com/threads/pve-5-0-locks-not-responding.35930/#post-176107

Thanks for the quick response, by remarkable coincidence I tried doing that this morning after the crash. I stumbled on https://askubuntu.com/questions/841876/how-to-disable-nouveau-kernel-driver if this solves my problem I'm going to be ecstatic!

After reading your link... Yes, that is the same GPU! Just a different vendor (zotac).

28:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 710B] [10de:128b] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GK208 [GeForce GT 710B] [19da:6326]
Kernel modules: nvidiafb, nouveau

The following shows nothing "lsmod | grep -i nouveau". Lets see how it goes. I can use the time to come up with creative ways that I will destroy the card while I wait to see if the system is stable. ;)
 

vooze

Member
May 11, 2017
77
20
8
34
Thanks for the quick response, by remarkable coincidence I tried doing that this morning after the crash. I stumbled on https://askubuntu.com/questions/841876/how-to-disable-nouveau-kernel-driver if this solves my problem I'm going to be ecstatic!

After reading your link... Yes, that is the same GPU! Just a different vendor (zotac).

28:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 710B] [10de:128b] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GK208 [GeForce GT 710B] [19da:6326]
Kernel modules: nvidiafb, nouveau

The following shows nothing "lsmod | grep -i nouveau". Lets see how it goes. I can use the time to come up with creative ways that I will destroy the card while I wait to see if the system is stable. ;)

Vendor does not matter, the nouveau driver is the problem. Just blacklist it and problem gone. Now we can go back to not blaming Ryzen and AMD :)
 

Mellozor

New Member
Aug 11, 2017
6
0
1
35
Vendor does not matter, the nouveau driver is the problem. Just blacklist it and problem gone. Now we can go back to not blaming Ryzen and AMD :)

Ashamed at myself (both debugging and forum-reading-capability), angry at nouveau, and joyful that my system is stable. Thanks Vooze!

Code:
uptime
 17:22:16 up 1 day
 

vooze

Member
May 11, 2017
77
20
8
34
Ashamed at myself (both debugging and forum-reading-capability), angry at nouveau, and joyful that my system is stable. Thanks Vooze!

Code:
uptime
 17:22:16 up 1 day

You are very welcome :)

Don't feel bad, debugging this was a bitch for me as well. I ended up in a forum thread with someone recommending to install nvidia-driver (which is bad! will pull a bunch for X11 depencies) and figured it would try this. Mostly it was 50% luck on my part :p
 

stony999

New Member
Oct 22, 2010
18
1
1
I also have an Asus PRIME B350M-A Motherboard (like Bogdan), and the System crashes twice a day with core dump on Proxmox 5.1
Initially I had Bios 0611 and it crashed hourly, then I updated to 1001 and it crashed twice daily in Idle and no VMs or CTs running.
Then I changed the grapfics card from NVIDIA to ATI and upgraded BIOS to 3203.

After upgrading BIOS to 3203 it ran 5 days without crash in Idle.
But under slight load it still crashes twice a day.

I am lost, anybody has a hint?
 

wolfgang

Proxmox Staff Member
Oct 1, 2014
6,496
496
103
I can tell you Ryzen is working with PVE.
May be you have a Ryzen of the damaged series.
I would make a RMA.
But to be sure make the Ryzen compiling test.
 

Dark26

Active Member
Nov 27, 2017
216
13
38
45
Have you tried to disable "C6" CPU mode in the bios ?
a have issue with this on some server when they was "idle"
 

stony999

New Member
Oct 22, 2010
18
1
1
I startet a script in the background which produced 90% CPU load on 1 core. After that, my system became stable for about 3 days, then it crashed again.
Today, I installed pm-utils and executed
Code:
pm-powersave false

and disbaled C6 states in grub by adding
Code:
processor.max_cstate=5 intel_idle.max_cstate=5 idle=poll
to GRUB_CMDLINE_LINUX in /etc/default/grub
and then executed
Code:
update-grub
and rebooted.

Let's see if this helps now.
 

FuriousGeorge

Active Member
Sep 25, 2012
80
2
28
I can report that my Threadripper workstation with PM installed is working fine, aside from the stupid name. I have an Asus Zenith motherboard. I had an issue with PCIe passthrough, but a smart guy on reddit fixed it with a few lines of code for my (IIRC) vfio-amd and vfio-pci. (This fix is not in the kernel at this time, btw, but the hope is to get it will be in the next major point release in a few months.) The recompile of both kernel and modules did not crash.

I do find that I have issues with /both/ the nvidia and nouvea drivers, however, this is only when I have another one of my GPUs attached to a VM. I'm not sure if this is a linux-wide problem or only on my platform, as most PCI passthrough tutorials recommend you blacklist those. It crashes the VM or at least the GPU driver within it and eventually locks the host with some cryptic messages in logs about the host GPU. I want to take another look at this, but have not had time. I have not tried to use the nvidia or nouveau driver for very long on their own, so I can't comment on performance or stability there.

Other than that rare case (again, I don't even know it's the platform's fault, and it seems just as likely not to be) which will affect almost no one and barely affects me (I can just make a VM and give it a GPU if I want a KDE desktop), I highly recommend.
 

Nightdefined

New Member
Jan 31, 2018
1
0
1
25
I have the same issue. I have installed a Ryzen server with PVE 5.1-41 (kernel 4.13.13-32) for personal purpose, but I am experiencing some weird freezes that happen randomly after 5-15 hours of uptime (average), the host becomes unreachable and needs hard reset because the machine does not turn off by itself. This bug is quite difficult to solve because I have to wait for several hours until it happens and syslog says absolutely nothing.

I tried to solve it by replacing the CPU (same model) and the motherboard (AsRock AB350M Pro4 to Gibabyte AB350-Gaming 3) but the bug still happens.
I tried to disable IOMMU and SMT in BIOS but nothing new.
PSU voltage is OK.
I am currently trying the solution given by Stony999 (disable cstate in grub) and I will tell you if there is something new.

Specs:
- Ryzen 5 1500X
- Gibabyte AB350-Gaming 3 - with last BIOS (F10, AGESA 1.0.7.2a)
- 2x4Gb Corsair Vengeance LPX 2666Mhz
- No GPU (but I used a NVIDIA GeForce 9400 GT for installation)

Attached my syslog (freeze at "Jan 31 02:24:00") + lscpu output.
(nb: sorry for my english)
 

Attachments

  • syslog.txt
    842.9 KB · Views: 6
  • lscpu.txt
    1.4 KB · Views: 3
Last edited:

Quindor

Member
Jun 15, 2017
7
0
21
43
It seems my recent build is also experiencing this problem. It's an Asus-X370-Pro with a "first few weeks" Ryzen 1700x. It has the segfault issue and I might RMA it for that but other then that I also have the problem that the system will hang sometimes. Time period varies between 1 to 14 days when it happens, very random.

I've installed the 4.15 test kernel and enabled the "rcu_nocbs=0-15" option in the grub boot file. Hoping that this solves my problem. If I still experience issues from what I've read the next thing to add is disabling C-state 6 and lower.

I'll try and keep this thread updated with my results. If you don't see a post for a while, that means it's probably remained stable. ;)
 

wolfgang

Proxmox Staff Member
Oct 1, 2014
6,496
496
103
Hi @Quindor

Do you use the latest Firmware?
It is important to use min AGESA 1.0.0.1a because of the memory compatibility.
 

Quindor

Member
Jun 15, 2017
7
0
21
43
Hi @Quindor

Do you use the latest Firmware?
It is important to use min AGESA 1.0.0.1a because of the memory compatibility.
I wish it was as simple as having memory issues. I was using the latest BIOS from somewhere in 2018-03, I recently upgraded to the newest one currently available(4008 (AGESA 1.0.0.2a) but I see they released 4011 today!), but this also did not solve my issue that during idle the system would hang at a random moment (Could be 2 days, could be 30 minutes). If I put more load on the hardware it seemed to remain stable longer.

Researching this I found that it probably has something to do with either the processors C6 power state or changes in the kernel regarding "
RCU_NOCB_CPU". In that thread they also mention a setting "idle power" and that it might fix the issue for some people. It seems a lot of people are suffering from this issue since kernel 4.10+.

About 1,5 days ago I changed the setting called "Power Supply Idle Control" in the BIOS to "Typical Current". For those 1,5 days now the system has remained stable even when fully idle. The setting uses like 4w more on average, but that would be an acceptable fix for me. I'm leaving it to run like this for a while longer to make sure that has actually fixed it.

For others, take a look at the following thread: https://bugzilla.kernel.org/show_bug.cgi?id=196683 it lists most details surrounding this issue and potential fixes for it. Those are mainly:
  • Changing RCU_NOCB_CPU kernel values (Requires re-compiled kernel, proxmox kernel does not have it enabled)
  • Disabling C6 power state (Also disables turbo and peak frequency though)
  • Changing the "Power Supply Idle Control" setting to give idle power a bit more juice (Makes the rig consume 4w more on average)

This seems to still be a "hot" issue with no real clear fix (Although people have reported success with the workarounds above!). Also, some systems are affected by it while others seem to be running fine so it's down to silicon and component combinations, but no one knows why or how exactly at this time.
 

wolfgang

Proxmox Staff Member
Oct 1, 2014
6,496
496
103
Researching this I found that it probably has something to do with either the processors C6 power state or changes in the kernel regarding "
You should disable all power saving mechanism when you virtualize similarly you have an AMD or an Intel CPU.
The dynamic CPU clock makes always problems with any Hypervisor.
 

Quindor

Member
Jun 15, 2017
7
0
21
43
You should disable all power saving mechanism when you virtualize similarly you have an AMD or an Intel CPU.
The dynamic CPU clock makes always problems with any Hypervisor.
Although I understand your remark, the added power cost and heat generation of disabling all power regulation isn't worth it for me (both at home and the datacenters I run). Although I understand power saving mechanisms such as CPU state switching can cause a little bit of extra latency (AMD Ryzen is actually very fast in doing so!), I gladly take that hit. I haven't had any issues running the same BIOS settings running Windows on the same board or Intel boards in the past either.

The issue we are seeing here is something different and more akin to a bug. From what I've read it also isn't solved by just disabling BIOS options since the true fix varies from build to build it seems and is something related to the Linux kernel since Windows runs with no issue on the same machines.

Since changing the setting above though, the system has remained stable while being completely idle, no definitive answer (It's been online this long before) but it's looking good.
 

wolfgang

Proxmox Staff Member
Oct 1, 2014
6,496
496
103
I haven't had any issues running the same BIOS settings running Windows on the same board or Intel boards in the past eithe
Run Windows with Hyper-V and running VM's and you will have also problems.

We have here also Ryzen CPU and they run all fine without a problem. My home server runs over 2 Month without a reboot stable.
 

Quindor

Member
Jun 15, 2017
7
0
21
43
I'm not really sure why we are debating this issue? It's known to happen with recent Linux kernels with the conditions I've mentioned above. It also happens during idle time and not during load specifically. But I've run Windows on the same machine without issue, it seems to use certain instructions for power saving differently.

Anyway, since changing the "Power Supply Idle Control" in the BIOS to "Typical Current" I now have a system uptime of 7 days while the system was idle (the problem does not surface (or at least not as quick) during load). So this is starting to look like a decent fix for anyone else experiencing the same issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!