Proxmox VE 5.0 Ryzen 7 1700X crashes daily

Bogdan Popa

Active Member
Aug 4, 2017
7
2
43
38
Hello,

I've got a new server running Proxmox VE 5.0 but since day one it crashes randomly. Any hints you guys can give me? Maybe disable SMT in BIOS? Could be a Kernel bug?

This is a Hetzner remote machine, so seeing the exact KP message is kinda hard(I have to request a KVM). The last time it was scrolling messages non-stop.

I've ran different stress tests on it, no problems. Some reboots happened when completely idle(no VMs running).

Versions:
proxmox-ve: 5.0-18 (running kernel: 4.10.17-1-pve)
pve-manager: 5.0-29 (running version: 5.0-29/6f01516)
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-5
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-14
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-5
libpve-storage-perl: 5.0-12
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-2
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: not correctly installed
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve2
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
 
  • Like
Reactions: chrone
there are different issues with the ryzen platform, those range from specific MB/Bios bugs and instabilities to stuff like https://community.amd.com/thread/215773?start=0&tstart=0 . without a concrete error message and trace it is impossible to tell what the problem might be..
 
Thank you kindly Fabian for your reply. I was afraid of that. I'll contact their support maybe they can help, otherwise I have to cancel the server and get Intel. Pity, I like the performance/price it has.
 
Is it qpossible that hetzner have not updated the BIOS (or none available) that is as new as the one I use (AGESA version 1.0.0.6a?) I have been running Ryzen for over a month now (I think) not one single crash.
 
Is it qpossible that hetzner have not updated the BIOS (or none available) that is as new as the one I use (AGESA version 1.0.0.6a?) I have been running Ryzen for over a month now (I think) not one single crash.

This was the first thing I checked.

DMI: System manufacturer System Product Name/PRIME B350M-A, BIOS 0805 06/20/2017

The latest one from ASUS:
PRIME B350M-A BIOS 0805
Update AGESA to 1.0.0.6a

I did ask support to check the settings in BIOS, they apply some changes and I hope it works, the last crash was after 26H uptime. I will ask what they changed.
 
I freaking love this CPU, i don't want to switch back to Intel i7 :(

So the reply I got was "Usually we update the BIOS and disable the C-States, in most cases this helps to avoid unwanted crashes."

Bios was already up to date according to kern.log, so I guess we'll have to see if it still crashes with C-states off.
 
  • Like
Reactions: timonych
TLDR; Kernel 4.10.0 causes daily crashes on many Ryzen systems.

I thought about trying out Proxmox, but after reading about all the issues people are facing I'm having second thoughts. I ran Windows 10 on my Ryzen server since March 2017 without any issues. I switched to Fedora 26 few days ago. It is the first Linux distribution that is running stable on my system.

I have personally tried many Linux distributions on my system. Ubuntu 17.04 crashed by far the quickest and Gentoo did not completed kernel compilation so I never got to finish the install. Ubuntu 16.04.2 also could not be installed on my system.

OS that did not work:
Arch Linux (2017-03 4.10.0)
Gentoo (Early 2017-04 4.10.0)
Ubuntu 16.04.2 LTS (4.8)
Ubuntu 17.04 (4.10.0)

OS that worked without modifications:
Windows 10 Pro 64 bit
Fedora Core 26

Reports related to kernel version:
www dot reddit dot com /r/Amd/comments/62yyh4/anyone_using_ryzen_in_servers/dg6pv78/

I know there are other issues with Ryzen system, but this one does not seem to be related to issues that are officially announced (probably because it's already fixed in a newer kernel).

My system specs:
CPU: Ryzen 1700
M/B: Asus Prime 370X Pro (BIOS 0805 w/ AGESA 1.0.0.6a)
RAM: 32 GB Ram (16 GB x2 running at 2133 Mhz)
SSD: 512 GB Samsung 960 M.2 NVMe

processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1700 Eight-Core Processor
stepping : 1
microcode : 0x8001126
cpu MHz : 1550.000
cache size : 512 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca
bugs : fxsave_leak sysret_ss_attrs null_seg
bogomips : 5988.27
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

For now I going to stick with Fedora 26 with libvirt it seems to work and I don't want to run Windows on it. I am now using this machine for production so unfortunately I am unable to test.
 
For what it's worth, it works perfecly on my Ryzen 7 1700 (OC to 3.5) with Proxmox 5.0

I think something else is at play here, maybe it depends on the motherboard (mine is AsRock)
 
I really don't think that is was OP is experiencing. That bug will (as far as I can see) only occur when compiling etc.

I'm really baffled that 4.10 kernel is working for you. I have read about the kill-ryzen and save-ryzen scripts. You are correct, I don't think that issue is related to my problems (or the other reddit users' from my previous post). Today I also read a lot in the forum post that Fabian shared. That problem looks like a much bigger issue. I'm glad that it's not effecting me, touch wood.

Thanks for confirming that it's working for you. I hope you are right about it being related to the motherboard, this would mean not as many people are effected by this issue. I know that 4.11 is working for me. It's unfortunate, but it looks like I will need to wait for more people to adopt it. Stretch was released recently, so it might be a long wait. :(
 
Today I also read a lot in the forum post that Fabian shared. That problem looks like a much bigger issue. I'm glad that it's not effecting me, touch wood.

The segfault issue is related to what Rhinox wrote. It's a same thing. It will ONLY happen on some early ryzen chips when compiling on all 16 cores etc... Do you compile your own kernel/packages on Proxmox? If not, don't worry about it!

Thanks for confirming that it's working for you. I hope you are right about it being related to the motherboard, this would mean not as many people are effected by this issue. I know that 4.11 is working for me. It's unfortunate, but it looks like I will need to wait for more people to adopt it. Stretch was released recently, so it might be a long wait. :(

First of, I seem to remember you having an asus board? I have had alot of problems with asus boards and Linux in the past (one would even start flashing a BIOS with no warning or anything.. Asus support was like "wtf." and I just got a new board, it had the same problem, 2-3 bios updates later It was fixed.. I honestly think Asus only tests for Windows and if that works, then it's good enough.

Also.. Proxmox uses Ubuntu kernel, so you will get a newer kernel (I guess) before Debian. Don't worry :) You will get whatever kernel is "stable" around the time for 5.1 (I think) maybe they will wait for 5.2.
 
ups, wrong thread.
 
The segfault issue is related to what Rhinox wrote. It's a same thing. It will ONLY happen on some early ryzen chips when compiling on all 16 cores etc... Do you compile your own kernel/packages on Proxmox? If not, don't worry about it!

I might compile my own kernel, but it will be a one time thing. If I run into trouble I can disable SMT/opcache/etc... If I'm stuck I can always use another machine for kernel compilation.

First of, I seem to remember you having an asus board? I have had alot of problems with asus boards and Linux in the past (one would even start flashing a BIOS with no warning or anything.. Asus support was like "wtf." and I just got a new board, it had the same problem, 2-3 bios updates later It was fixed.. I honestly think Asus only tests for Windows and if that works, then it's good enough.

Also.. Proxmox uses Ubuntu kernel, so you will get a newer kernel (I guess) before Debian. Don't worry :) You will get whatever kernel is "stable" around the time for 5.1 (I think) maybe they will wait for 5.2.

Yes, I have an Asus board. I think you're right again about them testing on Windows only. I'm glad to hear about Proxmox using Ubuntu kernel :)

Thanks for your time.
 
Yesterday I installed Proxmox 5.0 (fully updated 15 Aug 2017). The OS ran for half a day before it crashed.

I could not find any useful information in the logs. The server's screen showed some messages, one of them "watchdog: BUG: soft lockup - CPU#6 stuck for 23s!" each time with different CPU numbers and times all above 20 seconds. I tried: "/var/log# grep -ir 'soft lockup' ." but no results were found. Any idea where the logs are stored that was sent to stdout?

The soft lockup messages kept coming, I could not interact with the system using SSH or the keyboard attached to the server. Not even numlock lights would change.

Syslogs show the following around the time of the crash:
Aug 15 22:43:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 15 22:44:00 mop systemd[1]: Starting Proxmox VE replication runner...
Aug 15 22:44:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 15 22:45:00 mop systemd[1]: Starting Proxmox VE replication runner...
Aug 15 22:45:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 15 22:46:00 mop systemd[1]: Starting Proxmox VE replication runner...
Aug 15 22:46:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 15 22:47:00 mop systemd[1]: Starting Proxmox VE replication runner...
Aug 15 22:47:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 15 22:48:00 mop systemd[1]: Starting Proxmox VE replication runner...
Aug 15 22:48:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 15 22:49:00 mop systemd[1]: Starting Proxmox VE replication runner...
Aug 15 22:49:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 15 22:50:00 mop systemd[1]: Starting Proxmox VE replication runner...
Aug 15 22:50:00 mop systemd[1]: Started Proxmox VE replication runner.
Aug 16 14:09:48 mop systemd-modules-load[402]: Inserted module 'iscsi_tcp'
Aug 16 14:09:48 mop systemd-modules-load[402]: Inserted module 'ib_iser'
Aug 16 14:09:48 mop systemd-modules-load[402]: Inserted module 'vhost_net'
Aug 16 14:09:48 mop systemd[1]: Starting Flush Journal to Persistent Storage...
Aug 16 14:09:48 mop systemd-udevd[480]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Aug 16 14:09:48 mop systemd[1]: Started Flush Journal to Persistent Storage.
Aug 16 14:09:48 mop systemd[1]: Started udev Coldplug all Devices.
Aug 16 14:09:48 mop systemd[1]: Starting udev Wait for Complete Device Initialization...
Aug 16 14:09:48 mop systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Aug 16 14:09:48 mop systemd-udevd[595]: failed to execute '/etc/console-setup/cached_setup_font.sh' '/etc/console-setup/cached_setup_font.sh': No such file or directory
Aug 16 14:09:48 mop systemd-udevd[596]: failed to execute '/etc/console-setup/cached_setup_font.sh' '/etc/console-setup/cached_setup_font.sh': No such file or directory
Aug 16 14:09:48 mop kernel: [ 0.000000] Linux version 4.10.15-1-pve (root@stretchbuild) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP PVE 4.10.15-15 (Fri, 23 Jun 2017 08:57:55 +0200) ()

I'm going to try to disable some things in my BIOS to see if it makes any difference

incase it's useful to someone:
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1450]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Device [1022:1451]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1454]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1454]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1463]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1467]
03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b9] (rev 02)
03:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b5] (rev 02)
03:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b0] (rev 02)
1d:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b4] (rev 02)
1d:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b4] (rev 02)
1d:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b4] (rev 02)
1d:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b4] (rev 02)
1d:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b4] (rev 02)
1d:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b4] (rev 02)
25:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:1343]
26:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
28:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 710B] [10de:128b] (rev a1)
28:00.1 Audio device [0403]: NVIDIA Corporation GK208 HDMI/DP Audio Controller [10de:0e0f] (rev a1)
29:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device [1022:145a]
29:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Device [1022:1456]
29:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] USB3 Host Controller [1022:145c]
2a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device [1022:1455]
2a:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
2a:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Device [1022:1457]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!