[SOLVED] PVE 6 RaidZ2 freeze every day Ryzen 7 1700

slave2anubis

Member
Feb 29, 2020
8
1
23
35
Hello friends, i have a issue that drives me crazy!

A few days ago i moved my main (and only) zpool from a mirror (two Samsung PM863 480GB drives) to a RaidZ2 configuration (4 x PM863 480GB).

This is when all my problems started :mad:

The issue i have is that after i moved to the RaidZ2, the server is freezing up every night (morning to be precise).
What i tested:
  1. Installed Kdump-tools and configured it to dump 256M in case of a kernel panic;
  2. Run a full Memtest 85+ on the ram (16GB DDR4 ECC);
  3. Connected a monitor and keypad over night to monitor the kernel output;
  4. Reconnected all HW (sata cables, ram, power);
  5. Scrubed the zpool two times (no errors);
  6. Reset the bios settings.
When the server freezes the only way to recover it is with a force reset (5 seconds on the power button). I had no luck with the Kdump-tool since there were no kernel dump, and also there were no information on the monitor.

The logs also dont offer any information about the issue, it seems that everything is working good until i get a sudden freeze.
Screenshot_2020-03-06 phoneresq - Proxmox Virtual Environment(1).pngScreenshot_2020-03-06 phoneresq - Proxmox Virtual Environment(2).pngScreenshot_2020-03-06 phoneresq - Proxmox Virtual Environment(3).png

The SSD's are not new but they seem to be fine from a health perspective.
I have a total of 10 CT's and 1 VM (where i run pFsense) on the system.
Before this i had no problems with its server, it was rock solid and had 30+ days of uptime without a incident.
Screenshot_2020-03-06 phoneresq - Proxmox Virtual Environment.png
Code:
root@phoneresq:/var/log# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-5
pve-kernel-helper: 6.1-5
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.14-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-12
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-6
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-5
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 

Attachments

Coincidence is not a proof. I see many zfs related problems in several years, but not anything like what you have now.
But I see the same problem like yours caused by power problem or a bad PSU.

Good luck/Bafta
 
Had a similar issue here.
Bas 5V capacity of the PSU - I was using a lot of 2,5" drives and they have overwhelmed the 5V rail.
Most PSU's are powerful on the 12V, but only have "few amps" on the 5V.
Once you start summing things up - you might easily overload it (yes, I know, I shouldn't have used 20x 2,5" HDDs on that PSU).
Maybe you are exactly on that point as well as @guletz pointed out.
 
So i did manage to resolve the issue after all :)
Thank you all for the help, it turns out it wasn't a HW issue after all.
The thing that got me confused was the timing, because until i changed the ssd zpool from Raid1 (with two drives) to RaidZ2 (with four drives), everything was rock stable.
The problem turns out to be a well known bug with the first generation of Ryzen processor. The bug has to do with C6 power state and the processor lowering the voltages to much and crashing.

This bug can be fixed by disabling the C6 state or by changing a setting named "Power Supply Idle Control", from "Auto" to "Typical Current Idle".
After this fix and a update to the latest BIOS i had no more problems with the system, rock solid.
As for the PSU angle, i dont think its the case since i have a good known brand 500W PSU and the power draw is less then 100W under load (65W CPU plus 4 SSD's).

My two cents is that the increased R/W speed of the array (RaidZ2 vs Raid1), decreased to total CPU load/overhead so much that this bug started to manifest itself, otherwise i don't know...

Thank you all for the help, maybe this information will help anyone having problems like this.
 
  • Like
Reactions: guletz

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!