[SOLVED] PVE 6 RaidZ2 freeze every day Ryzen 7 1700

slave2anubis · Mar 6, 2020

Hello friends, i have a issue that drives me crazy!

A few days ago i moved my main (and only) zpool from a mirror (two Samsung PM863 480GB drives) to a RaidZ2 configuration (4 x PM863 480GB).

This is when all my problems started

The issue i have is that after i moved to the RaidZ2, the server is freezing up every night (morning to be precise).
What i tested:

Installed Kdump-tools and configured it to dump 256M in case of a kernel panic;
Run a full Memtest 85+ on the ram (16GB DDR4 ECC);
Connected a monitor and keypad over night to monitor the kernel output;
Reconnected all HW (sata cables, ram, power);
Scrubed the zpool two times (no errors);
Reset the bios settings.

When the server freezes the only way to recover it is with a force reset (5 seconds on the power button). I had no luck with the Kdump-tool since there were no kernel dump, and also there were no information on the monitor.

The logs also dont offer any information about the issue, it seems that everything is working good until i get a sudden freeze.

The SSD's are not new but they seem to be fine from a health perspective.
I have a total of 10 CT's and 1 VM (where i run pFsense) on the system.
Before this i had no problems with its server, it was rock solid and had 30+ days of uptime without a incident.

Code:

root@phoneresq:/var/log# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-5
pve-kernel-helper: 6.1-5
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.14-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-12
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-6
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-5
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

guletz · Mar 6, 2020

Hi,

It is not seems to me that the zfs is the cause of the problem!

slave2anubis · Mar 6, 2020

I dont know, but the problems started after i changed from Raid1(mirror) to RaisZ2.

guletz · Mar 6, 2020

Coincidence is not a proof. I see many zfs related problems in several years, but not anything like what you have now.
But I see the same problem like yours caused by power problem or a bad PSU.

Good luck/Bafta

apoc · Mar 7, 2020

Had a similar issue here.
Bas 5V capacity of the PSU - I was using a lot of 2,5" drives and they have overwhelmed the 5V rail.
Most PSU's are powerful on the 12V, but only have "few amps" on the 5V.
Once you start summing things up - you might easily overload it (yes, I know, I shouldn't have used 20x 2,5" HDDs on that PSU).
Maybe you are exactly on that point as well as @guletz pointed out.

slave2anubis · Mar 8, 2020

So i did manage to resolve the issue after all

Thank you all for the help, it turns out it wasn't a HW issue after all.
The thing that got me confused was the timing, because until i changed the ssd zpool from Raid1 (with two drives) to RaidZ2 (with four drives), everything was rock stable.
The problem turns out to be a well known bug with the first generation of Ryzen processor. The bug has to do with C6 power state and the processor lowering the voltages to much and crashing.

This bug can be fixed by disabling the C6 state or by changing a setting named "Power Supply Idle Control", from "Auto" to "Typical Current Idle".
After this fix and a update to the latest BIOS i had no more problems with the system, rock solid.
As for the PSU angle, i dont think its the case since i have a good known brand 500W PSU and the power draw is less then 100W under load (65W CPU plus 4 SSD's).

My two cents is that the increased R/W speed of the array (RaidZ2 vs Raid1), decreased to total CPU load/overhead so much that this bug started to manifest itself, otherwise i don't know...

Thank you all for the help, maybe this information will help anyone having problems like this.

Search

Search

[SOLVED] PVE 6 RaidZ2 freeze every day Ryzen 7 1700

slave2anubis

Member

Attachments

guletz

Famous Member

slave2anubis

Member

guletz

Famous Member

apoc

Famous Member

slave2anubis

Member