2 Different PVE Versions; 2 Different File Systems on 2 Different Drives, But Same PVE Failures

CashewCaliphate · Oct 29, 2017

4 months ago I installed PVE v4.4 on an eXT4 filesystem SSD. I put about 10 different containers on it and was the only user. I didn't use ceph or HA, and I had no other clusters--it was just one node with about 10 different LXCs. All containers were under 50% usage in their allotted storage consumption. Then--for reasons still unknown to me--the pve host showed logical data corruption, locked itself to read-only, and I had to back-up what I could and kill the entire node.

Fast forward to 3 days ago, and I installed PVE v5.1 using ZFS RAID1 as a file system across two brand new SSDs via SATA ports. I figured ZFS mirrors would eliminate the possibility of EXT4 being the culprit in my previous scenario. It wasn't even installed for two days before the rpool reported itself as degraded due to logical corruption on one of the mirrors. I hadn't even installed any VMs/containers yet.

These SSDs were on SATA ports 0 and 1, whereas the single previous EXT4 SSD was an nvme stick on a PCIe lane.

I feel like this can't be a coincidence, but I don't know how to do forensics on when/why logical blocks become corrupted. I can resilver the rpool, but without knowing why it happened in the first place, I feel like I'm just asking for it to happen again. PVE syslog shows nothing notable.

I ran smartctl tests on both drives and there were no physical bad blocks. I saw no events in pve syslog that would indicate the moment a logical block become corrupt. The only reason I found out about the issue was by incidentally running zpool status.

Since I've used two different drives (both brand new) on two different ports (PCIe and SATA), with two different filesystems (EXT4 and ZFS), then I'm not sure what else to try. I'm fairly confident my other hardware isn't the issue:

My motherboard (ASRock EPC612D4U) - Has the latest available BIOS
My 16x4GB ECC RAM (listed as QVL memory by my motherboard) - 2 passes of memtest86 showed no errors
My CPU is a Intel Xeon E5-2650Lv3 (LGA 2011-3)

Anyone else had similar experiences or know how I could figure out how this is happening?

CashewCaliphate · Oct 30, 2017

If there's a way to still enjoy the security of mirroring the root file system without ZFS, I'd be open to hearing about that as well. I just assumed this was the path of least resistance.

mbaldini · Oct 30, 2017

Is zpool status showing errors? On what disk? You could put offline that one disk and run a badblocks -w -v -s to check if there are badblocks
You could install zfs-zed to have email message if there are errors on pool, so to have a warning mail in future if a problems arise

CashewCaliphate · Oct 30, 2017

mbaldini said:
Is zpool status showing errors? On what disk? You could put offline that one disk and run a badblocks -w -v -s to check if there are badblocks
You could install zfs-zed to have email message if there are errors on pool, so to have a warning mail in future if a problems arise

As I said in my post, smartctl tests returned no bad blocks. I encountered `zpool status` showing a degraded rpool while setting up the zfs-zed email notifications.

Search

Search

2 Different PVE Versions; 2 Different File Systems on 2 Different Drives, But Same PVE Failures

CashewCaliphate

Member

CashewCaliphate

Member

mbaldini

Well-Known Member

CashewCaliphate

Member

We value your privacy