2 Different PVE Versions; 2 Different File Systems on 2 Different Drives, But Same PVE Failures

Jun 19, 2017
44
0
6
4 months ago I installed PVE v4.4 on an eXT4 filesystem SSD. I put about 10 different containers on it and was the only user. I didn't use ceph or HA, and I had no other clusters--it was just one node with about 10 different LXCs. All containers were under 50% usage in their allotted storage consumption. Then--for reasons still unknown to me--the pve host showed logical data corruption, locked itself to read-only, and I had to back-up what I could and kill the entire node.

Fast forward to 3 days ago, and I installed PVE v5.1 using ZFS RAID1 as a file system across two brand new SSDs via SATA ports. I figured ZFS mirrors would eliminate the possibility of EXT4 being the culprit in my previous scenario. It wasn't even installed for two days before the rpool reported itself as degraded due to logical corruption on one of the mirrors. I hadn't even installed any VMs/containers yet.

These SSDs were on SATA ports 0 and 1, whereas the single previous EXT4 SSD was an nvme stick on a PCIe lane.

I feel like this can't be a coincidence, but I don't know how to do forensics on when/why logical blocks become corrupted. I can resilver the rpool, but without knowing why it happened in the first place, I feel like I'm just asking for it to happen again. PVE syslog shows nothing notable.

I ran smartctl tests on both drives and there were no physical bad blocks. I saw no events in pve syslog that would indicate the moment a logical block become corrupt. The only reason I found out about the issue was by incidentally running zpool status.

Since I've used two different drives (both brand new) on two different ports (PCIe and SATA), with two different filesystems (EXT4 and ZFS), then I'm not sure what else to try. I'm fairly confident my other hardware isn't the issue:
Anyone else had similar experiences or know how I could figure out how this is happening?
 
Last edited:
If there's a way to still enjoy the security of mirroring the root file system without ZFS, I'd be open to hearing about that as well. I just assumed this was the path of least resistance.
 
Is zpool status showing errors? On what disk? You could put offline that one disk and run a badblocks -w -v -s to check if there are badblocks
You could install zfs-zed to have email message if there are errors on pool, so to have a warning mail in future if a problems arise
 
Is zpool status showing errors? On what disk? You could put offline that one disk and run a badblocks -w -v -s to check if there are badblocks
You could install zfs-zed to have email message if there are errors on pool, so to have a warning mail in future if a problems arise
As I said in my post, smartctl tests returned no bad blocks. I encountered `zpool status` showing a degraded rpool while setting up the zfs-zed email notifications.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!