4 months ago I installed PVE v4.4 on an eXT4 filesystem SSD. I put about 10 different containers on it and was the only user. I didn't use ceph or HA, and I had no other clusters--it was just one node with about 10 different LXCs. All containers were under 50% usage in their allotted storage consumption. Then--for reasons still unknown to me--the pve host showed logical data corruption, locked itself to read-only, and I had to back-up what I could and kill the entire node.
Fast forward to 3 days ago, and I installed PVE v5.1 using ZFS RAID1 as a file system across two brand new SSDs via SATA ports. I figured ZFS mirrors would eliminate the possibility of EXT4 being the culprit in my previous scenario. It wasn't even installed for two days before the rpool reported itself as degraded due to logical corruption on one of the mirrors. I hadn't even installed any VMs/containers yet.
These SSDs were on SATA ports 0 and 1, whereas the single previous EXT4 SSD was an nvme stick on a PCIe lane.
I feel like this can't be a coincidence, but I don't know how to do forensics on when/why logical blocks become corrupted. I can resilver the rpool, but without knowing why it happened in the first place, I feel like I'm just asking for it to happen again. PVE syslog shows nothing notable.
I ran smartctl tests on both drives and there were no physical bad blocks. I saw no events in pve syslog that would indicate the moment a logical block become corrupt. The only reason I found out about the issue was by incidentally running zpool status.
Since I've used two different drives (both brand new) on two different ports (PCIe and SATA), with two different filesystems (EXT4 and ZFS), then I'm not sure what else to try. I'm fairly confident my other hardware isn't the issue:
Fast forward to 3 days ago, and I installed PVE v5.1 using ZFS RAID1 as a file system across two brand new SSDs via SATA ports. I figured ZFS mirrors would eliminate the possibility of EXT4 being the culprit in my previous scenario. It wasn't even installed for two days before the rpool reported itself as degraded due to logical corruption on one of the mirrors. I hadn't even installed any VMs/containers yet.
These SSDs were on SATA ports 0 and 1, whereas the single previous EXT4 SSD was an nvme stick on a PCIe lane.
I feel like this can't be a coincidence, but I don't know how to do forensics on when/why logical blocks become corrupted. I can resilver the rpool, but without knowing why it happened in the first place, I feel like I'm just asking for it to happen again. PVE syslog shows nothing notable.
I ran smartctl tests on both drives and there were no physical bad blocks. I saw no events in pve syslog that would indicate the moment a logical block become corrupt. The only reason I found out about the issue was by incidentally running zpool status.
Since I've used two different drives (both brand new) on two different ports (PCIe and SATA), with two different filesystems (EXT4 and ZFS), then I'm not sure what else to try. I'm fairly confident my other hardware isn't the issue:
- My motherboard (ASRock EPC612D4U) - Has the latest available BIOS
- My 16x4GB ECC RAM (listed as QVL memory by my motherboard) - 2 passes of memtest86 showed no errors
- My CPU is a Intel Xeon E5-2650Lv3 (LGA 2011-3)
Last edited: