Hi,
TLDR:
There is a ZFS issue that has been around for a long time that causes silent data corruption. It triggers only in very specific workloads, however a ZFS 2.2.0 feature called block cloning increased the probability of it being triggered, which was why it only got noticed now. Proxmox VE 8.1 got released with ZFS 2.2.0 but I am unsure if it uses any block cloning feature.
Anyway, a quick fix has been found that reduces the probability of it being triggered, although silent data corruption could still occur:
An actual PR to fix this has been issued here but please be careful about putting this in production. A script to check if any files has this silent corruption can be found here, although it is based on the heuristic that the corruption tends to occur in the first block, and can result in false positives and may not detect all cases of silent data corruptions.
A Proxmox user also commented that this issue manifested itself in her Proxmox VE host.
TLDR:
- ZFS silent data corruption issue since ZFS 2.1.4 and especially since ZFS 2.2.0/PVE 8.1
- Set echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync to reduce the probability of this occurring.
- Actual PR to fix this issue: https://github.com/openzfs/zfs/pull/15571. Be careful of putting this in production!
- Script to check if any files has this silent corruption. Note: Running this script does mean that other files are not corrupted, and can result in false positives.
There is a ZFS issue that has been around for a long time that causes silent data corruption. It triggers only in very specific workloads, however a ZFS 2.2.0 feature called block cloning increased the probability of it being triggered, which was why it only got noticed now. Proxmox VE 8.1 got released with ZFS 2.2.0 but I am unsure if it uses any block cloning feature.
Anyway, a quick fix has been found that reduces the probability of it being triggered, although silent data corruption could still occur:
echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync
An actual PR to fix this has been issued here but please be careful about putting this in production. A script to check if any files has this silent corruption can be found here, although it is based on the heuristic that the corruption tends to occur in the first block, and can result in false positives and may not detect all cases of silent data corruptions.
A Proxmox user also commented that this issue manifested itself in her Proxmox VE host.