[PSA] ZFS Silent Data Corruption

garetht

New Member
Nov 26, 2023
7
4
3
Hi,

TLDR:
  • ZFS silent data corruption issue since ZFS 2.1.4 and especially since ZFS 2.2.0/PVE 8.1
  • Set echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync to reduce the probability of this occurring.
  • Actual PR to fix this issue: https://github.com/openzfs/zfs/pull/15571. Be careful of putting this in production!
  • Script to check if any files has this silent corruption. Note: Running this script does mean that other files are not corrupted, and can result in false positives.


There is a ZFS issue that has been around for a long time that causes silent data corruption. It triggers only in very specific workloads, however a ZFS 2.2.0 feature called block cloning increased the probability of it being triggered, which was why it only got noticed now. Proxmox VE 8.1 got released with ZFS 2.2.0 but I am unsure if it uses any block cloning feature.

Anyway, a quick fix has been found that reduces the probability of it being triggered, although silent data corruption could still occur:

echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

An actual PR to fix this has been issued here but please be careful about putting this in production. A script to check if any files has this silent corruption can be found here, although it is based on the heuristic that the corruption tends to occur in the first block, and can result in false positives and may not detect all cases of silent data corruptions.

A Proxmox user also commented that this issue manifested itself in her Proxmox VE host.
 
PVE 8.1 ships with block cloning disabled by default, so the chances of triggering this should be low (note that the comment you linked is specifically about being able to trigger it with a custom reproducer script that tries to force the issue, not running into it with regular work loads, which seems very rare unless block cloning is enabled, from the reports so far). once the fix for the (old) issue has been reviewed, we will include it in our builds as well. if you want to be extra careful, deploy the workaround that disables the problematic behaviour entirely.
 
Thanks for the reply @fabian

One user has actually done a systematic analysis of this bug on the different combinations of factors that affects the probability of this bug occurring:
I'll let the updated tests run while I figure out how to present the data. In the meantime, here's a list of things that make it more likely to hit this bug, in approximate order of significance:

  • Extreme CPU/DRAM workloads parallel to and independent from the file I/O
  • Smaller files/less time writing and more time handling metadata, in relative terms
  • Slow disk I/O performance
I'm not sure yet if more parallel operations have an impact, I'll have to crunch this data down to something usable first.

I am unsure if Proxmox VE is more susceptible to the first use case. The reason why this bug was noticed in the first place was that somebody was doing a compilation when files were being written and then read concurrently, which was then re-written again. The original block was written correctly, but read incorrectly resulting in the re-written file getting corrupted.

I am unsure if certain use cases of Proxmox VE (running VMs in parallel) or doing a concurrent backup or snapshots of a live VM may result in this getting triggered more often. It can be very nasty when there is a silent corruption when doing a backup to PBS for instance as there is no way to validate that the backup is valid.

What is interesting is that from the perspective of @robn (the author of the fix that addresses this issue) is that this bug might have existed since 2006 and is only starting to get noticed now.
 
Last edited:
  • Like
Reactions: alpha754293
I just read the article from The Register earlier today.

I think that this is something that's super important for people to note.

I also wonder if this bug affects the Oracle/Solaris (original) implementation of ZFS or if this bug ONLY affects OpenZFS.

Thanks.
 
the next pve-kernel update (6.5.11-6-pve) will contain the cherry-picked fix from 2.2.2 staging.
 
  • Like
Reactions: Dunuin and garetht
A good objective writeup of what this issue means from a technical perspective.

The good news is, this is very hard to hit without contrived examples, which is why it was around for so long - we hit a number of cases where someone made the window where it was incorrect much wider, those got noticed very fast and undone, and we couldn't reproduce it with those contrived examples afterward.

It's also not very common because GNU coreutils just started using things like this by default with 9.0+, though that's just for the trivial case of using cp, things that read files outside of coreutils might be doing god knows what.

So your best bet is if you have a known good source copy or hash from one, to compare your stuff against that. Anything else is going to be a heuristic with false positives and negatives.

Yes, that sucks. But life is messy sometimes.
 
I also wonder if this bug affects the Oracle/Solaris (original) implementation of ZFS or if this bug ONLY affects OpenZFS.

I think this comment in the link above answers this well:


The flaw, as written, was arguably in the original commit adding ZFS support in Solaris. (Maybe the expectations were different back then and it wasn't written down well? Dunno.)

But in practice, you couldn't hit this without something that would care about the distinction between the dnode being dirty and the buffers representing the modified contents being dirty, so then you can't hit it at least until 802e7b5fe (committed 2013, in 0.6.2 in April 2015), and even then, the code syncs txgs so often that at least in my quick experiments, it doesn't reproduce reliably until 905edb405d (committed 2015, in 0.6.5 in September 2015), which is probably the oldest example of "oops the gap got wider".
 
I think this comment in the link above answers this well:
Well...I dunno because in the same post -- he writes that this issue was due in part to the commit that was made to OpenZFS in 2013.

So.....that's where I am a little bit confused in regards to how both statements can be true, simultaneously?

(i.e. if it was in the original Solaris ZFS code, then the commit should, I would think, on the surface, have little to do with it. But in the post, he writes EXPLICITLY about what changed in the commit, and how the check for dnode...dirty was "incomplete", so....that would suggest that it was the commit that is partially responsible for it. So I am trying to wrap my head around how both statements can be true, at the same time.)
 
a better fix, the first version is already in our kernel packages as of proxmox-kernel-6.5.11-6-pve
 
What's not totally clear to me: Does this also apply to zvol's or just if I use ZFS as posix Filesystem?
 
Hi Fabian,

any chance you know with what release the second fix is deployed?

Thanks!
there were three iterations of a fix:
- disable block cloning (later turned out to be incomplete, but already makes it very hard to trigger - this one we already included in 6.5.11-3-pve)
- first version of the dirty fix (complete, but not the most elegant - this one we included in 6.5.11-6-pve)
- second version of the dirty fix (not yet included in Proxmox kernels, but also no pressure since the previous one is already completely fixing the issue, it's also not yet been reviewed and accepted upstream!)
 
  • Like
Reactions: Lephisto

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!