[PSA] ZFS Silent Data Corruption

garetht · Nov 26, 2023

Hi,

TLDR:

ZFS silent data corruption issue since ZFS 2.1.4 and especially since ZFS 2.2.0/PVE 8.1
Set echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync to reduce the probability of this occurring.
Actual PR to fix this issue: https://github.com/openzfs/zfs/pull/15571. Be careful of putting this in production!
Script to check if any files has this silent corruption. Note: Running this script does mean that other files are not corrupted, and can result in false positives.

There is a ZFS issue that has been around for a long time that causes silent data corruption. It triggers only in very specific workloads, however a ZFS 2.2.0 feature called block cloning increased the probability of it being triggered, which was why it only got noticed now. Proxmox VE 8.1 got released with ZFS 2.2.0 but I am unsure if it uses any block cloning feature.

Anyway, a quick fix has been found that reduces the probability of it being triggered, although silent data corruption could still occur:

echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

An actual PR to fix this has been issued here but please be careful about putting this in production. A script to check if any files has this silent corruption can be found here, although it is based on the heuristic that the corruption tends to occur in the first block, and can result in false positives and may not detect all cases of silent data corruptions.

A Proxmox user also commented that this issue manifested itself in her Proxmox VE host.

fabian · Nov 27, 2023

PVE 8.1 ships with block cloning disabled by default, so the chances of triggering this should be low (note that the comment you linked is specifically about being able to trigger it with a custom reproducer script that tries to force the issue, not running into it with regular work loads, which seems very rare unless block cloning is enabled, from the reports so far). once the fix for the (old) issue has been reviewed, we will include it in our builds as well. if you want to be extra careful, deploy the workaround that disables the problematic behaviour entirely.

garetht · Nov 28, 2023

Thanks for the reply @fabian

One user has actually done a systematic analysis of this bug on the different combinations of factors that affects the probability of this bug occurring:

I'll let the updated tests run while I figure out how to present the data. In the meantime, here's a list of things that make it more likely to hit this bug, in approximate order of significance:

Extreme CPU/DRAM workloads parallel to and independent from the file I/O

Smaller files/less time writing and more time handling metadata, in relative terms

Slow disk I/O performance

I'm not sure yet if more parallel operations have an impact, I'll have to crunch this data down to something usable first.

I am unsure if Proxmox VE is more susceptible to the first use case. The reason why this bug was noticed in the first place was that somebody was doing a compilation when files were being written and then read concurrently, which was then re-written again. The original block was written correctly, but read incorrectly resulting in the re-written file getting corrupted.

I am unsure if certain use cases of Proxmox VE (running VMs in parallel) or doing a concurrent backup or snapshots of a live VM may result in this getting triggered more often. It can be very nasty when there is a silent corruption when doing a backup to PBS for instance as there is no way to validate that the backup is valid.

What is interesting is that from the perspective of @robn (the author of the fix that addresses this issue) is that this bug might have existed since 2006 and is only starting to get noticed now.

alpha754293 · Nov 29, 2023

I just read the article from The Register earlier today.

I think that this is something that's super important for people to note.

I also wonder if this bug affects the Oracle/Solaris (original) implementation of ZFS or if this bug ONLY affects OpenZFS.

Thanks.

fabian · Nov 29, 2023

the next pve-kernel update (6.5.11-6-pve) will contain the cherry-picked fix from 2.2.2 staging.

garetht · Nov 30, 2023

A good objective writeup of what this issue means from a technical perspective.

The good news is, this is very hard to hit without contrived examples, which is why it was around for so long - we hit a number of cases where someone made the window where it was incorrect much wider, those got noticed very fast and undone, and we couldn't reproduce it with those contrived examples afterward.

It's also not very common because GNU coreutils just started using things like this by default with 9.0+, though that's just for the trivial case of using cp, things that read files outside of coreutils might be doing god knows what.

So your best bet is if you have a known good source copy or hash from one, to compare your stuff against that. Anything else is going to be a heuristic with false positives and negatives.

Yes, that sucks. But life is messy sometimes.

garetht · Nov 30, 2023

alpha754293 said:
I also wonder if this bug affects the Oracle/Solaris (original) implementation of ZFS or if this bug ONLY affects OpenZFS.

I think this comment in the link above answers this well:

The flaw, as written, was arguably in the original commit adding ZFS support in Solaris. (Maybe the expectations were different back then and it wasn't written down well? Dunno.)

But in practice, you couldn't hit this without something that would care about the distinction between the dnode being dirty and the buffers representing the modified contents being dirty, so then you can't hit it at least until 802e7b5fe (committed 2013, in 0.6.2 in April 2015), and even then, the code syncs txgs so often that at least in my quick experiments, it doesn't reproduce reliably until 905edb405d (committed 2015, in 0.6.5 in September 2015), which is probably the oldest example of "oops the gap got wider".

alpha754293 · Dec 1, 2023

garetht said:
I think this comment in the link above answers this well:

Well...I dunno because in the same post -- he writes that this issue was due in part to the commit that was made to OpenZFS in 2013.

So.....that's where I am a little bit confused in regards to how both statements can be true, simultaneously?

(i.e. if it was in the original Solaris ZFS code, then the commit should, I would think, on the surface, have little to do with it. But in the post, he writes EXPLICITLY about what changed in the commit, and how the check for dnode...dirty was "incomplete", so....that would suggest that it was the commit that is partially responsible for it. So I am trying to wrap my head around how both statements can be true, at the same time.)

alpha754293 · Dec 1, 2023

It looks like that @robn has a fix for this:

cf. https://github.com/openzfs/zfs/pull/15615

fabian · Dec 1, 2023

a better fix, the first version is already in our kernel packages as of proxmox-kernel-6.5.11-6-pve

Lephisto · Dec 1, 2023

What's not totally clear to me: Does this also apply to zvol's or just if I use ZFS as posix Filesystem?

Sralityhe · Dec 1, 2023

fabian said:
a better fix, the first version is already in our kernel packages as of proxmox-kernel-6.5.11-6-pve

Hi Fabian,

any chance you know with what release the second fix is deployed?

Thanks!

BelCloud · Dec 2, 2023

Does this have any effect on VMs/zvols? Or only on containers/host filesystem?

LnxBil · Dec 3, 2023

BelCloud said:
Does this have any effect on VMs/zvols? Or only on containers/host filesystem?

Everything.

Lephisto · Dec 3, 2023

LnxBil said:
Everything.

Source?

fabian · Dec 4, 2023

Sralityhe said:
Hi Fabian,

any chance you know with what release the second fix is deployed?

Thanks!

there were three iterations of a fix:
- disable block cloning (later turned out to be incomplete, but already makes it very hard to trigger - this one we already included in 6.5.11-3-pve)
- first version of the dirty fix (complete, but not the most elegant - this one we included in 6.5.11-6-pve)
- second version of the dirty fix (not yet included in Proxmox kernels, but also no pressure since the previous one is already completely fixing the issue, it's also not yet been reviewed and accepted upstream!)

LnxBil · Dec 4, 2023

Lephisto said:
Source?

The bug is in the block logic, so layers below the distinction between zvol and datasets. All patches up to now are stating a problem with dnodes and zvol are per definition also dnodes. The problem may not be as often triggered in a zvol as in a dataset, yet the same code is executed. I do not know if there are test cases for this in zvols, yet we cannot imply that the problem is not present.

Lephisto · Dec 4, 2023

LnxBil said:
The bug is in the block logic, so layers below the distinction between zvol and datasets. All patches up to now are stating a problem with dnodes and zvol are per definition also dnodes. The problem may not be as often triggered in a zvol as in a dataset, yet the same code is executed. I do not know if there are test cases for this in zvols, yet we cannot imply that the problem is not present.

thanks for the elaborate answer.

ky41083 · Dec 4, 2023

From: https://github.com/openzfs/zfs/issues/15526#issuecomment-1826476737
robn commented: "@admnd I am 100% certain that zvols on Linux are not affected. On FreeBSD I'm 99% certain"

I would love a 2nd opinion, but based on this it would appear zvols should be ok?

HankFlaggerty · Dec 4, 2023

I see the news here about a fix included in the 6.5 kernels. Will there be anything for 5.5/zfs 2.1.11? Thank you!

[PSA] ZFS Silent Data Corruption

New Member

Proxmox Staff Member

New Member

Member

Proxmox Staff Member

New Member

New Member

Member

Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Renowned Member

Distinguished Member

Well-Known Member

Proxmox Staff Member

Distinguished Member

Well-Known Member

Member

Member

We value your privacy