Understand and prevent (single drive) ZFS data loss

Mitmischer

New Member
May 18, 2024
1
0
1
Hello,
first of all, I'm looking for advice for ZFS-related issues and googling led me to this forum. But in fact I'm not using Proxmox, but plain Ubuntu on my server.
I hope that's fine with you.

My dataset is called `storj`. It resides on a 16 TB HDD (which is connected via USB 3 due to my SATA ports all being in use) and has 5G of log and 100G of cache, both of which reside on a NVME SSD.

To ensure high read performance, I chose the following slightly dangerous settings:

```
> cat /sys/module/zfs/parameters/zfs_txg_timeout
15
> zfs get all storj | grep sync
storj sync disabled local
```

I was completely fine with losing the last 15 seconds of data on unsafe shutdowns (https://unix.stackexchange.com/ques...ication-crash-consistency-with-sync-disabled; https://www.reddit.com/r/homelab/comments/7b0op7/zfs_sync_disabled/; https://www.reddit.com/r/truenas/comments/spwlk2/zillog_and_realistic_chances_of_catastrophe_with/; https://www.reddit.com/r/zfs/comments/c45z4i/assuming_a_proper_ups_setup_lots_of_ram_is/). But yesterday, my home lost power and I lost way more than the last 15 seconds and I want to understand why. ZFS is praised for its resilience, however this is my first bigger data loss with a file system. On my desktop PC, that went through a lot of hard shutdowns, I only ever lost fragments on files with NTFS or ext4. Are those file systems more resilient to power failure in the end?

As a short summary, after the power outage, I got an IO-error: `Destroy and re-create the pool from a backup source.` when trying to mount the drive. After trying `-fFm` and readonly without success, I went with `0 > /sys/module/zfs/parameters/spa_load_verify_metadata ` as suggested here: (https://www.reddit.com/r/zfs/comments/ey67l5/pool_crashed_data_is_saved_but_still_have_some/; https://www.reddit.com/r/truenas/comments/1755eew/zdb_scrub_roll_back_transactions_what_when_and/) and got the pool back immediately.
I noticed that upon `zpool import /proc/spl/kstat/zfs/dbgmsg`, there was a notion about only 2 metadata errors.
I wonder - in what state, on which txg did my pool get imported and how does zfs make decisions about inconsistencies? Would it have been better to run `-X` instead of skipping the metadata verification (https://www.reddit.com/r/zfs/comments/13vmg1u/what_does_pool_import_fxn_do/)?
I ruled hardware failure out as a reason (the drive is 4 months old with good SMART values, the RAM is ECC and tested each boot).

I then ran a scrub on the data (still running) and so far I got 76 data errors in folders (!), which are non-accessible, so I think all the files inside are lost.
Clearly, those folders were not fully populated in just 15 seconds and I'm surprised to see so much data go.
> Is there any chance of recovery of folders or all the files gone?
> Suppose that only a few files in the folder were changed. Can this corrupt the whole folder?
> ZFS is CoW. In my mind, this means that old data should always be consistent. But here, it seems that the corruption also spilled over to old files/higher level folders. Why didn't I benefit from CoW here?
> The scrub gave me roughly 50K checksum errors so far. I think that this amount of corruption also could not have happened in just 15 seconds. Is it likely that some "chained checksums" were corrupted here or what is going on? That's maybe naive, but is there a way to fix the inconsistencies by throwing out the checksums instead of the folders?
I also noticed this during recovery - why is the txg given by `zdb -C | grep txg` so far behind those in `zdb -ul | grep txg`?
How could that happen, when data is flushed to disk every 15 seconds?

By running single disk, I violate one of ZFS best practices. But I wonder - would mirroring have prevented this kind of error? Remember that the drive in question is fine.
Also, would having a recent snapshot have made the import more likely to succeed?

Would you recommend disabling the drive cache, as that might improve performance (https://forum.proxmox.com/threads/slow-zfs-performance.51717/) and resilience (https://www.klennet.com/notes/2018-12-30-damaged-zfs-pools.aspx) ?

I'm not here to rant about ZFS, I rather want to understand why it failed so badly, I want to discuss hardening approaches, possible flaws in the ecosystem (better recovery tools?), and collect information about recovery (which is all over the place).
So far, I found that
* ZFS is very sensitive to the free space map which is unnecessary for a pure recovery (https://github.com/openzfs/zfs/issues/10085;
* ZFS recover empowers zdb (https://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/)?

Does anyone have experience with recovering other file systems? I have the impression that ZFS takes a lot of measures to prevent data loss but actual resilience relies on redundancy. When there is no redundant information, ZFS recovery appears rather weak to me (surely because of its complexity). The lack of fsck hinders recovery in critical cases. I cannot recall losing so much data to ext4, for example - only ever a few inodes despite a lot of unsafe shutdown. Why does ZFS appear so fragile without its replicates?

To summarize my question: How did the disk state get inconsistent despite no hardware errors being involved? Did the CoW property not work here (in that it leaves existing data unharmed) - should ZFS not have just discarded/rolled back the unfinished transactions? How could so much data get corrupted at once? What can be done to make power losses less terrifying (apart from a UPS but that does leave kernel panics as a possible source of disaster)?
 
Last edited:
My dataset is called `storj`
How about this forum: https://forum.storj.io ? Or ZFS or Ubuntu?
which is connected via USB 3
This was your first deadly sin.

5G of log and 100G of cache

First of all, you can't use both on the same drive.
Second, a SLOG device only speeds up sync writes.
Third, there is no "cache" in ZFS. You mean L2ARC?
Generally speaking, most of the time L2ARC and SLOG are completely useless for newcomers, because they are misunderstanding their workload or what these mechanisms do.

I chose the following slightly dangerous setting
Only dangerous if you need sync. STORJ finally turned off sync writes in the latest update (apparently).

But yesterday, my home lost power and I lost way more than the last 15 seconds and I want to understand why.
And you know that you lost more than 15 because...
On my desktop PC, that went through a lot of hard shutdowns, I only ever lost fragments on files with NTFS or ext4.
Yeah, that is what journaled file systems and not lying about sync is for.
Are those file systems more resilient to power failure in the end?
no
The scrub gave me roughly 50K checksum errors so far
and past scrub gave you 0 errors? What makes you think these errors are due to your shutdown?

By running single disk, I violate one of ZFS best practices.
not really. ZFS is totally fine with single disks or even stripes. What ZFS does not like none direct access to HDDs like USB.
would mirroring have prevented this kind of error?
of course not. Mirror is for parity, nothing more nothing less.
I'm not here to rant about ZFS, I rather want to understand why it failed so badly,
Because you violated the hardware requirements.
possible flaws in the ecosystem (better recovery tools?)
ZFS is rock solid and in case of user errors there are backups. ZFS does not need recovery tools, so there probably never will be.

I have the impression that ZFS takes a lot of measures to prevent data loss but actual resilience relies on redundancy.
You are mixing completely different things. High availability, data loss prevention and data consistency are not the same thing. ZFS takes a lot of measures to prevent data corruption.
To summarize my question: How did the disk state get inconsistent despite no hardware errors being involved?

To summarize my answer: play stupid games, win stupid prices.

If you want to use ZFS, I recommend reading this and especially this part.
 
Last edited:
> It resides on a 16 TB HDD (which is connected via USB 3 due to my SATA ports all being in use)

As mentioned previously, trying to run a hypervisor off a single-disk USB3 is not optimal.

There is no substitute for:

A) redundancy - [you get self-healing scrubs with at least a mirror] and
B) Backups.

You cannot possibly expect to implement a single large point of failure - on USB3 no less - and have no consequences if something goes sideways.

Sorry for your loss, but you kinda built on sand here instead of a solid foundation. The only thing you can do is take up best practices and rebuild better. This includes:

Use an actively-cooled HBA and a disk enclosure / shelf with NAS-rated disks (or SAS) with ZFS (requires free pcie slot)

Some recommended homelab-level equipment on a budget here:
https://github.com/kneutron/ansitest/blob/master/ZFS/zfs-parts-list-60TB-backup-raidz1.xlsx

ECC RAM if possible / affordable

Backups. Regularly.

Instead of 1x 16TB disk, you really need 3. 1 for a mirror and at least 1 for backing up to, in case something happens with the mirror. Or an equivalent-sized raidz2 array with smaller disks, preferably with extra free space

/ the problem with zfs servers is you can never really have just one ;-) You should see my homelab. It's definitely an investment.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!