Hello,
first of all, I'm looking for advice for ZFS-related issues and googling led me to this forum. But in fact I'm not using Proxmox, but plain Ubuntu on my server.
I hope that's fine with you.
My dataset is called `storj`. It resides on a 16 TB HDD (which is connected via USB 3 due to my SATA ports all being in use) and has 5G of log and 100G of cache, both of which reside on a NVME SSD.
To ensure high read performance, I chose the following slightly dangerous settings:
```
> cat /sys/module/zfs/parameters/zfs_txg_timeout
15
> zfs get all storj | grep sync
storj sync disabled local
```
I was completely fine with losing the last 15 seconds of data on unsafe shutdowns (https://unix.stackexchange.com/ques...ication-crash-consistency-with-sync-disabled; https://www.reddit.com/r/homelab/comments/7b0op7/zfs_sync_disabled/; https://www.reddit.com/r/truenas/comments/spwlk2/zillog_and_realistic_chances_of_catastrophe_with/; https://www.reddit.com/r/zfs/comments/c45z4i/assuming_a_proper_ups_setup_lots_of_ram_is/). But yesterday, my home lost power and I lost way more than the last 15 seconds and I want to understand why. ZFS is praised for its resilience, however this is my first bigger data loss with a file system. On my desktop PC, that went through a lot of hard shutdowns, I only ever lost fragments on files with NTFS or ext4. Are those file systems more resilient to power failure in the end?
As a short summary, after the power outage, I got an IO-error: `Destroy and re-create the pool from a backup source.` when trying to mount the drive. After trying `-fFm` and readonly without success, I went with `0 > /sys/module/zfs/parameters/spa_load_verify_metadata ` as suggested here: (https://www.reddit.com/r/zfs/comments/ey67l5/pool_crashed_data_is_saved_but_still_have_some/; https://www.reddit.com/r/truenas/comments/1755eew/zdb_scrub_roll_back_transactions_what_when_and/) and got the pool back immediately.
I noticed that upon `zpool import /proc/spl/kstat/zfs/dbgmsg`, there was a notion about only 2 metadata errors.
I wonder - in what state, on which txg did my pool get imported and how does zfs make decisions about inconsistencies? Would it have been better to run `-X` instead of skipping the metadata verification (https://www.reddit.com/r/zfs/comments/13vmg1u/what_does_pool_import_fxn_do/)?
I ruled hardware failure out as a reason (the drive is 4 months old with good SMART values, the RAM is ECC and tested each boot).
I then ran a scrub on the data (still running) and so far I got 76 data errors in folders (!), which are non-accessible, so I think all the files inside are lost.
Clearly, those folders were not fully populated in just 15 seconds and I'm surprised to see so much data go.
> Is there any chance of recovery of folders or all the files gone?
> Suppose that only a few files in the folder were changed. Can this corrupt the whole folder?
> ZFS is CoW. In my mind, this means that old data should always be consistent. But here, it seems that the corruption also spilled over to old files/higher level folders. Why didn't I benefit from CoW here?
> The scrub gave me roughly 50K checksum errors so far. I think that this amount of corruption also could not have happened in just 15 seconds. Is it likely that some "chained checksums" were corrupted here or what is going on? That's maybe naive, but is there a way to fix the inconsistencies by throwing out the checksums instead of the folders?
I also noticed this during recovery - why is the txg given by `zdb -C | grep txg` so far behind those in `zdb -ul | grep txg`?
How could that happen, when data is flushed to disk every 15 seconds?
By running single disk, I violate one of ZFS best practices. But I wonder - would mirroring have prevented this kind of error? Remember that the drive in question is fine.
Also, would having a recent snapshot have made the import more likely to succeed?
Would you recommend disabling the drive cache, as that might improve performance (https://forum.proxmox.com/threads/slow-zfs-performance.51717/) and resilience (https://www.klennet.com/notes/2018-12-30-damaged-zfs-pools.aspx) ?
I'm not here to rant about ZFS, I rather want to understand why it failed so badly, I want to discuss hardening approaches, possible flaws in the ecosystem (better recovery tools?), and collect information about recovery (which is all over the place).
So far, I found that
* ZFS is very sensitive to the free space map which is unnecessary for a pure recovery (https://github.com/openzfs/zfs/issues/10085;
* ZFS recover empowers zdb (https://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/)?
Does anyone have experience with recovering other file systems? I have the impression that ZFS takes a lot of measures to prevent data loss but actual resilience relies on redundancy. When there is no redundant information, ZFS recovery appears rather weak to me (surely because of its complexity). The lack of fsck hinders recovery in critical cases. I cannot recall losing so much data to ext4, for example - only ever a few inodes despite a lot of unsafe shutdown. Why does ZFS appear so fragile without its replicates?
To summarize my question: How did the disk state get inconsistent despite no hardware errors being involved? Did the CoW property not work here (in that it leaves existing data unharmed) - should ZFS not have just discarded/rolled back the unfinished transactions? How could so much data get corrupted at once? What can be done to make power losses less terrifying (apart from a UPS but that does leave kernel panics as a possible source of disaster)?
first of all, I'm looking for advice for ZFS-related issues and googling led me to this forum. But in fact I'm not using Proxmox, but plain Ubuntu on my server.
I hope that's fine with you.
My dataset is called `storj`. It resides on a 16 TB HDD (which is connected via USB 3 due to my SATA ports all being in use) and has 5G of log and 100G of cache, both of which reside on a NVME SSD.
To ensure high read performance, I chose the following slightly dangerous settings:
```
> cat /sys/module/zfs/parameters/zfs_txg_timeout
15
> zfs get all storj | grep sync
storj sync disabled local
```
I was completely fine with losing the last 15 seconds of data on unsafe shutdowns (https://unix.stackexchange.com/ques...ication-crash-consistency-with-sync-disabled; https://www.reddit.com/r/homelab/comments/7b0op7/zfs_sync_disabled/; https://www.reddit.com/r/truenas/comments/spwlk2/zillog_and_realistic_chances_of_catastrophe_with/; https://www.reddit.com/r/zfs/comments/c45z4i/assuming_a_proper_ups_setup_lots_of_ram_is/). But yesterday, my home lost power and I lost way more than the last 15 seconds and I want to understand why. ZFS is praised for its resilience, however this is my first bigger data loss with a file system. On my desktop PC, that went through a lot of hard shutdowns, I only ever lost fragments on files with NTFS or ext4. Are those file systems more resilient to power failure in the end?
As a short summary, after the power outage, I got an IO-error: `Destroy and re-create the pool from a backup source.` when trying to mount the drive. After trying `-fFm` and readonly without success, I went with `0 > /sys/module/zfs/parameters/spa_load_verify_metadata ` as suggested here: (https://www.reddit.com/r/zfs/comments/ey67l5/pool_crashed_data_is_saved_but_still_have_some/; https://www.reddit.com/r/truenas/comments/1755eew/zdb_scrub_roll_back_transactions_what_when_and/) and got the pool back immediately.
I noticed that upon `zpool import /proc/spl/kstat/zfs/dbgmsg`, there was a notion about only 2 metadata errors.
I wonder - in what state, on which txg did my pool get imported and how does zfs make decisions about inconsistencies? Would it have been better to run `-X` instead of skipping the metadata verification (https://www.reddit.com/r/zfs/comments/13vmg1u/what_does_pool_import_fxn_do/)?
I ruled hardware failure out as a reason (the drive is 4 months old with good SMART values, the RAM is ECC and tested each boot).
I then ran a scrub on the data (still running) and so far I got 76 data errors in folders (!), which are non-accessible, so I think all the files inside are lost.
Clearly, those folders were not fully populated in just 15 seconds and I'm surprised to see so much data go.
> Is there any chance of recovery of folders or all the files gone?
> Suppose that only a few files in the folder were changed. Can this corrupt the whole folder?
> ZFS is CoW. In my mind, this means that old data should always be consistent. But here, it seems that the corruption also spilled over to old files/higher level folders. Why didn't I benefit from CoW here?
> The scrub gave me roughly 50K checksum errors so far. I think that this amount of corruption also could not have happened in just 15 seconds. Is it likely that some "chained checksums" were corrupted here or what is going on? That's maybe naive, but is there a way to fix the inconsistencies by throwing out the checksums instead of the folders?
I also noticed this during recovery - why is the txg given by `zdb -C | grep txg` so far behind those in `zdb -ul | grep txg`?
How could that happen, when data is flushed to disk every 15 seconds?
By running single disk, I violate one of ZFS best practices. But I wonder - would mirroring have prevented this kind of error? Remember that the drive in question is fine.
Also, would having a recent snapshot have made the import more likely to succeed?
Would you recommend disabling the drive cache, as that might improve performance (https://forum.proxmox.com/threads/slow-zfs-performance.51717/) and resilience (https://www.klennet.com/notes/2018-12-30-damaged-zfs-pools.aspx) ?
I'm not here to rant about ZFS, I rather want to understand why it failed so badly, I want to discuss hardening approaches, possible flaws in the ecosystem (better recovery tools?), and collect information about recovery (which is all over the place).
So far, I found that
* ZFS is very sensitive to the free space map which is unnecessary for a pure recovery (https://github.com/openzfs/zfs/issues/10085;
* ZFS recover empowers zdb (https://sigtar.com/2009/10/19/opensolaris-zfs-recovery-after-kernel-panic/)?
Does anyone have experience with recovering other file systems? I have the impression that ZFS takes a lot of measures to prevent data loss but actual resilience relies on redundancy. When there is no redundant information, ZFS recovery appears rather weak to me (surely because of its complexity). The lack of fsck hinders recovery in critical cases. I cannot recall losing so much data to ext4, for example - only ever a few inodes despite a lot of unsafe shutdown. Why does ZFS appear so fragile without its replicates?
To summarize my question: How did the disk state get inconsistent despite no hardware errors being involved? Did the CoW property not work here (in that it leaves existing data unharmed) - should ZFS not have just discarded/rolled back the unfinished transactions? How could so much data get corrupted at once? What can be done to make power losses less terrifying (apart from a UPS but that does leave kernel panics as a possible source of disaster)?
Last edited: