Lost all data on ZFS RAID10

esi_y · Sep 24, 2024

Just went to see some of the most recent symptoms one can experience with the versatile filesystem, fairly nice is recent:

The HDD volumes can significantly affect the write performance of SSD volumes on the same server node

It's really by design...

Well, anyhow, this is a filesystem (and volume manager) that could not do reflinks till last year, after all:

- COW cp (--reflink) support

TRIM took till 2019 to get working:
- Add TRIM support

It took 10 years (since its inception) to fix its own death by SWAP:

- Support swap on zvol

And then it keeps coming back:

- Swap deadlock in 0.7.9

Then especially the unlucky ZVOLs:

- VOL data corruption with zvol_use_blk_mq=1

- [Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs

Yes, solving this caused the above one; also, some comments are pure gold there. Before and after the fix:

Code:

Sequential dd to one zvol, 8k volblocksize, no O_DIRECT:

    legacy submit_bio()     292MB/s write  453MB/s read
    this commit             453MB/s write  885MB/s read

- Discard operations on empty zvols are "slow"

... I won't be even going further back, it's just what I myself remember when it comes to attempting having it as backend for a HV.

And every now and then there's mysteries like:

- Better performance with O_DIRECT on zvols

- ZVOL caused machine freeze

(yes, PVE bring up these, might be other kernel-related issues, but that's what one gets for running "tainted" kernels)

@ubu And why do you prefer it?

alexskysilk · Sep 24, 2024

esi_y said:
I find it bizzare that PVE install puts BTRFS on some "experimental" pedestal,

the dev team simply didnt put as much effort into the tooling and integration; consequently its just not that mature. It doesnt mean there is no maturity to the underlying file system. FWIW I have a single node deployment with BTRFS which seems to work fine but I dont think I want to use it in anger.

esi_y said:
Can you elaborate?

I'm sure he could. others (including myself) have all over the forum. You seem to have some vendetta around this choice by the developers; just because you dont agree with the reasons for its (zfs) use and preferential status doesnt mean those reasons arent there. Why is this such a passion for you?

esi_y · Sep 24, 2024

alexskysilk said:
the dev team simply didnt put as much effort into the tooling and integration; consequently its just not that mature.

Fair enough, I was mostly referring to the fact that ZFS by itself could be considered experimental. Not really thinking of the tooling part. Then again, if the OP is not after the features (e.g. replication), what difference does it make to him (the better tested tooling?).

alexskysilk said:
I'm sure he could. others (including myself) have all over the forum. You seem to have some vendetta around this choice by the developers;

When OP starts a thread like this, then what's wrong about asking him why he chose that particular filesytem? Just because it's popular on the forum? I don't remember by heart what's default in ISO installer, probably not even ZFS, so it's not about me being upset about the choice of ZFS for e.g. replication support.

alexskysilk said:
just because you dont agree with the reasons for its (zfs) use and preferential status doesnt mean those reasons arent there. Why is this such a passion for you?

Listing a couple of factual links to trackers is passionate nowadays?

_gabriel · Sep 25, 2024

_gabriel said:
as ZFS or ext4 ?

ubu said:
Irrelevant in this case, since the data loss is in the striped mirror (aka raid10)

jhr said:
I can accept that PVE system hdd crashed, but I do not understand why it wiped all zvols on zpool. If there were some power outage and my server goes down, than maybe I can accept that ZFS could be broken

But system-crash is like power outage !
Seems some precious data was on disk's cache and was lost because disk's cache isn't protected.
That's why HW raid controller disable disk's cache and use its cache protected by battery.
DataCenter Flash disk's cache are protected too thanks to its PLP, that's why they're recommended.

esi_y · Sep 25, 2024

_gabriel said:
But system-crash is like power outage !
Seems some precious data was on disk's cache and was lost because disk's cache isn't protected.

But the OP never mentioned he would have used SLOG on the SSD, it more sounds he had regular PVE ISO install on ZFS which shredded SSD due to its amplification and the total pool failure should not have happened, not even on power loss. You don't lose even a regular journalled filesystem on a power loss..

alexskysilk said:
You seem to have some vendetta around this choice by the developers; just because you dont agree with the reasons for its (zfs) use and preferential status doesnt mean those reasons arent there.

Just last night due to another ZFS related post here (on why autoreplace does not work with /dev/disk/by-id), it made me actually go and check how old is some of the documentation on ZFS. And it is really old, 8 years:

https://pve.proxmox.com/mediawiki/index.php?title=ZFS_on_Linux&action=history
https://github.com/proxmox/pve-docs/blame/fe4a583789b272137034b6d43f3ca1a05df50961/local-zfs.adoc

I would not nitpick on using 4 years old name in the title (but it implies there's no time to update even that, so the quality must be low), but the "advantages" part is clearly vague and some sort of reasoned rationale (that one could actually challenge on facts) completely non-existent. E.g. "encryption" (not recommended to this day) or "can use SSD for cache" or "Continuous integrity checking" (like Red Hat has been missing out on this all along?) - it just looks like at the time the choice was made because ZFS appeared like a panacea for missing features, but it turned out to be quite a bit of a fallacy.

_gabriel · Sep 25, 2024

esi_y said:
it more sounds he had regular PVE ISO install on ZFS which shredded SSD due to its amplification

Perhaps, we never know as PVE system is dead now. OP write PVE was installed as ext4 or xfs.
Due to PVE système SSD crash, the other ZFS pool can be degraded.
Even lost should never happen, it seems to be possible.

esi_y · Sep 25, 2024

_gabriel said:
Perhaps, we never know as PVE system is dead now

Well, he can tell us.

_gabriel said:
OP write PVE was installed as ext4 or xfs.

You mean he did NOT write? I think there was no mention of how root was installed. (I assumed only ZFS ISO install.)

_gabriel said:
Due to PVE système SSD crash, the other ZFS pool can be degraded.

This one I don't quite get what you are implying. Why would one (single vdev) pool degrade anything about (any) other pool(s)?

_gabriel · Sep 25, 2024

esi_y said:
I think there was no mention of how root was installed. (I assumed only ZFS ISO install.)

https://forum.proxmox.com/threads/lost-all-data-on-zfs-raid10.154843/post-705354

esi_y · Sep 25, 2024

_gabriel said:
https://forum.proxmox.com/threads/lost-all-data-on-zfs-raid10.154843/post-705354

Alright, my bad, but then again (even more so):

esi_y said:
This one I don't quite get what you are implying. Why would one (single vdev) pool degrade anything about (any) other pool(s)?

Now that we (I) know there were no other ZFS pools, especially.

ubu · Sep 25, 2024

Without a forensic analysis of the ZFS pool we will never know for certain what happened in this case and the reason for it. It is a very unlikely problem, but very unlikely things happen. I am quite sure the data could be recovered, but it probably would the expensive and time consuming.
I understand his frustration, I had an xfs filesystem die on me one, but that is no reason to bitch about ZFS, which is a very solid filesystem with a lot of great features and a long proven track record used for very big storages.
It is definitely one of the best choices for proxmox.
Btrfs has a similar feature set, but less real world usage and I prefer the handling and tooling of ZFS, but I do not have much experience with btrfs, while I have been using ZFS since opensolaris 2007 or so

ubu · Sep 25, 2024

esi_y said:
Can you elaborate?

In 2001?

https://www.google.com/url?sa=t&sou...UQFnoECEQQAQ&usg=AOvVaw1dvqCFM60OhKiM5266sthf

esi_y · Sep 25, 2024

ubu said:
https://www.cs.hmc.edu/~rhodes/cs134/readings/The Zettabyte File System.pdf

I really wished you would have suggested something (other than what e.g. the outdated PVE docs list or old white paper or marketing-like terms) that is of a value to YOU on your hypervisor use case. The paper of course does not cover anything related to SSDs or hypervisors, the concepts were impressive in the day, it would have made a great filesystem for e.g. archival purposes. Even then put against e.g. XFS, it was a behemoth that does not scale performance-wise. I just feel it's the worst choice possibly for the use case being discussed and the feature set is not relevant for a hypervisor (where e.g. you want to use the resources for your guests, not host overhead).

esi_y · Sep 25, 2024

ubu said:
It is a very unlikely problem, but very unlikely things happen.

I do not use ZFS that much (anymore), but used it enough, also since Solaris days. FWIW I never had it fail on me on Solaris or BSD back when it had its own codebase. On Linux it's been buggy all along. There's too many bugs related to corruption coming up still that simply do not arise on mature code bases.

In my view, this attitude is what caused this whole thread - the marketing for ZFS especially on a forum like this is great, but reality kicks in. In this case early, so lessons learned at least.

Pifouney · Sep 26, 2024

did you tried zfs list -t snapshot ?

waltar · Sep 26, 2024

Just 2 more power outages today to zfs the safest ever filesystem on planet with at the end same result of data gone, so you are not alone ...
https://forums.truenas.com/t/pool-online-unable-to-import/14129
https://www.reddit.com/r/selfhosted/comments/1fplu8v/just_lost_24tb_of_media/

Btw. did you do snapshots and zfs list -t snapshot show some ?

esi_y · Sep 26, 2024

waltar said:
Just 2 more power outages today to zfs the safest ever filesystem on planet with at the end same result of data gone, so you are not alone ...
https://forums.truenas.com/t/pool-online-unable-to-import/14129
https://www.reddit.com/r/selfhosted/comments/1fplu8v/just_lost_24tb_of_media/

I think I added enough here to "support" the case, but just a factual: I had ZFS lose pools also NOT on power outage. It's software, it has bugs. The more complex, the more likely.

ubu · Sep 27, 2024

esi_y said:
I really wished you would have suggested something (other than what e.g. the outdated PVE docs list or old white paper or marketing-like terms) that is of a value to YOU on your hypervisor use case. The paper of course does not cover anything related to SSDs or hypervisors, the concepts were impressive in the day, it would have made a great filesystem for e.g. archival purposes. Even then put against e.g. XFS, it was a behemoth that does not scale performance-wise. I just feel it's the worst choice possibly for the use case being discussed and the feature set is not relevant for a hypervisor (where e.g. you want to use the resources for your guests, not host overhead).

You asked if ZFS was designed with Virtualisation in mind, yes it was designed for use with solaris zones.
ZFS provides subvolumes (used on proxmox for lxc containers) and ZVols (used on Proxmox for KVM Virtual Machines),
ZFS provides snapshots and zfs send/receive, used on proxmox for periodic storage replication to another node in a cluster.

As for any filesystem:
- Avoid cheap Disks without PLP
- HAVE Backups, something will fail, it is just a matter of time

I have lost ext filesystems, XFS Filesystems and ReiserFS filesystems over the last 25 years, i have not yes lost a zfs pool, fortunatly, but i know it can (and will) happen if i get old enough

Have Backups

, PBS makes that really easy

I fully understand that with this data loss happening to you you might not trust ZFS, i have not much experience with BTRFS, but it might be an alternative.

I hope in future you data will be save, good luck

esi_y · Sep 27, 2024

Thank you for your answer, first of all.

ubu said:
You asked if ZFS was designed with Virtualisation in mind, yes it was designed for use with solaris zones.

Fair enough, I do not think it was mentioned in the paper, minimal experience with Solaris containers on my part to contest this, it would have been all on spinning disks.

ubu said:
ZFS provides subvolumes (used on proxmox for lxc containers) and ZVols (used on Proxmox for KVM Virtual Machines),

Just a nitpick, I think you used BTRFS terminology for what ZFS calls a dataset. I just mention this because I noticed lots of miscontrued terms in the docs, e.g. calling ZFS mirrors RAID1 or making a distinction between dataset and a ZVOLs (where ZVOL is a special type of dataset). Yes, everyone knows most of the time what the others mean, but sometimes it makes a difference.

ubu said:
ZFS provides snapshots and zfs send/receive, used on proxmox for periodic storage replication to another node in a cluster.

I would just add this is limitation of PVE, to only support send/receive on ZFS. Same is possible with BTRFS, just the storage plugin in missing. But then by their own admission, shared storage is preferred to replicas. And if replication was the selling point of ZFS for PVE, then everyone using shared storage would lack any reason to use ZFS at all.

ubu said:
As for any filesystem:
- Avoid cheap Disks without PLP

I just can't help and have to make one more comment on these "PLP only" suggestions. I think this reasoning does not work TODAY. For one, power loss does not really happen in UPS backed environments, if so, one has backups, it's not like the old days that power loss can actually kill SSD. The second thing is, we are now in a period when actually good SSDs without PLP have 2-4x the TBW than a PLP one that maxes out at 1/4 of capacity and they cost around half. Finally, there isn't even a good choice of 2280 form factor NVMe SSDs with PLP nowadays at good capacities.

PVE itself can get corrupt config.db (that backs the /etc/pve) from software glitches alone, it's not a non-PLP induced issue. Moreover, it suffers from terrible write amplification which leads me to believe was the original reason for suggesting PLP ones all along + then the choice of ZFS. But accessible PLP SSDs do not have higher TBWs anymore (e.g. case in point, not so stellar but just okay WD SN700 at 1TB has 2000TBW).

When you look at hyperscalers, they are often on commodity-like hardware, everything fails all of the time. When I have ~10 nodes with commodity SSDs and they keep failing (as I keep RMA'ing them) in a well set-up cluster, I could not mind they are less reliable.

This also goes against the whole point behing ZFS and commodity hardware, no expensive RAID controllers, etc. It does not even make sense, with all the checksuming and backups and all, of course corruption will be detected and such drive tossed away. Availability is then guaranteed by the cluster itself.

ubu said:
- HAVE Backups, something will fail, it is just a matter of time

Yes and I agree with the rest you wrote too, actually.

jhr · Sep 30, 2024

_gabriel said:
https://forum.proxmox.com/threads/lost-all-data-on-zfs-raid10.154843/post-705354

I am sorry I have no notifications of a new posts, so I think my lost is lost.
As I wrote OS was installed to a single consumer NVME LVM, ext4 and 4x consumer HDD works like zpool ZFS RAID10 (not RAID1 as I wrote before, it's a typo).

zpool had a correct name, import was OK, partitions on all disks looks OK too, but
zfs list -t snapshot or zfs list was empty.
But zpool history -il shows history of my zpool.

esi_y · Oct 19, 2024

So just for the benefit of the OP here (or anyone searching be the title, keywords), should you need to recover at least some of your data, at least this worked ok (there): https://forum.proxmox.com/threads/z...o-error-on-import-attempt.156075/#post-713311

I was shocked no one else helped there from the strong ZFS community here.

Lost all data on ZFS RAID10

Renowned Member

The HDD volumes can significantly affect the write performance of SSD volumes on the same server node​

- COW cp (--reflink) support​

- Support swap on zvol​

- Swap deadlock in 0.7.9​

- VOL data corruption with zvol_use_blk_mq=1​

- [Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs​

- Discard operations on empty zvols are "slow"​

- Better performance with O_DIRECT on zvols​

- ZVOL caused machine freeze​

Distinguished Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Active Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Renowned Member

We value your privacy

The HDD volumes can significantly affect the write performance of SSD volumes on the same server node

- COW cp (--reflink) support

- Support swap on zvol

- Swap deadlock in 0.7.9

- VOL data corruption with zvol_use_blk_mq=1

- [Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs

- Discard operations on empty zvols are "slow"

- Better performance with O_DIRECT on zvols

- ZVOL caused machine freeze