ZFS On Debian or MDadm SoftRaid ? stability, and reliability of ZFS

pgro

Member
Oct 20, 2022
56
2
13
Hello Everyone

Some guys are concerned at work about the stability, and reliability of ZFS on Linux and overall on Debian.
This is about an installation on vessels (ships) traditionally aimed to run at about 10 years. Now we are moving to proxmox virtualization and those guys fully support and endorse ZFS.

However, some ppl in the company seem concerned since their expertise lies on EXT4/ mdadm, etc.

How Stable is the Proxmox/Debian with ZFS upon updating and not losing/broke things in term that ZFS is not "built into" Kernel ?
Some users stop using ZFS on root on Linux Mainly because it didn't have the same advantages as ZFS on root on FreeBSD (e.g. no boot environments).

There is also Linus Torvalds stmt back in 2020 : "do not use ZFS"? LINK1 LINK2

Some others claiming

On should make a clear distinction between boot-on-zfs & root-on-zfs.
I don't consider linux stable enough for boot-on-zfs, i.e. the kernel itself on zfs.
I prefer ext4 or ext2 for /boot

For root, old jfs,ext4,xfs,zfs are good options.

Some guy also claimed

Yes, Linus threw a tantrum 'cause the Openzfs guys wouldn't give him code under terms he likes so he decided to spread FUD about ZFS.

The statement "(ZFS) was always more of a buzzword than anything else..." is so patently and obviously absurd that I can't believe someone as smart as Linus uttered it in good faith. ZFS is more than a filesystem. It's a volume manager and software RAID layer, and it makes the Linux md and LVM crapola look primitive.
 
At least the PVE development team decided to use ZFS instead of mdadm because in their opinion ZFS is the more reliable software raid solution. So ZFS is officially supported, mdadm is not (but still works). I would stick with the solution that is supported and well-tested (by the staff before releasing a update and by nearly all PVE users after the release) to work.
You could of cause install a Debian with mdadm and the proxmox-ve package on top of that, but then you are running a custom setup that is more likely to run into complications, compared to running a default unmodified direct PVE installation.
 
Last edited:
At least the PVE development team decided to use ZFS instead of mdadm because in their opinion ZFS is the more reliable software raid solution. So ZFS is officially supported, mdadm is not (but still works). I would stick with the solution that is supported and well-tested (by the staff before releasing a update and by nearly all PVE users after the release) to work.
You could of cause install a Debian with mdadm and the proxmox-ve package on top of that, but then you are running a custom setup that is more likely to run into complications, compared to running a default unmodified direct PVE installation.
Thank you

How I am protected in raid1 if for unknown reason my pool is become unusable or damaged ?
 
Thank you

How I am protected in raid1 if for unknown reason my pool is become unusable or damaged ?
That can also happen with mdadm. I read enough threads here where people lost their mdadm arrays.

Maybe the developers might tell us something about their design decisions. I would also like to hear that.

Only other officially supported software raid is btrfs and that was only added recently so it should be quite experimental from the PVE side. And btrfs is not as mature as ZFS with still stuff like raid5 unstable on the kernel side, according to this status list:
https://btrfs.wiki.kernel.org/index.php/Status
 
The first thing that comes to my mind when talking about md/dm raid is this: The default caching mode for VMs we use is none, that is, the VMs will access disks with the O_DIRECT flag. The MD/DM raid implementations, for this case, will simply forward a pointer to the memory region to each individual block device, and each of those will copy the data from memory separately. If a 2nd thread is currently writing to that data, the underlying disks will sooner or later write different data, immediately corrupting the raid.[1]

In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.

Ideally the kernel would just ignore O_DIRECT here, since it is in fact documented as *trying* to minimize cache effects... not forcibly skipping caches consistency be damned, completely disregarding the one job that for example a RAID1 has: actually writing the *same* data on both disks...

And yes, writing data which is being modified *normally* means you need to expect garbage on disk. However, the point of a RAID is to at least have the *same* garbage on *both* disks, not give userspace a trivial way to degrade the thing.

If you take care not to use this mode, you'll be fine with it though, but you'll be utilizing some more memory.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=99171
 
I'd choose ZFS over MDADM anytime.
The fact that ZFS can ensure your data consistency is alone worth it. Also it is much more versatile in my opinion.

ZFS on Linux has come a long way and is as we speak already a standard in many Linux distributions.
I have used it on PVE since years before I have migrated my file service from MDADM to ZFS.

However, some ppl in the company seem concerned since their expertise lies on EXT4/ mdadm, etc.
In which case we should all go back to horses and avoid these "new things" called cars.

Ask your colleagues how they handle (e.g. detect) bitrod on MDADM and ext4 filesystem's.
Data is not worth anything if you can't trust the consistency of it...
 
  • Like
Reactions: UdoB
Well arguably, those are not cases where virtualization is and should be used. It is okay to use mdadm maybe(and it is used in thousands servers as i know) in POS and similar servers, but for virtualization you want rock stable.
 
There are usecases where lvm/ext are preferrable.
I expressed my opinion, this is hours.

If I can't rely on data consistency it is not worth it. So your application needs to make sure to detect bit issues and then what? Can it recover?
ZFS needs resources as it provides benefits through them. Imho this is a fair trade off.

There are less possibilities to configure MDADM wrong (for instance limit arc usage) - I have to admit that.
 
If I can't rely on data consistency it is not worth it. So your application needs to make sure to detect bit issues and then what? Can it recover?
ZFS needs resources as it provides benefits through them. Imho this is a fair trade off.
I switched some guests from ZFS raid to single disk LVM-Thin, as this saves 1/3 of the writes so 3x more SSD lifespan. But I only use it for my monitoring server, logging server and blockchain databases where I really don't care if some metrics or logs will corrupt.
But, yes I also would't store important data on an storage that is missing bit rot protection or a host that got no ECC RAM.
 
How Stable is the Proxmox/Debian with ZFS upon updating and not losing/broke things in term that ZFS is not "built into" Kernel ?
Some users stop using ZFS on root on Linux Mainly because it didn't have the same advantages as ZFS on root on FreeBSD (e.g. no boot environments).
If you use Proxmox VE, it already comes with ZFS out of the box for all concerns you might have as an admin. That also means that any you won't have to deal with incompatibilities between newer kernel versions and ZFS. Something that can happen if you use a distribution that does not deliver the kernel with ZFS out of the box.

Even if you might not have boot environments out of the box, there are still a lot of things why I prefer ZFS as root FS. As already mentioned, checksumming of everything (data and metadata) is pretty much the most important one. By doing scrubs on a regular basis, ZFS can detect bit rot on disks and if you regularly have the same disk reporting issues where the checksum did not match the data, you know that you should look into it as the disk, cable or something else might need to be replaced.

In the context of VMs, the behavior of device mapper block devs mentioned earlier by @wbumiller is also something you should consider. It is the main reason why Proxmox VE does not support MD RAIDs officially.

"Issues" of higher resource usages with ZFS can either be dealt with by throwing money at it (more or faster HW) or are worth taking since you get a very high level of data integrity. Other features of ZFS, such as easy separation with datasets, send/recv (used for VM replication in Proxmox VE), simple tooling to deal with it, etc. etc. are more in the field of personal preference or if you actually need it.
However, some ppl in the company seem concerned since their expertise lies on EXT4/ mdadm, etc.
Absolutely understandable, as people usually like the stack they know and have experience with. But maybe they should try out something new to see how well it performs. By choosing a distribution that delivers ZFS out of the box (like Proxmox VE), you don't have to deal with possible incompatibilities between the Linux kernel and ZFS yourself and can enjoy the nice things it provides. ;)
 
Thank you, Everyone, for your great Contribution, a lot resources here helped me understand that ZFS can be supported on Proxmox. Although I need to see the performance and also verify the available ram in order to get ensure my data stability in RAID-1 mode. Other than that I need to also check the use of SWAP and if SWAP partition or device is actually needed for my Proxmox Server. Some says to disable the swap completely, some other says just enabled for 10% but at the same time Proxmox suggests to not use ZFS for Swap partition due to it's performance degradation. IT's strange that I cannot MAP / Partitioning my DISK Layout during Proxmox Installation in ZFS/Raid-1 mode in order to define some free space for SWAP.
 
IT's strange that I cannot MAP / Partitioning my DISK Layout during Proxmox Installation in ZFS/Raid-1 mode in order to define some free space for SWAP.
Thats considered a custom disk setup which is not possible with the PVE installer. It is a software appliance trying to make install as simple as possible.
This is the reason I have Used a Debian installation and converted it into a PVE install. Works great since years and even through multiple major release upgrades.
Initially I started with PVE 5, now I am at 7
Some says to disable the swap completely,
I am doing it that way, but mainly for avoiding potential sensitive data on disk which is unencrypted. Encrypted swap is too much of a hazzle for me and I just have thrown plenty of memory in the box
 
Thats considered a custom disk setup which is not possible with the PVE installer. It is a software appliance trying to make install as simple as possible.
This is the reason I have Used a Debian installation and converted it into a PVE install. Works great since years and even through multiple major release upgrades.
Initially I started with PVE 5, now I am at 7
It is possilble. While installing PVE you can click on the "Advanced" button and then define the "hdsize" to not use all unallocated space for the ZFS partition. That way you can create a normal partition later and format it for swap use. That way your swap isn't on top of ZFS.
 
  • Like
Reactions: pgro and aaron
It is possilble. While installing PVE you can click on the "Advanced" button and then define the "hdsize" to not use all unallocated space for the ZFS partition. That way you can create a normal partition later and format it for swap use. That way your swap isn't on top of ZFS.
Thank you Dunuin but there are no such option in advanced setup

1666356477660.png

If we pick a filesystem such as EXT3, EXT4, or XFS instead of ZFS, the Harddisk options dialog box will look like the following screenshot, with a different set of options:

1666356513831.png
 
Thats considered a custom disk setup which is not possible with the PVE installer. It is a software appliance trying to make install as simple as possible.
This is the reason I have Used a Debian installation and converted it into a PVE install. Works great since years and even through multiple major release upgrades.
Initially I started with PVE 5, now I am at 7

I am doing it that way, but mainly for avoiding potential sensitive data on disk which is unencrypted. Encrypted swap is too much of a hazzle for me and I just have thrown plenty of memory in the box

I'd like that way too, but Proxmox obviously prepares it's initramfs/kernel and initrd for proper config. How I can later prepare my system to match the Proxmox pre-made custom kernel for PVE 7.2 distributed? I don't want to keep my Debian default but PVE Instead.
 
Thank you Dunuin but there are no such option in advanced setup

View attachment 42470

If we pick a filesystem such as EXT3, EXT4, or XFS instead of ZFS, the Harddisk options dialog box will look like the following screenshot, with a different set of options:

View attachment 42471

Here is a part of a tutorial I wrote this week. Strange that is it missing for you. Here it is available when using the PVE 7.2-1 ISO:
The PVE Installer should show up and you then select "Install Proxmox VE". Read and accept the EULA. You then should see a page with a "Target Harddisk" dropdown.


Click the "Options" button [7] next to it. Then there should be a "Harddisk Options" popup.


There you select the ZFS raid mode of your choice, as described in chapter "4.) Choosing the right Disk layout", from the "Filesystem" dropdown [8]. Now make sure you only select the disks you really want to install PVE to [9] and go to the "Advanced Options" tab [10].


Here you will have to edit the "hdsize" [11] as we will later need some unallocated disk space to create the encrypted swap partition. Let's say you want a 8 GiB swap partition and the predefined value of the "hdsize" field was "60.0". Then lower it by 8 GiB and use "52.0". For more information see here.

Maybe your resolution is too small and the bottom of the "Harddisk Option" popup is cut off?
 
Last edited:
Wow!

I can confirm, it's there. I run with other browser because of my iDRAC
Old installer maybe?
I found it with same installer. I don’t know maybe I didn’t noticed it ? I ll fired up again my install and adjust the partition for swap and the rest for zvol. Other than that is there any best practice assign optimal config for zfs raid1 giving best performance ? Eg I read that volblocksize can significantly improve vdev performance. My intention is to deploy a postgresql but my disks are 10k sas working under hba mode simple because I do not trust my perc controller especially without bbu
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!