Storage, checking data integrity, bit rot protection

hardek

Member
Dec 27, 2021
14
1
8
32
Hi All

For two weeks I have been searching the market for a good solution to solve the problem of data integrity, but I still can't decide due to insufficient experience. I think, It will be the best to write a new thread for sharing it. I believe someone had a similar problem and I hope it could share their experience :)

I focused on the following aspects:
  • security
  • flexibility (changing storage size, turn on/off encryption)
  • data integrity and bit rot protection
  • reporting + auto-healing of corrupted data
  • low resource consumption
  • relatively good performance
  • compression (in future, when data set will be too large)
  • snapshot support (natively, I don't need GUI for administration)

Currently I considering the following solutions:
  • block devices -> LUKS -> ZFS pool -> Mirroring / RAID Z2 (depends on protection level) -> ZFS filesystem
  • block devices -> dm-integrity -> mdadm (raid1/raid6) (depends on protection level) -> lvm -> LUKS -> ext4/xfs filesystem
  • block devices -> mdadm (raid1/raid6) (depends on protection level) -> lvm -> LUKS -> btrfs filesystem
I am searching for the following information:
  1. What is the behavior of each solution when the the integrity data issues occurred?
  2. What is the performance of each solution?
  3. What is the stability and reliability of each solution?
  4. Does the each solution support auto-healing without any intervention?
The firs on solution, probably seems to be the best, but it needs additional RAM, which potentially can generated bit rot issues during caching activity and it potentially could cause problem with licensing and support. Additionally is offer less flexibility regarding encryption, because probably luks cannot be setup per partition or per volume (I must encrypt whole underlying block device only? I know that ZFS support native encryption, but it supports only one key and moreover some data cannot be encrypted (metadata). I would rather lean towards one of the native solutions.

Currently, my server doesn't support ECC RAM memory and I know, it can generate related with it issues, but I don't know how to overcome this issue using software solution (maybe emulation? I am aware of additional overhead). I read tons of documentation and examples, so I hope the architecture of each is ok, but .... If someone has more experience I am open to criticism and for another solution if it is available.

Thank you a lot for any help and sharing knowledge.
 
I'd just go with native ZFS for everything:

block devices -> ZFS Pool on multiple vdevs RAIDz2 and encryption

All other solutions do not offer anything like bit-rot-protection and self-healing on all disks
 
Last edited:
  • Like
Reactions: hardek
flexibility (changing storage size, turn on/off encryption)
Not that easy with ZFS. Turning on encryption will for example only encrypt new data but not existing data. When using any raidz1/2/3 you can't simply add single disks but you have to add new vdevs (so for example adding 6 new disks when using an 6 disk raidz2). When working with raidz1/2/3 you also can't easily remove disks. Easier to work with (striped) mirrors.

low resource consumption
Also not ZFSs selling point ;)
All those data integrity features come at a cost...

block devices -> LUKS -> ZFS pool -> Mirroring / RAID Z2 (depends on protection level) -> ZFS filesystem
Useful in case you want a cluster, as migration won't be possible when using ZFS native encryption.
 
  • Like
Reactions: hardek
Thank you @LnxBil and @Dunuin for you answers.

@Dunuin , could you write more about typical ZFS administration in the case of RAID1 and RAID6. Currently I have the following architecture:
  • block devices -> mdadm (raid1/raid6) -> LVM -> VM side [ LUKS -> EXT4] and
  • block devices -> mdadm (raid1/raid6) -> LVM -> VM side [ LVM -> LUKS -> EXT4]
How ZFS handle with typical cases in practice like, which I have done many times for mdadm:
  • failed disk issue
  • manually failing disk (for testing purpose)
  • adding new disk
  • manual raid assembling
  • expanding disk size and next growing raid size
  • shrinking raid
  • is there need to add at least one partition as best practice like in mdadm or the approach is different here?
I am still searching for useful information and a lot of information speaks in favor of ZFS, despite its disadvantages. I think I'm starting to understand why ZFS was chosen as the main solution for Proxmox. During reading, additional questions were born in my mind regarding ZFS. Probably to achieve flexibility regarding encryption I will follow the below architectures:
  • block devices -> ZFS pool -> ZFS Mirroring / RAID Z2 (depends on protection level) -> ZVOL -> VM side [ LVM -> LUKS -> EXT4 filesystem]
or
  • block devices -> ZFS pool -> ZFS Mirroring / RAID Z2 (depends on protection level) -> ZVOL -> VM side [ LUKS -> EXT4 filesystem]
Now questions:
  • Can be ZVol using as replacement for LVM volume? -> What about performance? -> Does Zvol still has a checking error feature or use the ARC?
  • Is ZVol as flexible as LVM in the case of: growing size, shrinking size?
  • Is there possibility to turn off all cache features to reduce RAM consumption and potentially how does the performance look like compared to the EXT4 filesystem in this case?
Thanks for all the tips and help.

Ps. In the background I read about DDR5 (my case) and built-in (in chip) error correction, which might help in in-memory data protection, but it doesn't offer full true ECC. I will go deeper for more information about this feature.
 
Is there possibility to turn off all cache features to reduce RAM consumption and potentially how does the performance look like compared to the EXT4 filesystem in this case
But ext4 is also caching, so this is no fair comparison and makes no sense.
 
How ZFS handle with typical cases in practice like, which I have done many times for mdadm:

  • failed disk issue
  • manually failing disk (for testing purpose)
  • adding new disk
  • manual raid assembling
  • expanding disk size and next growing raid size
  • shrinking raid
  • is there need to add at least one partition as best practice like in mdadm or the approach is different here?
I found answer my my questions. I needed to understand ZFS architecture first and compare it to known by me solution:
  • block devices -> VDEV -> ZPOOL -> DATASET
    • VDEV is similar to mdadm, because here I will set up potential raidz solution
    • ZPOOL is similar to LVM group
    • DATASET is similar to LVM volume
So:
  • any disk cases like failing, adding, assembling will be done on VDEV side via simply commands
  • I have read, that expanding disk size and next growing raid size will be done through simply offline the one device and replace by another
  • shrinking will be done through by creating new pool contains smaller disks and move data (seems to be the best solution)
  • approach to create raidz is different than in mdadm, because I need to create it on whole disk instead partition
Additionally I have found answers to my question:
  • Can be ZVol using as replacement for LVM volume? -> What about performance? -> Does Zvol still has a checking error feature or use the ARC?
Yes, it can be treated as replacement for LVM. Performance is probably good due to pre-allocation. ZVOL still has an error checking (it seems to be set up on pool side)
  • Is ZVol as flexible as LVM in the case of: growing size, shrinking size?
Yes.
  • Is there possibility to turn off all cache features to reduce RAM consumption and potentially how does the performance look like compared to the EXT4 filesystem in this case?
Yes, by the below commands:
zfs set primarycache=none pool/dataset
zfs set secondarycache=none pool/dataset

Ps. In the background I read about DDR5 (my case) and built-in (in chip) error correction, which might help in in-memory data protection, but it doesn't offer full true ECC. I will go deeper for more information about this feature.
DDR5 seems to have built-in (on chip side by design) correction of error, but it isn't visible by OS and the error still can be occur (for example during transfer to CPU), but this issue seems to be partially resolved and error chance seems to be smaller compared to DDR4.

I have the last questions, which will need more knowledge and experience with ZFS and I am not able to found answers:
  1. Does the ZFS have a defrag tool to solve fragmentation issues on standard HDD, when the space is above 80%?
  2. Does (pre-allocated) ZVOL suffer on fragmentation issue?
  3. LZ4 compression seems to increase throghput in some cases, is it true? (for example storage for Proxmox VMs)
  4. A question out of pure curiosity: what the performance will be comapred to EXT4 if I turn off ARC and L2ARC cache and checksum via the commands:
    • Code:
      zfs set primarycache=none pool/dataset
    • Code:
      zfs set secondarycache=none pool/dataset
    • Code:
      zfs set checksum=off pool/dataset
Thank you for share any knowledge.
 
approach to create raidz is different than in mdadm, because I need to create it on whole disk instead partition
You can use partitions for that.
DATASET is similar to LVM volume
A "zvol" is more like a LV, as both are block devices. Datasets are filesystems where you can't change the filesystem, it is no blockdevice and therefore don't got a blocksize and so on.

shrinking will be done through by creating new pool contains smaller disks and move data (seems to be the best solution)
Yes, moving data (even including existing snapshots and so on) between pools on block level is quite easy thanks to recursive replication.

DDR5 seems to have built-in (on chip side by design) correction of error, but it isn't visible by OS and the error still can be occur (for example during transfer to CPU), but this issue seems to be partially resolved and error chance seems to be smaller compared to DDR4.
Yes, but thats not real full ECC. DDR5 basically got to a point where you can only achieve more speed by doing more errors, so some basic ECC is required there to be not worse than DDR4. There are still real ECC DDR5 DIMMs if you care about stability and data integrity that got full ECC coverage. So this doesn't mean you don't need to buy proper DDR5 ECC DIMMs just because the "non-ECC" DIMMs also got some basic ECC capabilities.

Does the ZFS have a defrag tool to solve fragmentation issues on standard HDD, when the space is above 80%?
No, ZFS is copy-on-write, so you can't defrag it. Therefore you should make sure not to fill it too much (<80%) so the fragmentation rate will be slower. Defrag basically means to move all data off that pool and writing it back later.
LZ4 compression seems to increase throghput in some cases, is it true? (for example storage for Proxmox VMs)
LZ4 is super fast and especially when using HDDs you might gain performance. Because when reading/writing compressed blocks you need to read/write less data and this might save more time than what the compression/decompression is adding.

zfs set primarycache=none pool/dataset
I wouldn't do that. At least set it to "primarycache=metadata" so metadata would still be cached.
And I don't see the point of disabling checksumming. Data integrity is the whole point of why to use ZFS. If you don't care about your data there would be way faster alternatives like HW raid, mdadm and whatever.
And yes, you can use mdadm with PVE. It's not officially supported and not recommended because of problems like this one, but it will still work.
 
Last edited:
  • Like
Reactions: hardek
Thank you @Dunuin for answers and explanations. I really appreciate that!

I wouldn't do that. At least set it to "primarycache=metadata" so metadata would still be cached.
And I don't see the point of disabling checksumming. Data integrity is the whole point of why to use ZFS. If you don't care about your data there would be way faster alternatives like HW raid, mdadm and whatever.
And yes, you can use mdadm with PVE. It's not officially supported and not recommended because of problems like this one, but it will still work.
Regarding this, I am trying to avoid use another solution, because I will need to setup separate hardware and separate configuration layers. I am going to achieve something like that: if i will need data integration checking, I will go standard ZFS, if I won't need data integration checking I will simply turn it off instead use for example EXT4 filesystem (flexibility).

Make it sense? Will I achieve the same performance like on EXT4 or it will it be worse anyway?

At this moment I am preparing to migrate from my current solution (mdadm + lvm + ext4) and am trying to work out solutions to potential problems that may occur.
I regret a bit that I didn't think about data security then, I would have definitely less work now, but what I learned is my :)

No, ZFS is copy-on-write, so you can't defrag it. Therefore you should make sure not to fill it too much (<80%) so the fragmentation rate will be slower. Defrag basically means to move all data off that pool and writing it back later.
It will be the biggest problem for me, because I assume occupancy around 80-90%. Moving terabytes of data will be time consuming. Such a shame there is no online tool to do this :(
I am wondering if such a tool will come out for ZFS in the future.
A "zvol" is more like a LV, as both are block devices. Datasets are filesystems where you can't change the filesystem, it is no blockdevice and therefore don't got a blocksize and so on.
Do you know, if the fragmentation issue affect the entire pool and the ZVOLs within it? (I assume so) Or is it related with datasets only?
 
Make it sense? Will I achieve the same performance like on EXT4 or it will it be worse anyway?
It is still copy-on-write and does things very overcomplicated for data integrity and feature reasons. So probably worse anyway.

It will be the biggest problem for me, because I assume occupancy around 80-90%. Moving terabytes of data will be time consuming. Such a shame there is no online tool to do this :(
I am wondering if such a tool will come out for ZFS in the future.
Have a read on how copy-on-write works. It works fundamentally differently than your typical filesystem like ext4 or xfs. Data is not stored on fixed locations. It's all just like a very big journal. Good short article to learn some of the ZFS fundamentals: https://arstechnica.com/information...01-understanding-zfs-storage-and-performance/

Do you know, if the fragmentation issue affect the entire pool and the ZVOLs within it? (I assume so) Or is it related with datasets only?
Not totally sure but I think on a vdev level.
 
  • Like
Reactions: hardek
what the performance will be comapred to EXT4 if I turn off ARC and L2ARC cache and checksum via the commands
As I already said, you cannot compare that because ext4 uses buffer cache. Compare standard ext4 with standard ZFS and ext4 will be faster, because it lacks a lot of features. Those features, ever grow accustomed to, you will most definietly miss: Snapshots, incremental backup, self-healing. mdadm is also sub-par to ZFS in a mirrored setup. I used mdadm+lvm over decades, but since ZFS was available, I switched EVERYTHING and never looked back. ZFS is superior in any setup and use good (enterprise) hardware and you won't complain about performance. Performance is overrated with respect to data integrity.
 
  • Like
Reactions: hardek and Dunuin
Thank you very much for help. I performed test on my environment and it seems to be quite efficient compared to the previous solution (mdadm + lvm) - I think the following optimizations and ARC do the job in my case:
  • /etc/modprobe.d/zfs.conf
    • Code:
      # Description: Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking at c_min as tuned by zfs_arc_min. In this case 4GB.
      zfs_arc_min=4294967296
      
      # Description: Maximum size of ARC in bytes. If set to 0 then the maximum size of ARC is determined by the amount of system memory installed (50% on Linux). In this case 24GB.
      zfs_arc_max=25769803776
      
      # Description: Sets the limit to ARC metadata, arc_meta_limit, as a percentage of the maximum size target of the ARC, c_max. Default is 75.
      zfs_arc_meta_limit_percent=75
      
      # Description: Disables writing prefetched, but unused, buffers to cache devices. Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. Also, if the main pool devices are very slow, setting to 0 can improve some workloads such as backups.
      l2arc_noprefetch=0
      
      # Description: How far through the ARC lists to search for L2ARC cacheable content, expressed as a multiplier of l2arc_write_max. ARC persistence across reboots can be achieved with persistent L2ARC by setting this parameter to 0, allowing the full length of ARC lists to be searched for cacheable content.
      l2arc_headroom=2
      
      # Description: Seconds between L2ARC writing.
      l2arc_feed_secs=1
      
      # Description: Controls whether only MFU metadata and data are cached from ARC into L2ARC. This may be desirable to avoid wasting space on L2ARC when reading/writing large amounts of data that are not expected to be accessed more than once. By default both MRU and MFU data and metadata are cached in the L2ARC
      l2arc_mfuonly=1
      
      # Description: This tunable limits the maximum writing speed onto l2arc. The default is 8MB/s. So depending on the type of cache drives that the system used, it is desirable to increase this limit several times. But remember not to crank it too high to impact reading from the cache drives.
      l2arc_write_max=8388608
      
      # Description: This tunable increases the above writing speed limit after system boot and before ARC has filled up. The default is also 8MB/s. During the above period, there should be no read request on L2ARC. This should also be increased depending on the system.
      l2arc_write_boost=8388608
      
      # Description: Modern HDDs have uniform bit density and constant angular velocity. Therefore, the outer recording zones are faster (higher bandwidth) than the inner zones by the ratio of outer to inner track diameter. The difference in bandwidth can be 2:1, and is often available in the HDD detailed specifications or drive manual. For HDDs when metaslab_lba_weighting_enabled is true, write allocation preference is given to the metaslabs representing the outer recording zones. Thus the allocation to metaslabs prefers faster bandwidth over free space.
      metaslab_lba_weighting_enabled=1
  • /etc/default/grub
  • I use the below options for pools and zvols:
    • zpool create -f -o ashift=12 -o dedup=off -O recordsize=128K -O atime=on -O relatime=on -O compression=lz4 -O acltype=posixacl .....
    • zfs create -o volblocksize=16K

Seems, I never switch back to the previous solution again. I will try to move my the biggest mdadm RAID6 to RaidZ-2, but I am not sure how to do it without lose any data.
  • Will I able to convert single ZFS disk into RAIDZ-2?
    • I assume something like the below: remove one drive from mdadm raid6 -> create new ZFS pool with one drive -> move part of data to new ZFS pool -> remove second drive and convert one ZFS drive into RAIDZ-2 -> move rest data -> add the rest disk to rebuild full featured ZFS array
  • I wondering, how I move rootfs instance (mdadm + lvm + raid1) into ZFS. I think, it can be hard to do. So I am searching for another less invasive solution.
I think, the the above question will be the last one and the environment will cover most data issues :D
I have a lot of fun with "new" technology (ZFS)!
 
l2arc_noprefetch=0
Are you planning on using an L2ARC device? If not, you can omit all l2arc-prefixed settings.

Will I able to convert single ZFS disk into RAIDZ-2?
AFAIK, a vdev cannot be changed but in general vdevs can be added or removed from a pool, so you maybe can work something out. Best would be to use other disks for storing your data but keep in mind to have at least two disks or another mirrored zpool for your "move device".

I wondering, how I move rootfs instance (mdadm + lvm + raid1) into ZFS. I think, it can be hard to do. So I am searching for another less invasive solution.
Best to start fresh. Much easier and faster.
 
  • Like
Reactions: hardek
Do you know, if the fragmentation issue affect the entire pool and the ZVOLs within it? (I assume so) Or is it related with datasets only?
fragmentation on zfs is only an issue once the entire FS has been written. Then it will impact last written only, and progressively the more the filesystem is full. As a rule of thumb, dont run your filesystem more then 80% full and you should have no issues. edit- @Dunuin already made that point ;)

And yes, you can use mdadm with PVE. It's not officially supported and not recommended because of problems like this one, but it will still work.
Curious, what's the use case for this? edit- misread "zfs over mdadm." ignore.

As I already said, you cannot compare that because ext4 uses buffer cache. Compare standard ext4 with standard ZFS and ext4 will be faster, because it lacks a lot of features.
For reads. writes are often faster (depending on the write pattern) but only as long as you have lots of contiguous free space. see above.

Will I able to convert single ZFS disk into RAIDZ-2?
No. vdevs cannot be changed in place, and you cannot remove top level vdevs from a pool containing raidz vdevs. see https://openzfs.github.io/openzfs-docs/man/8/zpool-remove.8.html
 
Last edited:
  • Like
Reactions: Dunuin
No. vdevs cannot be changed in place, and you cannot remove top level vdevs from a pool containing raidz vdevs. see https://openzfs.github.io/openzfs-docs/man/8/zpool-remove.8.html
Correct. I personally would:
A.) in case you need that single disk as a part of that raidz2: Backup your pools data into a file on some temporary storage like "zfs send > file". Create your raidz2 and then restore it with "zfs recv < file".
B.) in case you don't need that single disk: copy all data from single disk pool to raidz2 pool using "zfs send | zfs recv".

And don't forget to test it first and make your backups...especially on option A.

Here are some zfs send examples: https://docs.oracle.com/cd/E18752_01/html/819-5461/gbchx.html
 
  • Like
Reactions: hardek
I am during migration to ZFS pool. I have copied as many data ass possible to second location (unfortunately, second data location is smaller). I have read many documentation and forum threads and I am doing the follow steps (of course, It is risky for non-backed up data). I hope it will useful for others.
  • Remove two disks from mdadm array:
    • Code:
      mdadm --manage /dev/md3 --fail /dev/sdc3
    • Code:
      mdadm --manage /dev/md3 --fail /dev/sde3
    • Code:
      mdadm --remove /dev/sdc3
    • Code:
      mdadm --remove /dev/sde3
    • Code:
      mdadm --zero-superblock /dev/sdc3
    • Code:
      mdadm --zero-superblock /dev/sde3
  • Prepare removed disks by:
    • zeroed them and create new GPT partition table:
      • Code:
        shred -v -n 1 /dev/sdc
      • Code:
        shred -v -n 1 /dev/sde
      • Code:
        fdisk /dev/disk/by-id/sdc -> g -> w -> q
      • Code:
        fdisk /dev/disk/by-id/sde -> g -> w -> q
  • Check size of removed disks:
    • Code:
      fdisk -l /dev/sdc
    • Code:
      fdisk -l /dev/sdd
  • Create fake files for RAIDZ-2 purpose:
    • Code:
      truncate -s <disk_size_in_bytes> /home/disk1.img
    • Code:
      truncate -s <disk_size_in_bytes> /home/disk1.img
  • Create raidz-2 data pool for sdc, sde and new fake files:
    • Code:
      zpool create -o ashift=12 -O dedup=off -O recordsize=256K -O atime=on -O relatime=on -O compression=lz4 -O acltype=posixacl "<pool_name>" raidz2 <name_by_wwn_id_sdc> <name_by_wwn_id_for_sde> /home/disk1.img /home/disk2.img
  • Make fake files offline to no saved any data to them
    • Code:
      zpool offline "<pool_name>" /home/disk1.img
    • Code:
      zpool offline "<pool_name>" /home/disk2.img
  • Move data to new raidz-2 pool
  • Stop and completely remove old mdadm array with rest drives:
    • Code:
      mdadm --stop /dev/md3
    • Code:
      mdadm --manage /dev/md3 --fail /dev/sdd3
    • Code:
      mdadm --manage /dev/md3 --fail /dev/sdf3
    • Code:
      mdadm --remove /dev/sdd3
    • Code:
      mdadm --remove /dev/sdf3
    • Code:
      mdadm --zero-superblock /dev/sdd3
    • Code:
      mdadm --zero-superblock /dev/sdf3
  • Repeat step for zeroing and new GPT partition for sdd and sdf
  • Replace fake images to disks:
    • Code:
      zpool replace "<pool_name>" /home/disk1.img <name_by_wwn_id_for_sdd>
    • Code:
      zpool replace "<pool_name>" /home/disk1.img <name_by_wwn_id_for_sdf>

Regarding
I wondering, how I move rootfs instance (mdadm + lvm + raid1) into ZFS. I think, it can be hard to do. So I am searching for another less invasive solution.
The BTRFS has a feature, which allows to migrate EXT4 to BTRFS. BTRFS has a data corruption protection and seems to be suitable for rootfs and RAID1. Re-installation and start with fresh ZFS is not possible for node due to time. I am considering the below architectures for rootfs:
  • disk -> mdadm (raid-1) -> lvm -> BTRFS (of course boot and efi will be handled in separate, different way outside of raid and lvm)
    • Does it good solution as alternative?
    • Does BTRFS enough stable for RAID-1> (I have read, only RAID5/6 has an issue)?
    • Is someone perform similar migration, I would be very grateful for sharing knowledge and steps for do it (I am afraid, I forgot do one step and OS simply doesn't run)
  • (I don't really take that into account) disk -> dm-integrity -> mdadm (raid-1) -> lvm -> EXT4 (here, the biggest issues is lack of support for auto-run dm drives during boot procedure, so it is not good for me, moreover it will need to move to lvmraid to achieve it)
 
Your plan sounds good.

Have you considered moving the rootfs also on your zfs pool and UEFI on the first two disks in your vdev? There are a lot of people advocating that it is good practice to split them, yet I had never had any problem running it also on ZFS and it is a supported configuration. I only split it if the zpool data is external (e.g. external shelves or SAN).

I converted ext4 in btrfs once to try it out and it worked. The RAID1(0) vs. RAID5/6 debate with btrfs is according to the developers still not "production ready", so you're right. I don't see the point in running btrfs if you're already running ZFS. I also don't advise in running multiple zpools. That is not what a pool was created for in the first place. The basic idea behind ZFS was that you have one pool that holds ALL you data and you have therefore the ARC for yourself and don't waste if for multiple pools. More vdevs mean more throughput.
 
  • Like
Reactions: hardek
Have you considered moving the rootfs also on your zfs pool and UEFI on the first two disks in your vdev? There are a lot of people advocating that it is good practice to split them, yet I had never had any problem running it also on ZFS and it is a supported configuration. I only split it if the zpool data is external (e.g. external shelves or SAN).
I like to split it for 3 reasons (but I also wouldn'T do that for server with just a few disks):
1.) in case I need more space for my raidz1/2/3 pool I can just backup all guest, destroy the pool, create a bigger one and restore everything without needing to install and configure PVE again.
2.) easier when doing block level host backups, if you don'T want to also backup all that other data
3.) heavy guest IO, like when doing backups, won't effect PVE that much...but you only see difference when using really slow storages like HDDs
 
  • Like
Reactions: hardek
I converted ext4 in btrfs once to try it out and it worked. The RAID1(0) vs. RAID5/6 debate with btrfs is according to the developers still not "production ready", so you're right. I don't see the point in running btrfs if you're already running ZFS. I also don't advise in running multiple zpools. That is not what a pool was created for in the first place. The basic idea behind ZFS was that you have one pool that holds ALL you data and you have therefore the ARC for yourself and don't waste if for multiple pools. More vdevs mean more throughput.

I would like to split zpools, because:
  1. I would like to have separate pool for OS, VMs images and VMs data
  2. Disks have different sizes, what potentially complicate RAIDZ configuration
  3. Flexibility with moving from smaller to bigger pools and backup procedures
  4. Less impact on performance (greater isolation)
Could you write more regarding conversion procedure? Did you have mdadm + lvm below like me?

I care about the most trouble-free migration. He wants to avoid a situation where the system won't boot. At the moment it is not possible to move virtual machines and storage to another node to prepare the previous one.
 
Disks have different sizes, what potentially complicate RAIDZ configuration
Usually not that problematic because ZFS will try to write new data in a way that different vdevs should end up similarily filled. And creating a single big vdev should be avoided anyway, especially when using HDDs, as IOPS performnce will only scale with the number of vdevs and not the number of disks. So multiple smaller raidz striped together is better for performance and resilvering times than single big raidz vdevs.
 
  • Like
Reactions: hardek
Today I have read the following:
Layering btrfs volumes on top of LVM may be implicated in passive causes of filesystem corruption (Needs citation -- Nicholas D Steeves)
source: https://wiki.debian.org/Btrfs#Other_Warnings

Probably It will change my current plan into:
  • [remove lvm layer] disk -> mdadm (raid-1) -> BTRFS filesystem (of course boot and efi will be handled in separate, different way outside of raid and lvm)
Alternatively:
  • disk -> built-in BTRFS raid-1 (it seems to be stable enough, when I read opinions) -> BTRFS filesystem (of course boot and efi will be handled in separate, different way outside of raid and lvm)
Note: I would like to avoid using ZFS on rootfs due to lack of native support in kernel.

BTRFS allows to growing up and shrinking partition with it online, but it seems to be less flexible than LVM (when I will have several partitions and try to change size of first one, I will need to move free space before). I am trying to find examples with BTRE sub-volumes with specific size, but It seems to be not supported (of course it is not the same like block-device offered by LVM, but I need something similar).