ext4/zfs mixed on single Disk

cave · Feb 13, 2023

I've got access to some USFF desktops and want to use them as Poor-Mans-Blades or "Plates"

TL;DR at the End

Explanation why they are interresting:
some Comparison [Passmark/Watt]:

HPE Microserver Proliant

Gen7 N40L AMD Turion 602/25W
Gen8 i3-3240 2295/55W
Gen8 E3-1220Lv2 2452/17W
Gen10 AMD Opteron X3421 3375/35W
Gen10+ E-2224 @ 3.40GHz 7268/71W

1-Liter Class USFF / 100-300€ price Range with below CPU's in use:

i5-3470T 2955/35W
i5-4570T 3167/35W
i3-7100T 3800/35W (ECC yes)
i3-6300T 4029/35W
i5-6500T 4758/35W
i5-7500T 5272/35W
i3-8100T 5303/35W (2x32GB DDR4-2400, ECC yes?)
i7-6700T 7233/35W

Pro:

Intel vPro is often included for i5/i7
incredible small
stacks vertically like "Plates"
price

Limitation:

only 2 disks: 1x M.2/PCIe & 1x 2.5"
no ECC RAM
single Gbe
refurbished/old

Therefore i was thinking which Disks i should put in them.

Drawback: No redundancy possible with mdadm/RAID1/ext4 or zfs/mirror.

I was thinking of mixing ext4 and zfs on the NVMe to have the fast NVMe drive available for guests.

Code:

HP Poor-Mans-Blade Setup

/dev/sda                500GB    NVMe (WDS500G1R0C 55€)
/dev/sdb                  1TB    HDD (WD10JFCX 75€)    or 1TB SSD (WDS100T1R0A 95€)


/dev/sda1                /boot    ext4    1G        Linux filesystem
/dev/sda2                pve        lvm2    25G        Linux LVM pv0
/dev/vg0                        lvm2            Linux LVM vg0    (alloc 15G / free 10G)
/dev/mapper/pve-root    /        ext4    3G        Linux LVM lv0    (is 1,1G)
/dev/mapper/pve-usr     /usr    ext4    4G        Linux LVM lv1    (is 2,6G)
/dev/mapper/pve-var     /var    ext4    4G        Linux LVM lv2    (is 1,7G)
/dev/mapper/pve-swap    swap    4G                Linux LVM lv3   


/dev/sda5                 475G    ZFS vdev
/dev/sdb1               1000G    ZFS vdev

zpool0 sda5 (no mirror)
zpool0 datasets "pve-data" /var/lib/vz/
zpool1 sdb1 (no mirror)

Datasets for LXC Clients on zpool0
    PBS ZFS/zpool0 MountPoint as local Storage
    GlusterFS Replication on ZFS/zpool0
    GlusterFS Share on zpool1 as Storage to other PVE/PBS 
    DB Replication done via postgres, mariaDB on zpool1

Each "Plate" should run:

PVE as Host
PBS/LXC on Pool2
LXC/GlusterFS Replication across the Plates for static WebServer Files, replicated HAPROXY Configs/Certs, etc ...
GlusterFS Storage from the other Plate for PBS, putting a Backup on the local and GlusterFS Storage.
MariaDB/PostgreSQL Slave Replication

So ZFS could also build a rpool for the host OS to boot from it.
I'm unsure if i should stick to fdisk/lvm/ext4 which has served me well the last year's, where i now have my preferred file system layout. Or should i put everything into zfs with it's features.
adding a spare to my md/raid5, resizing/increasing my md/raid5, replace/rma a md/raid5 disk was no issue. with zfs it's like rethink at the beginning, change only possible with pool rebuild.
Are there any Pros/Cons for the root filesystem on ext4 compared to a zfs/rpool. Or vice versa?

For storage i really want zfs. My old hardware and mdadm raids are going to be converted to zraids accordingly.
For PBS i really see the benefits for incremental backups. Putting the whole LXC including it's mount points into the backup.

Regarding bitrot detection and healing, there are some .

Adding "copies=2" is also an option for a single-disk-zpool. But up to now, bit-rot wasn't an issue for me on my raid5's.
Or splitting the disk into two partitions and build the zpool with two vdevs from the same disk.
Or splitting the disk into three partitions and build the zpool with thee vdevs from the same disk as a raidz.

BUT:
It seems there is a long lasting Bug with zfs putting its zfs_member tags on the whole disk instead of the partition only, confusing blkid

critical bug: zpool messes up partition table : zfs_member occupy the whole disk instead of being constrained to a partition

Is mixing ext4 and zfs on the same disk with different partitions safe? Has anyone done that before or is it just common sense and unwritten law always assign whole disks to zfs.

TL;DR:

mixing ext4 & zfs on a single disk is possible and safe?
zfs/rpool or lvm/ext4 for PVE root file system, benefits and contras?

Dunuin · Feb 13, 2023

cave said:
Are there any Pros/Cons for the root filesystem on ext4 compared to a zfs/rpool. Or vice versa?

Missing Bitrot detection, no compression.

cave said:
For storage i really want zfs. My old hardware and mdadm raids are going to be converted to zraids accordingly.

Keep in mind that ZFS adds a lot overhead and is doing sync writes for the metadata. For reliability and performance it is highly recommended to get enterprise/datacenter SSDs with power-loss protection and a good DWPD rating, which those WD Reds aren't.

cave said:
Adding "copies=2" is also an option for a single-disk-zpool. But up to now, bit-rot wasn't an issue for me on my raid5's.

You probably just didn't noticed it because your filesystem wasn't able to identify corrupted data.

Copies=2 would help with bit rot but I personally would really prefer to build a raid1, even when that means your writes are slowed down to SATA performance. You lose the same 60% of raw capacity but at least you don't lose data, no extra work and no downtime in case a SSD fails.

cave said:
Or splitting the disk into three partitions and build the zpool with thee vdevs from the same disk as a raidz

There is a thread about this from last year. Maybe you find it.

cave said:
Is mixing ext4 and zfs on the same disk with different partitions safe? Has anyone done that before or is it just common sense and unwritten law always assign whole disks to zfs.

I got ZFS, LVM and mdraid on the same disk and it worked fine so far.

alexskysilk · Feb 14, 2023

cave said:
mixing ext4 & zfs on a single disk is possible and safe?

Yes and yes. its actually the recommended method to have zfs and swap on the same disk.

cave said:
zfs/rpool or lvm/ext4 for PVE root file system, benefits and contras?

It really depends on what you're trying to accomplish. these little things really dont perform or have much resources, so they're best used with very lightweight applications- eg docker. In any case, if your applications are stateless then it really doesnt matter; if you ARE consuming/generating data that requires a proper storage backing its probably best to house it off machine, eg a NAS.

cave · Feb 18, 2023

@Dunuin & @alexskysilk thx for your response and advice, highly appreciated.

Dunuin said:
I personally would really prefer to build a raid1

That would break the whole point with the small & cheap "Plates" hardware. The pun with "Blades" is intended. disk-free blades concept is ignored.
Hmm, maybe that would be the next thing, let the PVE hypervisor boot it'S system over PXE, ...

disk-free-plates

so the m.2 and SATA ports can be used differently.

Dunuin said:
For reliability and performance it is highly recommended to get enterprise/datacenter SSDs with power-loss protection and a good DWPD rating, which those WD Reds aren't.

I calculated more or less the same rating with TBW/expected lifetime and what is my expected actual usage. More or less similar to DWPD.
So i included some safety in my calculation. If expected 250GB Disk will have a TBW = 500TB in 5 Years Warranty, so a 500GB Disk will have 1000TB in 5 Years, considering same 1=DWPD.

I'm not expecting to scratch the TBW Wall of the WD RED Disk. But i calculated. i'll have a close look, how it behaves different with zfs-overhead.

alexskysilk said:
these little things really dont perform or have much resources, so they're best used with very lightweight applications- eg docker.

a) depends from your point of view. My first NAS was a "Linksys NSLU2", and if there would be a PVE for ARM build, i would install it on all of my routers as well

and let the OpenWRT firmware run on LXC.
b) most of all my stuff runs on LXC/bullseye.

alexskysilk said:
if you ARE consuming/generating data that requires a proper storage backing its probably best to house it off machine, eg a NAS.

i'll consider it. at the moment, everything is pushed to my existing mdraid.
Though, i'm already thinking and reading into it, how it would be best changed to a RAIDZ2+1spare and ZIL/SLOG on PLP.

Code:

root@pve2:~# df -hTl
Filesystem                Type      Size  Used Avail Use% Mounted on
udev                      devtmpfs  3.8G     0  3.8G   0% /dev
tmpfs                     tmpfs     784M  1.1M  783M   1% /run
/dev/mapper/vg0-pve--root ext4      2.8G  5.3M  2.6G   1% /
/dev/mapper/vg0-pve--usr  ext4      3.7G  3.1G  457M  88% /usr
tmpfs                     tmpfs     3.9G   66M  3.8G   2% /dev/shm
tmpfs                     tmpfs     5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p2            ext4      944M   78M  801M   9% /boot
/dev/nvme0n1p1            vfat      487M   11M  476M   3% /boot/efi
/dev/mapper/vg0-pve--var  ext4      3.7G  329M  3.2G  10% /var
pve-pool0                 zfs       423G  128K  423G   1% /pve-pool0
pve-pool0/data            zfs       423G  128K  423G   1% /var/lib/vz
tmpfs                     tmpfs     784M     0  784M   0% /run/user/0
/dev/fuse                 fuse      128M   36K  128M   1% /etc/pve

ashift for the pool defaulted to 9 instead of 12, so i already destroyed my first zpool and recreated it.
/usr is already a bit unexpectedly dense... i guess the zfs-backports install causes it.

good old lvm+ext4 resize to the rescue.

Dunuin · Feb 18, 2023

cave said:
I calculated more or less the same rating with TBW/expected lifetime and what is my expected actual usage. More or less similar to DWPD.
So i included some safety in my calculation. If expected 250GB Disk will have a TBW = 500TB in 5 Years Warranty, so a 500GB Disk will have 1000TB in 5 Years, considering same 1=DWPD.

I'm not expecting to scratch the TBW Wall of the WD RED Disk. But i calculated. i'll have a close look, how it behaves different with zfs-overhead.

It's not only the DWPD, it's also the power-loss protection. Without that PLP those consumer SSDs can't cache sync writes and the performance of those writes will be magnitudes slower. And because sync writes can't be cached, the SSD also can't optimize writes and those sync writes will cause massive SSD wear. I for example got a average write amplification of factor 20 (factor 3 to 62 depending on configuration and workload), so a 500 TBW would be reached after just writing 25TB inside a VM.

And even if you don't run databases that cause a lot of sync writes, ZFS itself will do sync write for its metadata.

Search

Search

ext4/zfs mixed on single Disk

cave

Well-Known Member

Dunuin

Distinguished Member

alexskysilk

Distinguished Member

cave

Well-Known Member

Dunuin

Distinguished Member