ZFS Pool layout options for Boot and VM storage

TonyArr · Oct 27, 2021

Hi all,

I'm getting up and running on a "new" server to replace an array of MiniPCs and SBCs running the varying services and products I use in my quest to self-host.
At this stage I have the hardware, just testing different ways of setting it up to see what I get, how I can use it and using that to get used to Proxmox and teach myself some LXC tricks etc, before doing a final setup and moving everything into the one machine.

I'm wondering if there is a way to direct the ZFS pool layout during the install beyond just "zfs (RAID10/Mirror/RAID-Zx)" and selecting the total number of disks?
The drives I have for my boot disk and VM storage are 6x500GB Kingston A2000s, installed in pairs on PCIe > NVMe cards, using bifurcation on my Dell's x8 slots (so there is no controller nonsense to get in ZFS's way).
Originally I was going to install these in a RAID-Z2, so if a PCIe card failed, the system would stay running in a degraded state, allowing me to resolve the issue without losing access to whatever I'm running on it.
As I have studied and experimented more, however, I've learnt that in RAID-Z your IOPS tend to be help to a single drive's performance, which is fine for stuff like media storage for example, but less than ideal for VMs and databases.
In the endless compromise between Capacity, Redundancy and Performance, I'm pretty sure that RAID10 would be the sweet spot with what I have, and to ensure the right redundancy to remove the PCIe cards as single points of failure, I'd need to be selective about which SSD mirrors which other SSD. Is there a way to do that in the installer? I had through perhaps in debug mode it might drop to a shell at the disk setup step, or have an added button under advance to do so, but didn't see anything.
If not an option for the installer, if I partition up the disks right (so 1mb partition labelled as legacy bios, one 512mb partition labeled ESP and one large partition for the rest of the drive on each), then set up a empty zfs pool called 'rpool' across the drives in the right topology for what I'm after, will the proxmox installer see the free pool and offer to use it?

And yes I know I'm lacking enterprise SSDs, for anyone worried. For what I'm doing, and the capacities I need for it, and the budget I have, these drive's price point plus doing lots of regular backups is the way to go for now. Hopefully by the time I'm looking for more capacity I'll be able to find some enterprise drives second hand at a price point I can afford (or even better, I have a higher income by then! )

Thanks for anyone who has the time to chat! If you have any critique or advice on drive layout, or pool/dataset options I should be using, also welcome!
I'll be running email, contacts syncing, a influxDB (v1) instance, HomeAssistant, Plex, and probably add a federated social network node in at some point, though not really thought about it other than "hey, I should do that sometime"

Dunuin · Oct 27, 2021

TonyArr said:
Hi all,

I'm getting up and running on a "new" server to replace an array of MiniPCs and SBCs running the varying services and products I use in my quest to self-host.
At this stage I have the hardware, just testing different ways of setting it up to see what I get, how I can use it and using that to get used to Proxmox and teach myself some LXC tricks etc, before doing a final setup and moving everything into the one machine.

I'm wondering if there is a way to direct the ZFS pool layout during the install beyond just "zfs (RAID10/Mirror/RAID-Zx)" and selecting the total number of disks?
The drives I have for my boot disk and VM storage are 6x500GB Kingston A2000s, installed in pairs on PCIe > NVMe cards, using bifurcation on my Dell's x8 slots (so there is no controller nonsense to get in ZFS's way).
Originally I was going to install these in a RAID-Z2, so if a PCIe card failed, the system would stay running in a degraded state, allowing me to resolve the issue without losing access to whatever I'm running on it.
As I have studied and experimented more, however, I've learnt that in RAID-Z your IOPS tend to be help to a single drive's performance, which is fine for stuff like media storage for example, but less than ideal for VMs and databases.
In the endless compromise between Capacity, Redundancy and Performance, I'm pretty sure that RAID10 would be the sweet spot with what I have, and to ensure the right redundancy to remove the PCIe cards as single points of failure, I'd need to be selective about which SSD mirrors which other SSD. Is there a way to do that in the installer? I had through perhaps in debug mode it might drop to a shell at the disk setup step, or have an added button under advance to do so, but didn't see anything.
If not an option for the installer, if I partition up the disks right (so 1mb partition labelled as legacy bios, one 512mb partition labeled ESP and one large partition for the rest of the drive on each), then set up a empty zfs pool called 'rpool' across the drives in the right topology for what I'm after, will the proxmox installer see the free pool and offer to use it?

I don't think the Proxmox installer will support that. 2 workarounds come into my mind:
1.) get two 120 GB SATA SSDs and use them as your system disks as a ZFS mirror. First its always better if the systems runs on dedicated drives so if VMs fully utilize the VM storage this won't effect the operation of the hypervisor with its own dedicacted disks. Second point is that you can then use your 6 NVMes as a dedicated VM storage and create that pool using CLI with full control over how it should be created. And you can easily destroy and recreate your VM pool later (for example if you need more space and want to add some new disks) without needing to reinstall proxmox. And third benefit is that you can create a block level backup of your PVE host. PVE will make it easy to backup VMs/LXCs but there is no way to backup the host itself. Best way to back it up is to boot clonezilla from a USB stick and create a image of both complete system disks on blocklevel so that all partitions and even the bootloader gets backed up so you just need to put in a new drive and write the backup from image back to the new disk. So if you want to backup your hosts OS with 2x 120GB this would only need 240GB. If you want to do that with 6x 500GB you would need to backup 3TB. Even a 32GB system disk would be plenty enough space. But USB sticks won't last long because PVE will write around 30GB per day to the system disks and they got no wear leveling.
2.) Dont use ZFS and create your raid 10 using LVM-thin ontop of mdadm SW raid. You can do that by using the debian installer (install a normal debian) and later install the proxmox-ve package ontop of that. You won't get all the nice ZFS features making your storage more reliable but you will save alot of RAM and you SSDs will live way longer.

TonyArr said:
And yes I know I'm lacking enterprise SSDs, for anyone worried. For what I'm doing, and the capacities I need for it, and the budget I have, these drive's price point plus doing lots of regular backups is the way to go for now. Hopefully by the time I'm looking for more capacity I'll be able to find some enterprise drives second hand at a price point I can afford (or even better, I have a higher income by then! )

I hope you keep the write amplification in mind. With ZFS, raid, multiple filesystems ontop of each other and virtualization you get alot of overhead and write amplification. Here I got a write amplification of factor 3 to 82, depending on the workload, with an average of factor 20. So for every 1 GB I write inside a VM 20 GB will be written to the SSDs. So with a average write amplification of factor 20 your A2000s TBW of 350TB would be reached after only writing 17.5TB inside a VM. And a factor 20 write amplification also means only 1/20th of the write performance. So you need to check how bad your write amplification will be at the end after everything is migrated.

I also started with 6x 500GB consumer SSDs (600 TB TBW per 1TB) but needed to replace them with enterprise SSDs (21.000 TB TBW per 1TB) because I would have exceeded the TBW within 1 year. So keep an eye on your SMART attributes so you can replace them early if you see too much writes. Would be really a waste if you need to buy those 6 SSDs each year again and again. Because if you look at the price per TB TBW the enterprise SSDs are waaay cheaper. My second hand enterprise SSDs got a 30 times higher TBW and I paid the same as for my new consumer SSDs.

TonyArr · Oct 28, 2021

Dunuin said:
1.) get two 120 GB SATA SSDs and use them as your system disks as a ZFS mirror. First its always better if the systems runs on dedicated drives so if VMs fully utilize the VM storage this won't effect the operation of the hypervisor with its own dedicacted disks. Second point is that you can then use your 6 NVMes as a dedicated VM storage and create that pool using CLI with full control over how it should be created. And you can easily destroy and recreate your VM pool later (for example if you need more space and want to add some new disks) without needing to reinstall proxmox. And third benefit is that you can create a block level backup of your PVE host. PVE will make it easy to backup VMs/LXCs but there is no way to backup the host itself. Best way to back it up is to boot clonezilla from a USB stick and create a image of both complete system disks on blocklevel so that all partitions and even the bootloader gets backed up so you just need to put in a new drive and write the backup from image back to the new disk. So if you want to backup your hosts OS with 2x 120GB this would only need 240GB. If you want to do that with 6x 500GB you would need to backup 3TB. Even a 32GB system disk would be plenty enough space. But USB sticks won't last long because PVE will write around 30GB per day to the system disks and they got no wear leveling.

I do currently have the bootloader installed on a pair of SD cards mirrored in a IDSDM at the moment, which is also mirrored to spare USB drive I had sitting around and figured I could put to use, as the R730xd I got doesn't support booting NVMe unless it's through Dell's U.2 expander. So as I was reading your reply I was starting to think I could just move the OS entirely to be on that 3 way mirror, but yeah with that much writing to the boot disk, that really isn't an option... maybe I'll see if I can find a couple of 60 or less GB optanes I could throw in a mirror.
Setting the A2000s as dedicated VM/LXC storage does make sense though, and if I am using something over USB or SATA as the OS drive, then that means I don't need to remember to do the proxmox-boot-tool dance if doing clean setups later...

I'm actually preferring to step away from block-level clones in this project, and trying to do a lot of "infrastructure as code" setup so that if I need to setup again without a full backup I still get everything how I need it. And besides, zfs send/receive has been pretty effortless for me thus far. Before the R730 rocked up I was playing with proxmox on a single SATA SSD in a NUC, and had a few macOS VMs running as well as a Windows VM running, and their performance was barely touched by sending a snapshot via USB3 to another SSD (yes I have a lot of random capacity SSDs ). I was playing Freelancer (HD mod with widescreen and long-draw) through a VNC session and had no complaints. I expect that my backups overall will limited to running around midnight - 3am, and using ZFS send means they'll be incremental.
Restoring from the ZFS snapshots also wasn't terrible on the single drive setup:
1. Boot the proxmox install USB in debug mode
2. Create a new rpool
3. zfs send/receive
4. Reboot proxmox rescue mode off the usb, which picks up the existing install and uses the USB's bootloader but the install's kernel
5. use proxmox-boot-tool to format and init the ESP partition, which also pairs it to the install so future updates know where to look (on a multitask pool, I'd have to do this once for each disk)
7. reboot and it's all back.

If only one drive died in the pool, all I'd need to do is resilver and then run the boot tool for the replaced drive. And if I was doing that to move to a new storage medium (say, 8k sector disks or some new different form of flash with alignment differences etc) block copies may (of course not guaranteed to, but have in the past) leave me with a boot setup not efficiently using the drive advantages. I probably would take the occasional block backup as a just in case, like before a major software upgrade, but generally I'm a little adverse to them.

(and in terms of downtime, this is a low importance system compared to most. As long as I could get it back up within a day, there would be no issue, as my mail has a satellite system set up on a VPS that controls all the access to the actual server, and just gives out "try again in 4 hours" messages to incoming messages when the main host is down. I'm the only one actually relying on any of this, so plenty of freedom cause screw ups only effect me, not anyone else )

Dunuin said:
2.) Dont use ZFS and create your raid 10 using LVM-thin ontop of mdadm SW raid. You can do that by using the debian installer (install a normal debian) and later install the proxmox-ve package ontop of that. You won't get all the nice ZFS features making your storage more reliable but you will save alot of RAM and you SSDs will live way longer.

I kinda really like the nice ZFS features making the storage more reliable tho... but I could do as you're suggesting with ZFS on root with Debian, I already have done a Debian on ZFSroot install the NUC when I first got it and it was... a few steps to get rolling, but nothing terrible. Secure Boot gave me more trouble than anything else, but once I worked that out there was nothing standout difficult.
I am not RAM poor at this stage, this thing came along with 4x32GB DIMMs, and there are 20 more slots I can drop DIMMs into as needed.

Dunuin said:
I hope you keep the write amplification in mind. With ZFS, raid, multiple filesystems ontop of each other and virtualization you get alot of overhead and write amplification. Here I got a write amplification of factor 3 to 82, depending on the workload, with an average of factor 20. So for every 1 GB I write inside a VM 20 GB will be written to the SSDs. So with a average write amplification of factor 20 your A2000s TBW of 350TB would be reached after only writing 17.5TB inside a VM. And a factor 20 write amplification also means only 1/20th of the write performance. So you need to check how bad your write amplification will be at the end after everything is migrated.

I had no idea how far the write amplification could go, I was expecting a factor or 3 or 4 maybe

How am I best to measure that when things are running?

Dunuin · Oct 28, 2021

TonyArr said:
So as I was reading your reply I was starting to think I could just move the OS entirely to be on that 3 way mirror, but yeah with that much writing to the boot disk, that really isn't an option... maybe I'll see if I can find a couple of 60 or less GB optanes I could throw in a mirror.

The system drives don't need to be that fast. There are cheap (25$) very small M.2 SSDs like the "32GB Transcend 400S M.2 2242 M.2 6Gb/s MLC" that could be attached to an USB-SSD encosure. But yes, something with a Powerloss Protection would be better for reliability.

TonyArr said:
I had no idea how far the write amplification could go, I was expecting a factor or 3 or 4 maybe
How am I best to measure that when things are running?

The factor 82 write amplification was a debian 10 VM running on ext4 with virtio SCSI doing 4K random sync writes with a encrypted striped mirror as the VM storage. Basically all sync writes are really terrible when it comes to write amplification, so you should avoid using databases whenever possible if the SSDs should last long.
Measuring write amplification isn't that easy. I did it doing fio benchmarks inside the guests and measuring in parallel on the host how much data was written to the SSDs NAND using a script that logged the SMART attributes. I think NVMe SSDs will only log data written from the host to the SSD, not actual data that was written to the SSDs NAND chips so a absolute write amplification could be hard to find out.
So easiest would be to just use smartctl to look how much data was written to that SSD. After a week or month you do it again and compare the numbers so you can extrapolate how much data should be written to that SSD over 5 years (your warranty) and compare that to your drives TBW.
If these extrapolated writes are more than your TBW, then you will loose your warranty even if the 5 years aren't over.

TonyArr · Oct 29, 2021

Dunuin said:
The system drives don't need to be that fast. There are cheap (25$) very small M.2 SSDs like the "32GB Transcend 400S M.2 2242 M.2 6Gb/s MLC" that could be attached to an USB-SSD encosure. But yes, something with a Powerloss Protection would be better for reliability.

Oh I wasn't thinking about the speed, I was thinking about the fact most of them have much better write endurance, even at low capacities. Alas, I've only seen them round the 100 buck mark for 32GB, and those ones are 350-450TBW endurance so not much in the gains department anyway.
But at 25 bucks apiece, I could justify grabbing a pair for use and a pair for spares... And on a cursory search, 64GB ones are only a few bucks difference after currency conversion and shipping... Thanks!

Search

Search

ZFS Pool layout options for Boot and VM storage

TonyArr

Member

Dunuin

Distinguished Member

TonyArr

Member

Dunuin

Distinguished Member

TonyArr

Member

We value your privacy