Homelab cluster - disk setup best practice guide, ZFS

dp123

New Member
May 7, 2023
3
1
1
Hi - firstly, I have looked for a guide on how best to do this, but haven't found anything that I can make sense of, mostly because the majority appears to (understandably) be targeted towards Enterprise environments - if i've missed one, I would be very grateful for a link, and totally accepting of being called a plonker with poor google fu. I'm looking for help with a home lab setup...My storage setup is below - all SATA SSDs are Enterprise grade (Intel s3700). All my workloads (currently) will run on a single Node's resources.

3 Node cluster made up of micro PCs (Fujitsu Esprimo's).
- Node 1 - 400GB SATA & 100GB SATA; 32GB RAM
- Node 2 - 16 GB Optane NvME and 400GB SATA; 32GB RAM
- Node 3 - 200GB SATA and 500GB spinning platter HDD; 16GB RAM.
- NAS - WD-EX4100/4bay running RAID 5 with hot spare on WD gold disks.

(I know identical hardware easier/better - but this way I get to learn about way more scenarios.....and it was cheaper/what I could find cheaply on ebay....)

I run daily backups of all VMs to the NAS - RPO (recovery point objective) of 24 hours is absolutely fine for me - I'm aware dual disks per node would be 'better' but I am fine with single disk risk and 24 hr RPO. I have this running with LVM storage, and it's great, but there's lots of chat about ZFS and how it's "better".....so I want to play, but I'm totally new to ZFS. I'm therefore looking for a guide on how to set it up, covering:

1. Optimal partition sizes? I don't really understand this about ZFS and how ZFS uses disk partitions
2. How do I add the second 400GB SATA disk and expand the rpool/data dataset?

3. How do i see the current allocations across:
```
rpool
rpool/ROOT
rpool/ROOT/pve-1
rpool/data
```
4. Optimal block size - granted this is workload dependant, though understand default of 128KB is generally recommended, though given I expect mostly logging, sensor data, and small writes to splunk/influxdb/syslog servers etc., I was going to go with 32KB?

5. I think the scenario I am aiming for is to get to having one zpool per node, with 3 datasets (ROOT, ROOT/pve-1 and data) with data spread across 2 disks (no replication) each with same basic config (ASHIFT 12, block size, pool, vdev and dataset names), with, for example Node 1 having around 480GB for data with the rest for OS etc.

Any help greatly appreciated!
 
  • Like
Reactions: Oneiros42
. Optimal partition sizes? I don't really understand this about ZFS and how ZFS uses disk partitions
Simply use a single big partition or whole disk for ZFS. Then all your storages can dynamically share the whole space. Its not like LVM/LVM-Thin where you need to care about how much space you want for virtual disks and how much space for files.

2. How do I add the second 400GB SATA disk and expand the rpool/data dataset?
Depends on what you want to do with it. Raid0 or Raid1?

Optimal block size - granted this is workload dependant, though understand default of 128KB is generally recommended, though given I expect mostly logging, sensor data, and small writes to splunk/influxdb/syslog servers etc., I was going to go with 32KB?
There are two different "block sizes". The recordsize used by datasets and the volblocksize used by zvols. The volblocksize is fixed, defaults to 8K and you really need to care to choose a good value because otherwise you will lose a lot of usably capacity or write/read amplification goes up.
The recordsize, defaulting to 128K, you are talking about, is dynamic and only defines the upper limit. ZFS will decide for each record if it want to write it as a 4K, 8K, 16K, 32K, 64K or a 128K record. So with a 128K recordsize, when you write a 25KB file, it will create a 32K record. When you write a 60KB file it will create a 64K record. When you write a 320KB file it will create 2x 128K records + 1x 64K record.

for example Node 1 having around 480GB for data with the rest for OS etc.
Keep in mind that ZFS needs free space to be able to operate efficiently. You usually don't want to fill your ZFS pool more than 80-90% for best performance. Especially when using HDDs.
 
  • Like
Reactions: takeokun
Thanks - really appreciate the help -it's starting to make sense...especially the block sizes, very well explained - that cleared up a lot of my confusion there!,

In terms of

2. How do I add the second 400GB SATA disk and expand the rpool/data dataset?
Depends on what you want to do with it. Raid0 or Raid1?

I was aiming on having the 16GB Optane NvME as the 'boot/OS' drive with the SATA SSD as the 'data' drive for VM disks. At the moment the VM disk space is the default root/data dataset (I think) - so I either want to just move that to a different physical drive or span that dataset across 2 drives (not sure if I'm making sense - in LVM world, I just added the disk as a new disk/PV and extended the LV across the second drive).


Not sure if this helps with current setup, but just in case:

Bash:
root@proxmox-node-2:~# pvesm zfsscan
rpool
rpool/ROOT
rpool/ROOT/pve-1
rpool/data

Bash:
root@proxmox-node-2:~# zdb -e -C rpool

MOS Configuration:
        version: 5000
        name: 'rpool'
        state: 0
        txg: 207
        pool_guid: 11461766048783398431
        errata: 0
        hostid: 3794294419
        hostname: '(none)'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 11461700330383398431
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 14536470055202514779
                path: '/dev/disk/by-id/nvme-eui.5cd2e88bb1b50100-part3'
                whole_disk: 0
                metaslab_array: 65
                metaslab_shift: 29
                ashift: 12
                asize: 13860339712
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 129
                com.delphix:vdev_zap_top: 130
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
 
Last edited:
At the moment the VM disk space is the default root/data dataset (I think) - so I either want to just move that to a different physical drive or span that dataset across 2 drives (not sure if I'm making sense - in LVM world, I just added the disk as a new disk/PV and extended the LV across the second drive).
Moving datasets/zvols between pools is no problem. This can be done with "zfs send ... | zfs recv ...". When you "span" accross drives you basically create a raid0. Which means if one of the disks fails all data of both disks is lost. So I only see downsides doing that. And its no problem to use the same disk for PVE + data + VM/LXC storage. Dedicacted disk for PVE system is primarily useful if you want to be able to easily destroy a ZFS pool (for example for replacing a SSD because you need a bigger SSD for your VM/LXC storage) so you can do that without needing to reinstall PVE. So you could just install PVE to the SATA SSD and maybe use the optane as a swap partition or SLOG (rite cache for sync writes).
 
Last edited:
Many thanks - dare I say I might be starting to understand ZFS at a very superficial level. My understanding of options is below, if you would be so kind as to check it for any inaccuracies it would be most appreciated.

3 Node PVE Cluster. Target:
  • 1 'datacentre' zpool configuration
  • 1 local zpool per node (zpool)
  • 1 rpool/data dataset in each zpool on each node for VM storage
  • 2 vdevs (1 vdev per disk - vdevs contain raid arrays, not individual disks so need to consider capacity matching, redundancy etc when adding (i think?)). note while having mismatched vdevs (called a mutt) is perfectly viable, it's not 'best practice'. Each vdev consists of 1 physical device/disk.

Node 2:

Option 1:

Install PVE on Optane and have 2nd SSD for additional capacity. No SLOG/support VDEVs. Add 2nd SSD as new VDEV.
  • Pros - Simpler setup as no 'support' vdevs (noting not using Optane at all simplest, though where's the fun in that?)
  • Cons - No separate SLOG device, theoretical risk of contention on SSD I/O (though something I would have to work really hard to actually make happen).
How to:
  1. Install PVE as normal, select RAID0 type - select -- do not use -- for second disk (Harddisk 1) and select NvME Optane as Harddisk 0. Note RAID type will be ignored when only 1 disk is used.
  2. Update repos & packages (not required, but should do this)
  3. Add second disk to pool: zpool add insertpoolname insertdiskname For example: zpool add rpool sda. This will create a new vdev, add it to the current zpool and it will be available to all datasets in the pool. Done.

Option 2:
Install PVE on SSD and use Optane for SLOG - i.e. dedicated drive to cache synchronous writes (ZIL cache). NB could also use Optane as a cache drive for read heavy workloads.
  • Pros - Using NvME Optane for SLOG provides highest speed & endurance for sync writes.
  • Cons - Limited - I guess OS boot/operations may be slightly slower, but unlikely disk would be limiting factor here.
How to:
  1. Install PVE as normal
  2. , select RAID0 type - select -- do not use -- for second disk (Harddisk 1) and select SATA SSD as Harddisk 0. Note RAID type will be ignored when only 1 disk is used.
  3. Update repos & packages (not required, but should do this)
  4. Add NvME disk to pool as LOG type: zpool add insertpoolname log insertdiskname For example: zpool add rpool log nvme0n1. This will add the disk nvme0n1 to the rpool ZFS pool as a log device.

For reference, disk names can be obtained by running lsblk. To add a different vdev type, just substitute log for one of mirror, meta, raidz, raidz1, raidz2, raidz3, spare or cache (I think that's all options).

Option 2 seems to be 'better' - though with the caveat that it's better only in the theoretical sense, to actually hit performance limitations of a SATA SSD without a dedicated Optane SLOG drive would be challenging.

Nodes 1/3:

Equivalent options as above. For my usecase I will follow option 1 for nodes 2 & 3, noting that pool performance on node 3 will be noticeably slower due to HDD (as ZFS will balance writes across both SSD and HDD vdevs based on storage and IO capacity at time of write - hence data for a single VM disk will almost always be spread across SSD and HDD - hence pool performance is, in effect, limited to speed of slowed vdev).

If this is accurate, then many many thanks for helping me get my understanding to here. If it's not, then any corrections, most appreciated, and hopefully this thread will be useful to others.
 
Last edited:
Simply use a single big partition or whole disk for ZFS. Then all your storages can dynamically share the whole space. Its not like LVM/LVM-Thin where you need to care about how much space you want for virtual disks and how much space for files.
This caused issues for me. If you do that, and if for any reason the filesystem fills it causes a lot of issues (as you would expect).

It's always a best practice to separate the OS from your growing data. I was told to "use quotas" on a single disk becauser the installer doesn't have a default system partition. This is a large PITA as it's all CLI based with proxmox. The official best practice is to use a system drive just for the OS and then use other drives for data which would be nice, but many home labs don't have that as an option.

I asked and was never able to get a straight answer on what the boot partition size should be. The wiki doesn't provide this info and a staff member wouldn't give a straight answer.
 
I have looked for a guide on how best to do this, but haven't found anything that I can make sense of,
The reason you don't find such a tutorial is that it does not make any sense. You would not want to have a cluster on local ZFS, not even for testing.
Of course you can do it, as @Dunuin wrote, but why simulate/test/learn something that is nothing but "Just for fun"? It would be total overkill for a homelab.

In a cluster environment (at least what any other company does with this term), you want shared storage with near zero-loss failover, live migration and so on. This is NOT (easily and nearly) possible with ZFS on each node. If you would have ZFS on an external storage, this would be much better and set it up via ZFS-over-iSCSI.

I also know that it's fun to play around with such things, but simulating a cluster is much simpler if you just simulate while running in PVE itself, e.g. a Cluster-in-a-Box-kinda setup. You can play around with it, snaphot it etc. while just using one machine instead of at least two and saving energy.
 
This caused issues for me. If you do that, and if for any reason the filesystem fills it causes a lot of issues (as you would expect).

It's always a best practice to separate the OS from your growing data.
In a cluster, yes. For your single machine clearly: "it depends". I, for example, never do this.

I was told to "use quotas" on a single disk becauser the installer doesn't have a default system partition. This is a large PITA as it's all CLI based with proxmox. The official best practice is to use a system drive just for the OS and then use other drives for data which would be nice, but many home labs don't have that as an option.
Yes, I'm from the refreservation-fraction an I created the bugzilla report about it. ZFS is not easily configureable via GUI, because it would be as hard as the CLI, it is a beast but worth to learn. That beeing said, you need to understand the concepts behind ZFS and its unpredicability with anything besides mirroring. Most people learn this the hard way.

I asked and was never able to get a straight answer on what the boot partition size should be. The wiki doesn't provide this info and a staff member wouldn't give a straight answer.
Why do people always assume that there is a better answer that the one already built into the product?If in doubt, increase it.
 
  • Like
Reactions: maatsche
Why do people always assume that there is a better answer that the one already built into the product?If in doubt, increase it.
My question was "what are recommended boot partition sizes"

Any filesystem you use and can install on has options for filesystems. It's just silly that there are no size recommendations documented. If the people that make the product don't understand it well enough to give recommended sizes based on a few basic scenarios who else could have a hope to know?

Basic scenarios:

basic lab 1-5 systems
advanced lab 5-10 systems
productoin system up to 25 systems
production system up to 50 systems

This is all assuming that the number of systems means the boot/OS drive needs additional storage. If it doesn't then it's even easier and fewer options. It's like pulling teeth to get a real answer on boot/OS partition size.

The reason you don't find such a tutorial is that it does not make any sense. You would not want to have a cluster on local ZFS, not even for testing.
This statement doesn't make sense. There are lots of good reasons to use ZFS on a local node. The biggest one is that ZFS is pretty resilient. It won't corrupt itself easily, it has built in compression (yes, qcow2 does too, but that's putting your files on the more easily corrupted filesystems), it has snapshots, copy on write clones, etc etc.

If this is production, it's not ideal, but it is still a better option than running the system on EXT4 with/without LVM.
 
My question was "what are recommended boot partition sizes"
A common size for an ESP is 512MB but some Linux distributions are considering/switching to 1GB, as kernels are getting bigger and your want to put a few versions on there. The GRUB BIOS partition is less than 1MiB, which will fit between MBR/GPT and the first partition aligned to 1 MiB.
This is all assuming that the number of systems means the boot/OS drive needs additional storage. If it doesn't then it's even easier and fewer options. It's like pulling teeth to get a real answer on boot/OS partition size.
Luckily for you, the boot drive (per node) does not depend on the number of nodes in a cluster.

Please note that people on this forum are volunteers (even staff as far as I know) and spending their time here freely and for free. Most of them are random strangers from the internet, so don't blindly follow their advise. If you want expert advice within a certain time limit: buy a subscription with support tickets.
 
A common size for an ESP is 512MB but some Linux distributions are considering/switching to 1GB, as kernels are getting bigger and your want to put a few versions on there. The GRUB BIOS partition is less than 1MiB, which will fit between MBR/GPT and the first partition aligned to 1 MiB.

Luckily for you, the boot drive (per node) does not depend on the number of nodes in a cluster.

Please note that people on this forum are volunteers (even staff as far as I know) and spending their time here freely and for free. Most of them are random strangers from the internet, so don't blindly follow their advise. If you want expert advice within a certain time limit: buy a subscription with support tickets.
sorry, I should have mentioned that the conversation I was having on the forum was with a staff member.

Even today when I poked that thread again the only size I received was 8GB minimum but that would be good only for testing the install to a vm. Even then I couldn't get a "we recommend a 16GB partition for PVE" The documentation at either the install docs or the wiki both don't give any useful info. This was something I pointed out 4 months ago and it hasn't changed. I don't expect someone to run out and update it the second I mention it was missing, but it's been more than a quarter of a year.

It's frustrating especially when several threads have staff members being snarky and basically saying "RTFM" yet TFM doesn't have the answers and the staff won't actually give real and useful answers when directly engaged in a thread.

There is often the statement of "if you want answers do a support sub" but even with a 325 euro support plan you get a whole 3 tickets a year. Not exactly something that's enticing. I get to use 1 of the 3 tickets I get a year to check on something so basic that it should be well documented but isn't.

Why would anyone want to use this at the enterprise level with such basic info so hard to get?
 
This caused issues for me. If you do that, and if for any reason the filesystem fills it causes a lot of issues (as you would expect).

It's always a best practice to separate the OS from your growing data. I was told to "use quotas" on a single disk becauser the installer doesn't have a default system partition. This is a large PITA as it's all CLI based with proxmox. The official best practice is to use a system drive just for the OS and then use other drives for data which would be nice, but many home labs don't have that as an option.

I asked and was never able to get a straight answer on what the boot partition size should be. The wiki doesn't provide this info and a staff member wouldn't give a straight answer.
Yes, you should always set a quota. So you can't fill that pool to 100% by accident. But that is super simple. Let's say you got a 1000GB disk and you want 32GB for the PVE system and the remaining space for VMs/LXCs. A ZFS pool will get slow when filling it too much so you probably only want to fill it to 80-90%. So I would first set a pool wide quota of 900GB (90%): zfs set quota=900G rpool
Then a quota for the "data" dataset, that stores the virtual disk, that is 900GB - 32GB = 868GB: zfs set quota=868G rpool/data

If you don't want to use the CLI, you shouldn't use PVE. A secure installation requires that you set up a lot of stuff using the CLI and only the most basic stuff is covered by the webUI.

Yes, I'm from the refreservation-fraction an I created the bugzilla report about it. ZFS is not easily configureable via GUI, because it would be as hard as the CLI, it is a beast but worth to learn. That beeing said, you need to understand the concepts behind ZFS and its unpredicability with anything besides mirroring. Most people learn this the hard way.
That's true, like with most things. If you don't understand what you are doing, then all features available through the webUI won't help much. And if you have learned how ZFS works, then you also know the zfs commands and using the CLI wouldn't be harder than using a GUI. I for example never use the webUI for anything ZFS related. With the CLI it is faster done and I can directly control what will happen, instead of clicking in some windows where the webUI is then doing some obfuscated stuff I can't see or directly control.

Basic scenarios:

basic lab 1-5 systems
advanced lab 5-10 systems
productoin system up to 25 systems
production system up to 50 systems

This is all assuming that the number of systems means the boot/OS drive needs additional storage. If it doesn't then it's even easier and fewer options. It's like pulling teeth to get a real answer on boot/OS partition size.
PVE is no appliance. Some people use 8GB, some 16GB, I prefer 32GB to be safer. It really depends what you want to do with it.
If I want to install a lot of additional packages I need way more space. Then there is the question if you want to keep your pool default with the "local" storage on the system dataset or if you add another dataset to store your ISOs/backups/templates/snippets/...
If I don't want an additional storage for that stuff and maybe want to store 200GB of ISOs, then my root dataset also needs to be 200GB bigger.

This statement doesn't make sense. There are lots of good reasons to use ZFS on a local node. The biggest one is that ZFS is pretty resilient. It won't corrupt itself easily, it has built in compression (yes, qcow2 does too, but that's putting your files on the more easily corrupted filesystems), it has snapshots, copy on write clones, etc etc.

If this is production, it's not ideal, but it is still a better option than running the system on EXT4 with/without LVM.
For a single node ZFS is fine. For a 2 node + qdevice cluster too, where you can't run better suited storages like ceph and when you don't want the single point of failure of the ZFS-over-iSCSI/NFS. But with enough nodes ZFS and replication doesn't make much sense anymore. Ceph got all the ZFS features, got even more, is proper shared storage and scales better.

A common size for an ESP is 512MB but some Linux distributions are considering/switching to 1GB, as kernels are getting bigger and your want to put a few versions on there. The GRUB BIOS partition is less than 1MiB, which will fit between MBR/GPT and the first partition aligned to 1 MiB.
PVE just switched with 7.4 from 512MB to 1GB.
 
Last edited:
  • Like
Reactions: Neobin
sorry, I should have mentioned that the conversation I was having on the forum was with a staff member.
Sorry, I did not realize that this public thread was limited to you and a staff member.
Why would anyone want to use this at the enterprise level with such basic info so hard to get?
Enterprises pay for the idea of support when (if ever) they run into problems. Try Promox for free (with all features) for your work-load. If it works well or if you can get it to work for your purpose with the help from this forum: good for you. If not,try other solutions and vendors. You did lose some time but prevented an expensive mistake and live happily ever after. No need to kick Proxmox after it does not meet your expectations; just move one.
 
  • Like
Reactions: Johannes S
Sorry, I did not realize that this public thread was limited to you and a staff member.

Enterprises pay for the idea of support when (if ever) they run into problems. Try Promox for free (with all features) for your work-load. If it works well or if you can get it to work for your purpose with the help from this forum: good for you. If not,try other solutions and vendors. You did lose some time but prevented an expensive mistake and live happily ever after. No need to kick Proxmox after it does not meet your expectations; just move one.
See, more snark. It wasn't limited to just the staff and I obviously (having to state the obvious since if I don't you get snarky), but the question was very direct and received no specific answer from that staff member beyond the useless answer of "you can do this, but only to test the installer". Would it have hurt them to answer with a meaningful answer, it would have taken one more sentence at most and yet it wasn't done.

It's not kicking Proxmox, it's called feedback. If a company never hears about things they could improve, or how they are failing, they have no incentive to change. If they hear the same thing often enough one of a few things happen. They either change to meet the needs of the community they are trying to get to purchase/use their product, or they double down and sit at an extremely small market share. Personally the product feels like it's still in the mostly testing and product enhancement phase. It doesn't feel like a mature product. This isn't an insult, it's me seeing a lot of potential in the product and hoping they realize it but there is some work they have to do to get there.
 
This statement doesn't make sense. There are lots of good reasons to use ZFS on a local node. The biggest one is that ZFS is pretty resilient. It won't corrupt itself easily, it has built in compression (yes, qcow2 does too, but that's putting your files on the more easily corrupted filesystems), it has snapshots, copy on write clones, etc etc.
I was - as the text mentioned - ONLY talking about ZFS in a cluster and of course it makes total sense for a single node, but not for a cluster. ZFS is not a cluster filesystem and just by that not suitable for a cluster. That is everything I said and wanted to say. Therefore you won't find resources online that present a working solution to run ZFS in a cluster with ZFS on local nodes. It's not a cluster, it's a bunch of local nodes that combine all negative things from a cluster and no benefits.
 
it's a bunch of local nodes that combine all negative things from a cluster and no benefits.
Works for me. Very well! No need for a complex storage system with new fields of additional problems.

Disclaimer: this is very small cluster, tolerating data loss for replication interval.

Just my 2€c...
 
There are two different "block sizes". The recordsize used by datasets and the volblocksize used by zvols. The volblocksize is fixed, defaults to 8K and you really need to care to choose a good value because otherwise you will lose a lot of usably capacity or write/read amplification goes up.
The recordsize, defaulting to 128K, you are talking about, is dynamic and only defines the upper limit. ZFS will decide for each record if it want to write it as a 4K, 8K, 16K, 32K, 64K or a 128K record. So with a 128K recordsize, when you write a 25KB file, it will create a 32K record. When you write a 60KB file it will create a 64K record. When you write a 320KB file it will create 2x 128K records + 1x 64K record.
I'm new to using Proxmox and I've been training a lot to put it into practice at the company where I work.

Is the information you mentioned in practice similar to the image?


1721853063431.png
 
That "Block Size: 16k" is the volblocksize (meanwhile changed defaults from 8K to 16K) and only effects VMs. LXCs will use the recordsize which has to be set via CLI.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!