Best Two Node Setup Sharing A ZFS Pool

liamlows · Jun 4, 2022

Hey y'all,

So I wanted to gauge some people's opinion here on a homelab setup i am in the process of creating with 2 PVE nodes that have already been setup in a cluster. Here are the 2 nodes

1. R430 (pve1) with 2x Xenon E5-2620 v3 cpus, 32GB RAM, and 8x 800GB SAS SSDs in raid z1 (~6TB usable) (cluster host)
2. R720xd (pve2) with 2x Xenon E5-2640 cpus, 95GB RAM, and 11x 1.2TB SAS HDDs in raid z2 (~10TB usable)

pve1 is used to run all applications (emby server, game servers, personal development servers, etc) due to its higher end HW while pve2 is really just a file storage server (i got it for ~$300 a few months back).

Now I am currently in the process of creating new containers to run the media server (emby) and a seedbox that runs applications to obtain content (sonarr, radarr, prowlarr, qbitorrent). I am trying to find out the best way to provide access of the ZFS pool in pve2 to the LXCs in pve1 for storage of media content. I do not plan to store guests in pve1 on the pool, just file storage for streaming and access. I have thought of a few solutions but am interested in other opinions before i set forth. I obviously strive for the most performant setup possible but understand limitations may be present and don't have all the time in the world to work on this.

NOTE: I did not realize until recently that unprivileged containers can NOT mount NFS shares thus eliminating my initial solution from the equation of simply sharing the zfspool directly to the conatiners via NFS.

Solution 1 (preferred): share the pool on pve2 via NFS to pve1 and perform a bind mount of the NFS mount to the individual LXCs that need it. I like this the most as I could do a P2P share based off the IP of pve1 and have a small amount of lockdown on access control.

Solution 2: mark the media server and seedbox LXCs as privileged and allow direct mount of NFS. Not ideal because although the risk is relatively low in my setup, i dont like the idea of running unprivileged LXCs.

Solution 3: (not sure how id do this): use iSCSI to share the zpool from pve2 to pve1 and bind mount the zpool to the LXCs. I think this could be more performant but i have never done anything with iSCSI.

Let me know if y'all have any other ideas or what you think. Thanks!

Dunuin · Jun 4, 2022

liamlows said:
Hey y'all,

So I wanted to gauge some people's opinion here on a homelab setup i am in the process of creating with 2 PVE nodes that have already been setup in a cluster. Here are the 2 nodes

1. R430 (pve1) with 2x Xenon E5-2620 v3 cpus, 32GB RAM, and 8x 800GB SAS SSDs in raid z1 (~6TB usable) (cluster host)
2. R720xd (pve2) with 2x Xenon E5-2640 cpus, 95GB RAM, and 11x 1.2TB SAS HDDs in raid z2 (~10TB usable)

pve1 is used to run all applications (emby server, game servers, personal development servers, etc) due to its higher end HW while pve2 is really just a file storage server (i got it for ~$300 a few months back).

Now I am currently in the process of creating new containers to run the media server (emby) and a seedbox that runs applications to obtain content (sonarr, radarr, prowlarr, qbitorrent). I am trying to find out the best way to provide access of the ZFS pool in pve2 to the LXCs in pve1 for storage of media content. I do not plan to store guests in pve1 on the pool, just file storage for streaming and access. I have thought of a few solutions but am interested in other opinions before i set forth. I obviously strive for the most performant setup possible but understand limitations may be present and don't have all the time in the world to work on this.

NOTE: I did not realize until recently that unprivileged containers can NOT mount NFS shares thus eliminating my initial solution from the equation of simply sharing the zfspool directly to the conatiners via NFS.

Solution 1 (preferred): share the pool on pve2 via NFS to pve1 and perform a bind mount of the NFS mount to the individual LXCs that need it. I like this the most as I could do a P2P share based off the IP of pve1 and have a small amount of lockdown on access control.

Solution 2: mark the media server and seedbox LXCs as privileged and allow direct mount of NFS. Not ideal because although the risk is relatively low in my setup, i dont like the idea of running unprivileged LXCs.

Solution 3: (not sure how id do this): use iSCSI to share the zpool from pve2 to pve1 and bind mount the zpool to the LXCs. I think this could be more performant but i have never done anything with iSCSI.

Let me know if y'all have any other ideas or what you think. Thanks!

There is also a 4th solution. You use VMs that can directly mount NFS/SMB shares. For everything that is reachable from the internet or is very important I personally would use VMs anyway because of the better isolation, therefore better security and less problems.
For LXCs with just access from LAN I use solution 1. Also keep in mind that you will need to edit the user remapping when bind-mounting to a unprivileged LXC: https://pve.proxmox.com/wiki/Unprivileged_LXC_containers

And I wouldn't use raidz1 as a VM/LXC storage if you don't really require all that capacity. Read this article explaining volblocksize and padding overhead: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz

Using 8x SSDs in a raidz1 pool created with ashift=12 (ZFS uses 4K as smallest blocksize/sector per disk) with the default 8K volblocksize you will loose 50% of your raw capacity when using VMs. And a ZFS pool should always keep 20% of free space so it won't get slow, so with default values actually only 40% of the raw capacity can be used with zvols. To get the padding overhead to a reasonable value you would need to increase the volblocksize to atleast 32K (20% raw capacity lost instead of 50%) or even 256K (14% raw capacity lost instead of 50%) which would be really bad when running stuff like MySQL/posgres DBs that needs to write blocks smaller then the volblocksize.
With only LXCs it wouldn't be that much of a problem, as LXCs will use datasets instead of zvols and these will use the dynamic recordsize instead of the fixed volblocksize. But a VM/LXC storage still primarily needs IOPS and IOPS performance won't increase with the number of disks when using raidz1. So concerning IOPS a 8 disk raidz1 pool isn't faster than a single disk. Using a striped mirror (raid10) would quadruple your IOPS performance and the volblocksize could be as low as 16K.

liamlows · Jun 10, 2022

Dunuin said:
There is also a 4th solution. You use VMs that can directly mount NFS/SMB shares. For everything that is reachable from the internet or is very important I personally would use VMs anyway because of the better isolation, therefore better security and less problems.
For LXCs with just access from LAN I use solution 1. Also keep in mind that you will need to edit the user remapping when bind-mounting to a unprivileged LXC: https://pve.proxmox.com/wiki/Unprivileged_LXC_containers

And I wouldn't use raidz1 as a VM/LXC storage if you don't really require all that capacity. Read this article explaining volblocksize and padding overhead: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz

Using 8x SSDs in a raidz1 pool created with ashift=12 (ZFS uses 4K as smallest blocksize/sector per disk) with the default 8K volblocksize you will loose 50% of your raw capacity when using VMs. And a ZFS pool should always keep 20% of free space so it won't get slow, so with default values actually only 40% of the raw capacity can be used with zvols. To get the padding overhead to a reasonable value you would need to increase the volblocksize to atleast 32K (20% raw capacity lost instead of 50%) or even 256K (14% raw capacity lost instead of 50%) which would be really bad when running stuff like MySQL/posgres DBs that needs to write blocks smaller then the volblocksize.
With only LXCs it wouldn't be that much of a problem, as LXCs will use datasets instead of zvols and these will use the dynamic recordsize instead of the fixed volblocksize. But a VM/LXC storage still primarily needs IOPS and IOPS performance won't increase with the number of disks when using raidz1. So concerning IOPS a 8 disk raidz1 pool isn't faster than a single disk. Using a striped mirror (raid10) would quadruple your IOPS performance and the volblocksize could be as low as 16K.

Awesome thanks for the detailed response! I think i will use VMs for the seedbox and media server and most of the other services as well that are internet facing to make them more isolated and allow me to directly mount SMB/NFS. That being said, in my case the media server is sent through a nginx reverse proxy (LXC) and the seedbox is only accesible from LAN so they are already pretty isolated. Furthermore, would you also recommend placing the NGINX reverse proxy in a VM to provide better isolation (currently in an LXC)? Also i hope that having wireguard in an LXC is ok since ive been doing that since day 1.

Regarding the raidz1 pool i am using on the r430 with the 8x 800GB SSDs, you recommend i use a zfs stripe + mirror for them otherwise i will see severe loss of usable capacity? I am ok with this but just wanted to confirm I would likely have to wipe the system and reinstall since proxmox itself is installed on that pool, but i am willing to do it if raidz1 is truly as bad as you say. This zfs pool will have a lot of VMs and LXCs on it so i want to make sure i get this right before i go and create them all.

Thanks again for the help!

Dunuin · Jun 10, 2022

liamlows said:
Awesome thanks for the detailed response! I think i will use VMs for the seedbox and media server and most of the other services as well that are internet facing to make them more isolated and allow me to directly mount SMB/NFS. That being said, in my case the media server is sent through a nginx reverse proxy (LXC) and the seedbox is only accesible from LAN so they are already pretty isolated. Furthermore, would you also recommend placing the NGINX reverse proxy in a VM to provide better isolation (currently in an LXC)? Also i hope that having wireguard in an LXC is ok since ive been doing that since day 1.

That all depends on how well isolated you want your services. I personally also put my reverse proxy and wireguard into a VM. If there is any port-forward to a service, it is attackable from the internet and I want it in a VM that is part of a DMZ so its harder for the attacker to get control over the host, other guests or even other hosts in your LAN. If you don't care that much about security a LXC will be fine.

liamlows said:
Regarding the raidz1 pool i am using on the r430 with the 8x 800GB SSDs, you recommend i use a zfs stripe + mirror for them otherwise i will see severe loss of usable capacity? I am ok with this but just wanted to confirm I would likely have to wipe the system and reinstall since proxmox itself is installed on that pool, but i am willing to do it if raidz1 is truly as bad as you say. This zfs pool will have a lot of VMs and LXCs on it so i want to make sure i get this right before i go and create them all.

If you are fine with a minimum blocksize of 32K and only 1/4 of the IOPS performance then you could keep it as it is. It really depends on your workload. Running a posgres DB for example would be terrible as each 8K block written by posgres would be stored as a atleast 32K block that is stored again as 8x 4K blocks. This results in even more IO and that multiplied with the 1/4 IOPS performance makes it even slower. maybe you don't need all that performance and you are fine with how it performs now. But then you still got that additional SSD wear.
But if you only store some big files like movies on it, it wouldn't really matter.

liamlows · Jun 12, 2022

Dunuin said:
That all depends on how well isolated you want your services. I personally also put my reverse proxy and wireguard into a VM. If there is any port-forward to a service, it is attackable from the internet and I want it in a VM that is part of a DMZ so its harder for the attacker to get control over the host, other guests or even other hosts in your LAN. If you don't care that much about security a LXC will be fine.

Gotcha that totally makes sense. I’m glad you responded to this because I never really thought of it that way. Thanks!

Dunuin said:
If you are fine with a minimum blocksize of 32K and only 1/4 of the IOPS performance then you could keep it as it is. It really depends on your workload. Running a posgres DB for example would be terrible as each 8K block written by posgres would be stored as a atleast 32K block that is stored again as 8x 4K blocks. This results in even more IO and that multiplied with the 1/4 IOPS performance makes it even slower. maybe you don't need all that performance and you are fine with how it performs now. But then you still got that additional SSD wear.
But if you only store some big files like movies on it, it wouldn't really matter.

That’s fair, all those SSDs are running VM storage and nothing else. Thus, it makes total sense to swap the ZFS pool to mirror + striping (RAIDz10). Glad I found that out now rather than later. I’ll definitely be paying attention to block size when configuring storage even just for my home lab!

Last question. The drives have a sector size of 512b. Does that change any of this? The model number is HUSMM8080ASS201.

Thanks again for all the help!

Dunuin · Jun 12, 2022

liamlows said:
Last question. The drives have a sector size of 512b. Does that change any of this? The model number is HUSMM8080ASS201.

Not really. SSDs are always lying about the sector size. Internally they will work with something like a 8K, 16K or higher blocksize. And for writes that might be even bigger because NAND flash can only erase complete rows of dozens of sectors. So that aktually can mean that writing a single 4K block might write a full megabyte. Thats why you see a terrible write amplifiation when doing small random writes to a SSD.
But there is no way to find out what blocksize that SSD is really using internally because not a single manufacturer writes this in the documentation or datasheets. All you can do is running benchmarks with different blocksizes and then guess whats used inside the SSD by comparing the performances.

liamlows · Jun 12, 2022

Dunuin said:
Not really. SSDs are always lying about the sector size. Internally they will work with something like a 8K, 16K or higher blocksize. And for writes that might be even bigger because NAND flash can only erase complete rows of dozens of sectors. So that aktually can mean that writing a single 4K block might write a full megabyte. Thats why you see a terrible write amplifiation when doing small random writes to a SSD.
But there is no way to find out what blocksize that SSD is really using internally because not a single manufacturer writes this in the documentation or datasheets. All you can do is running benchmarks with different blocksizes and then guess whats used inside the SSD by comparing the performances.

Ahh i was wondering about that. I kept looking around for the sector size of the disks expecting to find it for the SSDs and never did. What you say makes sense though.

So to close it all out, i think i will change the raid config from raidz1 due to the factors you listed (wearout and issue of utilizing capacity). I will likely choose raid10 from the proxmox installer when reinstalling and go from there. The loss is space is a little unfortunate but it seems like i will get a substantial performance boost and i don't necessarily need all 6.4 TB.

If you dont mind me asking. do you think it would be better to do HW raid config or the software raid config by proxmox? Now that we are talking about all this, im starting to feel that a hardware raid config might be better since it could lift some of the work from proxmox with handling it. I know that ZFS has much better failure recovery but if im just using raid10 then fixing it would be a breeze since there is no need for parity re-calculation. I have also thought about using raid 5 or 6 potentially as well.

So sorry to drag this out but with the 8x800GB SAS SSDs what raid config (5, 6, 10, etc.) and what type (software/hardware) would you recommend if the workload is storage of a bunch of VM guests (media server VM, game server VM, dev/test/staging server VM, potentially a database server) and a few LXCs? I'm somewhat leaning towards HW Raid10 or SW Raid10 but curious to hear your final thoughts.

Thanks so much for helping me out with this!

Dunuin · Jun 12, 2022

ZFS isn't only about raid. You get replication, deduplication, block level compression, bit rot protection (so your data won't silently currupt over time), caching, multi tiered storages, encryption, combined filesystem and disk management...so alot of features you won't get when using a traditional HW raid.
And its software raid, so you can move your disks to any server and import that pool there for data recovery or when upgrading servers. If you don't need all these features and the additional data integrity then a HW raid with cache and BBU might be a valid choice. Otherwise I would stick with ZFS as you already got the enterprise grade hardware needed for a decent ZFS setup.

Best Two Node Setup Sharing A ZFS Pool

liamlows

Well-Known Member

Dunuin

Distinguished Member

liamlows

Well-Known Member

Dunuin

Distinguished Member

liamlows

Well-Known Member

Dunuin

Distinguished Member

liamlows

Well-Known Member

Dunuin

Distinguished Member

We value your privacy