[SOLVED] ZFS newbie. Want to add 2x1TB drives (ZFS mirrored) to setup. Is it better to put VMs/CTs on zpool, or use pool for data?

May 21, 2020
25
8
8
This is my first time using ZFS. Currently, I have a single NVMe drive for both local and local-lvm storage (my VMs/CTs and all of their data live here).

I would like to add 2x1TB SSDs in a ZFS mirror. However, I'm torn about what to do with the extra storage:

1. Move my VMs/CTs and all of their data to that ZFS storage, leaving my Proxmox install on the NVMe drive

2. Leave the VMs/CTs on the NVMe drive, but create datasets on ZFS pool (e.g., Nextcloud, backups, etc...) and mount the datasets into VMs/CTs
 
Last edited:
You didn't told us much about your SSDs. I guess they are all consumer and not enterprise SSDs like highly recommended for ZFS? And I guess the new two SSDs are SATA and not M.2?
In my opinion every storage in a server should have some kind redundancy if its not just a cache disk. So I personally would install PVE again on those new mirrored SSDs and use them for PVE + guests. But then again would be the question what to do with the NVMe SSD, except for maybe backups, where it doesn't matter if that data get lost, because you should have 3 copies of everything anyway.

Some ZFS recommendations:
  • enable relatime to reduce SSD wear: zfs set relatime=on YourPoolName
  • a ZFS pool should only be filled to 80% otherwise it can't operate optimally. As soon as it reaches 90% it will get very slow and when it gets completely full you might enter a state where you won't even be able to delete anything to free stuff up again. So you might want to set a pool wide quota to something like 90% and setup monitoring that alarms you when the usage exceeds 70%, so you can delete stuff or add more disks before it reaches 80%. For monitoring you could setup something like zabbix in a LXC. To set a quota you could use something like zfs set quota=900G YourPoolName
  • don't use consumer SSDs. ZFS got massive overhead and the write amplification could kill new consumer SSDs within months when hit with a workload like sync writes. Performance might also be very bad with consumer SSDs, especially when using QLC NAND. With enterprise SSDs thats usually not such a great problem as these are way more durable (compare TBW or DWPD between consumer QLC and enterprise SSDs of the same size...these are often 6-60 times more durable). So when still using consumer SSDs atleast make sure to monitor the wear, so you can remove and replace them before too much damage is done to them.
  • if you setup the ZFS pool manually, don't forget to add a monthly scrub job to your crontab
  • there is a zfs-zed package that can send you alert emails in case you don't want to setup something like zabbix for monitoring. But you need to set it up first and also postfix.
  • don't store snapshots for too long as they will grow. cv4pve-autosnap is a nice tool to automate snapshot creation and retention.
 
Last edited:
You didn't told us much about your SSDs. I guess they are all consumer and not enterprise SSDs like highly recommended for ZFS? And I guess the new two SSDs are SATA and not M.2?
In my opinion every storage in a server should have some kind redundancy if its not just a cache disk. So I personally would install PVE again on those new mirrored SSDs and use them for PVE + guests. But then again would be the question what to do with the NVMe SSD, except for maybe backups, where it doesn't matter if that data get lost.

Some ZFS recommendations:
  • enable relatime to reduce SSD wear: zfs set relatime=on YourPoolName
  • a ZFS pool should only be filled to 80% otherwise it can't operate optimally. As soon as it reaches 90% it will get very slow and when it gets completely full you might enter a state where you won't even be able to delete anything to free stuff up again. So you might want to set a pool wide quota to something like 90% and setup monitoring that alarms you when the usage exceeds 70%, so you can delete stuff or add more disks before it reaches 80%. For monitoring you could setup something like zabbix in a LXC. To set a quota you could use something like zfs set quota=700G YourPoolName
  • don't use consumer SSDs. ZFS got massive overhead and the write amplification could kill new consumer SSDs within months when hit with a workload like sync writes. Performance might also be very bad with consumer SSDs, especially when using QLC NAND. With enterprise SSDs thats usually not such a great problem as these are way more durable (compare TBW or DWPD between consumer QLC and enterprise SSDs of the same size...these are often 6-60 times more durable)
  • if you setup the ZFS pool manually, don't forget to add a monthly scrub job to your crontab
  • there is a zfs-zed package that can send you alert emails in case you don't want to setup something like zabbix for monitoring. But you need to set it up first and also postfix.
  • don't store snapshots for too long as they will grow. cv4pve-autosnap is a nice tool to automate snapshot creation and retention.
Thanks for the reply! I've been reading a lot about ZFS the past few hours. I think I'm going to go with my option #1 and leave Proxmox on the NVMe drive, and add the SSDs in a mirror for VMs. Proxmox already does VM backups to a NAS, which itself is backed up to the cloud. If the VMs are on a separate drive, I can replace Proxmox on the NVMe drive as-needed without worry.

You're right, I didn't give enough info. The current NVMe drive is a 970 Pro 512GB (consumer) drive, but the SSDs will be 2x1TB Samsung PM893 SATA (datacenter) SSDs. The reason I'm try to move away from the 970 Pro is that (as you pointed out), it only has a 600TBW warranty, whereas the PM893 has about 1752TBW life in it. I should add this is for my homelab, so I'll probably never hit that amount, but I'd like the overhead.

I appreciate the comments on relatime, the capacity, and the cronjobs/monitoring!
 
Proxmox already does VM backups to a NAS, which itself is backed up to the cloud. If the VMs are on a separate drive, I can replace Proxmox on the NVMe drive as-needed without worry.
But don't forget to also backup your PVE host. If you loose your PVE disk, you won't be able to start any guest, as you then also lost all the VM/LXC config files. The VM/LXC storage is only storing the virtual disks, but not the whole VMs. So always keep recent VM/LXC backups because they also include the VM/LXC config files. Stuff like firewall rules, network config, backup jobs, ... is still lost if you don't also backup the "/etc" folder.
I should add this is for my homelab, so I'll probably never hit that amount, but I'd like the overhead.
Like I said, ZFS got alot of overhead. My homelab is most of the time idleing (CPU usually at 4-7% utilization) and writing 0.3-0.9 TB per day where most of the writes are just logs and metrics the guests are producing while idleing. Seen write amplification here of up to factor 72. If I do 1TB of 4k random sync writes inside a VM this will cause 72TB of writes to the NAND of the SSDs. For 1M sequential async writes its more like factor 3. So its not that hard for a PVE server to write hundreds or thousand of TBs over the years while basically doing nothing.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!