ZFS Questions - Moving from old school ext4 install

Stacker

Member
May 6, 2020
23
0
21
I am new to ZFS. I’m old school and I feel like I'll take ext4, mdadm, and lvm-thin to the grave with me. But I like to learn new tricks. I'd like to make the switch but want to make sure I am not F'ing myself over. Ive done some searhing where possible but leaves me sometimes with more questions than answers.

Situation:

I have 2x - 1.9 TB NVMe drives in a brand new server that im trying to figure out best storage setup for. I plan on migrating off an old Proxmox 6 box. I have installed 7 to new to just play around with ZFS till comfortable.

Thoughts:

1) I have been using NVMe for long time at home they have not yet failed (or worn out) I checked the new drives and these drives are fairly new only few dozen TB written.

2) I could get away with NOT putting these into a mirror and just keep backups... My brain kicked back in and said I should probably make a partition large enough to store production VMs and then partition 200-500 GB on each stick for more like non critical VMs that if lost are no big deal. Sometimes I need a space to drop few 100 GB temporally etc.

3) I think I came to the conclusion that for safety purposes I should probably create at least a ~1TB Raid1 (mirror in zfs terms) and then maybe allocate the other empty each as single drives... That gives me almost another 1.8 TB of mess around storage. that I could even use to do some local backups (say sftp from a Hosting Pane, or game server storage) .. Of course, I also offsite the main backups but this be a quick local backup spot that if I lose it so well, I have the slow backup location.

Note: In terms of ZFS... I see VDEVs as Partitions which could be in 1 or more drives. ZFS Pools then combines vdevs as assigned

Currently:

My Pre-Setup / Mess around stage... I used gparted (can’t use that to make zfs itself I learned) to help me make multiple "unallocated partition" so proxmox could then see the storage and make "zfs pools"
-- ~80 GB "root" drive. *** The installer did this for me I gave it the size.
-- 500 GB Mirror
-- I left a ~300 GB BLANK unused area
-- Created a ~900 GB "partition" at the END of EACH drive which I then created 2 (two) 900 GB standalone ZFS areas in the GUI.

ZFS related Questions:

Question 1: Can I EXPAND easily my "mirror" pool / vdev Into BLANK UN-allocated blocks NEXT to it "partition" wise... or for that matter any of my "partitions". I tried to go into GPARTED and expand the "partition" but it would not let me. Figure I could expand my 500GB to 700GB by adding 200 to both drives in my unused area then tell ZFS to go use that space.

Question 1a: Related to 1st... Could I in the future delete the stand-alone sections (on each drive) and expand my mirror... might be redundant question

--- Maybe --- zpool set autoexpand=on <name of pool> ??

Question 2: Was reading about trim. I assume I should keep auto trim OFF and run a weekly trim? or how often would you recommend... I believe this kills SSD life.

Question 3: How will ZFS handle a drive failure on my mirror. Even the BOOT mirror. I am used to mdadm (linux raid) I am very good at managing this / replacing drives etc... how would I replace / rebuild my Mirrors. Can it be done ONLINE (online rebuild?)

Question 4: Will I have issues or bad practice by creating multiple "partitions / vdevs" on the SAME drive like I did with Gparted. Where I have a Mirror "pool" and then stand alone "pool"

Question 5: The installer made 3 /rpool paths on same pool... should I care. Are they used for different items?

Question 6: SWAP ... ProxMox shows me N/A SWAP usage in the GUI.. No swap partition exist from what I can see.. Should I have a SWAP space partition? Or Not needed? I’ve read some bad things happen with zfs swap. I also think SWAP would kill SSD. https://forum.proxmox.com/threads/w...-disks-zfs-file-root-system.95115/post-414319

Question 7: I have a handful of .qcow2 VMs.. I also sometimes look to deploy an OVA / vmware type disk. I see some mixed messaging on if ZFS "partitions / storage" can support this... will these types of VMs work? Reference -- https://forum.proxmox.com/threads/no-qcow2-on-zfs.37518/ whats ZFS ZVOL and can it be enabled?

Question 8: Other incompatibilities / issues around ZFS ... Or things I should also be setting up? I’m trying to not get scared away from ZFS and fall back to my trusted ext4.


Conundrum: Should I just Mirror entire 2 Drives? ... was hoping to utilize the stand along part for non-critical storage. (its also NVMe, failure is rare till worn out + I’d be ok with losing the data)


Code:
# zfs list
NAME               USED  AVAIL     REFER  MOUNTPOINT
raid1data          960K   481G       96K  /raid1data
rpool             1.47G  75.1G      104K  /rpool
rpool/ROOT        1.46G  75.1G       96K  /rpool/ROOT
rpool/ROOT/pve-1  1.46G  75.1G     1.46G  /
rpool/data          96K  75.1G       96K  /rpool/data
single0data       1.43M   845G       96K  /single0data
single1data        876K   845G       96K  /single1data

~# lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL
NAME        FSTYPE       SIZE MOUNTPOINT LABEL
zd0                       32G
nvme1n1                  1.7T
├─nvme1n1p1             1007K
├─nvme1n1p2 vfat         512M
├─nvme1n1p3 zfs_member  79.5G            rpool
├─nvme1n1p4 zfs_member   500G            raid1data
└─nvme1n1p5 zfs_member 878.9G            single1data

nvme0n1                  1.7T
├─nvme0n1p1             1007K
├─nvme0n1p2 vfat         512M
├─nvme0n1p3 zfs_member  79.5G            rpool
├─nvme0n1p4 zfs_member   500G            raid1data
└─nvme0n1p5 zfs_member 878.9G            single0data

##### fdisk -l
Disk model: SAMSUNG MZQLB1T9HAJR-00007
Device              Start        End    Sectors   Size Type
/dev/nvme1n1p1         34       2047       2014  1007K BIOS boot
/dev/nvme1n1p2       2048    1050623    1048576   512M EFI System
/dev/nvme1n1p3    1050624  167772160  166721537  79.5G Solaris /usr & Apple ZFS
/dev/nvme1n1p4  167774208 1216350207 1048576000   500G Solaris /usr & Apple ZFS
/dev/nvme1n1p5 1907548160 3750748159 1843200000 878.9G Solaris /usr & Apple ZFS

Disk /dev/nvme0n1: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors

Disk model: SAMSUNG MZQLB1T9HAJR-00007
Device              Start        End    Sectors   Size Type
/dev/nvme0n1p1         34       2047       2014  1007K BIOS boot
/dev/nvme0n1p2       2048    1050623    1048576   512M EFI System
/dev/nvme0n1p3    1050624  167772160  166721537  79.5G Solaris /usr & Apple ZFS
/dev/nvme0n1p4  167774208 1216350207 1048576000   500G Solaris /usr & Apple ZFS
/dev/nvme0n1p5 1907548160 3750748159 1843200000 878.9G Solaris /usr & Apple ZFS
 

Attachments

  • och-zfs-gpart.PNG
    och-zfs-gpart.PNG
    59.7 KB · Views: 18
Last edited:
Thoughts:
1) I have been using NVMe for long time at home they have not yet failed (or worn out) I checked the new drives and these drives are fairly new only few dozen TB written.
But ZFS got a lot of overhead with its copy-on-write. Virtualization got overhead. Nested filesystems got overhead. Raid creates overhead. SSDs technically always got a lot of write amplificaton, especially when doing small sync writes like a database would do. And all this overhead isn`t adding up, it is multiplying. Causing something like exponential growth, which can kill SSDs really fast, especially when using specific workloads like DBs. Average write amplification of my PVE server is for example factor 20. So for every 1TB I write inside a VM to the guest filesystem, 20 TB will be written to the NAND cells of the SSDs. So here that means I only get 1/20th of the performance and the SSDs will die 20 times faster.
And this can be even worse. Seen write amplification here of up to factor 62 (4K random sync writes written to ext4 over virtio SCSI to encrypted zvol on mirrored ZFS pool).
This is one of the points why you should buy way more durable enterprise SSDs for a server. Especially when using ZFS. Did a comparison table some days ago showing the durability difference between different grades of NVMe SSD, in case you care: https://forum.proxmox.com/threads/c...-for-vms-nvme-sata-ssd-hdd.118254/post-512801)
So you will have to test it, how much wear your personal workload will cause. But it is not uncommen that you can kill a new consumer SSD within a few months when hitting it with a lot of writes and a hard workload.
2) I could get away with NOT putting these into a mirror and just keep backups... My brain kicked back in and said I should probably make a partition large enough to store production VMs and then partition 200-500 GB on each stick for more like non critical VMs that if lost are no big deal. Sometimes I need a space to drop few 100 GB temporally etc.
You should always keep backups, also when using a mirror. Benefit of the mirror is that you don't get any downtime and you won't lose the data your added/edited after the last backup happend.
Another benefit of ZFS is the bit rot protection, but that will only work when you got a mirror or parity data. All your data may corrupt over time and usually, you won't even recognize it. Maybe you stored 10,000 home photos on a HDD/SSD 3 years ago. Now maybe 1 or 2 of those photos won't work anymore because the bits flipped. The JPG is still there but the data is damaged and the bottom half of the picture is just black. You probably didn't looked at those photos for a long time and most likely never looked at all of those 10,000 photos again in those 3 years. So you won't know if a picture is actually damaged or not. Or what photo is damaged, in case something is damaged. Now let us say you only keep backups for 1 year, because HDDs are expensive and you can't keep your backups forever (...and your backups will corrupt over time too...). Those few photos might have been corrupted 2 or 3 years ago, so the past year you only backed up those corrupted images and you don't got a single uncorrupted copy of those damaged pictures anymore.
Or in short: Backups will only help you in case a disk fails, but not against data silently corrupting over time.
What ZFS is doing is, that it will checksum every block of data and store the checksum together with the data. From time to time (I do it every month) you run a scrub job. This scrub job will read all the stored data and checksum it again. Then it will compare the checksum of what the data looks now with the checksum of the time where the data was written to the disk. If that checksum doesn't match anymore, then that data has been corrupted. Without parity data or a mirror ZFS can only tell you that you lost data, but it can't do anything against it. In case you got a mirror or parity data, it can use that redundant data to repair that corrupted data. You basically get a self-healing filesystem/blockdevice.
And with that, backups make really sense, as you can trust that you backup healthy and not corrupted data. If you want that your backups also won't degrade over time, you should store them on another ZFS pool.
And ECC RAM is highly recommnended. Because if you can't trust your RAM, that your RAM won't corrupt your data before it is written to disk, then you also can't really trust that the ZFS checksumming is always correct.
3) I think I came to the conclusion that for safety purposes I should probably create at least a ~1TB Raid1 (mirror in zfs terms) and then maybe allocate the other empty each as single drives... That gives me almost another 1.8 TB of mess around storage. that I could even use to do some local backups (say sftp from a Hosting Pane, or game server storage) .. Of course, I also offsite the main backups but this be a quick local backup spot that if I lose it so well, I have the slow backup location.
For unimportant data or redundant data that can make sense to reduce the SSD wear and parity loss. I for example always also create a unmirrored LVM-Thin pool for my graylog and Zabbix. They only store unimportant data (metrics and logs for monitoring my homelab) but easily write 400GB per day to the SSD (isn't that much of actual logs/metrics, it is just the stupid write amplification because all metrics/logs will be written to MySQL/elasticstack/mongodb databases).

But for your calculations...keep in mind that a ZFS pool always should have 20% of free space. So if you mirror 2x 1TB partitions, you should only store 800GB of files or block devices on it. I also always recommend to at least set a poolwide quota of 90%, so you can't fill that pool totally up by accident. Totally filling up the pool might cause data loss, the system will stop working and because ZFS is a copy-on-write filesystem, it needs to add new data first, to be able to delete data. If it can't add any new data because it is completely full, you also won't be able to recover from this state without replacing all your disks with bigger models. And after around 80% capacity usage your pool will start to get slower and fragment faster (which is bad as you can't defragment a copy-on-write filesystem without moving all data off that pool and copy it back). So best you monitor your pool and expand it when you see that you will have to exceed this 80%.

ZFS related Questions:

Question 1: Can I EXPAND easily my "mirror" pool / vdev Into BLANK UN-allocated blocks NEXT to it "partition" wise... or for that matter any of my "partitions". I tried to go into GPARTED and expand the "partition" but it would not let me. Figure I could expand my 500GB to 700GB by adding 200 to both drives in my unused area then tell ZFS to go use that space.

Question 1a: Related to 1st... Could I in the future delete the stand-alone sections (on each drive) and expand my mirror... might be redundant question


--- Maybe --- zpool set autoexpand=on <name of pool> ??
Yes, that is what the "autoexpand" option is for. You of cause need to find out a way first to expand that partition. I always thought gparted should be able to do that. Best you search this forum. This was already discussed and done by dozens of people. Maybe they wrote how exactly they did it.
Question 2: Was reading about trim. I assume I should keep auto trim OFF and run a weekly trim? or how often would you recommend... I believe this kills SSD life.
Trim shouldn`t kill SSD life. At least not noticable. It is indeed recommended to TRIM SSDs frequently so the SSD knows what blocks it can free up. The less full/fragmented your SSDs is, the better the wear leveling and write optimization of the SSD will work, which will extend the SSD life. How often you should run a fstrim really depends on how much you write. I run it hourly for stuff that writes really much and daily for everything else.
Trim is also important for thin provisioning. Without it, nothing you overwrite/delete will actually be deleted and it still consumes the full space.
Question 3: How will ZFS handle a drive failure on my mirror. Even the BOOT mirror. I am used to mdadm (linux raid) I am very good at managing this / replacing drives etc... how would I replace / rebuild my Mirrors. Can it be done ONLINE (online rebuild?)
Jup, online rebuild. See paragraph "Changing a failed bootable device": https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration
Question 4: Will I have issues or bad practice by creating multiple "partitions / vdevs" on the SAME drive like I did with Gparted. Where I have a Mirror "pool" and then stand alone "pool"
I wouldn't put two partitions of the same disk into a single pool. But two pools, each with one partition of the same disk, should be fine. But they will influence each other. Writing/reading to one pool will then of cause slow down the other pool. But I also do that sometimes. I for example like to have one partition for my ZFS pool with the PVE root filesystem and another partition for my ZFS pool that stores my VMs. Creating and destroying a pool is basically a one-liner and you can easily back up or restore all VMs with a few clicks in the webUI. So it is very convenient to be able to quickly destroy and recreate the VM storage pool while the root filesystem continues running.
Question 5: The installer made 3 /rpool paths on same pool... should I care. Are they used for different items?
What exactly do you mean? Can you give some examples?
Question 6: SWAP ... ProxMox shows me N/A SWAP usage in the GUI.. No swap partition exist from what I can see.. Should I have a SWAP space partition? Or Not needed? I’ve read some bad things happen with zfs swap. I also think SWAP would kill SSD. https://forum.proxmox.com/threads/w...-disks-zfs-file-root-system.95115/post-414319
I also asked that and the question is still not answered. See here: https://forum.proxmox.com/threads/best-way-to-setup-swap-partition.116781/
PVE will only create a swap partition when using LVM. I guess the Proxmox team also wasn't able to find a solution on how to have a reliable mirrored swap, so they just skipped it.
Short: Swap on ZFS is a bad idea. Swap on mdadm raid is a bad idea. Unmirrored swap is a bad idea in case you care about downtime. No swap is a bad idea, unless you want to waste a lot of RAM to keep it free as a headroom so OOM will never trigger.
Question 7: I have a handful of .qcow2 VMs.. I also sometimes look to deploy an OVA / vmware type disk. I see some mixed messaging on if ZFS "partitions / storage" can support this... will these types of VMs work? Reference -- https://forum.proxmox.com/threads/no-qcow2-on-zfs.37518/ whats ZFS ZVOL and can it be enabled?
A storage of type "ZFSPool" will only allow you to store virtual disks of VMs in raw format on zvols (which are ZFS native block devices). A storage of type "directory" is pointing to any folder stored on a filesystem and as qcow2 disks are files, you can store them on such a "directroy" storage. ZFS also got filesystems (called datasets). So if you really want to store qcow2 disks on top of ZFS (which is usually not a great idea because both are copy-on-write filesystems that got massive overhead and this will multiply again...I only would do that if I really need one of those qcow2 featues), you could create a dataset and then create a directory storage pointing to the mountpoint of that dataset. But don't forget to run a pvesm set YourDirStorageId --is_mountpoint yes afterwards or you will fill up your root filesystem, in case that mounting will ever fail.
Question 8: Other incompatibilities / issues around ZFS ... Or things I should also be setting up? I’m trying to not get scared away from ZFS and fall back to my trusted ext4.
ZFS is a complex topic and so much more than just software raid. It isn't easy to learn, because it is fundamentally different to the "old-school" filesystems. But the more you learn, the more you can make use of all its great features. And as soon as you are daily using all those features you don't want to miss them.
You probably will run into some problems and some things can't be changed later (like the ashift, removing of special devices, adding more disks to a raidz, ... ) without destroying the whole pool and creating it again from scratch. But as long as you got your backups and you document what you have done, this all should be somehow fixable without losing data. In my opinion learning by doing is the best way to understand how ZFS works.

Edit:
Maybe keep in mind that rolling back a ZFS snapshot can't be reverted. It will wipe everything thatwas done after that snapshot was taken. This can be a fatal user error and should be avoided.
Conundrum: Should I just Mirror entire 2 Drives? ... was hoping to utilize the stand along part for non-critical storage. (its also NVMe, failure is rare till worn out + I’d be ok with losing the data)
Looked ok so far.
 
Last edited:
Wow Dunuin what a great response, I don't think anyone has ever replied to a set of questions that I've ask like that. I very much appreciate you taking the time to do that and I hop someone else benefits to.

One NEW Question:
I see compression options when I make the ZFS pool. I chose NONE.. because... I dont feel id get to much disk saved based on bulk of whats stored.. I also dont want my CPU to work harder than it needs to. I'm not working with 32+ cores. Is this the right call?

On some follow up items:

Life Span / qcow2:

I have one very active vm doing a lot of database / file work. My old node is on 4 x SATA drives in a RAID 10.. it works just fine but I have 1 VM for example that has 41 TB written in last 3 months (according to proxmox).. I think that would be my hardest hitting item and its an LXC container. I only have 2 qcow2 VMs they are rather static as most runs inside ram on them.

Perhaps I should make a EXT4 lvm-thin partition to fit this heavy hitting VM and also the couple qcow2 vms? not sure how much that would help with ssd wear and tear should I even bother?

This is the model of the drives in the server. I believe based on what i read here they are at least not consumer level drives. I see 2733 TB under specs for Endurance .. What is your thoughts on this drive? -- https://www.wiredzone.com/shop/prod...1-9tb-nvme-pcie3-0-x4-v4-tl-pm983-series-2435

Trim:
I will probably set cron to run nightly and see results.. maybe every 12 hrs or off high times. I dont think I should use autotrim based on what I read unless I can be convinced otherwise.

SWAP:
I'm going to go with 0 swap partitions. Your damned if you do damned if you don't. I am pretty good with not over allocating RAM.

zvols:
zvols (which are ZFS native block devices). A storage of type "directory"...
So I believe I did then when I created a FOLDER mount inside the ZFS area in the GUI ... Storage > Add > Directory... and then giving it a path inside the zfs pool.. I then could store / upload ISO in there etc... Is this what a zvol is? Just a folder path?

I take it this command
Code:
pvesm set YourDirStorageId --is_mountpoint yesundefined
... tells proxmox / system to treat this as a real mount point in case its not really there sometime and instead would end up filling my main / (root) partition. Like if the stand alone nvme faild that I made that on.

Snapshots:
Maybe keep in mind that rolling back a ZFS snapshot can't be reverted. It will wipe everything thatwas done after that snapshot was taken. This can be a fatal user error and should be avoided.
Can I delete the snap shot once I am done with them... Easy example... I want to run major upgrade.. I take snap shot, do major upgrades, confirm all is working, then delete the snap shot.. the VM will be in the upgraded state after delete of snap? I feel like this is a dumb question but I'm double checking how ZFS does this not sure if there had to be a merge etc.

Final thoughts for this post:

The NVME drives I have (link above) are supplied by the data center... If they break / wear out I would get a replacement free. (I rent this server) ... with that in mind?
1) Is it worth doing ZFS with a LXC that does 40+ TB over 3 months (others are low to maybe 2 TB)

2) Is it worth just running the qcow2 in the directory and taking the hit or should I make ETX-4 + lvm-thin area.... I am running a load balancer / vhost pair in qcow2. Its very light on overall disk i/o.

3) The part where you talked about "Average write amplification of my PVE server is for example factor 20" .. concerns me that ZFS is really not a good system to be using on SSDs and more of a HDD system.... Would abanding ZFS and falling back to good old EXT-4 / lvm-thin save me headaches with wear and TB written? Or am I looking at minimal gains here and over thinking this.

4) Assuming I stick with ZFS if im averaging 40 TB written /month.. I feel like I could get 2-3+ years out of these... I'm just looking to not have to be watching drive health as a full time job.. Maybe a bi-monthly check-in or some automated email that sends TB written smart totals (any ideas on a script for this already made?) . Replacing a drives once year is not bad, Monthly would be a PIA.
 
Last edited:
I see compression options when I make the ZFS pool. I chose NONE.. because... I dont feel id get to much disk saved based on bulk of whats stored.. I also dont want my CPU to work harder than it needs to. I'm not working with 32+ cores. Is this the right call?
Block-level compression is one of the great ZFS features and it is usually a good idea to keep the default LZ4 compression enabled. LZ4 only needs super few CPU resources to compress/decompress data. Usually, the performance of your disks is the bottleneck for IO operations and not your CPU. So losing a small bit of CPU performance is a good tradeoff for a faster storage, as compressed data is smaller and so the disks have to read/write less data, which will increase the disk performance. And ZFS isn't compressing everything. It will sample your files and try if it is worth compressing. If that data is well compressible it will compress it. If it is not (for example that file is already a video or zip file that is already compressed) it will skip the compression and write that data uncompressed.
Only case where you might consider disabling the LZ4 compression is when using super fast NVMe disks, where storage performance won't be any bottleneck. But best you benchmark your storage using fio, to see what works best for you.
This is the model of the drives in the server. I believe based on what i read here they are at least not consumer level drives. I see 2733 TB under specs for Endurance .. What is your thoughts on this drive? -- https://www.wiredzone.com/shop/prod...1-9tb-nvme-pcie3-0-x4-v4-tl-pm983-series-2435
Jup, that is entry level enterprise grade. Made for read intense workloads that don't write much (but still way better than any consumer/prosumer SSD and got powerloss protection so sync write performance should be fine).
Perhaps I should make a EXT4 lvm-thin partition to fit this heavy hitting VM and also the couple qcow2 vms? not sure how much that would help with ssd wear and tear should I even bother?
Are you sure you have to run them as qcow2? Have a look at the "qm importdisk" command: https://pve.proxmox.com/pve-docs/qm.1.html
You can use that to import a qcow2 disk and then save it in raw format as a zvol on a ZFS pool. That should reduce the overhead.

And if moving that DB LXC to a LVM-Thin without mirroring or bot rot protection makes sense for you really depends on how important that data is to you. If you don't care that much about losing it, this might be an option. You will have to decide if that additional data integrity and redundancy is worth the additional SSD wear (and so money).
So I believe I did then when I created a FOLDER mount inside the ZFS area in the GUI ... Storage > Add > Directory... and then giving it a path inside the zfs pool.. I then could store / upload ISO in there etc... Is this what a zvol is? Just a folder path?
No. Here is a paragraph of a tutorial I'm writing right now, where I'm trying to explain the basic ZFS terms. Datasets and zvols are also part of that. Maybe then you better understand what I mean with dataset and zvol:

4 A.) ZFS Definition​

The first thing people often get wrong is, that ZFS isn't just a software raid. Its way more than that. It's software raid, it's a volume manager like a LVM and it's even a filesystem. It's a complete enterprise grade all-in-one package that manages everything from the individual disks down to single files and folders.
You really have to read some books or at least read several tutorials to really understand what it is doing and how it is doing it. It's very different compared to traditional raid or file systems. So don't make the mistake to think it will work like the other things you used so far and you are familiar with.


Maybe I should explain some common ZFS terms so you can follow the tutorial a bit better:

  • Vdev:
    Vdev is the short form of "virtual device" and means a single disk or a group of disks that are pooled together. So for example a single disk could be a vdev. A raidz1/raidz2 (aka raid5/raid6) of multiple disks could be a vdev. Or a mirror of 2 or more disks could be a vdev.
    All vdevs have in common that no matter how many disks that vdev consists of, the IOPS performance of that vdev won't be faster than the single slowest disk that is part of that vdev.
    So you can do a raidz1 (raid5) of 100 HDDs and get great throughput performance and data-to-parity-ratio, but IOPS performance will still be the same as a vdev that is just a single HDD. So think of a vdev like a single virtual device that can only do one thing at a time and needs to wait for all member disks to finish their things before the next operation can be started.
  • Stripe:
    When you want to get more IOPS performance you will have to stripe multiple vdevs. You could for example stripe multiple mirror vdevs (aka raid1) to form a striped mirror (aka raid10). Striping vdevs will add up the capacity of each of the vdevs and the IOPS performance will increase with the number of striped vdevs. So if you got 4 mirror vdevs of 2 disks each and stripe these 4 mirror vdevs together, then you will get four times the IOPS performance, as work will be split across all vdevs and be done in parallel. But be aware that as soon as you loose a single complete vdev, the data on all vdevs is lost. So when you need IOPS performance its better to have multiple small vdevs that are striped together than having just a single big vdev. I wouldn't recommend it, but you could even stripe a mirror vdev (raid1) and raidz1 vdev (raid5) to form something like a raid510 ;-).
  • Pool:
    A pool is the biggest possible ZFS construct and can consist of a single vdev or multiple vdevs that are striped together. But it can't be multiple vdevs that are not striped together. If you want multiple mirrors (raid1) but don't want a striped mirror (raid10) you will have to create multiple pools. All pools are completely independent.
  • Zvol:
    A zvol is a volume. A block device. Think of it like a LV if you are familiar with LVM. Or like a virtual disk. It can't store files or folders on its own but you can format it with the filesystem of your choice and store files/folders on that filesystem. PVE will use these zvols to store the virtual disks of your VMs.
  • Volblocksize:
    Every block device got a fixed block size that it will work with. For HDDs this is called a sector which nowadays usually is 4KB in size. That means no matter how small or how big your data is, it has to be stored/read in full blocks that are a multiple of the block size. If you want to store 1KB of data on a HDD it will still consume the full 4KB as a HDD knows nothing smaller than a single block. And when you want to store 42KB it will write 11 full blocks, so 44KB will be consumed to store it. What's the sector size for a HDD is the volblocksize for a zvol. The bigger your volblocksize gets, the more capacity you will waste and the more performance you will lose when storing/accessing small amounts of data. Every zvol can use a different volblocksize but this can only be set once at the creation of the zvol and not changed later. And when using a raidz1/raidz2/raidz3 vdev you will need to change it, because the default volblocksize of 8K is too small for that.
  • Ashift:
    The ashift is defined pool wide at creation, can't be changed later and is the smallest block size a pool can work with. Usually, you want it to be the same as the biggest sector size of all your disks the pool consists of. Let's say you got some HDDs that report using a physical sector size of 512B and some that report using a physical sector size of 4K. Then you usually want the ashift to be 4K too, as everything smaller would cause massive read/write amplification when reading/writing from the disks that can't handle blocks smaller than 4K. But you can't just write ashift=4K. Ashift is noted as 2^X where you just set the X. So if you want your pool to use a 512B block size you will have to use an ashift of 9 (because 2^9 = 512). If you want a block size of 4K you need to write ashift=12 (because 2^12 = 4096) and so on.
  • Dataset:
    As I already mentioned before, ZFS is also a filesystem. This is where datasets come into play. The root of the pool itself is also handled like a dataset. So you can directly store files and folder on it. Each dataset is its own filesystem, so don't think of them as normal folders, even if you can nest them like this: YourPool/FirstDataset/SecondDataset/ThirdDataset.
    When PVE creates virtual disks for LXCs, it won't use zvols like for VM, it will use datasets instead. The root filesystem PVE uses is also a dataset.
  • Recordsize: Everything a dataset will store is stored in records. The size of a record is dynamic and will be a multiple of the ashift but will never be bigger than the recordsize. The default recordsize is 128K. So with a ashift of 12 (so 4K) and a recordsize of 128K, a record can be 4K, 8K, 16K, 32K, 64K or 128K. If you now want to save a 50K file it will store it as a 64K record. If you want to store a 6K file it will create a 8K record. So it will always use the next bigger possible recordsize. With files that are bigger than the recordsize this is a bit different. When storing a 1M file it will create eight 128K records. So the recordsize is usually not as critical as the volblocksize for zvols, as it is quite versatile because of its dynamic nature.
  • Copy-on-Write (CoW):
    ZFS is a Copy-on-Write filesystem and works quite different to a classic filesystem like FAT32 or NTFS. With classic filesystems, the data of every file has fixed places spread across the disk. And then there is an index that will tell you at what places the data of that file is stored. If you now edit that file it will overwrite parts of the old data with new data. So after editing that file it isn't possible anymore to see what that file looked like in the past. But reading that file is super easy. Just look at the index, go to that part on the disk and just read the file.
    With Copy-On-Write this works differently. Think of it more like an empty log book where you can only add data line by line at the end. But you aren't allowed to ever erase and overwrite a line you have written in the past. If you want to change data you have written in the past, you will have to write down a new line at the end to correct what was written earlier. Let's say on page 10 line 1 of that book you have written the data of a file "ZFS is easy". You did a snapshot on page 20 which will work like a bookmark placed between the pages. Now you are on page 30 and want to change that file to "ZFS is difficult". You can`t directly change what's written on page 10 line 1, so you write to the end of page 30: "Have to correct me. Page 10 line 1 word 3 should be 'difficult'. When you then want to know what that file now looks like you will have to read the whole book from start to end in chronological order and construct what that file should look like. First you read page 10 line 1 "ZFS is easy" and then page 30 last line which tells you to replace "easy" with "difficult". So now your file will be "ZFS is difficult". But because it's a big chronological log book nothing is ever lost, even when you change it, as long there is a snapshot. Now comes the snapshot or bookmark on page 20 into play. If you want to know how the file looked like when you did that snapshot, you just have to read the book from the beginning to the bookmark. You read page 10 line 1 "ZFS is easy" and nothing changes it until you reach page 20 and stop reading. So at the time, you created that snapshot that file was "ZFS is easy". I hope you see that it is great to be able to revert back in time to see what a file was in the past. But it's also way more work to find out what the file actually is, as you need to read the whole book again and again and puzzle together what the file has to look like after multiple changes. Way more complex and much more work compared to just overwriting parts of a file and changing the index.
    This is a big part of why ZFS got such a massive overhead. But also the explanation why you can snapshot terabytes of data within a second, as you don't have to write or read anything new, its all already written there, you just have to place the bookmark between the pages of your choice.
    In reality, this is way more complex. But I really like this book metaphor, to make the difference between classical filesystems understandable.
 
Last edited:
I take it this command
Code:
pvesm set YourDirStorageId --is_mountpoint yesundefined
... tells proxmox / system to treat this as a real mount point in case its not really there sometime and instead would end up filling my main / (root) partition. Like if the stand alone nvme faild that I made that on.
Jup. With the "is_mountpoint" option enabled, PVE will check first if that mountpoint is really mounted. If it is not, it will let the storage fail so the storage isn't usable until it is mounted again. If you don't set that option and the mount would fail, PVE would simply write stuff to that empty folder, filling up your system disk. This could easily crash your server, in case you for example only got a 32GB system partition and you try to write 100GB backup to that directory storage that is then writing everything on that 32GB system partition.
Can I delete the snap shot once I am done with them... Easy example... I want to run major upgrade.. I take snap shot, do major upgrades, confirm all is working, then delete the snap shot.. the VM will be in the upgraded state after delete of snap? I feel like this is a dumb question but I'm double checking how ZFS does this not sure if there had to be a merge etc.
Jup, you can create as many snapshots as you want and you can delete any snapshot without changing your most recent data. You just lose the ability to return to that old point in time when deleting a snapshot.

Also, keep in mind that snapshots will prevent freeing up space. So the older your snapshot gets, the bigger your dataset/zvol will grow. So they are great if you want to be able to do a super-fast "backup"/rollback and you don't want to keep that snapshot for long. For long-term backups, I would recommend setting up a proxmox backup server (PBS). So snapshots are great on an hourly/daily basis and PBS backups on a daily/weekly/monthly/annualy basis.
The NVME drives I have (link above) are supplied by the data center... If they break / wear out I would get a replacement free. (I rent this server) ... with that in mind?
1) Is it worth doing ZFS with a LXC that does 40+ TB over 3 months (others are low to maybe 2 TB)
If you don't have to pay for the SSD wear, I would simply use ZFS mirrors for everything. Then the wear isn't your problem and a failing SSD wouldn`t be a problem as you got the redundancy of the raid and everything will continue running as if nothing has happened. Just make sure to monitor the pools, so the datacenter can replace the failed disks as fast as possible, so you don't risk losing the remaining disk too, which would result in losing the whole pool.

2) Is it worth just running the qcow2 in the directory and taking the hit or should I make ETX-4 + lvm-thin area.... I am running a load balancer / vhost pair in qcow2. Its very light on overall disk i/o.
I would convert them. See above.
3) The part where you talked about "Average write amplification of my PVE server is for example factor 20" .. concerns me that ZFS is really not a good system to be using on SSDs and more of a HDD system.... Would abanding ZFS and falling back to good old EXT-4 / lvm-thin save me headaches with wear and TB written? Or am I looking at minimal gains here and over thinking this.
ZFS got all these great features and is triple checking everything to ensure data integrity. These are heavy tasks that come at the cost of overhead. So yes, it is hitting the SSDs really hard, but only because otherwise this data integrity wouldn't be achievable. It really all depends if you are willing to pay more for better quality SSDs (or to replace cheaper SSDs more often) so that your data is safer.
See it like this:
Would you store your food in a freezer or in a locker? The freezer will consume a lot of electricity and buying a freezer will cost you way more than a simple locker. But your food will slowly rot in the locker, while the freezer will prevent that. This is the same with ZFS vs other simple filesystems like ext4/xfs/ntfs/... . The food is your bits, the freezer is your ZFS and the locker is your EXT4.
If you don't want your bits to rot over time, get a copy-on-write filesystem with bit rot protection like ZFS/btrfs/ceph/ReFS. But all these got massive overhead.
4) Assuming I stick with ZFS if im averaging 40 TB written /month.. I feel like I could get 2-3+ years out of these... I'm just looking to not have to be watching drive health as a full time job.. Maybe a bi-monthly check-in or some automated email that sends TB written smart totals (any ideas on a script for this already made?) . Replacing a drives once year is not bad, Monthly would be a PIA.
I've setup a zabbix server LXC. That will use a SMART template to monitor the wear of my SSD. And I use a ZFS-on-Linux template to monitor my ZFS pools.

If you don't want to setup a whole big monitoring stack you can have a look at "zfs-zed". This package should come shipped with PVE and it can send you an email in case your pool degrades. But you need to setup a postfix mail server first, so PVE is able to send emails.
 
Last edited:

Once again Dunuin you deserve a medal.

I think I'm gonna take the full plunge with ZFS.

With that in mind:
My drives are my drives... cant expand the server.. and free replacements of the drives when they die.

I think I'm going to go with the following unless you have any last comments or edits.

My drives are my drives... cant expand the server.. and free replacements of the drives when they die.
With that in mind:

1) Would this be how to set pool wide quota of 90% for the 1.5 TB mirror?:
zfs set quota=1350G raid1data

2) If either of these drives die.. I assume I treat them both as a bootable device in the failed drive area.. Where I should use the sgdisk commands to clone the partitions over then replace it in the pool.
https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration

I was going to then configure:

- 8k block size (this seems to be default) for Mirror and Stand alone (Should this be higher? lower?)
- Enable LZ4 on all "pools" or ZFS storage (I'm banking on this barely touching the CPU)
- ashift 12
- No SWAP partition
- No Cache partition
- 80 GB root/zfs root default install partition
- Mirror with 1.5 TB space
- Maybe I leave 20-40 GB of space between the Mirror and the general storage and anoher 10 at the end just in case ?
- 2x ~400GB+ stand alone partitions (for general temp storage or data I dont care if I lose)
---- I will create a Directory on one of these to store some ISOs / Templates / temp backup location etc (I think that should be ok in terms write damage)
- I will attempt the "qm importdisk" for my qcow2 VMs so I dont have to put them in a Directory folder. Then put them on mirror or single storage.
- Use: pvesm set YourDirStorageId --is_mountpoint yes -- as needed for a Directory situation so it dont fill root.
- Run Trim every 12 hrs (check in on status of this.. perhaps faster)
- I'm going to trust that ARC will give up RAM if needed.
- I have a LibreNMS install that I should be able to pull drive data via snmp or agent. I've seen mixed results on reliability of values in SMART. Maybe go on TB written?
 
My 5 cents about write intensity with ZFS/SSD:

I have 2x "Consumer" Western Digital SSDs (2x WDC WDS500G1R0A-68A4W0) as my "VM" ZFS Mirror Pool (about 20 Services, 11 Users, about 11 Postgres/MySQL databases), the total accumulated TBW over 1 year is 33.57, which was what I calculated ahead of time. The WD's are expected to last around 350 TBW, meaning I could go another 6.5 years (7.5 years total).

I have another SSD Mirror for Proxmox root (2x Samsung Evo 860 250GB SSD), this one has 5,05 TBW written after 1 year. The expected lifetime is 300TBW.

All my services are pretty write extensive, still I have almost 40TBW per year. That said, 40TBW is not much even for consumer SSDs and I would do it again this way, instead of getting Enterprise SSD.
 
Last edited:
1) Would this be how to set pool wide quota of 90% for the 1.5 TB mirror?:
zfs set quota=1350G raid1data
I would run zfs list raid1data and add the "USED" to the "AVAIL" number. That is then the total amount of storage your ZFS pool got. That will be probably a bit smaller than 1.5 TB. Then use 90% of that for the quota. And when monitoring your pool usage, try to keep in under 80% (or under 89% of that quota).

- 8k block size (this seems to be default) for Mirror and Stand alone (Should this be higher? lower?)
- Enable LZ4 on all "pools" or ZFS storage (I'm banking on this barely touching the CPU)
- ashift 12
- No SWAP partition
- No Cache partition
- 80 GB root/zfs root default install partition
- Mirror with 1.5 TB space
- Maybe I leave 20-40 GB of space between the Mirror and the general storage and anoher 10 at the end just in case ?
- 2x ~400GB+ stand alone partitions (for general temp storage or data I dont care if I lose)
---- I will create a Directory on one of these to store some ISOs / Templates / temp backup location etc (I think that should be ok in terms write damage)
- I will attempt the "qm importdisk" for my qcow2 VMs so I dont have to put them in a Directory folder. Then put them on mirror or single storage.
- Use: pvesm set YourDirStorageId --is_mountpoint yes -- as needed for a Directory situation so it dont fill root.
- Run Trim every 12 hrs (check in on status of this.. perhaps faster)
- I'm going to trust that ARC will give up RAM if needed.
- I have a LibreNMS install that I should be able to pull drive data via snmp or agent. I've seen mixed results on reliability of values in SMART. Maybe go on TB written?
Jup, sounds good. PVE root filesystem itself will be totally fine with 32GB, unless you want install any additional big packages or store ISOs/backups/templates there.
My 5 cents about write intensity with ZFS/SSD:

I have 2x "Consumer" Western Digital SSDs (2x WDC WDS500G1R0A-68A4W0) as my "VM" ZFS Mirror Pool (about 20 Services, 11 Users, about 11 Postgres/MySQL databases), the total accumulated TBW over 1 year is 33.57, which was what I calculated ahead of time. The WD's are expected to last around 350 TBW, meaning I could go another 6.5 years (7.5 years total).

I have another SSD Mirror for Proxmox root (2x Samsung Evo 860 250GB SSD), this one has 5,05 TBW written after 1 year. The expected lifetime is 300TBW.

All my services are pretty write extensive, still I have almost 40TBW per year. That said, 40TBW is not much even for consumer SSDs and I would do it again this way, instead of getting Enterprise SSD.
Really depends on the workload:
1669712967356.png
This for example is just my small PVE server running a fraction of my guests and its writing 144-190GB per day to each SSD of the ZFS mirror... Guest are only 2x OPNsense as routers, 1x Zabbix LXC for monitoring, 1x Pihole LXC as DNS, 2x Debian VM as Proxies, 1x Nextcloud VM (where I don't upload any big files...its just used to sync my password safe, contacts, calendar and notes), 1x Dokuwiki LXC, 1x HomeAssistant VM.
But my two SSDs each got 200GB capacity and 3600TB TBW, so even with that amount of writes they should survive some years. With a cheap 240GB consumer SSD with only 80-140TB TBW this would look way different. There I could exceed the rated durability in way under a year.

The bigger PVE server got even more and writes additional 600GB per day. So here it's roughly 0.9TB of writes per day while idleing.
 
Last edited:
  • Like
Reactions: Helmut101
This for example is just my small PVE server running a fraction of my guests and its writing 144-190GB per day to each SSD of the ZFS mirror... Guest are only 2x OPNsense as routers, 1x Zabbix LXC for monitoring, 1x Pihole LXC as DNS, 2x Debian VM as Proxies, 1x Nextcloud VM (where I don't upload any big files...its just used to sync my password safe, contacts, calendar and notes), 1x Dokuwiki LXC, 1x HomeAssistant VM.

The bigger PVE server got even more and writes additional 600GB per day. So here it's roughly 0.9TB of writes per day while idleing.
Really? I wonder what you do. I have two extra boxes for OPNsense and pfSense, apart from that, everything runs on my Proxmox:

- Gitlab (~100 Repositories currently)
- Nextcloud (9TB of data, the data folder is mounted from a spinning rust ZFS pool)
- Funkwhale
- Iris/Mopidy/Snapcast
- Miniflux
- Grafana
- Influx DB 2.0
- Home Assistant
- Invidious
- Photoview
- Rss-Bridge
- etc.

Instead of VMs, I use unprivileged LXC (9x) that run individual nested Docker systems, to separate concerns and keep resource usage low.
 
Last edited:
Biggest problem here are the small sync writes caused by the DBs. Especially zabbix writing tons of metrics to the mariadb every second. I also monitor NAND writes and not host writes. So the internal write amplification of the SSD is included. You probably also write more to your NAND flash, you just can't see it, as this happens as a blackbox inside your SSD, in case your SSDs firmware isn't offering you to monitor that.
And my ZFS pools are encrypted, which again doubles the write amplification.
 
Last edited:
  • Like
Reactions: Helmut101
Yes, I think I monitor host bytes, the Evo attribute is called "Total_LBAs_Written" and the WD reports "Host_Writes_GiB". Most of my DBs do 90% read and only rarely write, so this makes sense. I also have encryption+compression enabled on all my pools. Interesting..
 
Last edited:
Host writes vs NAND writes:
1669717307901.png

And I prefer security over efficiency. If a service is attackable from the internet, it will run it in a VM, even if I could run it in a LXC. Here the filesystem of the VM can also add additional overhead. I for example once benchmarked the write amplification from guest OS to zvol. If I rember right, the ext4 of the guest OS caused a write amplification of factor 4 when doing random 4K sync writes. So for every 4K block of data written inside the guest OS it was writing 16K to the virtual disk. When only using LXCs you also skip this, as you need no additional filesystem on top of your ZFS, as your LXCs can directly write to the ZFS datasets.

Seen a total write amplification (from data written inside guest down to NAND of all SSDs) between factor 3 (big sequential async writes) and 62 (random 4K sync writes) with an average of factor 20. This average was measured on my big node over months...not sure how big the total write amplification of this small node is. Just set that up last month and didn't mesured this yet. But in case its similar to my big node, these roughly 300GB per day that my small node is writing are probably just 15GB of actual data, that get amplified by factor 20 to 300GB per day.
 
Last edited:
  • Like
Reactions: Stacker
Great, thank you. Very informative. My server is only available locally/IPsec/vpn, Services are further separated in several VLANS and firewalled in a Demilitarized Zone (only response allowed). Otherwise, I would also prefer VMs.
 
qm importdisk
I imported a BACKUP of what is my qcow2 that I took from my old node. And it indeed shows up as qcow2 on the old node... On import to ZFS node (new) it apparently converted it for me to raw? Is this expected and I guess I dont need to do the qm importdisk? The VM boots up i see no issues.
 

Attachments

  • old-vm.PNG
    old-vm.PNG
    6.8 KB · Views: 7
  • new-vm.PNG
    new-vm.PNG
    8 KB · Views: 7
Jup, when importing a VM/LXC from backup it will automatically convert it so the target storage can store it. By default PVE uses zvols to store virtual disks of VMs and these are block devices and not filesystems, so they can only store raw data and not qcow2 files.
 
  • Like
Reactions: Stacker
Ok so I think I settled in on a setup... I think this checks out.

Below I believe default value for compression ON is lz4 .. I just let it be on in the GUI not change it to lz4... should I have manually set lz4 ?

Also set Quotas and then that mount point to the name of the "directory" based storage I made.

Code:
root@node01:~# zfs get compression
NAME              PROPERTY     VALUE           SOURCE
data0             compression  on              local
data1             compression  on              local
raid1data         compression  lz4             local
rpool             compression  on              local
rpool/ROOT        compression  on              inherited from rpool
rpool/ROOT/pve-1  compression  on              inherited from rpool
rpool/data        compression  on              inherited from rpool
root@node01:~# zfs list
NAME               USED  AVAIL     REFER  MOUNTPOINT
data0              576K   231G       96K  /data0
data1              596K   231G      104K  /data1
raid1data          552K  1.39T       96K  /raid1data
rpool             7.05G  69.5G      104K  /rpool
rpool/ROOT        7.04G  69.5G       96K  /rpool/ROOT
rpool/ROOT/pve-1  7.04G  69.5G     7.04G  /
rpool/data          96K  69.5G       96K  /rpool/data
root@node01:~# zfs set quota=1250G raid1data
root@node01:~# zfs set quota=207G data0
root@node01:~# zfs set quota=207G data1
root@node01:~# zfs get quota
NAME              PROPERTY  VALUE  SOURCE
data0             quota     207G   local
data1             quota     207G   local
raid1data         quota     1.22T  local
rpool             quota     none   default
rpool/ROOT        quota     none   default
rpool/ROOT/pve-1  quota     none   default
rpool/data        quota     none   default
root@node01:~# pvesm set data1-storage --is_mountpoint yes
 
Below I believe default value for compression ON is lz4 .. I just let it be on in the GUI not change it to lz4... should I have manually set lz4 ?
Jup, compression=on is the same as compression=lz4, as lz4 is the default.
rpool quota none default rpool/ROOT quota none default rpool/ROOT/pve-1 quota none default rpool/data quota none default
I would also set a 90% quota for rpool.
 
Jup, compression=on is the same as compression=lz4, as lz4 is the default.

I would also set a 90% quota for rpool.
OK I will add that to rpool also.

I have a slight issue around the command pvesm set data1-storage --is_mountpoint yes .. This ended up putting that storage offline. Am I doing this wrong?
 

Attachments

  • data-dir-post-command.PNG
    data-dir-post-command.PNG
    25.6 KB · Views: 5
  • data-dir-precommand.PNG
    data-dir-precommand.PNG
    15.9 KB · Views: 6
  • data-dir-make.PNG
    data-dir-make.PNG
    11.1 KB · Views: 6
  • data-dir-command.PNG
    data-dir-command.PNG
    2.9 KB · Views: 6
According to your screenshots error message your mountpoint isn't mounted, so that is what it should do. Taking it offline is a wanted feature, so you don't accidentally fill up your root filesystem. Better a storage that is not available than killing your PVE installation.

Check if /data1/storage is mounted by running lsblk and zfs get mount,mountpoint data1/storage. When it is not mounted you should fix that and make sure it always gets automounted.
 
According to your screenshots error message your mountpoint isn't mounted, so that is what it should do. Taking it offline is a wanted feature, so you don't accidentally fill up your root filesystem. Better a storage that is not available than killing your PVE installation.

Check if /data1/storage is mounted by running lsblk and zfs get mount,mountpoint data1/storage. When it is not mounted you should fix that and make sure it always gets automounted.
Are you saying a fstab mount?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!