Proxmox ZFS 80% Rule, Pool or VM Storage?

Jun 30, 2021
17
2
8
35
I have a Proxmox server that houses an NVR with Blue Iris, it runs as a VM and I have a ZFS Pool with 3 x 18TB Seagate Exos in RAIDZ1-0. On that ZFS Pool, I have a VM Storage of 32000GB or 32TB for my Blue Iris VM to house recordings. Lately Blue Iris has been complaining about Disk write being too slow. I have been trying to figure out how to diagnose and test this at no avail, however I did read that ZFS has an 80% rule where you need to keep your space below 80% to maintain performance, which potentially could be the cause of my issues.

However, I am trying to understand, is this 80% rule the total space of the pool? or the total space consumed on the pool? I am trying to understand if I need to remake the pool or remake the VM storage of the VM to consume <80%.

Thanks in advance for any advice you can give.
 
That 80% rule is not very good in my opinion. It is too strict for datasets and at the same time too relaxed for block storage.
If you pool is filled up to 80%, there is less of a chance that the HDDs find free continuous space and thous can write something without jumping around their heads. But performance begins to degrade much sooner. https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/

Lately Blue Iris has been complaining about Disk write being too slow.
RAIDZ1 with 3 disks is a not a fast pool layout.
Then you have a virtualization and block storage overhead.

Without knowing too much about your use case, I would probably setup a bare metal Blue Iris host on a small NMVE SSD and then use one 18TB Exos as storage for the recordings. I don't know why you would need RAID. Intel Quick Sync will also be a lot easier to setup and help a lot with performance.
 
Last edited:
  • Like
Reactions: Johannes S and UdoB
That 80% rule is not very good in my opinion. It is too strict for datasets and at the same time too relaxed for block storage.
If you pool is filled up to 80%, there is less of a chance that the HDDs find free continuous space and thous can write something without jumping around their heads. But performance begins to degrade much sooner. https://www.truenas.com/community/threads/the-path-to-success-for-block-storage.81165/


RAIDZ1 with 3 disks is a not a fast pool layout.
Then you have a virtualization and block storage overhead.

Without knowing too much about your use case, I would probably setup a bare metal Blue Iris host on a small NMVE SSD and then use one 18TB Exos as storage for the recordings. I don't know why you would need RAID. Intel Quick Sync will also be a lot easier to setup and help a lot with performance.
I am using RAID since one 18TB Exo is not enough and I need failure tolerance so my recordings don't disappear if a disk fails.
 
  • Like
Reactions: Kingneutron
While @IsThisThingOn is right I would like to add:

You get the IOPS of a single spindle with RaidZ1, and this is considered very slow nowadays. To speed this up a little bit you could add a small (let's say 200GB) "Special Device" in a mirrored(!) vdev, SSD/NVMe of course. This will contain meta-data only and it will increase the IOPS by at least a factor of two - or more.
 
I have a Proxmox server that houses an NVR with Blue Iris, it runs as a VM and I have a ZFS Pool with 3 x 18TB Seagate Exos in RAIDZ1-0. On that ZFS Pool, I have a VM Storage of 32000GB or 32TB for my Blue Iris VM to house recordings. Lately Blue Iris has been complaining about Disk write being too slow. I have been trying to figure out how to diagnose and test this at no avail, however I did read that ZFS has an 80% rule where you need to keep your space below 80% to maintain performance, which potentially could be the cause of my issues.

However, I am trying to understand, is this 80% rule the total space of the pool? or the total space consumed on the pool? I am trying to understand if I need to remake the pool or remake the VM storage of the VM to consume <80%.

Thanks in advance for any advice you can give.
Where did you read about that 80% rule? Given of what I understood of RAID and ZFS, maybe it is slow because many different parts of your files are split across multiple disks? And if fast reading and writing is required for large files, I'm guessing things could get very slow very quickly. You could try using just one disk and then possibly backing the contents of that disk up to the other two at a later time.m (Just a suggestion.)
 
Where did you read about that 80% rule? Given of what I understood of RAID and ZFS, maybe it is slow because many different parts of your files are split across multiple disks? And if fast reading and writing is required for large files, I'm guessing things could get very slow very quickly. You could try using just one disk and then possibly backing the contents of that disk up to the other two at a later time.m (Just a suggestion.)
https://www.45drives.com/community/articles/zfs-80-percent-rule/, there are some threads about it as well.

As previously mentioned I am not using RAID to mirror 1 disk. I have 3 disks, where 1 is parity, 18TB is not enough for my use case. I could add a 4th drive though and from what I gather I would mirror two striped vdevs to improve this issue.
 
I am using RAID since one 18TB Exo is not enough and I need failure tolerance so my recordings don't disappear if a disk fails.
Well, that disks will probably survive the hardware it is running on or at least throw a S.M.A.R.T error before.

Why do you need fault tolerance only at the disk level?

Is the PSU redundant?
The RAM?
Do you run a cluster, so one node can fail for whatever reason?
Are the cams redundant?
Do the cams have redundant LAN cables?
Are your Switches clustered?

You can of course run redundant disks for a nice warm feeling, but don't fool yourself. This is far, far away from having HA.


Where did you read about that 80% rule?
That rule is so old, it can almost drink alcohol
Given of what I understood of RAID and ZFS, maybe it is slow because many different parts of your files are split across multiple disks?
Almost, not split across multiple disks, but around sectors on your drive (so the disk heads have to jump around).
 
Last edited:
  • Like
Reactions: Johannes S
I could add a 4th drive though and from what I gather I would mirror two striped vdevs to improve this issue.
Sure, that would give you a RAID10 in traditional terms and give you the write speed of two disks.
But I doubt a huge improvement, RAIDZ1 3-wide should not be that much slower, because that should in theory also have close to two disks write speed.

Of course another problem could be your currently potential wrong pool geometry, depending on the volblocksize you are using. That isn't a problem with mirrors.
 
Last edited:
  • Like
Reactions: Johannes S
You get the IOPS of a single spindle with RaidZ1, and this is considered very slow nowadays. To speed this up a little bit you could add a small (let's say 200GB) "Special Device" in a mirrored(!) vdev, SSD/NVMe of course. This will contain meta-data only and it will increase the IOPS by at least a factor of two - or more.
Do you think his workload is IOPS bound? My guess is that it is mostly sequential writes.
Proxmox and the VM itself is hopefully not on that RAIDZ1, but an SSD!?
 
Well, these disks will probably survive the hardware they are running on or at least throw a S.M.A.R.T error before.

Why do you need fault tolerance only at the disk level?

Is the PSU redundant?
The RAM?
Do you run a cluster, so one node can fail for whatever reason?
Are the cams redundant?
Do the cams have redundant LAN cables?
Are your Switches clustered?

You can of course run redundant disks for a nice warm feeling, but don't fool yourself. This is far, far away from having HA.



That rule is so old, it can almost drink alcohol

Almost, not split across multiple disks, but around sectors on your drive (so the disk heads have to jump around).

I have spare PSUs and RAM I can pull same day not an issue, disk failure is far longer to resolve. The rest you mentioned don't concern me as much, not everything needs to be HA. Streaming data to disks 24/7 is probably gonna cause disk failure more than anything else listed.

Regardless RAID is required because I need more storage than 1 drive is capable so this is moot, otherwise I would have to have three separate drives and constantly move recordings among them as storage fills up.

Sure, that would give you a RAID10 in traditional terms and give you the write speed of two disks.
But I doubt a huge improvement, RAIDZ1 3-wide should not be that much slower, because that should in theory also have close to two disks write speed.

Of course another problem could be your currently potential wrong pool geometry, depending on the volblocksize you are using. That isn't a problem with mirrors.
I created it in the Proxmox GUI, I didn't see an option anywhere to establish a volblocksize unless I am blind.

Do you think his workload is IOPS bound? My guess is that it is mostly sequential writes.
Proxmox and the VM itself is hopefully not on that RAIDZ1, but an SSD!?
Correct, there is an NVME SSD housing the VMs. The RAIDZ1 pool is mounted to the Blue Iris VM and is solely used to house recordings and alerts.
 
Last edited:
I have spare PSUs and RAM I can pull same day not an issue, disk failure is far longer to resolve.
So we went from "I don't want to lose recordings" to "I don't want more than a day downtime" ;)

disk failure is far longer to resolve.
Why? New disk in, same drive letter in Windows, done. That is way easier than switching a PSU.

I created it in the Proxmox GUI, I didn't see an option anywhere to establish a volblocksize unless I am blind.
Since a few versions, Proxmox uses the default of 16k. You can see this under storage.
16k should work perfectly for 3 wide RAIDZ1, so the pool geometry isn't wrong.


Regardless RAID is required because I need more storage than 1 drive is capable so this is moot, otherwise I would have to have three separate drives and constantly move recordings among them as storage fills up

Got it. Then I would use the NVME as Windows boot drive and then use the mobo software RAID for the three disks.
That gives Blue Iris direct NTFS access, instead of having NTFS on a virtual image on top of ZFS.

Keep it Simple Stupid (KISS).

Btw, using a software that is based on Windows is not really great for anything server related.
Unless you air gap the system, you will have many reboots for Windows Updates.
 
Last edited:
Do you think his workload is IOPS bound? My guess is that it is mostly sequential writes.
No. Probably.

But a) ZFS tends to fragment on the long run. And while I do not know Blue Iris and while a video stream is probably sequential there may be "events" and "motion detection" resulting in a high number of much smaller files. (Or database operations?)

b) in any case a Special Device would store the meta data - so that the HDDs do not need to move the heads for this. In my understanding this is helps also for large files / streaming, at least a bit :-)
 
  • Like
Reactions: IsThisThingOn
I am using RAID since one 18TB Exo is not enough and I need failure tolerance so my recordings don't disappear if a disk fails.
Long-term, you should consider rebuilding this to at least a 6-disk RAIDZ2. Especially with large disks over ~10-12TB. With raidz1, if a disk fails then the whole pool is "at risk" until the drive is replaced and resilver finishes. Raidz2 gives you an additional disk of parity + protection while maint is being performed.

Then down the road if you need more I/O speed / free space, add another 6-disk RAIDZ2 vdev. SAS disk shelf is good for expansion.
 
I am trying to understand if I need to remake the pool or remake the VM storage of the VM to consume <80%.
This is a rule of thumb. Reality is a bit more nuanced.

a zfs pool operates optimally for writes for the first 100% of TBW (total bytes written.) This is due to the CoW nature of zfs; all your writes are being sequentially written to empty contiguous space. Once 100%TBW has been exceeded, further writes require hunting the next available set of blocks starting at the beginning of the address space. Every subsequent pass ends up with more and more fragmentation, which is how your writes performance continues to deteriorate as more time has to be spent literally looking for spaces to fit your write requests.

The idea behind keeping your pool at under 80% utilized is to try to control the empty block spaces and reduce/limit fragmentation, but ultimately this has more to do with your data write patterns and release of blocks (eg, whats your snapshot retention, trim schedule, etc)
 
  • Like
Reactions: Johannes S and UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!