Looking for advice on how to configure ZFS with my drives and use case

strikeraj

New Member
Sep 30, 2023
14
1
3
Hi
I am looking to set up proxmox VE on a Dell R630 server
Dual Xeon E5-2640 V4
256GB ram
4x 512GB NVME drive on a 4 port PCIE card (x4x4x4x4 bifurcation)
8x 2TB 3.5" HDD (running through extended cable from front ports to a custom rack) (raidz2 may be?)
4TB USB external drive (for nightly backup of the samba share dataset)
10GBE NIC

Use case:
1. Windows VM to perform read heavy optimization (dataset ~20GB each, calculation done by CPU)
2. Windows VM to perform video encoding (Adobe Media Encoder, H.265)
3. 2TB dataset for Samba sharing as network drive (this part gets rsync to the external HDD nightly, and every quarter I bring in another USB drive that I store off site for back up)

I understand that I should not put the SSD in the same VDEV as the HDD. The part that I don't truly understand is how does ZFS handle a SSD VDEV and a HDD VDEV under the same Zpool?

And with the hardware I have, how should I set up the zpool so the system can stay safe with any 2 HDD and/or 2SSD failure at the same point? (I check zpool status weekly)

Thanks
Tom
 
I understand that I should not put the SSD in the same VDEV as the HDD. The part that I don't truly understand is how does ZFS handle a SSD VDEV and a HDD VDEV under the same Zpool?
When adding the SSD as as normal VDEVs and not as "special", "log" or "cache" vdevs to the pool with the HDDs it will be a stripe. So it will write some MBs to the SSD, then some MBs to the HDDs, then some MBs to the SSDs and so on. You basically slow down your SSDs down to HDD performance. Either use two independent pools or add the SSD as "special" vdevs and when work with the "special_small_blocks" attribute.

I check zpool status weekly
You really should set up monitoring so the pool isn't running in a degraded state for a week.

Windows VM to perform video encoding (Adobe Media Encoder, H.265)
I don't know if AME supports GPU encoding but if it does it would make sense to PCI passthrough a GPU.

4x 512GB NVME drive on a 4 port PCIE card (x4x4x4x4 bifurcation)
8x 2TB 3.5" HDD (running through extended cable from front ports to a custom rack) (raidz2 may be
Be aware that it is highly recommended not to use any SMR HDDs and no consumer SSD without a power-loss protection, especially not with QLC NAND, when using ZFS.
With M.2 there are only a hand full of SSD models that actually would be reasonable to use with ZFS.
 
Last edited:
  • Like
Reactions: strikeraj
First of all your cpu will become a performance bottleneck if you want to achive full performance of this setup.

I see here two possible scenarios which I would do personally:

First scenario
simplest way

1. RAIDZ2 pool with 4 x 512GB = 1024GB n+2
2. RAIDZ2 pool with 8 x 2TB = 12TB n+2
3. ARC min. 64GB

Second Scenario
simple but different

1. (hot storage) MIRROR pool with 2 x 512GB = 512GB n+1
2. (hybrid warm/cold storage) RAIDZ2 pool with 8 x 2TB + MIRROR pool with 2 x 512GB (504GB read/ 8GB write cache, needs to be calculated) = 12TB + 512GB Cache n+2/n+1
3. ARC min. 64GB

Your Questions:

Why you shouldn't mix silicion with mechanical? Simply if you build a RAIDZ2 for example with 4 x 512GB and 8 x 2TB drives you will have 12 x 512GB because the smallest size defines the total size and horrible performance because the worst performance defines the overall performance of this array and that will be your mechanical drives. So your nvme's getting ready for retirement because they can't play their benefits.

About reliability its depending what kind of nvme's and hdd's you are using. Form my personal experice I don't use enterprise hardware in my homelab and got very far with it so far. I definitly wouldn't recommend using WD Green SSD's for 24/7 operation or better say in general because their cache is garbage. I used Samsung nvme's for over five years but they died very soon after I put some workload on them overtime. They got replaced by Lexar NM790 2TB nvme's. 6GB/s nvme mirror is neat by the way. For my "cheap" storage I use Curcial MX500 2TB ssd's. Currently I got 4 x 2TB in RAIDZ2 connected and watch them slowly dying but +1GB/s is also neat for "cheap" and "slow" storage. For OS I will replace my WD Greens with Silicon Power A55 256GB because they got SLC caching which made a way better figure in my secondary Proxmox host. The issues with the Greens is that if you copy a small iso like 500-1000MB on the host or install an OS on the local-lvm the whole Proxmox is freezing what is not nice in opinion. But I am also not normal because I use like SHA512 hashing with AES256 encryption on all my ZFS pools for vm's... My Ryzen 5 3600 was the weak for this but since I got a Ryzen 9 5950X we can got a little bit further but is still to weak lmao.
 
  • Like
Reactions: strikeraj
Thanks for the replies!

@Dunuin So am I understanding correctly that even if I group the SSD under one VDEV and the HDD under a separate VDEV, if I dont use a special VDEV, ZFS would not be able to utilize the extra speed of the SSD?

@m4k5ym Thanks for your reply! One quick question, does Raidz2 pool or mirror pool perform more "write" to the SSD?
 
So am I understanding correctly that even if I group the SSD under one VDEV and the HDD under a separate VDEV, if I dont use a special VDEV, ZFS would not be able to utilize the extra speed of the SSD?
Yes, ZFS will just stripe all the normal vdevs, so you shouldn't mix normal vdevs with different performance charcteristics as the pool will be as slow as the slowest vdev.
does Raidz2 pool or mirror pool perform more "write" to the SSD?
That depends on the layout, workload and ZFS options. In a 4 disk raidz2 I would guess that a striped mirror would write less as you can work with a lower volblocksize so small writes shouldn't be amplified that much. Especially when doing small random sync writes like then running DBs.
 
Last edited:
  • Like
Reactions: strikeraj
Thanks @Dunuin !

One more question regarding storage:

Does ZFS performance scale with ARC ram size indefinitely? I am not talking about linear speed increase but my thought is larger ARC allow more commonly read blocks be put there, to the point that if I have 200G ARC, and I have 180G files that my code repeatedly read, it can put everything into ARC and run at the limit of the CPU.

Another question that is not related to storage:

so I have it roughly setup and testing use case at the moment. I was experimenting to see if I can actually do video editing remotely on it, using Adobe CC. Right now it is on LAN and I am connecting through RDP. I have a Quadro P1000 passthrough to the Windows guest and confirmed that is working (could use NVENC for the encoding). But the user experience is still not close to what I feel on my local laptop which is on iGPU and 16GB of ram.

Do you have any recommendation on what I can tweak to improve the user experience?
 
Does ZFS performance scale with ARC ram size indefinitely? I am not talking about linear speed increase but my thought is larger ARC allow more commonly read blocks be put there, to the point that if I have 200G ARC, and I have 180G files that my code repeatedly read, it can put everything into ARC and run at the limit of the CPU.
Depends on your workload. But if you got 180GB of data that is getting read all the time and your ARC is 200GB, it would be way faster as the reads then would hit the RAM and not the disks.

Do you have any recommendation on what I can tweak to improve the user experience?
I prefer parsec over RDP. Less compression artifacts and better latency. Looks better and feels way more snappy.
For better utilization of the hosts hardware make sure to use virtio SCSI, virtio NIC and in case you are running a single node set the CPU type to "host" so the VM can actually make use of all features of the CPU. And in case you trust all your guests (for example no port-forwards and only people you trust are accessing the guests) you could edit grub to disable mitigations, making your server more vulnerable but also faster.
 
  • Like
Reactions: strikeraj
I am not sure what I was doing wrong, but using parsec got disconnected a lot, and it was not smooth when running either....
 
First of all your cpu will become a performance bottleneck if you want to achive full performance of this setup.

I see here two possible scenarios which I would do personally:

First scenario
simplest way

1. RAIDZ2 pool with 4 x 512GB = 1024GB n+2
2. RAIDZ2 pool with 8 x 2TB = 12TB n+2
3. ARC min. 64GB

Second Scenario
simple but different

1. (hot storage) MIRROR pool with 2 x 512GB = 512GB n+1
2. (hybrid warm/cold storage) RAIDZ2 pool with 8 x 2TB + MIRROR pool with 2 x 512GB (504GB read/ 8GB write cache, needs to be calculated) = 12TB + 512GB Cache n+2/n+1
3. ARC min. 64GB

Your Questions:

Why you shouldn't mix silicion with mechanical? Simply if you build a RAIDZ2 for example with 4 x 512GB and 8 x 2TB drives you will have 12 x 512GB because the smallest size defines the total size and horrible performance because the worst performance defines the overall performance of this array and that will be your mechanical drives. So your nvme's getting ready for retirement because they can't play their benefits.

About reliability its depending what kind of nvme's and hdd's you are using. Form my personal experice I don't use enterprise hardware in my homelab and got very far with it so far. I definitly wouldn't recommend using WD Green SSD's for 24/7 operation or better say in general because their cache is garbage. I used Samsung nvme's for over five years but they died very soon after I put some workload on them overtime. They got replaced by Lexar NM790 2TB nvme's. 6GB/s nvme mirror is neat by the way. For my "cheap" storage I use Curcial MX500 2TB ssd's. Currently I got 4 x 2TB in RAIDZ2 connected and watch them slowly dying but +1GB/s is also neat for "cheap" and "slow" storage. For OS I will replace my WD Greens with Silicon Power A55 256GB because they got SLC caching which made a way better figure in my secondary Proxmox host. The issues with the Greens is that if you copy a small iso like 500-1000MB on the host or install an OS on the local-lvm the whole Proxmox is freezing what is not nice in opinion. But I am also not normal because I use like SHA512 hashing with AES256 encryption on all my ZFS pools for vm's... My Ryzen 5 3600 was the weak for this but since I got a Ryzen 9 5950X we can got a little bit further but is still to weak lmao.
Hi,
I have a same type ssd - Lexar NM790 2TB, which seems to be quite unstable.
How stable is yours?
Which Linux kernel are you using?
 
Define unstable, I have them running for a few months now and I had no issue yet. They are just one thing. Hella fast.

Linux localhost 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux
 
Define unstable, I have them running for a few months now and I had no issue yet. They are just one thing. Hella fast.

Linux localhost 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux
Runs ok for around a day or less, then crashes.
Described in detail in the following post:
https://forum.proxmox.com/threads/lexar-nm790-2tb.134293/

The SSD model is often mentioned online for having limited compatibility with Linux, but it may also be something with my desktop (HP ProDesk G4 Mini)
 
Depends on your workload. But if you got 180GB of data that is getting read all the time and your ARC is 200GB, it would be way faster as the reads then would hit the RAM and not the disks.
So I started to test running the code for 5 hours now, the code is reading the file (as I see from Windows task manager it is reading from the disk), but so far it still is not moved into ARC. I have checked and I still have free memory. arc_summary also attached below

Code:
ZFS Subsystem Report                            Wed Oct 04 23:48:10 2023
Linux 6.2.16-3-pve                                           2.1.12-pve1
Machine: strikerajr630 (x86_64)                              2.1.12-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    32.8 %   84.0 GiB
        Target size (adaptive):                        50.0 %  128.0 GiB
        Min size (hard limit):                         50.0 %  128.0 GiB
        Max size (high water):                            2:1  256.0 GiB
        Most Frequently Used (MFU) cache size:         72.6 %   60.7 GiB
        Most Recently Used (MRU) cache size:           27.4 %   22.9 GiB
        Metadata cache size (hard limit):              75.0 %  192.0 GiB
        Metadata cache size (current):                  0.2 %  480.7 MiB
        Dnode cache size (hard limit):                 10.0 %   19.2 GiB
        Dnode cache size (current):                   < 0.1 %    1.9 MiB

ARC hash breakdown:
        Elements max:                                               1.2M
        Elements current:                              98.8 %       1.2M
        Collisions:                                                59.0k
        Chain max:                                                     2
        Chains:                                                    10.0k

ARC misc:
        Deleted:                                                      14
        Mutex misses:                                                  0
        Eviction skips:                                                1
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                 247.0 KiB
        L2 eligible MFU evictions:                      0.0 %    0 Bytes
        L2 eligible MRU evictions:                    100.0 %  247.0 KiB
        L2 ineligible evictions:                                 4.0 KiB

ARC total accesses (hits + misses):                               201.4M
        Cache hit ratio:                              100.0 %     201.3M
        Cache miss ratio:                             < 0.1 %      35.2k
        Actual hit ratio (MFU + MRU hits):            100.0 %     201.3M
        Data demand efficiency:                       100.0 %      92.1M
        Data prefetch efficiency:                       1.9 %      34.1k

Cache hits by cache type:
        Most frequently used (MFU):                    95.4 %     192.0M
        Most recently used (MRU):                       4.6 %       9.3M
        Most frequently used (MFU) ghost:               0.0 %          0
        Most recently used (MRU) ghost:                 0.0 %          0
        Anonymously used:                             < 0.1 %       1.8k

Cache hits by data type:
        Demand data:                                   45.7 %      92.1M
        Prefetch data:                                < 0.1 %        633
        Demand metadata:                               54.2 %     109.2M
        Prefetch metadata:                            < 0.1 %      64.5k

Cache misses by data type:
        Demand data:                                    2.9 %       1.0k
        Prefetch data:                                 95.1 %      33.5k
        Demand metadata:                                1.8 %        633
        Prefetch metadata:                              0.3 %         95

1696478269384.png

1696478298774.png
 
after running for another 24 hours, those 18GB files are still not in ARC
I am starting to wonder..... is there a size limit of file that goes to ARC? or the ZFS is not able to recognize the files in the VM volume?

1696553884374.png
 
@m4k5ym @Dunuin
I am now even more confused
Update:
Today just for trying, I copy the exact dataset and code to my other W10 guest on the same node (one that I plan to use for video editing), and I cannot find any DiskIO ANYWHERE, while the arc_summary is still not showing the size of the data is being used.

From atop

Code:
    PID               TID             RDDSK            WRDSK            WCANCL             DSK            CMD
1101858                 -              5.9G            12.0K                0B            100%            kvm<--original guest
1426877                 -                0B           140.0K                0B              0%            kvm<--- guest i tried today

From Windows

Original guest
1696698279821.png
Guest I tried today
1696698247905.png

On PVE GUI

Original Guest
1696698396196.png
Guest I tried today
1696698423375.png


TIA!
 
Last edited:
I can just tell you that my NM790 do perform with zero issues. I also can hit 6GB/s since they are connected to PCIe4.0 even with my heavy encryption and 16 core chip I reach good 2GB/s with ease. Also you must be aware that Windows is not be known to be a performance wonder by default configuration also are the readings from Task Manager mostly garbage for evaluation. As well you have a rig with two NUMA nodes which also affects your performance if not done correctly. There are so many variables to play on this but first things first be so kind and share me your Proxmox VM config file.

The theory I have for example is that since you have a x4x4x4x4 card which is in one x16 slot that card is already fully bound to just one of your cpu's. Therefore I would make sure that my VM is running on that cpu to reduce the stress on the NUMA link which results in lower latency etc. pp.

Second thing is to optimize ZFS you need to study ZFS. There are so many settings and optimizations you really can make a ton's of money when you master them hehehe. Also you must remember that ZFS has not been designed for SSD's which already cost a little performance. You can play with the ashift that can cause wonders just by itself.
 
I can just tell you that my NM790 do perform with zero issues. I also can hit 6GB/s since they are connected to PCIe4.0 even with my heavy encryption and 16 core chip I reach good 2GB/s with ease. Also you must be aware that Windows is not be known to be a performance wonder by default configuration also are the readings from Task Manager mostly garbage for evaluation. As well you have a rig with two NUMA nodes which also affects your performance if not done correctly. There are so many variables to play on this but first things first be so kind and share me your Proxmox VM config file.

The theory I have for example is that since you have a x4x4x4x4 card which is in one x16 slot that card is already fully bound to just one of your cpu's. Therefore I would make sure that my VM is running on that cpu to reduce the stress on the NUMA link which results in lower latency etc. pp.

Second thing is to optimize ZFS you need to study ZFS. There are so many settings and optimizations you really can make a ton's of money when you master them hehehe. Also you must remember that ZFS has not been designed for SSD's which already cost a little performance. You can play with the ashift that can cause wonders just by itself.
Thanks for your reply! Newbie question here.... how do I share my Proxmox VM config file?

Right now in this test build, it is only running with 3x512GB SATA SSD. I am still waiting for the rest of the hardware to arrive.

Do you have a recommendation source to learn ZFS? I have read the general guide available from google but when I start reading OpenZFS documentation my head went spinning.... I am not an IT professional just a hobbyist lol
 
I can just tell you that my NM790 do perform with zero issues. I also can hit 6GB/s since they are connected to PCIe4.0 even with my heavy encryption and 16 core chip I reach good 2GB/s with ease. Also you must be aware that Windows is not be known to be a performance wonder by default configuration also are the readings from Task Manager mostly garbage for evaluation. As well you have a rig with two NUMA nodes which also affects your performance if not done correctly. There are so many variables to play on this but first things first be so kind and share me your Proxmox VM config file.

The theory I have for example is that since you have a x4x4x4x4 card which is in one x16 slot that card is already fully bound to just one of your cpu's. Therefore I would make sure that my VM is running on that cpu to reduce the stress on the NUMA link which results in lower latency etc. pp.

Second thing is to optimize ZFS you need to study ZFS. There are so many settings and optimizations you really can make a ton's of money when you master them hehehe. Also you must remember that ZFS has not been designed for SSD's which already cost a little performance. You can play with the ashift that can cause wonders just by itself.
Hi,
What kernel are you using, what is your kernel boot cmd line?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!