ZFS vs Single disk configuration recomendation

hidagar · Dec 15, 2023

Hello

I'm doing a new cluster of PVE.

2 servers have this specs:

CPU: 72 x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
HDD: 8 x Samsung PM983
HDD: 2x SSD 128GB for OS
RAM: 376GB

Mi doubt is wich filesistem I should use , I also will run a PBS on another server.

Thanks

Dunuin · Dec 15, 2023

Depends...
Is HA required? How important is your data and would it be fine to for example lose 1 minute of data? How much NICs you got and how fast are they? How many nodes will your cluster consist of? What models are those 128GB SSD?

hidagar · Dec 15, 2023

Dunuin said:
Depends...
Is HA required? How important is your data and would it be fine to for example lose 1 minute of data? How much NICs you got and how fast are they? How many nodes will your cluster consist of? What models are those 128GB SSD?

We dont need HA , we have 2 nodes with 8 x 2TB NVME HDD for each node, every node have 4 x 10Gbps NIC. We can lose a minut is not a problem. Also the 128SSD are only to run the SO

Dunuin · Dec 15, 2023

hidagar said:
Also the 128SSD are only to run the SO

But PVE is write heavy. Especially when using ZFS for software raid1 it can quickly kill cheap consumer SSDs. Here those 120GB TLC consumer SSDs often doesn't survive a year. So be aware to always have a cold spare at hand.

hidagar said:
We dont need HA , we have 2 nodes with 8 x 2TB NVME HDD for each node, every node have 4 x 10Gbps NIC. We can lose a minut is not a problem.

Then probably a striped mirror with ZFS replication would be a good option.

hidagar · Dec 15, 2023

Dunuin said:
But PVE is write heavy. Especially when using ZFS for software raid1 it can quickly kill cheap consumer SSDs. Here those 120GB TLC consumer SSDs often doesn't survive a year. So be aware to always have a cold spare at hand.

Then probably a striped mirror with ZFS replication would be a good option.

I will buy a better SSD now I have a consumer one. Do you recommend me RAIDZ3?

Thanks

Dunuin · Dec 15, 2023

hidagar said:
I will buy a better SSD now I have a consumer one. Do you recommend me RAIDZ3?

Thanks

Any raidz won't be great if you need performance or a small block size (like when running DBs). A 8 disk raidz3 pool would require that you increase the block size from 8K (75% capacity loss) to 64K (43% capacity loss) or even 256K (38% capacity loss) or padding overhead will be big. And IOPS performance only scales with the number of vdevs, not the number of disks. So only IOPS performance like a single disk and a striped mirror would be 4 times faster.
And a raidz is less flexible as you for example won't be allowed to remove a vdev later and you would have to add 8 more disks in case you ever want to extend it. Resilvering is also magnitudes faster with a striped mirror (but not that important with SSD).

esi_y · Dec 15, 2023

You can always have LVM on those OS 128GB drives and ZFS pool on the rest for VMs.

hidagar · Dec 15, 2023

Dunuin said:
Any raidz won't be great if you need performance or a small block size (like when running DBs). A 8 disk raidz3 pool would require that you increase the block size from 8K (75% capacity loss) to 64K (43% capacity loss) or even 256K (38% capacity loss) or padding overhead will be big. And IOPS performance only scales with the number of vdevs, not the number of disks. So only IOPS performance like a single disk and a striped mirror would be 4 times faster.
And a raidz is less flexible as you for example won't be allowed to remove a vdev later and you would have to add 8 more disks in case you ever want to extend it. Resilvering is also magnitudes faster with a striped mirror (but not that important with SSD).

I'm quiat newby on this kind fo solutions I'm always use single disk.

The best option is run ZFS with Mirror setup for the 8 NVME?

zodiac · Dec 15, 2023

If I understand things correctly, the relatively high capacity loss in RAID-Z configurations is a direct result of trying to group only a small number of sectors together, when doing I/O operations. Because of the way how ZFS works, that results in excessive padding with sectors that effectively are "lost" to the user.

For "normal" file system access, this is determined by the "recordsize" parameter. It is set to 128kB by default, which generally presents a reasonable compromise between capacity loss and performance cost. If you write lots of really small files, it could lead to excessive I/O amplification. But when tuned properly, that's manageable. And in combination with data compression, it usually hits a good sweet spot. If you know how mostly have large writes and if you want to improve space utilization, you could increase the "recordsize"; and vice versa, if you are desperate for better performance for small writes and don't mind losing capacity, you can decrease the "recordsize" all the way to about 8kB. Any less than that doesn't really make sense. And at 8kB you have huge padding overhead.

Things get more complicated when you use your ZFS drives not to store files, but to carve them up for use by virtual disk devices. This is what happens when running virtual machines instead of containers. Instead of the "recordsize" parameter, you now tune things with the "volblocksize" parameter. If you set it to the same 128kB, you'd get the same capacity utilization as when storing files. But since the virtualized guest operating system uses its own file system implementation on top of the virtual disk device, it is not aware of the underlying allocations in ZFS. And that typically results in really bad I/O amplification and poor performance. All the tuning that you can do for file storage in order to minimize excessive I/O amplification is mostly ineffective for volumes, and this is the reason why PVE defaults to an 8kB "volblocksize" with the expected cost in capacity.

This is one of the reasons for why I have very few virtual machines and mostly try to use containers instead. They access ZFS on the file system level and that can be tuned more easily. But there are good reasons why people want to use virtual machines. So, this is a trade-off everyone needs to decide themselves.

I think a lot of these concerns would go away, when virtual machines can access host storage as a virtualized file system. There in fact is support in PVE to do so. But when I last tried it, Windows kept bugging. I think the Windows driver is still very unreliable at this stage. I am not sure whether the Linux driver is any better, as I haven't tried it myself. And I am not even sure there is a MacOS driver.

But if I was a Proxmox engineer, I'd probably prioritize working on this code. It looks to me like something that would get considerably better performance out of existing hardware.

In the meantime, consider favoring containers over virtual machines, up the "volblocksize", or set up LVM in parallel to ZFS and manually balance where you keep your data.

Dunuin · Dec 15, 2023

zodiac said:
For "normal" file system access, this is determined by the "recordsize" parameter. It is set to 128kB by default, which generally presents a reasonable compromise between capacity loss and performance cost. If you write lots of really small files, it could lead to excessive I/O amplification. But when tuned properly, that's manageable. And in combination with data compression, it usually hits a good sweet spot. If you know how mostly have large writes and if you want to improve space utilization, you could increase the "recordsize"; and vice versa, if you are desperate for better performance for small writes and don't mind losing capacity, you can decrease the "recordsize" all the way to about 8kB. Any less than that doesn't really make sense. And at 8kB you have huge padding overhead.

Recordsize is a "up to" value. Even with the default 128K recordsize ZFS can write small files as for example a 4K, 8K, 16K, ... sized record. So no, IO amplification of small files shouldn't be that bad. Its more about IO amplification of big files. Like when writing a 8MB file and using a 8K recordsize where it then has to write 1000x 8K records (+ 2000x metadata) instead of single a big 8MB record (+2x metadata).
And datasets are not affected by padding overhead. Thats only a zvol thing when used in combination with raidz1/2/3.

zodiac said:
All the tuning that you can do for file storage in order to minimize excessive I/O amplification is mostly ineffective for volumes, and this is the reason why PVE defaults to an 8kB "volblocksize" with the expected cost in capacit

PVE is just not optimizing anything and using the ZFS defaults everywhere. To optimize stuff according to your hardware, pool layout and workload is totally up to the admin. So its not plug-and-play, even if it might look to some people like that.

hidagar · Dec 15, 2023

Dunuin said:
But PVE is write heavy. Especially when using ZFS for software raid1 it can quickly kill cheap consumer SSDs. Here those 120GB TLC consumer SSDs often doesn't survive a year. So be aware to always have a cold spare at hand.

Then probably a striped mirror with ZFS replication would be a good option.

striped mirror with ZFS replication is raid10?

Dunuin · Dec 16, 2023

Yes, striped mirror is basically raid10. And with replication you sync the pools between the nodes in intervals down to every minute. So with for example 2 nodes with 8x 2TB disks each in a raid10 you get something like 6.4TB of usable capacity in total across all nodes. 32TB total raw capacity minus 16TB because you store a copy of everything on both nodes minus 8TB because of the local mirrors = 8TB pool capacity. And as ZFS pools will become slow when filling them up its usually recommended not to fill it more than 80%. So 8TB - 20% = 6.4TB of usable capacity.

hidagar · Jan 11, 2024

Hello,

Wich RAM do you recomend for ZRAID? I should limit RAM usage for cache?

Thanks

Dunuin · Jan 11, 2024

Depends on a lot of factors. Workload, type of raid, pool layout, number of disks, size of disks, tyoe of disks, ...without you explaining more rhats hard to tell.

hidagar · Jan 11, 2024

Hi ,

I have this machine
CPU: 72 x Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
HDD: 8 x Samsung PM983 1.92TB NVME
HDD: 2x SSD 460GB for OS
RAM: 376GB

I will make RAID10 with all the NVME

Dunuin · Jan 11, 2024

Then something like 8-16GB for the ARC should be fine. But if you got enough RAM, bigger ARC would help with performance.

Search

Search

ZFS vs Single disk configuration recomendation

hidagar

Member

Dunuin

Distinguished Member

hidagar

Member

Dunuin

Distinguished Member

hidagar

Member

Dunuin

Distinguished Member

esi_y

Renowned Member

hidagar

Member

zodiac

Member

Dunuin

Distinguished Member

hidagar

Member

Dunuin

Distinguished Member

hidagar

Member

Dunuin

Distinguished Member

hidagar

Member

Dunuin

Distinguished Member