Best Practices for File System Alignment in Virtual Environments

mare19

Member
Jul 20, 2021
44
1
13
29
Dear members

I would like to set up my server with Proxmox VE. I want to install the Proxmox host on a separate system disk (Samsung 860 500GB) and create a ZFS Raid 10 pool on 4 x 2TB Samsung 860 for my VMs.

My question is: How can I get the correct hard drive alignment and perfect performance? The sector alignment is apparently very important for optimal I / O performance.

Thank you and best regards
Marko
 
For ZFS, the 'ashift' value you enter when you create the pool determines the block/sector alignment. The default value for ZFS pools in PVE is ashift=12, which means 4k sector size - this should already be optimal for 99% of SSDs, including your Samsung 860s.
 
  • Like
Reactions: mare19
For ZFS, the 'ashift' value you enter when you create the pool determines the block/sector alignment. The default value for ZFS pools in PVE is ashift=12, which means 4k sector size - this should already be optimal for 99% of SSDs, including your Samsung 860s.
Okey perfect, thank you Stefan for this helpful info.

Best regards,
Marko
 
Which SSDs would you prefer, we always used the samsung 860 PRO.
My old Proxmox installation had HDDs for Data and two 850 Pro in a mirror for two Windows 10 VMs, these SSD wearout quiete quickly, on died after just 10 month due to wearout and had no more spare blocks to compensate.

My new Proxmox hosts is based on 4x Intel D3-S4510 3.84TB in a ZFS Strip mirror, great performance and still at 0% wearout after almost 1 year. Advantage of these SSDs is power protection, this helps if a power out happens and there is still data in the ssd cache, these ssd behave differently with sync writes from databases or just Windows in generel, more responsive and less laggy GUI.

In my experience, virtual disks are way more detrimental to SSDs then lxc on zfs that uses a filebased approach. Just keep an eye on the drives wearout, depending on your workload you may see wearout increases quickly, you can replace disk one by one and resilver. Enterprise SSD are always lower in capacity because they keep more spare blocks to replace dying flash cells.
 
  • Like
Reactions: mare19
ZFS with its copy-on-write and much metadata causes a lot of write amplification and virtualization because of the nested filesystems and different blocksizes too. This doesn't just sums up, it multiplies.

I did some benchmarks and write amplification here is between around factor 3 (for async writes) and up to factor 81 for 4k sync writes. In average with real workloads over months my write amplificarion is around factor 20. So for each 1GB I write in a VM or LXC 20 GB are written to the SSDs.
So a write amplification of factor 20 means the performance will be only 1/20th and the SSDs will die 20 times faster.
 
Last edited:
  • Like
Reactions: mare19
My old Proxmox installation had HDDs for Data and two 850 Pro in a mirror for two Windows 10 VMs, these SSD wearout quiete quickly, on died after just 10 month due to wearout and had no more spare blocks to compensate.

My new Proxmox hosts is based on 4x Intel D3-S4510 3.84TB in a ZFS Strip mirror, great performance and still at 0% wearout after almost 1 year. Advantage of these SSDs is power protection, this helps if a power out happens and there is still data in the ssd cache, these ssd behave differently with sync writes from databases or just Windows in generel, more responsive and less laggy GUI.

In my experience, virtual disks are way more detrimental to SSDs then lxc on zfs that uses a filebased approach. Just keep an eye on the drives wearout, depending on your workload you may see wearout increases quickly, you can replace disk one by one and resilver. Enterprise SSD are always lower in capacity because they keep more spare blocks to replace dying flash cells.
Thanks for the helpful and detailed information. We have now decided to choose the Enterprise version from Samsung. Because we will also work with ZFS pools and this also makes sense for virtualized machines.

Best regards,
Marko
 
ZFS with its copy-on-write and much metadata causes a lot of write amplification and virtualization because of the nested filesystems and different blocksizes too. This doesn't just sums up, it multiplies.

I did some benchmarks and write amplification here is between around factor 3 (for async writes) and up to factor 81 for 4k sync writes. In average with real workloads over months my write amplificarion is around factor 20. So for each 1GB I write in a VM or LXC 20 GB are written to the SSDs.
So a write amplification of factor 20 means the performance will be only 1/20th and the SSDs will die 20 times faster.
Thanks for the hint.. writing amplification is an interesting topic to consider.
 
I am surprised 860 pro's died like that, whilst the official TBW rating isnt great, the actual endurance is way way higher as tested by various people in the review industry.

Do you have data on the SMART expected life left for when they died or was you not monitoring that?

My 850 pro has done circa 5 years in PC, granted only desktop usage, and then a year or so in ps4 pro which does huge amounts of writes (it constantly records game footage as you play so is a server type load), and even on the official erase cycles its at 98% left, 63 erase cycles.

I suspect the load may have caused failure in another way such as the allocation table getting worn down which cannot be managed by wear levelling.

It sounds like you using this in a business project which in any case I would go with an enterprise drive in that instance.