Ashift for Intel S3700/S3710?

Dunuin

Distinguished Member
Jun 30, 2020
14,447
4,255
243
Germany
Hi,

I bought 2x used 100GB and 3x used 200GB Intel DC S3700/3710 SDDs for my Proxmox Server because of the great write endurance of a couple petabytes per drive and I'm not sure what the size of the flash pages is. I wasn't able to find anything in the datasheets nor on google. Does anyone know what ashift I should use for minimum write amplification? Most people seem to use 4k or 8k as logical block size for SSD so using a ashift of 12 or 13. Any suggestions?

And I'm not sure what the best setup would be. I want my VMs to be encrypted and everything mirrored so I don't loose any data if a drive fails.
Last Setup I installed Proxmox to 2 zfs mirrored 1TB HDDs leaving 900GB free space on the disks and created a 900GB partiton on top of it. This partitions I used for a mirrored encrypted zfs pool to store my VMs.
Now I am planning to create a zraid1 of 4x or 5x 200GB Intel S3700/S3710 SDDs for my VMs and I'm not sure if it is a good idea to install Proxmox to the same pool.
Are there any benefits of installing proxmox to a dedicated pair of 100GB SDDs instead of just using the 4 or 5 200GB SDDs for everything?

Another problem was the write amplification I saw on my last install. I ran a zabbix server with MySQL in one VM with ext4 as guest filesystem on a zdev with "raw". The guest os was writing around 700kb/s to the vdev (so maybe 350kb/s of real data because of the journaling of ext4) and the zfs on the host was writing around 4000 - 5000kb/s (using cache mode=none/writethrough/writeback on guest) to the HDDs to store those 700kb/s from the host. Only setting the VM to cache mode=unsafe was decreasing the writes of the host to around 2000kb/s. So I think that extra amplification was caused by using the ZIL because of the sync writes of the MySQL db.
What cache mode, ashift and filesystem would you use on the guest side and what ashift and filesystem on the host side so the write amplification wouldn't be that bad? I think 350kb/s of data really writing 5000kb/s to disk is a little bit hard, even if the write amplification caused by the new SSDs shouldn't be that high then.
Or is there no alternative If you want your data to be safe? Would xfs/lvm with mdraid or onboard raid and qcow2 be any option?
Do I really need a journaling filesystem on the guest if the host already uses journaling, CoW, has a UPS and SSDs with powerloss protection?
 
Last edited:
Hi,

use ashift=12 (4k).
I'm not 100% sure but if I remember correctly this SSD has internal 4M blocks.
But the controller is optimized to receive 4k block and merge it into the internal block structure.
This SSD has also internal power lost protection and so it can hold data longer in the DDR cache to optimize the written pattern.
ZFS can do this only with async data, sync data are directly written.
So in the case, you set ashitf to 16 what is max and 64KB.
Then you write 4k sync what is common with DB. You force the SSD to write 64kb where 4k is data.

Are there any benefits of installing proxmox to a dedicated pair of 100GB SDDs instead of just using the 4 or 5 200GB SDDs for everything?
It is always better to separate the VM data from the OS data. Because this way the OS and the guest can't block and disturb each other.
Also, the disk management in retaliation scenario is easier.
So I think that extra amplification was caused by using the ZIL because of the sync writes of the MySQL db.
No, the data are written in the Host Memory and stay there until the system gets a sync command.
ZFS has its own cache management and if you any cache mode except none you just copy the data one more time in the memory.
So you for zvols always cache mode "none".

If you have a ZIL what disk do you use?
The disk must be an enterprise disk like the S3700. A consumer disk can also slow you down.

Or is there no alternative If you want your data to be safe? Would xfs/lvm with mdraid or onboard raid and qcow2 be any option?
If you use these S3700 you can use ZFS with ext4 or XFS and the VM will fly.
 
use ashift=12 (4k).
I'm not 100% sure but if I remember correctly this SSD has internal 4M blocks.
But the controller is optimized to receive 4k block and merge it into the internal block structure.
This SSD has also internal power lost protection and so it can hold data longer in the DDR cache to optimize the written pattern.
ZFS can do this only with async data, sync data are directly written.
So in the case, you set ashitf to 16 what is max and 64KB.
Then you write 4k sync what is common with DB. You force the SSD to write 64kb where 4k is data.
Ok, I will use 4k/ashift12. So sync writes also bypass the internal SSD cache? In my test I told the MySQL DB to only send fsync once a second and I set the ZFS parameter "zfs_txg_timeout" to 60 so it will only try to write to the pool every minute so it will write data in bigger chunks to better use bigger flash blocks inside the SSD. Does this "zfs_txg_timeout" option also prevent sync writes from writing to disc or just async writes?

It is always better to separate the VM data from the OS data. Because this way the OS and the guest can't block and disturb each other.
Also, the disk management in retaliation scenario is easier.
I think IOPS should be high enough so the guests shouldn't disturb the host if they share the same drives. The retaliation scenario is also one thing I though about if a disc fails. It would be less likely that a boot drive fails if it is a seperate mirror of SSDs instead of a zraid1 of 5 discs. But this guide should work with both cases right?

No, the data are written in the Host Memory and stay there until the system gets a sync command.
ZFS has its own cache management and if you any cache mode except none you just copy the data one more time in the memory.
So you for zvols always cache mode "none".
Thanks, I didn't know that. I thought the cache mode of the virtual disc controller would also change how the zfs caches are being handled by the VM.

If you have a ZIL what disk do you use?
The disk must be an enterprise disk like the S3700. A consumer disk can also slow you down.
I didn't define a own SLOG drive for the pools because there aren't any affordable devices that are faster and comparable robust as the S3710 SSDs. As far as I understand zfs will alway use a ZIL for sync writes and if there is no dedicated drive given for that, it will just use a part of the pool written to as a ZIL. So the sync data should be written twice to the pool. First to the ZIL on that pool and later again. I thought that might explain why the data written to the pool is doubbled if I don't use cache mode "unsafe" which will just ignore any flushes.
 
Hi,

Be aware, mysql write sync data with 16K block size. Also be sure that you have a good UPS, because with 60 sec(tgx timeout) when the power is dissapear then you can lose at most all your arc cache (async data). Also the same will happen if you have a kernel crash.

MySQL and I think like any DB write a lot of data:
- first the data is write on the intend log/transaction log (bigger block size is better)
- then the same data is commited to disk(16 k)
- then again data is mark in intend log as succefull

Because of that you can have better performance using containers, where the block size could be variable (by default max is 128 k. And in this case your CT can use arc cache (not the case with zvol and cache=none) at least for read operations.


Good luck / Bafta!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!