how to best benchmark SSDs?

Dunuin

Distinguished Member
Jun 30, 2020
13,843
4,051
243
Germany
Hi,

I want to rebuild my pool and I will most likely use 6x Intel S3710 200GB as a striped mirror for the new VM pool.
Right now I'm using 4x S3710 200GB + 1x S3700 200GB as a raidz and got a write amplification from guest to NAND of around factor 20 and I really would like to lower that.

There are several parameters that might influence performance and write amplification that I might think of:

  • ashift of the pool
  • atime of the pool on/off
  • zfs_txg_timeout of the pool
  • volbocksize of the vzol
  • thin vs non thin
  • with and without SLOG
  • with and without L2ARC
  • ZFS native encryption on/off
  • encryption algorithm
  • ZFS compression on/off
  • compression algorithm
  • discard fstab option vs fstrim -a as cron
  • virtio SCSI vs virtio block
  • virtio SCSI blocksize of 512B vs 4K
  • cache mode of virtio
  • ssd emulation on/off
  • io thread on/off
  • blocksize of the guest OSs filesystem (4K for ext4 for example)
  • stride and stripe-width for ext4 inside guest
  • sync vs async writes
  • random 4k vs sequential 1M IO
  • ...
There are just so many combinations that I can't benchmark all.

Right know I created two identical Debian 10 VMs with only fio and qemu-guest-agent installed. One VM is using "args: -global scsi-hd.physical_block_size=4k" so virtio uses 4K block size and the other one is using the default 512B blocksize. Now I wanted to create different partitions and format them with ext4 but with different values for stride and stripe-width. Then I would backup both VMs so I could import them later after destroying and recreating pools with different ZFS/virtio configs.

For the benchmark my idea was to collect SMART attributes on the host (my SSDs are monitoring real NAND writes in 32MiB units every second), run some fio tests inside the VM that write a fixed amount of data and after all tests has been done I would collect SMART attributes again so I can see how much actually was written to the NAND to calculate the total write amplification.
Fio tests that sounds useful would be:
  • 4K random sync writes
  • 4K random async writes
  • 1M sequential sync writes
  • 1M sequential async writes
  • 4K async random reads
  • 1M async sequential reads
So what do you think would be good fio command parameters to test this?

And what ZFS/virtio settings should I test with a 6x SSD striped mirror that would sound promising?

Anything I didn't think of that might make the benchmarks uncomparable?

I would think this sound good:

1.) 4K volblocksize + ashift=12
ashift=12
atime=off
zfs_txg_timeout=default
volblocksize=4K
thin
without SLOG
without L2ARC
native encryption=on
encryption algorithm=aes-256-ccm
compression=lz4
discard using fstab
virtio SCSI
virtio SCSI blocksize=512B + 4K
cachemode=none
ssd emulation=on
io thread=on
ext4 blocksize=4K
stride and stripe-width: -b 4096 -E stride=1 -E stripe-width=3

2.) 16K volblocksize + ashift=12 + aes-256-ccm
ashift=12

atime=off
zfs_txg_timeout=default
volblocksize=16K
thin
without SLOG
without L2ARC
native encryption=on
encryption algorithm=aes-256-ccm
compression=lz4
discard using fstab
virtio SCSI
virtio SCSI blocksize=512B + 4K
cachemode=none
ssd emulation=on
io thread=on
ext4 blocksize=4K
stride and stripe-width: -b 4096 -E stride=4 -E stripe-width=12

3.) 16K volblocksize + ashift=12 + aes-256-gcm
ashift=12

atime=off
zfs_txg_timeout=default
volblocksize=16K
thin
without SLOG
without L2ARC
native encryption=on
encryption algorithm=aes-256-gcm
compression=lz4
discard using fstab
virtio SCSI
virtio SCSI blocksize=512B + 4K
cachemode=none
ssd emulation=on
io thread=on
ext4 blocksize=4K
stride and stripe-width: -b 4096 -E stride=4 -E stripe-width=12

4.) 16K volblocksize + ashift=12 + aes-128-ccm
ashift=12

atime=off
zfs_txg_timeout=default
volblocksize=16K
thin
without SLOG
without L2ARC
native encryption=on
encryption algorithm=aes-128-ccm
compression=lz4
discard using fstab
virtio SCSI
virtio SCSI blocksize=512B + 4K
cachemode=none
ssd emulation=on
io thread=on
ext4 blocksize=4K
stride and stripe-width: -b 4096 -E stride=4 -E stripe-width=12

5.) 16K volblocksize + ashift=13
ashift=13

atime=off
zfs_txg_timeout=default
volblocksize=16K
thin
without SLOG
without L2ARC
native encryption=on
encryption algorithm=aes-256-ccm
compression=lz4
discard using fstab
virtio SCSI
virtio SCSI blocksize=512B + 4K
cachemode=none
ssd emulation=on
io thread=on
ext4 blocksize=4K
stride and stripe-width: -b 4096 -E stride=4 -E stripe-width=12


Someone knows how to calculate the stride and stripe-width for VMs ontop of ZFS? The manuals always refer to physical disks on a real SW/HW raid with a defined stripe size that the OS has direct access to. Here I only use ZFS what is not a raid with a fixed stripe size and there is also the virtio in between the hosts ZFS and the guests OS.
Anyone know if it is possible to calculate what stride ans stripe-width to use? I've got a write amplification from guest to host of factor 7 and that is quite high. So I hoped I might optimize how the guests ext4 is writing data to the virtio so virtio isn'T amplifying that much because of all the mixed blocksizes in that chain.
 
Last edited:
In vm for ext4:
volblocksize <= 4k x stripe-width
16k <= 4k x 4
32k <= 4k x 8
On hypervisor:
volblocksize >= 4k(for ashift=12) x number of stripes in pool
Etc...
 
  • Like
Reactions: Dunuin
In vm for ext4:
volblocksize <= 4k x stripe-width
16k <= 4k x 4
32k <= 4k x 8
On hypervisor:
volblocksize >= 4k(for ashift=12) x number of stripes in pool
Etc...
So with 3 striped mirrors and ashift of 12 I would need to try 12K volblocksize but because 12K isn't 2^x I would round up to 16K. And if the volbocksize is 16K I use a stride of 1 and stripe-width of 4 for ext4 in the guest?

Then block sizes would look like:
4K LBA (6x SSD) <- 4K ashift (pool) <- 16K volblocksize (zvol) <- 4K (virtio SCSI) <- 4K LBA (virtual disk) <- 4K blocksize (ext4)

Someone knows it is suboptimal to use a striped mirror of 6 SSDs? I also could create a striped mirror of 8 disks (got two spare S3700 200GB that are a bit slower than the S3710). But I actually don't need that much space so the two additional discs would be wasted.
4 disks striped mirror + 2 disk mirror as two VM pools would also be an option because no VM needs the full space. but here I would think it isn't good to migrate the VMs between the two VM pools because a striped mirror needs an other volblocksize and stripe-width compared to just a single mirror...
 
Last edited:
I will start tests today.

I shutdown all VMs and then restore a test VM from my NAS so this is the only VM running and always the same one. Then I will boot the test VM up and wait a minute.

Then I will run a script on the host that get all SMART attributes "host_writes_32MiB" and "NAND_writes_32MiB", sums them up and outputs them as MiB, so I know how much was written to the SSD and NAND before the benchmark.

Then I ssh into the TestVM and run my benchmark script which runs some fio tests, logs amount of data written using iostat before and after all the tests and outputs everything to a log. After the script has finished I will download the log, run the script on the host again to see how much the SSDs had written until now and then destroy the TestVM.

Then I do the same with the second test VM that is identical but uses a virtio blocksize of 4K instead of 512B to see if that makes a difference.

Then I will destroy the pool and recreate it or just change some ZFS/virtio options if a destroy isn'T required and start all over again.

My test script looks like this:

benchmark.sh
Code:
#!/usr/local/bin/bash

LOGFILE="/root/benchmark.log" #filename of the logfile

iostat | tee -a "${LOGFILE}"

# sync randwrite = (writes 1G)
fio --filename=/tmp/test.file --name=sync_randwrite --rw=randwrite --bs=4k --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=1G --loops=1 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

# sync randread (writes 1G)
fio --filename=/tmp/test.file --name=sync_randread --rw=randread --bs=4k --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=1G --loops=10 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

# seq sync seqwrite (writes 5G)
fio --filename=/tmp/test.file --name=sync_seqwrite --rw=write --bs=4M --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=5G --loops=1 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

# seq sync seqread (writes 1G)
fio --filename=/tmp/test.file --name=sync_seqread --rw=read --bs=4M --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=1G --loops=10 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async uncached randwrite (writes 4G)
fio --filename=/tmp/test.file --name=async_uncached_randwrite --rw=randwrite --bs=4k --direct=1 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=1 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async cached randwrite (writes 4G)
fio --filename=/tmp/test.file --name=async_cached_randwrite --rw=randwrite --bs=4k --direct=0 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=1 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async uncached randread (writes 4G)
fio --filename=/tmp/test.file --name=async_uncached_randread --rw=randread --bs=4k --direct=1 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=10 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async cached randread (writes 4G)
fio --filename=/tmp/test.file --name=async_cached_randread --rw=randread --bs=4k --direct=0 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=10 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async uncached seqwrite (writes 8G)
fio --filename=/tmp/test.file --name=async_uncached_seqwrite --rw=write --bs=4M --direct=1 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=2G --loops=1 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async cached seqwrite (writes 8G)
fio --filename=/tmp/test.file --name=async_cached_seqwrite --rw=write --bs=4M --direct=0 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=2G --loops=1 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async uncached seqread (writes 4G)
fio --filename=/tmp/test.file --name=async_uncached_seqread --rw=read --bs=4M --direct=1 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=50 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

#async cached seqread (writes 4G)
fio --filename=/tmp/test.file --name=async_cached_seqread --rw=read --bs=4M --direct=0 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=50 --group_reporting | tee -a "${LOGFILE}"
rm /tmp/test.file

fstrim -a

sleep 60

iostat | tee -a "${LOGFILE}"
 
First test is a raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, discard, noatime, nodiratime, default ext4 parameters. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=512B.

Host Writes Difference on host: 84.720 MiB
NAND Writes Difference on host: 94.800 MiB
Guest Read Difference: 312.131 MiB
Guest Write Difference: 29.400 MiB

Write amplification from guest to host: 2.88x
Write amplification inside SSD: 1.12x
Total write amplification: 3.22x

sync_randwrite: 2,06 MiB/s
sync_randread: 87,6 MiB/s
sync_seqwrite: 264 MiB/s
sync_seqread: 2.941 MiB/s
async_uncached_randwrite: 221 MiB/s
async_cached_randwrite: 1.413 MiB/s
async_uncached_randread: 466 MiB/s
async_cached_randread: 760 MiB/s
async_uncached_seqwrite: 6.693 MiB/s
async_cached_seqwrite: 2.325 MiB/s
async_uncached_seqread: 9.803 MiB/s
async_cached_seqread: 3.518 MiB/s


Anyone knows why async_uncached_seqwrite and async_uncached_seqread are faster than async_cached_seqwrite and async_cached_seqread? Only difference is the "--direct=0|1". Why is it 3 times faster if I disable "direct"? I thought a "--direct=1" should disable caching on the guest and caching should make stuff faster...
 
Second test is basically the same but with a virtio blocksize of 4K (args: -global scsi-hd.physical_block_size=4k used for that) instead of the default 512B.

Host is raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, discard, noatime, nodiratime, default ext4 parameters. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=4K.

Host Writes Difference on host: 87.360 MiB
NAND Writes Difference on host: 96.120 MiB

Guest Read Difference: 311.567 MiB
Guest Write Difference: 30.969 MiB

Write amplification from guest to host: 2,82x
Write amplification inside SSD: 1,10x
Total write amplification: 3,10x

sync_randwrite: 2,07 MiB/s
sync_randread: 89,5 MiB/s
sync_seqwrite: 272 MiB/s
sync_seqread: 2.475 MiB/s
async_uncached_randwrite: 223 MiB/s
async_cached_randwrite: 1.646 MiB/s
async_uncached_randread: 470 MiB/s
async_cached_randread: 780 MiB/s
async_uncached_seqwrite: 7.269 MiB/s
async_cached_seqwrite: 2.111 MiB/s
async_uncached_seqread: 9.866 MiB/s
async_cached_seqread: 3.548 MiB/s

So looks like virtio blocksize doens't make a big difference. With 4K write amplification is a little bit lower and performance a little bit better but not really worth mentioning.
 
Last edited:
Third test is the same as the previous one but I ran the benchmark on an ext4 partition that was formaet with a stride of 8 (so 32K stride) and stripe-width of 32 (so 128K stripe-width). And I forgot to mount it with discard,noatime,nodiratime so this is missing.


Host is raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, ext4 parameters: stride=8,stripe-width=32. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=4K.


Host Writes Difference on host: 84.360 MiB
NAND Writes Difference on host: 98.320 MiB

Guest Read Difference: 311.223 MiB
Guest Write Difference: 29.362 MiB

Write amplification from guest to host: 2,87x
Write amplification inside SSD: 1,17x
Total write amplification: 3,35x

sync_randwrite: 2,11 MiB/s
sync_randread: 87,5 MiB/s
sync_seqwrite: 270 MiB/s
sync_seqread: 2715 MiB/s
async_uncached_randwrite: 223 MiB/s
async_cached_randwrite: 1615 MiB/s
async_uncached_randread: 463 MiB/s
async_cached_randread: 793 MiB/s
async_uncached_seqwrite: 6440 MiB/s
async_cached_seqwrite: 2451 MiB/s
async_uncached_seqread: 9645 MiB/s
async_cached_seqread: 3512 MiB/s

I don't know if I used the correct stride and stripe-width but it looks like there isn'T much of a difference. With my custom stride and stripe-width the write amplification is getting worse but async_uncached_seqwrite is a bit better. But differences aren't really worth mentioning.
 
Last edited:
Fourth test I did the same like before but this time I didn't forgot to mount the partition using discard,noatime,nodiratime and I used a stripe-width of 8 (32K) instead of 32 (128K).

Host is raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, discard, noatime, nodiratime, ext4 parameters: stride=8,stripe-width=8. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=4K.

Host Writes Difference on host: 89.440 MiB
NAND Writes Difference on host: 103.920 MiB

Guest Read Difference: 312.500 MiB
Guest Write Difference: 31.262 MiB

Write amplification from guest to host: 2,86x
Write amplification inside SSD: 1,16x
Total write amplification: 3,32x

sync_randwrite: 2,08 MiB/s
sync_randread: 88,9 MiB/s
sync_seqwrite: 262 MiB/s
sync_seqread: 3022 MiB/s
async_uncached_randwrite: 220 MiB/s
async_cached_randwrite: 1646 MiB/s
async_uncached_randread: 462 MiB/s
async_cached_randread: 744 MiB/s
async_uncached_seqwrite: 6533 MiB/s
async_cached_seqwrite: 1722 MiB/s
async_uncached_seqread: 9772 MiB/s
async_cached_seqread: 3498 MiB/s

So again no big change. Worse write amplification compared to without stride/stripe-width and some performances worse or better. But most of them are async so it might be some performance changes due to RAM utilization or something similar.
So looks like it is not really worth tuning the stride/spripe-width.
 
Last edited:
Highly interesting! Will you repeat the tests on a mirrored setup?
Yes, I just first test different settings on my existing pool to see what settings might do a difference at all and why my write amplification is that high.
Because if I just test the single test VM runningb on the pool it looks like the total write amplification is just about factor 3. But if I run all my production VMs at the same I see a mucher higher write amplification. Its more like factor 7 from guest to host and factor 3 inside the SSD so a total amplification of around 21x. So the question is really what causes this.

I collect and calculate the smart attributes the same way using a zabbix template I made and if I look at the numbers I have collected over the past months I clearly see a internal SSD write amplification that is changing between factor 2,5 and 3.
And the write amplification from guest to host of factor 7 I get by running iostat inside the guest and on the host. Then I sum up all writes inside the guest to the virtual disks and on the host I sum up the writes of the /dev/zdX that correspond to that VM.
Worst are the zabbix and graylog VMs. They both together write several MB/s and they just collect logs/metrics of about 30 hosts/VMs. So looks like DBs like MySQL/MongoDB/ElasticSearch are the main problem.
And I already optimized MySQL caching to reduce the writes.

Maybe I should additinally run iostat on the host too for the next tests.
 
Last edited:
Looks like iostat isn't monitoring zd-devices anymore. Someone know why that is the case? IF I run iostat -m -p ALL I see all the zdX devices but they all show a zero everywhere. In the past that worked fine.

Code:
Linux 5.4.124-1-pve (Hypervisor)        07/29/2021      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.80    0.00    3.31    0.17    0.00   93.72

Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
loop0             0.00         0.00         0.00          0          0
loop1             0.00         0.00         0.00          0          0
loop2             0.00         0.00         0.00          0          0
loop3             0.00         0.00         0.00          0          0
loop4             0.00         0.00         0.00          0          0
loop5             0.00         0.00         0.00          0          0
loop6             0.00         0.00         0.00          0          0
loop7             0.00         0.00         0.00          0          0
sda               3.92         0.10         0.03       2471        844
sda1              0.02         0.00         0.00        118          0
sda2              0.02         0.00         0.00        121          0
sda3              0.03         0.01         0.00        152          0
sda4              3.41         0.08         0.03       1988        844
md0               0.03         0.01         0.00        123          0
md1               7.06         0.13         0.03       3186        842
sdb               7.11         0.08         0.03       2026        844
sdb1              0.02         0.00         0.00        118          0
sdb2              0.03         0.00         0.00        122          0
sdb3              0.04         0.01         0.00        207          0
sdb4              6.58         0.06         0.03       1487        844
sdc               0.14         0.01         0.00        298          0
sdc1              0.13         0.01         0.00        236          0
sdd               0.14         0.01         0.00        294          0
sdd1              0.12         0.01         0.00        233          0
sdh             349.53         2.60         3.29      64103      81129
sdh1            247.52         2.59         3.29      63863      81129
sdh9              0.03         0.01         0.00        179          0
sdf             349.89         2.60         3.29      64037      81128
sdf1            247.88         2.59         3.29      63797      81128
sdf9              0.03         0.01         0.00        179          0
sde             346.72         2.51         3.35      61761      82540
sde1            244.71         2.50         3.35      61520      82540
sde9              0.03         0.01         0.00        179          0
sdi               0.05         0.01         0.00        241          0
sdi1              0.03         0.01         0.00        150          0
sdg             347.24         2.60         3.29      64094      81121
sdg1            245.24         2.59         3.29      63854      81121
sdg9              0.03         0.01         0.00        179          0
dm-0              7.04         0.12         0.03       3060        842
dm-1              6.98         0.11         0.04       2760        948
dm-2              0.03         0.00         0.00        121          0
sdj             344.83         2.58         3.29      63512      81133
sdj1            242.86         2.58         3.29      63507      81133
sdj9              0.00         0.00         0.00          3          0
zd0               0.00         0.00         0.00          0          0
zd0p1             0.00         0.00         0.00          0          0
zd0p2             0.00         0.00         0.00          0          0
zd0p3             0.00         0.00         0.00          0          0
zd0p4             0.00         0.00         0.00          0          0
zd16              0.00         0.00         0.00          0          0
zd16p1            0.00         0.00         0.00          0          0
zd48              0.00         0.00         0.00          0          0
zd48p1            0.00         0.00         0.00          0          0
zd48p5            0.00         0.00         0.00          0          0
zd64              0.00         0.00         0.00          0          0
zd64p1            0.00         0.00         0.00          0          0
zd64p5            0.00         0.00         0.00          0          0
zd80              0.00         0.00         0.00          0          0
zd80p1            0.00         0.00         0.00          0          0
zd80p2            0.00         0.00         0.00          0          0
zd80p3            0.00         0.00         0.00          0          0
zd96              0.00         0.00         0.00          0          0
zd96p1            0.00         0.00         0.00          0          0
zd112             0.00         0.00         0.00          0          0
zd112p1           0.00         0.00         0.00          0          0
zd112p2           0.00         0.00         0.00          0          0
zd112p3           0.00         0.00         0.00          0          0
zd128             0.00         0.00         0.00          0          0
zd128p1           0.00         0.00         0.00          0          0
zd144             0.00         0.00         0.00          0          0
zd144p1           0.00         0.00         0.00          0          0
zd160             0.00         0.00         0.00          0          0
zd160p1           0.00         0.00         0.00          0          0
zd160p2           0.00         0.00         0.00          0          0
zd176             0.00         0.00         0.00          0          0
zd176p1           0.00         0.00         0.00          0          0
zd192             0.00         0.00         0.00          0          0
zd192p1           0.00         0.00         0.00          0          0
zd192p2           0.00         0.00         0.00          0          0
zd192p3           0.00         0.00         0.00          0          0
zd208             0.00         0.00         0.00          0          0
zd208p1           0.00         0.00         0.00          0          0
zd208p2           0.00         0.00         0.00          0          0
zd208p3           0.00         0.00         0.00          0          0
zd224             0.00         0.00         0.00          0          0
zd224p1           0.00         0.00         0.00          0          0
zd240             0.00         0.00         0.00          0          0
zd240p1           0.00         0.00         0.00          0          0
zd240p2           0.00         0.00         0.00          0          0
zd240p3           0.00         0.00         0.00          0          0
zd272             0.00         0.00         0.00          0          0
zd272p1           0.00         0.00         0.00          0          0
zd288             0.00         0.00         0.00          0          0
zd288p1           0.00         0.00         0.00          0          0
zd304             0.00         0.00         0.00          0          0
zd304p1           0.00         0.00         0.00          0          0
zd320             0.00         0.00         0.00          0          0
zd320p1           0.00         0.00         0.00          0          0
zd320p2           0.00         0.00         0.00          0          0
zd320p3           0.00         0.00         0.00          0          0
zd336             0.00         0.00         0.00          0          0
zd336p1           0.00         0.00         0.00          0          0
zd352             0.00         0.00         0.00          0          0
zd352p1           0.00         0.00         0.00          0          0
zd368             0.00         0.00         0.00          0          0
zd384             0.00         0.00         0.00          0          0
zd384p1           0.00         0.00         0.00          0          0
zd400             0.00         0.00         0.00          0          0
zd400p1           0.00         0.00         0.00          0          0
zd400p2           0.00         0.00         0.00          0          0
zd400p3           0.00         0.00         0.00          0          0
zd416             0.00         0.00         0.00          0          0
zd416p1           0.00         0.00         0.00          0          0
zd416p2           0.00         0.00         0.00          0          0
zd416p3           0.00         0.00         0.00          0          0
zd432             0.00         0.00         0.00          0          0
zd432p1           0.00         0.00         0.00          0          0
zd448             0.00         0.00         0.00          0          0
zd448p1           0.00         0.00         0.00          0          0
zd448p2           0.00         0.00         0.00          0          0
zd448p3           0.00         0.00         0.00          0          0
zd464             0.00         0.00         0.00          0          0
zd464p1           0.00         0.00         0.00          0          0
zd480             0.00         0.00         0.00          0          0
zd480p1           0.00         0.00         0.00          0          0
zd496             0.00         0.00         0.00          0          0
zd496p1           0.00         0.00         0.00          0          0
zd512             0.00         0.00         0.00          0          0
zd512p1           0.00         0.00         0.00          0          0
zd528             0.00         0.00         0.00          0          0
zd528p1           0.00         0.00         0.00          0          0
zd528p2           0.00         0.00         0.00          0          0
zd528p3           0.00         0.00         0.00          0          0
zd544             0.00         0.00         0.00          0          0
zd544p1           0.00         0.00         0.00          0          0
zd560             0.00         0.00         0.00          0          0
zd560p1           0.00         0.00         0.00          0          0
zd576             0.00         0.00         0.00          0          0
zd576p1           0.00         0.00         0.00          0          0
zd592             0.00         0.00         0.00          0          0
zd624             0.00         0.00         0.00          0          0
zd624p1           0.00         0.00         0.00          0          0
zd640             0.00         0.00         0.00          0          0
zd640p1           0.00         0.00         0.00          0          0
zd640p2           0.00         0.00         0.00          0          0
zd640p3           0.00         0.00         0.00          0          0
zd640p4           0.00         0.00         0.00          0          0
zd656             0.00         0.00         0.00          0          0
zd656p1           0.00         0.00         0.00          0          0
zd656p2           0.00         0.00         0.00          0          0
zd656p3           0.00         0.00         0.00          0          0
zd672             0.00         0.00         0.00          0          0
zd672p1           0.00         0.00         0.00          0          0
zd704             0.00         0.00         0.00          0          0
zd704p1           0.00         0.00         0.00          0          0
zd704p2           0.00         0.00         0.00          0          0
zd704p3           0.00         0.00         0.00          0          0
zd720             0.00         0.00         0.00          0          0
zd720p1           0.00         0.00         0.00          0          0
zd736             0.00         0.00         0.00          0          0
zd736p1           0.00         0.00         0.00          0          0
zd752             0.00         0.00         0.00          0          0
zd752p1           0.00         0.00         0.00          0          0
zd768             0.00         0.00         0.00          0          0
zd768p1           0.00         0.00         0.00          0          0
zd768p2           0.00         0.00         0.00          0          0
zd768p3           0.00         0.00         0.00          0          0
zd784             0.00         0.00         0.00          0          0
zd784p1           0.00         0.00         0.00          0          0
zd784p2           0.00         0.00         0.00          0          0
zd800             0.00         0.00         0.00          0          0
zd800p1           0.00         0.00         0.00          0          0
zd816             0.00         0.00         0.00          0          0
zd816p1           0.00         0.00         0.00          0          0
zd32              0.00         0.00         0.00          0          0
zd608             0.00         0.00         0.00          0          0
zd256             0.00         0.00         0.00          0          0
zd688             0.00         0.00         0.00          0          0
zd832             0.00         0.00         0.00          0          0
 
Last edited:
Do not use stride parameter when formatting ext4 in guest. This is not real array from quest view.
Use for 32k:
mkfs.ext4 -b 4k -E stripe-width=8
 
  • Like
Reactions: Dunuin
Do not use stride parameter when formatting ext4 in guest. This is not real array from quest view.
Use for 32k:
mkfs.ext4 -b 4k -E stripe-width=8
Thanks will try that next.

I now had shutdown all VMs except for my Zabbix VM that was only monitoring 8 hosts so really not that much metrics to capture.
I ran iostat on host, logged SMART attributes on host, ran iostat inside guest.
After 1 hour I did the same again, substracted the first from the second measurement and this is the result:

Zabbix VM running for 60 minutes monitoring 8 clients:

Host Writes Difference (SMART) 10.760 MiB
NAND Writes Difference (SMART): 30.480 MiB

Guest Read Difference (iostat): 38 MiB
Guest Write Difference (iostat): 1.232 MiB

Host Read Difference (iostat): 57 MiB
Host Write Difference (iostat): 10.756 MiB

Write amplification from guest to host: 8,73x
Write amplification inside SSD: 2,83x
Total write amplification: 24,74x

So in this 1 hour the guest wrote 1.232 MiB, the hosts VM pool wrote 10.756 MiB (what nearly perfectly matches the values I got from SMART) and 30.480 MiB was written to the SSDs NAND chips. So I indeed got a super high write amplification like described before.

Someone knows why that VM got such a bad write amplification of factor 24,74 while the other test VMs only got a write amplification of around 3?

Host it raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, noatime, nodiratime, default ext4 parameters. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=512B.
So settings are basically the same as my first test just with missing discard because I run a daily fstrim -a cron.

The Zabbix VM is using PHP7-fpm, mariadb and a nginx.
 
Last edited:
Fifth test is with the recommended "mkfs.ext4 -b 4k -E stripe-width=8". So its the same as test Nr 4 but without any stride set but an explicit 4k blocksize instead (but I think 4K is default and so should also be used before?).

Host is raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, discard, noatime, nodiratime, ext4 parameters: b=4K,stripe-width=8. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=4K.

Host Writes Difference on host: 82.440 MiB
NAND Writes Difference on host: 91.800 MiB

Guest Read Difference: 311.982 MiB
Guest Write Difference: 29.391 MiB

Write amplification from guest to host: 2,80x
Write amplification inside SSD: 1,11x
Total write amplification: 3,12x

sync_randwrite: 2,10 MiB/s
sync_randread: 82 MiB/s
sync_seqwrite: 266 MiB/s
sync_seqread: 3.303 MiB/s
async_uncached_randwrite: 223 MiB/s
async_cached_randwrite: 1.631 MiB/s
async_uncached_randread: 455 MiB/s
async_cached_randread: 757 MiB/s
async_uncached_seqwrite: 6.522 MiB/s
async_cached_seqwrite: 2.438 MiB/s
async_uncached_seqread: 9.470 MiB/s
async_cached_seqread: 3.528 MiB/s

Doesn't look like the stripe-width made a big difference.

Edit:
I made a spreadsheet with diagrams because it was getting hard to compare numbers:
https://docs.google.com/spreadsheet...nashpbwqszHOSNTo2KKY55SDP4/edit#gid=462945904
Test2: default
Test1: - virtio blocksize 4K; + virtio blocksize 512B
Test3: - discard,noatime,nodiratime; + ext4 stride=8,stripe-width=32
Test4: + ext4 stride=8,stripe-width=8
Test5: + ext4 stripe-width=8
Test6: - compression=lz4; + compression=none
Test7: - encrytion=aes-256-gcm; + encryption=none


I also think I should have added some pauses between the individual fio tests so the caches got enough time to be written to disk. But can'T change that now without making stuff no longer comparable.
 
Last edited:
6th test is basically the same as the 2nd one just without lz4 compression:

Host it raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=none, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, discard, noatime, nodiratime, default ext4 parameters. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=4K.


Host Writes Difference on host: 91.680 MiB
NAND Writes Difference on host: 100.880 MiB

Guest Read Difference: 310.760 MiB
Guest Write Difference: 29.398 MiB

Write amplification from guest to host: 3,11x
Write amplification inside SSD: 1,10x
Total write amplification: 3,43x

sync_randwrite: 2,08 MiB/s
sync_randread: 86 MiB/s
sync_seqwrite: 272 MiB/s
sync_seqread: 2.378 MiB/s
async_uncached_randwrite: 230 MiB/s
async_cached_randwrite: 1.390 MiB/s
async_uncached_randread: 465 MiB/s
async_cached_randread: 810 MiB/s
async_uncached_seqwrite: 6.781 MiB/s
async_cached_seqwrite: 2.283 MiB/s
async_uncached_seqread: 9.571 MiB/s
async_cached_seqread: 3.498 MiB/s
 
Last edited:
Test Nr. 7 is the same as test Nr. 2 but without ZFS native encryption:

Host it raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=none, volblocksize=32K.
Guest is a Debian 10 with ext4, discard, noatime, nodiratime, default ext4 parameters. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=4K.

Host Writes Difference on host: 82.560 MiB
NAND Writes Difference on host: 93.160 MiB

Guest Read Difference: 312.242 MiB
Guest Write Difference: 29.405 MiB

Write amplification from guest to host: 2,81x
Write amplification inside SSD: 1,13x
Total write amplification: 3,17x

sync_randwrite: 2,31 MiB/s
sync_randread: 89,1 MiB/s
sync_seqwrite: 332 MiB/s
sync_seqread: 3.989 MiB/s
async_uncached_randwrite: 239 MiB/s
async_cached_randwrite: 1.515 MiB/s
async_uncached_randread: 470 MiB/s
async_cached_randread: 761 MiB/s
async_uncached_seqwrite: 6.748 MiB/s
async_cached_seqwrite: 2.429 MiB/s
async_uncached_seqread: 10.650 MiB/s
async_cached_seqread: 3.571 MiB/s

Disabling encryption was the biggest performance boost so far.
 
Maybe it isn't the best idea to run all 12 fio tests at once, because most writes should be async. So if I just look at the written data before and after all 12 tests has finished the write amplification will most likely more reflect the async writes.

So I edited the fio tests so they should all write the same amount of data (12G write test + 1G io file) and run them individually to see what the write amplification for each individual test would be.

VM and Pool is the same as Test 2:
Host is raidz1 of 4x S3710 200GB + 1x S3700 200GB. atime=off, ashift=12, thin, compression=lz4, encryption=aes-256-gcm, volblocksize=32K.
Guest is a Debian 10 with ext4, discard, noatime, nodiratime, default ext4 parameters. cachemode=none, io threat=yes, discard=yes, ssd emulation=yes, virtio SCSI, SCSI, virtio blocksize=4K.

fio commands are:
Code:
#Test 1 - sync_randwrite:
fio --filename=/tmp/test.file --name=sync_randwrite --rw=randwrite --bs=4k --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=1G --loops=12 --group_reporting
#Test 2 - sync_seqwrite:
fio --filename=/tmp/test.file --name=sync_seqwrite --rw=write --bs=4M --direct=1 --sync=1 --numjobs=1 --ioengine=psync --iodepth=1 --refill_buffers --size=1G --loops=12 --group_reporting
#Test 3 - async_uncached_randwrite:
fio --filename=/tmp/test.file --name=async_uncached_randwrite --rw=randwrite --bs=4k --direct=1 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=3 --group_reporting
#Test 4 - async_uncached_seqwrite:
fio --filename=/tmp/test.file --name=async_uncached_seqwrite --rw=write --bs=4M --direct=1 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=3 --group_reporting
#Test 5 - async_cached_randwrite:
fio --filename=/tmp/test.file --name=async_cached_randwrite --rw=randwrite --bs=4k --direct=0 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=3 --group_reporting
#Test 6 - async_cached_seqwrite:
fio --filename=/tmp/test.file --name=async_cached_seqwrite --rw=write --bs=4M --direct=0 --sync=0 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --size=1G --loops=3 --group_reporting

I ran the test in reverse order (test 6 to test 1) on the same VM without rebooting/restoring with 4 to 6 minutes pauses between each test.

Here are the results:
Guest Writes (MiB)Host Writes (MiB)NAND Writes (MiB)W.A. guest->hostW.A. inside SSDW.A. total
Test 1 (sync rand):25.545312.320349.76012,231,1213,69
Test 2 (sync seq):12.32531.80035.5202,581,122,88
Test 3 (async rand):12.30242.76046.9203,481,103,81
Test 4 (async seq):12.2885.7207.6800,460,631,34
Test 5 (async buff rand):2.0502.9204.4801,421,532,19
Test 6 (async buff seq):5.4468.28010.1601,521,231,87
Sum:69.956403.800454.5205,771,136,50
Avg:11.65967.30075.753

There must be something wrong:
1.) Test 5 and 6 should also have Guest Writes that are 12GB because fio reports "io=12.0GiB (12.9GB)" and "Laying out IO file (1 file / 1024MiB)" for both of them.
2.) Test 4 has written less on the host than inside the guest
3.) Test 1 the guest should have only be written 12GB too.

Someone knows what happened there? I only could think of that the 4 to 6 minute pause between tests isn't enough and guests write cache or ZIL might be too slow to write everything down within that pause. But by default ZIL should only cache async writes for about 5-30 seconds right? What about the debian write cache? So I need to manually flush after a test or something like that? Except for the ZIL the host shouldn't cache because I used cachemode=none. Because the hosts write average is nearly 12GiB and test 1 ran last and was running for an hour, it looks like cached writes from test 3-6 ended up in mainly test 1?

But it really looks like virtio can't handle 4K random sync writes very well.
 
Last edited:
I think I see whats the problem is...
async_cached_randwrite and async_cached_seqwrite write just write 12x 1G to the guests RAM. The guest realizes that the same 1G file was overwritten 11 times so after 1 or 2 minutes it just writes the last 1G from RAM to disk.
So cached writes are quite useless to benchmark, atleast with --loop or --numjobs greater than 1.

But an interesting part is why why cached reads are slower than uncached reads. Caches reads would use the linux page files and uncached reads just the ARC? Then it is strange why reading data from ARC is 3-4 times faster than reading from the guests page files. I thought that reads from ARC should be slower because they would need to go through virtio and produce additional overhead.

Edit:
I rewrote my host+guest monitoring scripts to output data written/read since start of the script in 1 minute intervals.
That way I can monitor when SSD, host and guest are actually finished doing stuff. If I then also run only one fio test at a time the data should be better.
 
Last edited:
Looks like the small 4K random writes are the main problem that cause alot of write amplification from guest to host. I would guess that virtio writing with 512B/4K to the 32K pool is causing this.
In that case it would be nice to benchmark if a 4 disk striped mirror + 2 disk mirror would be more useful as a single 6 disks striped mirror. In that case I could run the 2 disk mirror as 4K volblocksize to host my DB heavy VMs and the 4 disk striped mirror with a 8K volbocksize for all the other VMs.
If I unterstand it right a 6 disk striped mirror should use 12K volblocksize (ashift * number of mirrors) but that isn't possible so I would need to use 8K or 16K. With 16K volblocksize two mirrors would get one 4K block each and one mirror would need to write two 4K blocks so the other mirrors need to wait? Similar problem with 8K volblocksize. Only two mirrors get 4K blocks and one mirror has nothing to do. But one mirror that has nothing to do sounds better than two waiting mirrors.
 
...If I unterstand it right a 6 disk striped mirror should use 12K volblocksize (ashift * number of mirrors)...
volblocksize = ashift * number of stripes not mirrors.
For ashift = 12 You have 4k block on disk:
  • 2 disk mirror ZFS writes 2 * 4k block with same content
  • 2 disk stripe ZFS writes 2 * 4k block with different contents (8k)
  • 6 disk stripped mirror (3 stripes * 2 mirrors) ZFS writes 3 stripes * (2 mirrors * 4k), and this means the minimal block that ZFS can write to array is 3 * 4k = 12k
  • 6 disk stripped mirror (2 stripes * 3 mirrors) ZFS writes 2 stripes * (3 mirrors * 4k), and this means the minimal block that ZFS can write to array is 2 * 4k = 8k
  • 6 disk RAIDZ-2 ZFS writes 4 stripes * 4k + 2 * 4k parity and this means the minimal block that ZFS can write to array is 4 * 4k = 16k
 
Last edited:
  • Like
Reactions: Dunuin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!