High I/O wait with SSDs

@spirit Thank you very much for your input. I read the page you linked, and tried disabling the write-cache, with surprising results.
I had 8013kB/s with the cache enabled and between 16.2MB/s and 19.8MB/s after disabling it. Why is my disk faster with cache disabled? :oops:
I am not aware of any ZIL or SLOG storage. Are they created by default? Also, I thought I read somewhere in the forum they are (mostly) only relevant if you have deduplication turned on? Anyway, you're probably right, and my consumer grade SSDs sucks when hammered by ZFS. I guess I need to run it on ext4 (or maybe Btrfs?) as an alternative and keep ZFS for the spinning disks, as it works perfectly fine on those.
ZFS is always using a ZIL on sync writes. If you don't specify a SLOG, to store a ZIL on a dedicated drive, the ZIL will be on the drives itself, so each drive will write everything twice.
 
thanks

I must say the iostats from the zpool are abysmal.

just that you've something to compare:
I've a 3x sandisk ultra 3d ssd (also consumer) in a mirrored zpool and this is what I get from fio.

Code:
Run status group 0 (all jobs):
  WRITE: bw=672MiB/s (704MB/s), 83.0MiB/s-197MiB/s (88.0MB/s-207MB/s), io=2048MiB (2147MB), run=1297-3049mse

I can only assume that there is a firmware problem layers below...either bios, ssd, etc...

regardless if enterprise or consumer ssd the crucial ssd should outperform your hdd zpool easily.
 
I'm not sure if they
I can only assume that there is a firmware problem layers below...either bios, ssd, etc...
I wouldn't bet on that. SSDs are just like USB sticks but with more advanced controller chips with fancy caching and more computation in the background. I also benchmark my USB Sticks and 95% of them can't exceed 0,01MB/s write speed with sync 4K writes. If you use consumer SSDs with small sync writes they can't use their caching stuff and the performance might be very bad. NAND flash isn't that fast in all situations and depending on the workload a HDD might be way faster.
 
Last edited:
and depending on the workload a HDD might be way faster.
Sequential writes 1,2 k for instance.
No problem for a HDD. Just position the head and move it down slowly.
SSD - there is a hack of wear levelling and reprogramming going on. Once you have used up the empty/fresh cells - good night ;)
 
thanks

I must say the iostats from the zpool are abysmal.

just that you've something to compare:
I've a 3x sandisk ultra 3d ssd (also consumer) in a mirrored zpool and this is what I get from fio.

Code:
Run status group 0 (all jobs):
  WRITE: bw=672MiB/s (704MB/s), 83.0MiB/s-197MiB/s (88.0MB/s-207MB/s), io=2048MiB (2147MB), run=1297-3049mse

I can only assume that there is a firmware problem layers below...either bios, ssd, etc...

regardless if enterprise or consumer ssd the crucial ssd should outperform your hdd zpool easily.

You are running the test for less then 3 seconds, you are only testing the cache.

Technically your drives can't exceed 400mb/s in a 3 way mirror.

I would like you to test with the following and report back:
Code:
fio --name=seqwrite --filename=fio_seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=3G --runtime=60 --group_reporting && rm fio_seqwrite.fio
 
Last edited:
Code:
fio --name=seqwrite --filename=fio_seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=3G --runtime=60 --group_reporting && rm fio_seqwrite.fio
I also run that here, if you might want something to compare:

A.) 2x Intel DC S3700 100GB SATA (mdraid raid1, luks, lvm thin, ext4):
Code:
WRITE: bw=133MiB/s (139MB/s), 133MiB/s-133MiB/s (139MB/s-139MB/s), io=7957MiB (8344MB), run=60005-60005msec

B.) 4x Intel DC S3710 200GB SATA + 1x Intel DC S3700 200GB SATA (ZFS raidz1, sync=standard):
Code:
WRITE: bw=1129MiB/s (1184MB/s), 1129MiB/s-1129MiB/s (1184MB/s-1184MB/s), io=9216MiB (9664MB), run=8160-8160msec

C.) 2x ST3000DM001 3TB 7100RPM Consumer HDDs (ZFS mirror, sync=standard):
Code:
WRITE: bw=188MiB/s (197MB/s), 188MiB/s-188MiB/s (197MB/s-197MB/s), io=9216MiB (9664MB), run=49022-49022msec

Test B needs higher values for "size" to run longer? With 1184MB/s in raidz1 I only tested the cache right?

Edit: Tried test B again with some other paramters:
Code:
fio --name=seqwrite --filename=fio_seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --iodepth=32 --numjobs=4 --ioengine=libaio --bs=1m --size=30G --runtime=60 --group_reporting && rm fio_seqwrite.fio

seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
...
fio-3.12
Starting 4 processes
seqwrite: Laying out IO file (1 file / 30720MiB)
seqwrite: Laying out IO file (1 file / 30720MiB)
seqwrite: Laying out IO file (1 file / 30720MiB)
seqwrite: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [W(4)][100.0%][w=4607MiB/s][w=4607 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=4): err= 0: pid=1530: Fri Jan 22 21:33:29 2021
  write: IOPS=3922, BW=3922MiB/s (4113MB/s)(230GiB/60001msec); 0 zone resets
    slat (usec): min=138, max=602640, avg=714.11, stdev=8444.69
    clat (usec): min=4, max=645673, avg=31605.58, stdev=48352.23
     lat (usec): min=392, max=646069, avg=32320.75, stdev=49144.06
    clat percentiles (msec):
     |  1.00th=[   16],  5.00th=[   18], 10.00th=[   20], 20.00th=[   22],
     | 30.00th=[   23], 40.00th=[   24], 50.00th=[   24], 60.00th=[   26],
     | 70.00th=[   27], 80.00th=[   30], 90.00th=[   37], 95.00th=[   47],
     | 99.00th=[  347], 99.50th=[  439], 99.90th=[  600], 99.95th=[  625],
     | 99.99th=[  642]
   bw (  KiB/s): min=26570, max=1540096, per=25.03%, avg=1005271.21, stdev=363382.17, samples=475
   iops        : min=   25, max= 1504, avg=981.64, stdev=354.93, samples=475
  lat (usec)   : 10=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.03%, 20=11.66%, 50=84.36%
  lat (msec)   : 100=2.56%, 250=0.01%, 500=1.06%, 750=0.31%
  cpu          : usr=25.89%, sys=31.30%, ctx=240811, majf=0, minf=46
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,235324,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=3922MiB/s (4113MB/s), 3922MiB/s-3922MiB/s (4113MB/s-4113MB/s), io=230GiB (247GB), run=60001-60001msec

Shouldn't that be way slower? 3922MiB/s is way faster than the bandwidth of 5 SATA ports...and the ARC is only 8GB, so it shouldn't be able to cache 247GB of writes.

Edit:
And with 4k random writes
Code:
fio --name=randwrite --filename=fio_seqwrite.fio --refill_buffers --rw=randwrite --direct=1 --loops=3 --iodepth=1 --numjobs=1 --ioengine=libaio --bs=4k --size=30G --runtime=60 --group_reporting && rm fio_seqwrite.fio

WRITE: bw=34.4MiB/s (36.1MB/s), 34.4MiB/s-34.4MiB/s (36.1MB/s-36.1MB/s), io=2065MiB (2165MB), run=60001-60001msec
 
Last edited:
  • Like
Reactions: ralph.vigne
So, I gave up using ZFS on my boot disk. My SDD performs significantly better with other file systems. Using ext4 with the same settings as above (8K 8 jobs) the fio test results in 432MB/s and using 8k and 16 jobs it is still 117MB/s (repeatedly). Using Btrfs I get around 60MB/s with 8k and 8 jobs and 40MB/s with 8k and 16 jobs, which is still significantly better than ZFS (8-10 times).

I assume, as has been mentioned by many of you, that the performance is mostly based on some clever caching done by the SSD. I don't know what is different to ZFS that the caching doesn't work as good. But be it as it may, compared to ZFS, the performance penalty is huge, with ZFS around 6-7MB/s for 8k and 8 jobs and apporx 4MB/s for 8K and 16 jobs.

Sicne I don't want to invest in new SSDs right now, as the others are not that old I don't have a use form them right now, I think I will give BTRFS a chance, as it also allows for software RAID1 and is supported by PVE. Maybe I keep ZFS on the 4 HDDs for now, IDK.

Anyhow, I really want to thank everybody who shared his thoughts and opinions with me - it helped me a lot and was very much appreciated.
 
I don't know what is different to ZFS
It is a copy on write (COW) filesystem and hence works completely different compared to ext4, btrfs etc.
Anyhow, I really want to thank everybody who shared his thoughts and opinions with me - it helped me a lot and was very much appreciated.
You are welcome. I find such discussions interesting and I also gain some knowledge from it. So it is a win-win.
 
Last edited:
  • Like
Reactions: ralph.vigne
I also run that here, if you might want something to compare:

B.) 4x Intel DC S3710 200GB SATA + 1x Intel DC S3700 200GB SATA (ZFS raidz1, sync=standard):
Code:
WRITE: bw=1129MiB/s (1184MB/s), 1129MiB/s-1129MiB/s (1184MB/s-1184MB/s), io=9216MiB (9664MB), run=8160-8160msec

Test B needs higher values for "size" to run longer? With 1184MB/s in raidz1 I only tested the cache right?

Nah seems fine given that raidz splits data across drives.

Your S3710 200GB do 300mb/s seq write x4 = 1200mb/s. S3700 excluded for parity.

Shouldn't that be way slower? 3922MiB/s is way faster than the bandwidth of 5 SATA ports...and the ARC is only 8GB, so it shouldn't be able to cache 247GB of writes.

keep numjobs=1
 
  • Like
Reactions: Dunuin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!