ZFS slow writes on Samsung PM893

limone

Well-Known Member
Aug 1, 2017
89
9
48
30
Hello,

I have a Dell R630 server (without HW storage controller) with 4x Samsung PM893 480GB, on which I run a ZFS.

Unfortunately I have very poor write performance:
fio --ioengine=libaio --filename=/ZFS-2TB_RAID0_SSD/fiofile --direct=1 --sync=1 --rw=write --bs=8K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=50G
Code:
fio: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=39.3MiB/s][w=5027 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2192386: Wed Aug  9 18:10:26 2023
  write: IOPS=7794, BW=60.9MiB/s (63.8MB/s)(3654MiB/60001msec); 0 zone resets
    slat (usec): min=86, max=51158, avg=125.70, stdev=273.54
    clat (nsec): min=1226, max=394109, avg=1580.25, stdev=1241.95
     lat (usec): min=87, max=51171, avg=127.28, stdev=273.64
    clat percentiles (nsec):
     |  1.00th=[ 1288],  5.00th=[ 1304], 10.00th=[ 1320], 20.00th=[ 1336],
     | 30.00th=[ 1352], 40.00th=[ 1368], 50.00th=[ 1384], 60.00th=[ 1400],
     | 70.00th=[ 1448], 80.00th=[ 1624], 90.00th=[ 1784], 95.00th=[ 2352],
     | 99.00th=[ 3216], 99.50th=[ 7712], 99.90th=[15552], 99.95th=[15808],
     | 99.99th=[16768]
   bw (  KiB/s): min=17360, max=79936, per=100.00%, avg=62388.71, stdev=13542.87, samples=119
   iops        : min= 2170, max= 9992, avg=7798.57, stdev=1692.84, samples=119
  lat (usec)   : 2=93.59%, 4=5.79%, 10=0.14%, 20=0.47%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=2.54%, sys=22.57%, ctx=483282, majf=0, minf=91
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,467661,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=60.9MiB/s (63.8MB/s), 60.9MiB/s-60.9MiB/s (63.8MB/s-63.8MB/s), io=3654MiB (3831MB), run=60001-60001msec



Read is actually quite good, about 50K IOPS / 400 MiB/s:
fio --ioengine=libaio --filename=/ZFS-2TB_RAID0_SSD/fiofile --direct=1 --sync=1 --rw=read --bs=8K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=200G
Code:
fio: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
fio: Laying out IO file (1 file / 204800MiB)
Jobs: 1 (f=0): [f(1)][100.0%][r=850MiB/s][r=109k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2175577: Wed Aug  9 18:07:22 2023
  read: IOPS=105k, BW=821MiB/s (861MB/s)(48.1GiB/60001msec)
    slat (usec): min=3, max=2601, avg= 8.29, stdev=20.81
    clat (nsec): min=806, max=142281, avg=925.76, stdev=311.68
     lat (usec): min=3, max=2605, avg= 9.22, stdev=20.93
    clat percentiles (nsec):
     |  1.00th=[  844],  5.00th=[  860], 10.00th=[  868], 20.00th=[  876],
     | 30.00th=[  876], 40.00th=[  884], 50.00th=[  884], 60.00th=[  892],
     | 70.00th=[  892], 80.00th=[  900], 90.00th=[  924], 95.00th=[ 1368],
     | 99.00th=[ 1496], 99.50th=[ 1592], 99.90th=[ 2096], 99.95th=[ 2768],
     | 99.99th=[ 8768]
   bw (  KiB/s): min=594432, max=875735, per=99.99%, avg=840533.60, stdev=33502.01, samples=119
   iops        : min=74304, max=109466, avg=105066.70, stdev=4187.77, samples=119
  lat (nsec)   : 1000=92.89%
  lat (usec)   : 2=6.99%, 4=0.08%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%
  cpu          : usr=14.49%, sys=83.98%, ctx=3195, majf=0, minf=44
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6304753,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=821MiB/s (861MB/s), 821MiB/s-821MiB/s (861MB/s-861MB/s), io=48.1GiB (51.6GB), run=60001-60001msec

It doesn't matter if I put them in RAID0, RAID1, RAID5 or RAID10.
Currently I'm running them in RAID0, I'd like to keep that as I just need fast and lots of storage, but extended downtime is not a problem.
I had actually bought extra Enterprise SSDs, admittedly probably the cheapest, but at least with power loss protection.
When I do the fio test without sync (sync=0) they are very fast, but unfortunately a VM doesn't care if I enable ZFS sync or not.

Now to my real question: Am I doing something wrong, or are the SSDs simply not suitable for my purpose?

Additional infos (let me know what's missing):
Code:
NAME               PROPERTY                       VALUE                          SOURCE
ZFS-2TB_RAID0_SSD  size                           1.73T                          -
ZFS-2TB_RAID0_SSD  capacity                       54%                            -
ZFS-2TB_RAID0_SSD  altroot                        -                              default
ZFS-2TB_RAID0_SSD  health                         ONLINE                         -
ZFS-2TB_RAID0_SSD  guid                           11975000417080453703           -
ZFS-2TB_RAID0_SSD  version                        -                              default
ZFS-2TB_RAID0_SSD  bootfs                         -                              default
ZFS-2TB_RAID0_SSD  delegation                     on                             default
ZFS-2TB_RAID0_SSD  autoreplace                    off                            default
ZFS-2TB_RAID0_SSD  cachefile                      -                              default
ZFS-2TB_RAID0_SSD  failmode                       wait                           default
ZFS-2TB_RAID0_SSD  listsnapshots                  off                            default
ZFS-2TB_RAID0_SSD  autoexpand                     off                            default
ZFS-2TB_RAID0_SSD  dedupratio                     1.00x                          -
ZFS-2TB_RAID0_SSD  free                           810G                           -
ZFS-2TB_RAID0_SSD  allocated                      966G                           -
ZFS-2TB_RAID0_SSD  readonly                       off                            -
ZFS-2TB_RAID0_SSD  ashift                         12                             local
ZFS-2TB_RAID0_SSD  comment                        -                              default
ZFS-2TB_RAID0_SSD  expandsize                     -                              -
ZFS-2TB_RAID0_SSD  freeing                        0                              -
ZFS-2TB_RAID0_SSD  fragmentation                  7%                             -
ZFS-2TB_RAID0_SSD  leaked                         0                              -
ZFS-2TB_RAID0_SSD  multihost                      off                            default
ZFS-2TB_RAID0_SSD  checkpoint                     -                              -
ZFS-2TB_RAID0_SSD  load_guid                      2918797835124086661            -
ZFS-2TB_RAID0_SSD  autotrim                       off                            default
ZFS-2TB_RAID0_SSD  compatibility                  off                            default
ZFS-2TB_RAID0_SSD  feature@async_destroy          enabled                        local
ZFS-2TB_RAID0_SSD  feature@empty_bpobj            active                         local
ZFS-2TB_RAID0_SSD  feature@lz4_compress           active                         local
ZFS-2TB_RAID0_SSD  feature@multi_vdev_crash_dump  enabled                        local
ZFS-2TB_RAID0_SSD  feature@spacemap_histogram     active                         local
ZFS-2TB_RAID0_SSD  feature@enabled_txg            active                         local
ZFS-2TB_RAID0_SSD  feature@hole_birth             active                         local
ZFS-2TB_RAID0_SSD  feature@extensible_dataset     active                         local
ZFS-2TB_RAID0_SSD  feature@embedded_data          active                         local
ZFS-2TB_RAID0_SSD  feature@bookmarks              enabled                        local
ZFS-2TB_RAID0_SSD  feature@filesystem_limits      enabled                        local
ZFS-2TB_RAID0_SSD  feature@large_blocks           enabled                        local
ZFS-2TB_RAID0_SSD  feature@large_dnode            enabled                        local
ZFS-2TB_RAID0_SSD  feature@sha512                 enabled                        local
ZFS-2TB_RAID0_SSD  feature@skein                  enabled                        local
ZFS-2TB_RAID0_SSD  feature@edonr                  enabled                        local
ZFS-2TB_RAID0_SSD  feature@userobj_accounting     active                         local
ZFS-2TB_RAID0_SSD  feature@encryption             enabled                        local
ZFS-2TB_RAID0_SSD  feature@project_quota          active                         local
ZFS-2TB_RAID0_SSD  feature@device_removal         enabled                        local
ZFS-2TB_RAID0_SSD  feature@obsolete_counts        enabled                        local
ZFS-2TB_RAID0_SSD  feature@zpool_checkpoint       enabled                        local
ZFS-2TB_RAID0_SSD  feature@spacemap_v2            active                         local
ZFS-2TB_RAID0_SSD  feature@allocation_classes     enabled                        local
ZFS-2TB_RAID0_SSD  feature@resilver_defer         enabled                        local
ZFS-2TB_RAID0_SSD  feature@bookmark_v2            enabled                        local
ZFS-2TB_RAID0_SSD  feature@redaction_bookmarks    enabled                        local
ZFS-2TB_RAID0_SSD  feature@redacted_datasets      enabled                        local
ZFS-2TB_RAID0_SSD  feature@bookmark_written       enabled                        local
ZFS-2TB_RAID0_SSD  feature@log_spacemap           active                         local
ZFS-2TB_RAID0_SSD  feature@livelist               enabled                        local
ZFS-2TB_RAID0_SSD  feature@device_rebuild         enabled                        local
ZFS-2TB_RAID0_SSD  feature@zstd_compress          enabled                        local
ZFS-2TB_RAID0_SSD  feature@draid                  enabled                        local
 
I have now removed the zpool to be able to test different things.
When I run the fio test directly on the SSD, the speed is actually quite good.
Or at least much better than with ZFS.

But what I still don't quite understand: ZFS is just as fast in write on a single disk as it is in ZFS RAID. But the write operations should be distributed to the individual disks, i.e. the write speed should be multiplied. Right?

fio --ioengine=libaio --filename=/dev/sdi --bs=8K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio --size=50G

speeds (where they have settled)--direct=--sync=--rw=
[w=154MiB/s][w=19.8k IOPS]11write
[w=106MiB/s][w=13.6k IOPS]01write
[w=223MiB/s][w=28.5k IOPS]10write
[w=391MiB/s][w=50.1k IOPS]00write
[w=155MiB/s][w=19.8k IOPS]11randwrite
[w=111MiB/s][w=14.2k IOPS]01randwrite
[w=216MiB/s][w=27.6k IOPS]10randwrite
[w=370MiB/s][w=47.4k IOPS]00randwrite
 
@Falk R. No, would this affect the benchmark, or could this also be used to tweak the datastore itself? Because the VMs are running very slow on a ZFS pool based on PM893.

I ended up using mdadm software raid instead, now the VMs are running very fast
 
No, would this affect the benchmark, or could this also be used to tweak the datastore itself? Because the VMs are running very slow on a ZFS pool based on PM893.

I ended up using mdadm software raid instead, now the VMs are running very fast
Yes, this affect the Benchmark. Normaly use a actually OS a bigger iodepth. But this SSDs with SATA Interface have limited write I/O by Design and ZFS has a significant Write Amplification. With a fast Special Device, you can use this SSDs for VM ZFS Pools, without Special Device for Backups. ;)
 
Yes, this affect the Benchmark. Normaly use a actually OS a bigger iodepth. But this SSDs with SATA Interface have limited write I/O by Design and ZFS has a significant Write Amplification. With a fast Special Device, you can use this SSDs for VM ZFS Pools, without Special Device for Backups. ;)
Well, I meant if it only affects the benchmark, or if it also does anything in terms of the ZFS pool. A fast benchmark is useless if the VMs are still slow.
But I don't think I'll put any more effort into it, since I don't get any benefit from ZFS anyway.
 
Benefits of ZFS:

  1. Error Detection
  2. Error Correction
  3. You gain about 20% - 30% Storage Space because of Compression (typically)
  4. Snapshots and Storage Replication
But in the end it all depends on your usecase
 
Interesting. I’m speculating here but thinking it could boil down to the classic scenarios of latency vs throughput. If your write operations are sync and therefore inherently dependant in the SSD completing the write, this always comes with a certain overhead irrespective of the amount of data written - and that overhead will be the same even if there are many SSDs involved in parallel as they all need to finish before the operation is finished. In addition SATA is slow by today’s standards and can only sustain a certain number of operations per second. With only 1 job and queue depth 1 you are really exposing your results to such overhead as everything is done serially. My guess is if you experiment with larger block writes and more parallelism, other factors will come into play (e.g. total writeable bandwidth) and you will also see more differences between different raid configurations. That may help you get a better sense of where the real bottlenecks are.

Edit: PS as @ubu outlines, maybe don’t disregard ZFS so quickly, personally I wouldn’t use anything else… Sync writes with checksums and redundant metadata etc is a real belt and braces approach so will invariably come with some overhead. If you go with another file system you may gain speed, but will lose features and data protection mechanisms. BUT for fair comparison, you can switch those off in ZFS too, should you wish…
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!