Terrible IO performance on SSDs

mind.overflow · May 6, 2022

Hello!
I have recently started working with Proxmox VE and I'm truly loving it. The Web UI is wonderful to use and very intuitive.
I have a very simple machine with a Xeon E3-1240v2 and 16GB of RAM, and the whole system is running on a ZFS rpool with two mirrored SSDs. The two SSDs are of different makes, but I have extensively tested them before installing the system and they had satisfactory performance levels (~500MB/s on one and 350MB/s on the other). The pool does not have any issue, the disks are fairly new and they have been working more than fine as standalones in other PCs. The disks are directly attached via SATA to the motherboard ports.
However, even on the Proxmox host, I'm having terrible IO performance, to the point where everything is almost unusable.

I tried writing /dev/zero to a random temp file, and I'm getting wildly fluctuating performance levels - actually, to describe it better, it's always at ~10MB/s and it has spikes up to ~200MB/s. 10MB/s are shared among all VMs, meaning that if I run this test on two VMs at the same time, I get less than 5MB/s each.

I'm pasting some info that I think might help investigate the issue, but please tell me if more information is needed. I'm learning Proxmox VE and I'm sure I have done something wrong, I just can't figure out where - especially because in the first few weeks of uptime the machine was perfectly fine; this issue slowly arose with time (I'm about 5 months in). Thank you so much in advance!

PS: This machine is running on a cluster with another, very low-performance one that I just use as a DNS and test machine. That one is running on slow old HDDs and has a 2-core CPU, although I don't see why it would keep the main one from performing at its best.
Funnily enough, while the main one has "I/O wait" varying between 5% and 20%, the slower machine has it always around 0%-1%.

Code:

root@pve:~# pveperf
CPU BOGOMIPS:      54400.72
REGEX/SECOND:      2221915
HD SIZE:           187.58 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     336.98
DNS EXT:           47.06 ms
DNS INT:           60.54 ms (redacted.net)

Code:

root@pve:~# uname -a
Linux pve 5.13.19-6-pve #1 SMP PVE 5.13.19-14 (Thu, 10 Mar 2022 16:24:52 +0100) x86_64 GNU/Linux

Code:

root@pve:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:03:21 with 0 errors on Sun Apr 10 00:27:22 2022
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            ata-KINGSTON_SUV400S37240G_50026B7675033560-part3  ONLINE       0     0     0
            ata-SSD_240GB_YS202010331654AA-part3               ONLINE       0     0     0

errors: No known data errors
root@pve:~#

Code:

root@pve:~# pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active       196689024         5828992       190860032    2.96%
local-zfs     zfspool     active       219670420        28810380       190860040   13.12%
root@pve:~#

Thank you again!

aaron · May 6, 2022

What make and model is the second SSD? The Kingston one shows up with enough details in the zpool status output, but I have no idea what the other one is.

Benchmarking by dding from /dev/zero is not really useful for ZFS, especially if compression is enabled.

Rather go checkout FIO. Our ZFS benchmarking paper has the commands how it was used on the various levels (disk itself, zfs, inside VMs).

Another thing: if you write with 4k blocks, you typically run into the IOPS (input output per second) limits and not into the bandwidth limits. To check the bandwidth limits, consider a larger block size of 1m or 4m.

mind.overflow · May 6, 2022

Thank you so much for the extremely quick and helpful reply!
IIRC, the other SSD is just a cheaper, probably DRAM-less model from Kingston, but I can't check right now as I'm far away from the server. I remember trying it with a quick Windows installation and it was giving consistent results on CrystalDiskMark though, even with particularly high file sizes.

I looked at the PDF you sent (thank you!) and I tried a very quick run of FIO with a size of 1GB on a Ubuntu VM, and the results are HORRIBLE. I/O delay is around 25% during the opereation. The write speed fluctuates between 1MB/s and 8KB/s.

Unless I did something wrong with the command, there is absolutely something that I need to fix. I'm starting to believe that the issue might be that second, slower SSD, but I wasn't expecting this much of a slowdown. I would've been more than ok with 100MB/s - not great, but this machine doesn't have to be fast. 1MB/s though? That makes everything absolutely unusable, an HDD from 2008 would be faster than that.

Do you have any idea about anything that I can try to make things better? Other things to check?
Thank you very much!

EDIT: The FIO operation finally ended, these are the results:

Code:

mind-overflow@services:~$ fio --ioengine=psync --filename=test_fio --size=1G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --threads --bs=4M --numjobs=4
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.16
Starting 4 processes
fio: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=4): [W(4)][100.0%][w=996KiB/s][w=249 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=4): err= 0: pid=303805: Fri May  6 11:19:09 2022
  write: IOPS=116, BW=465KiB/s (476kB/s)(272MiB/600013msec); 0 zone resets
    clat (msec): min=6, max=2269, avg=34.42, stdev=121.65
     lat (msec): min=6, max=2269, avg=34.43, stdev=121.65
    clat percentiles (msec):
     |  1.00th=[   13],  5.00th=[   15], 10.00th=[   15], 20.00th=[   15],
     | 30.00th=[   15], 40.00th=[   16], 50.00th=[   16], 60.00th=[   16],
     | 70.00th=[   16], 80.00th=[   16], 90.00th=[   19], 95.00th=[   22],
     | 99.00th=[  718], 99.50th=[  961], 99.90th=[ 1368], 99.95th=[ 1586],
     | 99.99th=[ 1989]
   bw (  KiB/s): min=   28, max= 1256, per=100.00%, avg=563.19, stdev=122.95, samples=3959
   iops        : min=    4, max=  314, avg=140.60, stdev=30.74, samples=3959
  lat (msec)   : 10=0.86%, 20=91.79%, 50=4.05%, 100=0.36%, 250=0.10%
  lat (msec)   : 500=1.04%, 750=0.88%, 1000=0.49%, 2000=0.42%, >=2000=0.01%
  cpu          : usr=0.02%, sys=0.16%, ctx=278347, majf=0, minf=47
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,69713,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=465KiB/s (476kB/s), 465KiB/s-465KiB/s (476kB/s-476kB/s), io=272MiB (286MB), run=600013-600013msec

Disk stats (read/write):
  sda: ios=5/141092, merge=0/53448, ticks=22/688004, in_queue=532220, util=97.40%

EDIT 2: Example workflow of updating a docker container (docker-compose pull & docker-compose prune -a) takes ~15minutes for a ~600MB container and causes this much IO delay:

Dunuin · May 6, 2022

aaron said:
What make and model is the second SSD? The Kingston one shows up with enough details in the zpool status output, but I have no idea what the other one is.

Benchmarking by dding from /dev/zero is not really useful for ZFS, especially if compression is enabled.

Rather go checkout FIO. Our ZFS benchmarking paper has the commands how it was used on the various levels (disk itself, zfs, inside VMs).

Another thing: if you write with 4k blocks, you typically run into the IOPS (input output per second) limits and not into the bandwidth limits. To check the bandwidth limits, consider a larger block size of 1m or 4m.

Also keep in mind that ZFS does alot of sync writes and consumer SSDs can't handle them well (you only got 336 IOPS for sync writes which isn't much better then HDD performance) which is beside durability and not loosing your pool on an power outage a point why you should use enterprise SSDs with ZFS.
And SSDs get slower as you fill them as the SLC cache will get smaller. This is especially important if your second SSD would be a QLC one which gets terrible slow as soon as the SLC cache gets full.

aaron · May 6, 2022

mind.overflow said:
IIRC, the other SSD is just a cheaper, probably DRAM-less model from Kingston, but I can't check right now as I'm far away from the server.

Let us know once you have access. Unfortunately, in the consumer SSD market, there are some models that are just utterly terrible. The recommendation to buy SSDs to get good performance needs to come with a big asterisk nowadays because of this.

FIO presents the min, max, avg and stddev results at the end. I remember once benchmarking an SSD directly in a support ticket that had as min result 2 IOPS, 8 MB/s (4M bandwidth test).

Also, for all the other people that might stumble across this thread: If you want to have good performance that actually matches the specs, go for datacenter SSDs that have real power loss protection. For example https://geizhals.eu/?cat=hdssd&xf=4643_Power-Loss+Protection&sort=p#productlist

mind.overflow · Jun 3, 2022

Thank you so much to both @aaron and @Dunuin !!
You pointed me in a great direction and I was able to address this issue. The other SSD was a knock-off Kingston, and that was the cause of all these problems. I have now since replaced it with a way better one, and I'm planning to move to enterprise-grade SSDs in the near future. For the moment, however, after replacing the fake SSD in the ZFS pool, I/O wait is back at 0-5% and everything is finally very responsive again.

As for power loss protection, I'm not extremely concerned as the server is behind a 1500VA UPS, but still, I can understand the concern in case of other hardware failures (including the UPS itself), so thanks for the heads up - I didn't know SSDs needed that.

My fsyncs/second from pveperf have more than doubled, so I think that's a pretty good sign:

Code:

FSYNCS/SECOND:     728.15

Also, this is a quick run of fio that I did with the same parameters as last time:

Code:

fio: (groupid=0, jobs=4): err= 0: pid=29782: Fri Jun  3 14:02:51 2022
  write: IOPS=26, BW=105MiB/s (111MB/s)(3108MiB/29483msec); 0 zone resets
    clat (msec): min=43, max=550, avg=151.29, stdev=115.42
     lat (msec): min=43, max=550, avg=151.51, stdev=115.42
    clat percentiles (msec):
     |  1.00th=[   61],  5.00th=[   73], 10.00th=[   74], 20.00th=[   75],
     | 30.00th=[   78], 40.00th=[   95], 50.00th=[  109], 60.00th=[  124],
     | 70.00th=[  146], 80.00th=[  165], 90.00th=[  409], 95.00th=[  439],
     | 99.00th=[  510], 99.50th=[  510], 99.90th=[  550], 99.95th=[  550],
     | 99.99th=[  550]
   bw (  KiB/s): min=32734, max=229376, per=100.00%, avg=108882.59, stdev=15883.77, samples=232
   iops        : min=    6, max=   56, avg=26.36, stdev= 3.87, samples=232
  lat (msec)   : 50=0.13%, 100=41.70%, 250=43.50%, 500=13.64%, 750=1.03%
  cpu          : usr=0.18%, sys=0.13%, ctx=3116, majf=0, minf=51
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,777,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=105MiB/s (111MB/s), 105MiB/s-105MiB/s (111MB/s-111MB/s), io=3108MiB (3259MB), run=29483-29483msec

Disk stats (read/write):
  sda: ios=78/4212, merge=130/3468, ticks=261/34519, in_queue=30920, util=94.46%

Thanks again for all the help!!

Dunuin · Jun 3, 2022

mind.overflow said:
Thank you so much to both @aaron and @Dunuin !!
You pointed me in a great direction and I was able to address this issue. The other SSD was a knock-off Kingston, and that was the cause of all these problems. I have now since replaced it with a way better one, and I'm planning to move to enterprise-grade SSDs in the near future. For the moment, however, after replacing the fake SSD in the ZFS pool, I/O wait is back at 0-5% and everything is finally very responsive again.

As for power loss protection, I'm not extremely concerned as the server is behind a 1500VA UPS, but still, I can understand the concern in case of other hardware failures (including the UPS itself), so thanks for the heads up - I didn't know SSDs needed that.

Thanks again for all the help!!

Its not only that. A SSD got its own firmware and this firmware can't know if you are using a UPS or not, so it will assume there is no UPS and therefore it won't cache any sync writes in its internal RAM cache, as this would be lost on an power outage. The result is a terrible sync write performance and alot more wear because the write amplification will go up because sync IO operations can't be optimized before writing them to NAND.
Its the same like with a raid card with cache but without a BBU.

mind.overflow · Jun 3, 2022

Dunuin said:
Its not only that. A SSD got its own firmware and this firmware can't know if you are using a UPS or not, so it will assume there is no UPS and therefore it won't cache any sync writes in its internal RAM cache, as this would be lost on an power outage. The result is a terrible sync write performance and alot more wear because the write amplification will go up because sync IO operations can't be optimized before writing them to NAND.
Its the same like with a raid card with cache but without a BBU.

Thank you for the info! I'm learning a lot of interesting things. My next purchase is definitely going to be better storage (both in terms of capacity and quality). Do you have any suggestions on anything, or something to absolutely avoid? I hate how storage companies always put out great products and then silently downgrade them in future revisions, so if you could point me anywhere, I'd really appreciate it.

Thank you again!

Dunuin · Jun 3, 2022

mind.overflow said:
Thank you for the info! I'm learning a lot of interesting things. My next purchase is definitely going to be better storage (both in terms of capacity and quality). Do you have any suggestions on anything, or something to absolutely avoid? I hate how storage companies always put out great products and then silently downgrade them in future revisions, so if you could point me anywhere, I'd really appreciate it.

Thank you again!

Thats more a problem with consumer hardware. As long as it got a powerloss protection (which all enterprise/datacenter SSDs got) and it is durable enough for your workload (1, 3 or 10 DWPD rating) it should be fine. SSDs for mixed workloads (3 DWPD) and write intense workloads (10 DWPD) often got higher quality NAND with better write performance and durability.

Search

Search

Terrible IO performance on SSDs

mind.overflow

New Member

aaron

Proxmox Staff Member

mind.overflow

New Member

Dunuin

Distinguished Member

aaron

Proxmox Staff Member

mind.overflow

New Member

Dunuin

Distinguished Member

mind.overflow

New Member

Dunuin

Distinguished Member