Performance issues on raid 1 SSD

puertorico

Renowned Member
Mar 30, 2014
40
8
73
i run proxmox 4.2 on HP server with a Xeon 1220v2 Cpu, 20gb of ecc ram. i am seeing massive performance issues and i can't find the reason why.

I run 2 mirros on on a Intel RS2WC080, last week i decided to remove the raid controller and try on the onboard sata. the problem stil exist. i benchamrked the drives before the where put into the proxmox node there they performed as aspected on ext4

The issue is only on the rpool, all smart data is passed and i have tested with no running vm's or containers,


the config for the pool:


RPOOL: sdc + sdd 2x SAMSUNG MZ7TE128HMGR

root@vmcluster01:~# zpool get all rpool
NAME PROPERTY VALUE SOURCE
rpool size 119G -
rpool capacity 50% -
rpool altroot - default
rpool health ONLINE -
rpool guid 5425977347035108410 default
rpool version - default
rpool bootfs rpool/ROOT/pve-1 local
rpool delegation on default
rpool autoreplace off default
rpool cachefile - default
rpool failmode wait default
rpool listsnapshots off default
rpool autoexpand off default
rpool dedupditto 0 default
rpool dedupratio 1.00x -
rpool free 58.8G -
rpool allocated 60.2G -
rpool readonly off -
rpool ashift 12 local
rpool comment - default
rpool expandsize - -
rpool freeing 0 default
rpool fragmentation 44% -
rpool leaked 0 default
rpool feature@async_destroy enabled local
rpool feature@empty_bpobj active local
rpool feature@lz4_compress active local
rpool feature@spacemap_histogram active local
rpool feature@enabled_txg active local
rpool feature@hole_birth active local
rpool feature@extensible_dataset enabled local
rpool feature@embedded_data active local
rpool feature@bookmarks enabled local
rpool feature@filesystem_limits enabled local
rpool feature@large_blocks enabled local


DATASTORE: sda + sdb 2 x 3TB Seagate disks

root@vmcluster01:~# zpool get all datastore
NAME PROPERTY VALUE SOURCE
datastore size 2.72T -
datastore capacity 48% -
datastore altroot - default
datastore health ONLINE -
datastore guid 6264853520651431196 default
datastore version - default
datastore bootfs - default
datastore delegation on default
datastore autoreplace off default
datastore cachefile - default
datastore failmode wait default
datastore listsnapshots off default
datastore autoexpand off default
datastore dedupditto 0 default
datastore dedupratio 1.00x -
datastore free 1.40T -
datastore allocated 1.32T -
datastore readonly off -
datastore ashift 12 local
datastore comment - default
datastore expandsize - -
datastore freeing 0 default
datastore fragmentation 11% -
datastore leaked 0 default
datastore feature@async_destroy enabled local
datastore feature@empty_bpobj active local
datastore feature@lz4_compress active local
datastore feature@spacemap_histogram active local
datastore feature@enabled_txg active local
datastore feature@hole_birth active local
datastore feature@extensible_dataset enabled local
datastore feature@embedded_data active local
datastore feature@bookmarks enabled local
datastore feature@filesystem_limits enabled local
datastore feature@large_blocks enabled local


Some tests:

i have tried creating different storagetypes to se if there is a difference in the performance.
there is no difference in running as "directory" or "zfs"

When coping a vm from from rpool to datastore it looks okay, 100-152MB write (on reagular disk is okay)

datastore 1.32T 1.40T 0 1.19K 0 152M
rpool 61.6G 57.4G 8.81K 0 67.9M 8.00K





but when When cloning a vm from from Datastore to rpool or from rpool to rpool is look horrible like this 5-12.6Mb on SSD

capacity operations bandwidth
pool alloc free read write read write

datastore 1.32T 1.40T 756 0 93.9M 0
rpool 61.5G 57.5G 4 1.04K 60.0K 12.6M



This is the ouput from iostat when cloning a vm from rpool to rpool, sdc and sdd is utilized 100% with iowait/delay going from 5%-50%
but the wierd part is my cpu is not going over 10% in cpu utilization, the questions is why are the disks utilized a 100% and performing so bad.


Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 23.00 0.00 42.00 102.00 4.19 12.11 231.78 2.74 19.28 35.62 12.55 6.94 100.00
sdd 31.00 0.00 54.00 71.00 8.13 8.39 270.66 3.08 25.47 36.30 17.24 8.00 100.00
zd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
zd16 0.00 0.00 0.00 4.00 0.00 0.01 4.00 0.32 79.00 0.00 79.00 79.00 31.60
zd32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
zd48 0.00 0.00 96.00 0.00 12.00 0.00 256.00 0.99 10.46 10.46 0.00 10.29 98.80
zd64 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
zd80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
zd96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00



Another iostat view:

avg-cpu: %user %nice %system %iowait %steal %idle
0.75 0.00 5.51 37.59 0.00 56.14


when i look at zfs list i can se there is no compress on datastore ?

root@vmcluster01:~# zfs list -o name,compression,recordsize
NAME COMPRESS RECSIZE
datastore off 128K
datastore/files off 128K
rpool lz4 128K
rpool/ROOT lz4 128K
rpool/ROOT/pve-1 lz4 128K
rpool/ct lz4 128K
rpool/ct/subvol-1910-disk-1 lz4 128K
rpool/ct/subvol-1920-disk-2 lz4 128K
rpool/ct/subvol-1930-disk-1 lz4 128K
rpool/ct/vm-112-disk-1 lz4 -
rpool/ct/vm-114-disk-1 lz4 -
rpool/ct/vm-1180-disk-1 lz4 -
rpool/ct/vm-1190-disk-1 lz4 -
rpool/ct/vm-1810-disk-1 lz4 -
rpool/swap lz4 -
 
Last edited:
Here is some test, with reboot in between to make changes in bios.

1- sata 5+6 in legacy mode with cache

2- sata 5+6 in achi with cache

3- sata 5+6 without cache

4 - sata 1+2 without cache


i did the same test on port 1 + 2 on the sff-8087 connector there is no difference in read/write


upload_2016-8-3_20-24-48.png
 
i tried booting the server with ubuntu gnome just to test the disks i found out a very weird thing, when the disk is formated as zfs it is painfully slow i see the same result as on proxmox, on both 10 mb and 100 chucks its 10mb to 50mb in write

But then i tried formatting to ext4, that made the difference now i see the results that the disk should provide. to my suprise, the same thing is happening on both disks. i will try installing proxmox on zfs again.

i have attached the screenshot from when the disk is formated as zfs and how it performs on gnome-disks right after when formated to ext 4, no restart or nothing just a format in gnome disks,

when i install proxmox on the same disk with ext4 i get this with iostat and a simple dd:
dd if=/dev/zero of=test.bin bs=1M count=5000 conv=sync


Screenshot from 2016-08-04 21-42-59.png
 

Attachments

  • 13942328_10153987375403208_1697572674_n.jpg
    13942328_10153987375403208_1697572674_n.jpg
    22.2 KB · Views: 16
  • 10mb size.png
    10mb size.png
    66 KB · Views: 15
Last edited:
Update: i tried a new pool with 2 new samsung 840 evo on ahci and tried every combination on setting on the ssd2 pool, with cache enabled/disabled in bios, tried adding more ram to zfs, tried without checksum, tried with or without compression, i have also tried running one single drive (non raid) with same performance issues.

when the 2 new ssd drives are formatted as ext4, and connected to the same controller/server
they perform 500+ mb in read and write ( in a non raid config) the controller is 6gbit and verified with smartctl

when added and as a zpool, the performance is at is max, 100mb in write, when running vm with windblows/other os as raw the performance is even worse (like 25-30 mb writes) the read speed is in the 300mb range, and when the same filie is read multiple times and cached in ram buy zfs it is almost 4000-5000mb :)

when i copy a big file or clone a vm etc, the iodelay/wait can go up to 50%+ but the cpu is in 6-10+%
i am completely lost what is causing this massive delay, and i have no idea how to fix it.

printout from the pool:

(2x samsung 120 gb)
pool: rpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0

errors: No known data errors


(2 x samsung 850 evo)
pool: ssd2
state: ONLINE
scan: resilvered 274G in 0h22m with 0 errors on Thu Aug 25 22:21:37 2016
config:

NAME STATE READ WRITE CKSUM
ssd2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
 
Last edited:
i thought i would try a clean install with only the 2 Samsung 850 evo SSD's
i still se the same performance issues

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0



i tried making a simple dd test again without compression, the read speed is as expected, but writes are stil to slow, the io delay goes much higher when writing:

WRITE:
root@vmcluster01:~# dd if=/dev/zero of=/rpool/data/files/test.bin bs=1M count=50000 conv=sync
C30861+0 records in
30861+0 records out
32360103936 bytes (32 GB) copied, 258.229 s, 125 MB/s


READ:
root@vmcluster01:~# dd if=/rpool/data/files/test.bin of=/dev/null bs=1M conv=sync
30861+0 records in
30861+0 records out
32360103936 bytes (32 GB) copied, 51.1426 s, 633 MB/s


i also tried loading som data on to the server via smb, when the transfer is finished from the client side, the iodelay takes almost 15 sek to go down. but no time in the transfer is the cpu stressed,
 

Attachments

  • Screenshot from 2016-08-29 23-02-34.png
    Screenshot from 2016-08-29 23-02-34.png
    108.1 KB · Views: 10
Last edited:
update: tried a clean install on a different server with 2 x 2tb disk in a mirror, this is completely different hardware with 32 gb ram and a xeon 1231 and zfs still create massive io delays when writing to disk, the disk are no way near the raw disk performance of a single disk and the system still locks up when a single 4gb file is copy'ed via smb / nfs / ssh / local.

We will go back to use ext4 since we cant get zfs to be usable for other than archiving or reading data.
 
about write performance.

zfs (and also ceph), need fast sync write for their journal.

The problem with consumer ssd drives, is that they are pretty shitty (sometime lower than an hdd :/) , for sync writes.

check at this ceph blog, with differents sss benchmarks.

https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/


Better to keep on lvm-think or ext4, if you don't have enterprise grade hardware for zfs
 
Can you add more ram to your server? I see in your top-post you have just 20GB, that is very low value for zfs...
 
about write performance.

zfs (and also ceph), need fast sync write for their journal.

The problem with consumer ssd drives, is that they are pretty shitty (sometime lower than an hdd :/) , for sync writes.

check at this ceph blog, with differents sss benchmarks.

https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/


Better to keep on lvm-think or ext4, if you don't have enterprise grade hardware for zfs
----

thanks for this reply, your are on to something, i have tested with many different dekstop ssd with almost the same performance, on different hw so the only thing i did'ent change was the use of desktop ssd's. it is sad that the performance is so low on that type on of hardware, maybe i should just stick with ext, and use zfs for the backups'
 
Can you add more ram to your server? I see in your top-post you have just 20GB, that is very low value for zfs...

i have no way to test more than the 32 g that i had in the last server that also had io issues.
 
Update: i got a new server to test on. :)

The new server is a 1u with a Supermicro X10SDV-4C-TLN2F motherboard with a Xeon D 1521 CPU, 32GB Registered Ram and 2x brand new 1tb Samung 850 evo drives, that i configured as a mirror

i reinstalled everything from scratch with the drives as ahci this system is performing at little better but zfs writes performance is not much better. Read speed is better. i tried to feed the server with 2 old wd rapter drives that i had laying around , they perform more consistently at writes 100MB+ I think it is safe to assume that zfs and cheap consumer ssd drives is a no go. :rolleyes:

Does anyone have a setup with 2x enterprise sdd like, intel dc's or similar in a mirror they can benchmark ?
 
Again EVO ... the pros are not good, but EVO are the worst.

I have multiple ZFS pools, also SSD-only with enterprise grade SSDs. What benchmark have you in mind? (and no, dd is not a benchmark - never has - never will).
 
i think it is hard to point at one specific benchmark to get real world performance.
my use case is regular file storage and vm's for the regular storage i think bonnie++ gives a good indication of how it performs.
for the vm i have no idea what a good test would be i tried both bonnie and using gnome benchmark from within a vm.

On the host i have testet with this bonnie command ,it will as default make a test with the size, twice the amount of ram in the machine to ensure that zfs cache has no effect on the results.

bonnie++ -d /root/ -u root | bon_csv2html > /root/bonnietest.html

upload_2017-1-22_21-59-6.png

upload_2017-1-22_21-59-16.png
 
Bonnie inside a VM is a reasonable benchmark if you use ZVOL on the outside, yet I would not compare it to plain ZFS. I'd suggest you try fio on the outside on a special volume for testing.

create a new zvol with some GB (512 here) with a blocksize of 4K for best block alignment:

Code:
root@proxmox4 ~ > zfs create -b 4K -V $(( 512 * 1024 * 1024 * 1024 )) rpool/test-fio

root@proxmox4 ~ > zfs list rpool/test-fio
NAME             USED  AVAIL  REFER  MOUNTPOINT
rpool/test-fio  512G  1,71T    64K  -

root@proxmox4 ~ > ls -l /dev/zvol/rpool/test-fio
lrwxrwxrwx 1 root root 11 Jan 23 19:23 /dev/zvol/rpool/test-fio -> ../../zd336

And then create this testfile:

Code:
cat > /tmp/4KQD32.fio <<EOF
[global]
time_based
runtime=30

[4kqd32_read]
description=4K QD32
numjobs=1
group_reporting=1
blocksize=4K
rw=randread
direct=1
ioengine=libaio
iodepth=32
EOF

and then run the actual test:

Code:
root@proxmox4 ~ > fio --filename=/dev/zvol/rpool/test-fio /tmp/4KQD32.fio
4kqd32_read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [r(1)] [100.0% done] [293.6MB/0KB/0KB /s] [75.2K/0/0 iops] [eta 00m:00s]
4kqd32_read: (groupid=0, jobs=1): err= 0: pid=63307: Mon Jan 23 19:32:19 2017
  Description  : [4K QD32]
  read : io=8866.6MB, bw=302635KB/s, iops=75658, runt= 30001msec
    slat (usec): min=10, max=515, avg=10.94, stdev= 1.23
    clat (usec): min=11, max=943, avg=410.22, stdev=15.49
     lat (usec): min=22, max=955, avg=421.37, stdev=15.92
    clat percentiles (usec):
     |  1.00th=[  402],  5.00th=[  402], 10.00th=[  402], 20.00th=[  402],
     | 30.00th=[  402], 40.00th=[  406], 50.00th=[  406], 60.00th=[  406],
     | 70.00th=[  410], 80.00th=[  414], 90.00th=[  426], 95.00th=[  434],
     | 99.00th=[  486], 99.50th=[  498], 99.90th=[  524], 99.95th=[  524],
     | 99.99th=[  556]
    bw (KB  /s): min=273176, max=307792, per=100.00%, avg=302677.29, stdev=7997.55
    lat (usec) : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=99.56%
    lat (usec) : 750=0.44%, 1000=0.01%
  cpu          : usr=16.00%, sys=84.00%, ctx=15, majf=0, minf=360
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=2269840/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=8866.6MB, aggrb=302635KB/s, minb=302635KB/s, maxb=302635KB/s, mint=30001msec, maxt=30001msec

You then see the "Single Thread Random Read 4K Blocksize" IOPS of 63k.

Just change the parameter rw=randrw inside the config to the the read/write test or to rw=randwrite for write only.

(My values are 12k for randrw and 17k for randwrite)

If you increase the numjobs parameter, you will get the aggregated bandwidth for all possible simultaneous jobs. This is where the SSD shines perfectly, because you do not have a moving head which is jumping all over the disk. I have with 16 threads 130k IOPS for randread, 30k IOPS for randwrite and 13k IOPS for randrw.
 
Howdy! Since ZFS is so much more complex than EXT4, it requires some tuning to perform.

For my UPS backed & well-backed-up servers, I have the following settings:

sysctl.conf
vm.swappiness = 1
vm.min_free_kbytes = 131072

/etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=8589934592
options zfs zfs_arc_min=1073741824
#This limits the memory zfs uses
options zfs zfs_prefetch_disable=1
#virtual machines do not benefit sufficiently with prefetch on

ZFS options

zfs set primarycache=metadata rpool/swap
zfs set secondarycache=metadata rpool/swap
zfs set compression=off rpool/swap
zfs set sync=disabled rpool/swap
zfs set checksum=on rpool/swap


And the controversial one - which of course made the biggest difference... :)

zfs set sync=disabled <pool name>
# ZFSs write throttle really slowed down writes

Good luck!
 
zfs set sync=disabled <pool name>
# ZFSs write throttle really slowed down writes

You forgot to mention that you turned off consistent synchronous writes and opened a side door to hell.

This option should only be used if you absolutely understand what you're doing and it is considered dangerous - not only by the manpage :-D
 
You forgot to mention that you turned off consistent synchronous writes and opened a side door to hell.

This option should only be used if you absolutely understand what you're doing and it is considered dangerous - not only by the manpage :-D


Yes, thus the "For my UPS backed & well-backed-up servers, I have the following settings:" and the
"And the controversial one - which of course made the biggest difference... :)" parts of my message.
 
thanks lnxbill & Joshin for your answer.

i will defiantly try this test that lnxbill posted on our hw, as for your setup is that on a mirror, ? and what kind of make/models?

if both of you should give a advice in buying 2x performance grade solid-state drive to use in a mirror as the rpool, what type of ssd would you recommend ? i have looked at Intel DC S3500 or Intel DC S3700 are they good? or will it be better to look at the nvme solution like Intel DC P3520. - the rpool wil only be for running vm and we will be willing to spend some on these drive to get the zfs performance up.
 
spirit's link to Sebastien Han homepage is your guide (Chuck Norris approved), please have a look at them. Buy the drive that is on the "good" list and has the best price/GB ratio for you or your favourite brand/manufacturer.
 
Yes, thus the "For my UPS backed & well-backed-up servers, I have the following settings:" and the
"And the controversial one - which of course made the biggest difference... :)" parts of my message.

Yes, but that does include software or hardware crashs. I would not recommend to use settings that even the manpage says you should not use and mark them as dangerous.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!