my NVMEs suck

zedicus

Renowned Member
Mar 5, 2014
26
4
68
i have recently installed a pair of NVME 970 pro 512gb drives in a zfs mirror because i was unhappy with the performance of my SATA SSD drives.

the NVME drives seem to be SLOWER than the SATA SSD drives and none of my config changes have made any difference.

NVME drives
Code:
root@serverminion:/# fio --filename=/zfsnvme/fiotest.fio --bs=4k --rw=write --name=test --direct=0 --sync=1 --size=1G --numjobs=1
test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/2710KB/0KB /s] [0/677/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=42366: Thu Jul  5 15:02:58 2018
  write: io=1024.0MB, bw=2831.9KB/s, iops=707, runt=370287msec
    clat (usec): min=963, max=14813, avg=1410.48, stdev=593.34
     lat (usec): min=963, max=14814, avg=1410.87, stdev=593.34
    clat percentiles (usec):
     |  1.00th=[ 1144],  5.00th=[ 1208], 10.00th=[ 1224], 20.00th=[ 1256],
     | 30.00th=[ 1288], 40.00th=[ 1304], 50.00th=[ 1304], 60.00th=[ 1320],
     | 70.00th=[ 1352], 80.00th=[ 1384], 90.00th=[ 1432], 95.00th=[ 1512],
     | 99.00th=[ 5216], 99.50th=[ 5408], 99.90th=[ 5664], 99.95th=[ 5728],
     | 99.99th=[ 7904]
    lat (usec) : 1000=0.01%
    lat (msec) : 2=97.55%, 4=0.08%, 10=2.37%, 20=0.01%
  cpu          : usr=0.14%, sys=2.59%, ctx=1047995, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=2831KB/s, minb=2831KB/s, maxb=2831KB/s, mint=370287msec, maxt=370287msec

SATA drives
Code:
root@serverminion:/# fio --filename=/zfsvm/fiotest.fio --bs=4k --rw=write --name=test --direct=0 --sync=1 --size=1G --numjobs=1
test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/4332KB/0KB /s] [0/1083/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12305: Thu Jul  5 14:55:46 2018
  write: io=1024.0MB, bw=4495.3KB/s, iops=1123, runt=233275msec
    clat (usec): min=687, max=10707, avg=887.93, stdev=371.57
     lat (usec): min=687, max=10707, avg=888.32, stdev=371.57
    clat percentiles (usec):
     |  1.00th=[  716],  5.00th=[  732], 10.00th=[  748], 20.00th=[  764],
     | 30.00th=[  772], 40.00th=[  788], 50.00th=[  812], 60.00th=[  844],
     | 70.00th=[  868], 80.00th=[  908], 90.00th=[  996], 95.00th=[ 1192],
     | 99.00th=[ 3824], 99.50th=[ 3984], 99.90th=[ 4128], 99.95th=[ 4448],
     | 99.99th=[ 8096]
    lat (usec) : 750=13.40%, 1000=76.92%
    lat (msec) : 2=8.32%, 4=0.99%, 10=0.36%, 20=0.01%
  cpu          : usr=0.23%, sys=5.14%, ctx=535863, majf=0, minf=64
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=4495KB/s, minb=4495KB/s, maxb=4495KB/s, mint=233275msec, maxt=233275msec

the NVME drives are a little over half the performance of the SATA drives???
 
Yes, this is normal all consumer NAND are not good in 4k.
NVME is only the interface which means low latency and they can very good write in parallel.
If you like high 4k sync writes you need enterprise NVMe disks.
 
fair enough, however i have tried over a half dozen different brands an connection styles of drives from intel to western digital (some of them with 'enterprise') ratings. i have yet to break 2000 fsyncs in FIO or pveperf.

if you could recommend a product in 4x sata config OR 2x NVME config that will get me too AT LEAST 2000 fsync that would be helpful. i have asked a couple of times and keep getting told 'that is not the correct hardware' and do not get any recommendation for what i should be using.

i have not tried optanes yet mostly due to size for the cost, but at this point i will consider anything.
 
would take any sort of helpful recommendation.

does anyone have a current configuration with pveperf data?
 
the server is kind of a do-it-all for a small business environment. it hosts a mysql database, complete windows domain environment, several virtualized desktops, and 3 web servers. also a lot of media is stored on a mini-SAN that connects to and interacts with this environment.

honestly FSYNC seems to directly correlate to the virtualized desktops performance. the old server in production now has a simple adaptec hardware controller with 4 10k drives. for its age it handles the workload well. in planning this upgrade i was convinced by other parties to use, like you said, a simple HBA with ZFS for the system. (i have used ZFS on the storage environment for years, with large spinning disks and a few VDEVS it handles those tasks fine.)

in my testing and my previous environments the higher the FSYNC the snappier the virtualized desktops were. maybe this was because there is a database on the same storage or why i do not know, it is just how it is for me. the new server will have the same design, local storage for the VM's to sit on, and they can connect to the storage network for other tasks.

in loading the new server i had only gotten the domain environment installed and was about half way through loading the virtualized desktops when i noticed how sluggish they were and then i started performance testing. initially i just assumed a bunch of high end consumer SSD should be several times faster than a nearly 10 year old pcie gen1 hardware controller.

in my mind that sounds like it should be a supper easy question.
 
You might want to look into re-creating the NVME namespace with 4K sector size and recreate partitions with alignment in mind.
SSDs tend to under-perform when they receive un-aligned writes.
And since you are testing, Try the test on an XFS or EXT4 formatted partition on single ssd , then an LVM mirror , then a ZFS on a single SSD, and then a mirrored ZFS voluem . Each of these test can actually expose one potential problem that might affect the performance .
Make sure to pre-condition the SSD before running any of these tests, SSDs usually perform very fast when they are new or have not received a lot of writes, but then the performance falls rapidly and stabilizes on the "steady-state" performance .

Cheers,
-Ali
 
is there a guide for doing a LVM mirror on proxmox? i have not found a lot of info on it and it does not seem to be an 'out of the box' proxmox config.
 
I don't think Proxmox offers an out of the box method for doing LVM mirror, but it's fairly straight forward,
I'm a new user on this forum and I cant post links, but searching for debian lvm mirror on google gave a good result living on linoxide dot com that was quite clear and easy to follow.
 
@zedicus : i just tested a standard grade samsung NVMe M2 HDD and it gives 270 fyncs per second and as it is nearly 60% full even iops are far from impressive.

The here is a fio result :

Code:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

Gives us

Code:
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/28244KB/0KB /s] [0/7061/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=547030: Sat Jul  7 10:16:55 2018
  write: io=2048.0MB, bw=31521KB/s, iops=7880, runt= 66532msec
  cpu          : usr=2.16%, sys=23.40%, ctx=34268, majf=0, minf=118
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=524288/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=2048.0MB, aggrb=31520KB/s, minb=31520KB/s, maxb=31520KB/s, mint=66532msec, maxt=66532msec

Disk stats (read/write):
    dm-1: ios=3/602116, merge=0/0, ticks=20/9906876, in_queue=9909364, util=99.73%, aggrios=297/524119, aggrmerge=0/78767, aggrticks=2800/3611660, aggrin_queue=3615716, aggrutil=99.67%
  sdi: ios=297/524119, merge=0/78767, ticks=2800/3611660, in_queue=3615716, util=99.67%

Then add fsyncs for every I/O which is a worst case scenario

Code:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=2G --readwrite=randwrite --fsync=1

I stopped the test as it was going to last forever @278 iops.

Then about ZFS ans synchronous writes : the directive used in command --direct=1 will return an error as it does not support 0_DIRECT. So the cool thing would maybe be use SLOG device so you get instant ACK from writes even those with a sync directive. I might test this if i find a little time.

You said you just put on domain on your server and it was very sluggish. What kind of disk driver do you use for your vm if it is a Windows Server Box.?

For an sample config tested this morning : 8x2,4TB 10K drives RAIDZ2 issues 1049 fsyncs per second. and 8x480GB Samsung RAIDZ2 issues not more than 1497 fsync. without fsync directive, iops are good.

Best regards
 
For ZFS don't forget SYNC writes are written to ZFS LOG device. If you don't have external ZFS LOG device then the pool become as LOG device too. Thats mean double writes.

If SYNC is important and you want good write performance then add single good SATA/NVME SSD as ZFS pool LOG device.
 
We ordered 2.5 inch HGST NVMe drives to fit our Intel servers but they also manufacturer PCIe drives with even higher throughput.

My FIO command was slightly different to yours, from benchmarking prior to adding them to a Ceph cluster:

Code:
  fio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4k
        --numjobs=1 --iodepth=1 --runtime=60 --time_based
        --group_reporting --name=journal-test
      numjobs=1         bw=139248KB/s, iops=34812, runt= 60001msec
      numjobs=2         bw=264803KB/s, iops=66200, runt= 60001msec
      numjobs=3         bw=402371KB/s, iops=100592, runt= 60001msec
      numjobs=4         bw=522527KB/s, iops=130631, runt= 60001msec
      numjobs=8         bw=968537KB/s, iops=242134, runt= 60003msec
 
so i put 4x intel s3700 drives in a RAIDZ1 and have good performance. 3200 fsync and 1/4 better throughput across the board than the old pcie hardware array. proving that doing correct research on the SSD for the task at hand IS in fact quite important.

last question, does it matter what VM HD i use on top of ZFS? all of my backups are VIRTIO but so far it seems that anything will work? is there a benefit to selecting something else when making a new VM?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!