[SOLVED] ZFS Raid 10 with 4 SSD and cache...SLOW.

GarrettB

Well-Known Member
Jun 4, 2018
105
15
58
I had a Raid 10 ZFS pool over 4 HDD with a SSD cache, and I thought "I wish I had known about ZFS sooner".

So I put together a Raid 10 ZFS pool over 4 SSD (and later added a SSD cache), and am thinking to myself, what did I miss?

upload_2018-12-30_23-38-16.png

I have compression on. These numbers are worse than the 7200RPMs!
HP SSD S700 500GB drives x 4.

I created the first mirror, and then added the second mirror.

Individually (before using them in the pool) I have seen from one SSD to another read/write speeds around 90MB/s.

I created a mdadm Raid0 across partitions from two different SSDs and was very pleased with the speed. (I understand the risk of Raid0). I thought Raid10 on ZFS would accomodate a simliar experience.

For root I have a Crucial 240GB SSD partitioned, and it shows:
upload_2018-12-30_23-48-13.png

For comparison (I think I ran it right) the zpool:
upload_2018-12-30_23-48-38.png


What can I check? I am so ignorant on this.
PS: I also picked up a LSI MegaRAID SAS card but I haven't tried it out yet.
 
What SDD model do you use as a cache?

A 40GB partition on the Crucial root drive, but it didn't really change the performance numbers. I think it's a MX200.

I don't understand something about ZFS. Alone, I had the entire data on my just one HP SSD and of course the IO was bottlenecked. But it was faster than with 4 on ZFS. This part I don't get.

Raid 0 mdadm was noticeably faster, and I had it across two 50GB partitions from different drives. I thought that if I threw in a mirror of the same it would be like Raid 10 on ZFS plus I could replace a drive, etc and all the good things.

I know these are relatively low performing drives compared to what's out there, but I thought there wouldn't be anything on ZFS that would make it worse, relatively speaking when compared to itself in different setups.

Is that thinking correct, or does ZFS introduce something different than mdadm?
 
I think I might have just started to understand my problematic thinking.

Maybe I was expecting a RAID01 on ZFS?

And of course I have a RAID10. I have backups of regular backups... My goal was to boost speed without a single drive failure bringing the whole thing down. I understand the trade-off being double the probability of failure... And with a mirror then it's no less risky than everything on one drive. But it would be faster.

I think I was imagining a RAID01. Can this be achieved with ZFS?

Edit: please ignore the last question. Rather, is there a way to achieve what I would like with better performance and some kind of in place, temporary solution? (I know your thinking "get a better drive") but for learning's sake, RAID 01 seems to be it... But I read that it is no faster than RAID 10??

Explanation in simple terms would be great. I already understand the limitation of the hardware, so we're talking "relative" to itself educational explanations, thanks. Why is my experience with the ZFS RAID10 so unlike what I expected?
 
Last edited:
You may want to check out:
https://www.phoronix.com/scan.php?page=article&item=freebsd-12-zfs&num=1

Every filesystem has a use case where it shines. If you're looking for raw sequential throughput, no CoW filesystem is going to compete with ext4 in RAID0. You can try these safe tuning options that may improve performance:
Code:
zfs set compression=lz4 YOURPOOL
zfs set atime=off YOURPOOL
zfs set xattr=sa YOURPOOL

Can you tolerate ~5 seconds of data loss if your server crashes, loses power, etc.? If you can, this will significantly improve performance:
Code:
zfs set sync=disabled YOURPOOL

Note: There is no risk of file system corruption, only data loss and only if the servers stops unexpectedly.
 
  • Like
Reactions: GarrettB
You may want to check out:
https://www.phoronix.com/scan.php?page=article&item=freebsd-12-zfs&num=1

Every filesystem has a use case where it shines. If you're looking for raw sequential throughput, no CoW filesystem is going to compete with ext4 in RAID0. You can try these safe tuning options that may improve performance:
Code:
zfs set compression=lz4 YOURPOOL
zfs set atime=off YOURPOOL
zfs set xattr=sa YOURPOOL

Can you tolerate ~5 seconds of data loss if your server crashes, loses power, etc.? If you can, this will significantly improve performance:
Code:
zfs set sync=disabled YOURPOOL

Note: There is no risk of file system corruption, only data loss and only if the servers stops unexpectedly.

Thanks,

atime was already off, but I changed the other noted settings. It all seems to have made a slight difference.

Thanks for the link. It helps to describe what I'm experiencing.
 
When you have a zpool with ssd you're l2arc is useless. Remove it. You're only capping the arc.

I don't really see the actual problem to be honest. Do you experience latency of some vm's ? Slow write IO ?
You're also referring to zpool iostat but what where the numbers with the hdd pool and was it the same workload ?
 
When you have a zpool with ssd you're l2arc is useless. Remove it. You're only capping the arc.

I don't really see the actual problem to be honest. Do you experience latency of some vm's ? Slow write IO ?
You're also referring to zpool iostat but what where the numbers with the hdd pool and was it the same workload ?
The numbers with the HDD ZFS Raid10 pool was probably around 70-100MB/s.

I would need someone to confirm if this is a relevant issue, but as I have observed speed, I am finding that when reading off of the SSD zpool and writing to another drive I'm sometimes getting ~70MB/s and sometimes getting ~210MB/s, depending on what data is being read.

The point: When I said earlier that I created a mirror, then added another mirror, I failed to mention that in between those two steps I added data to the first mirror. So, it is possible the striping is off? The two mirrors were never evenly striped. Over time, as I move data around between folders on the pool itself, the speeds are getting slower and slower.

I think I should have waited to write data until the second mirror was added. Yes? Or, does this not matter?

At the moment, I'm moving to a mdadm setup to compare.

I also managed to get an old LSI card going and other than running hot and waiting for a fan to arrive, it worked! I succeeded in installing and setting up a RAID card for the first time (yay for me) using the old spinners in a RAID10 and they were producing around 70-80MB/s with no issues.
 
but as I have observed speed, I am finding that when reading off of the SSD zpool and writing to another drive I'm sometimes getting ~70MB/s and sometimes getting ~210MB/s, depending on what data is being read.

how did you observe speed and what speed do you expect and with what workload ?

I'm sorry but can you actually post a real scenario ? copying one file from A to B and observe with zpool iostat.
also pl tell us your current config - do you've a HBA ? arc summary output ? the usual zfs stuff :)
 
Here's where I ended up, for what it's worth:

I proceeded to create a Raid01 using mdadm with the 4 SSDs, and it now shows:
upload_2019-1-4_14-20-12.png


The other 4 spinners ended up on the LSI raid card which worked great once I figured out the commands:
upload_2019-1-4_14-23-4.png
When I was transferring large files and folders to the LSIRaid mount, it was moving around 120MB/s which I thought was pretty good. But man, that LSI 9260-8i card gets hot!


I also realized I never trimmed the root, so I did that and the numbers appear better than before:
upload_2019-1-4_14-23-59.png


In the end, not currently using ZFS but a happy camper nonetheless.
 
You have to start performance testing right. Please use fio and read about sequential and random disk performance - both can be done right here in the forums.
 
You have to start performance testing right. Please use fio and read about sequential and random disk performance - both can be done right here in the forums.
That sounds great, can you suggest the fio parameters? I have used it, but an not sure what are the most appropriate settings. The examples I found online are for rigs with higher specs, and when I searched Proxmox forum postings regarding fio, everyone just says use fio. But no examples are given.

Or maybe direct me to an example?
 
The examples I found online are for rigs with higher specs

The thing with benchmarks is, that they have to use the same settings for comparability, so there are no settings for higher specs. Settings are normally: sequential read/write and random read/write with blocksizes of 4K and 8K. All dd "tests" are mostly useless, because ZFS with the PVE default settings does not write zeros, so you test your memory and your CPU driver for optimizing these zero writes away. pveperf is a good tool to get I/O response times that work for all PVE users without any further knowledge, but fio the one tool to rule them all.

Gibt genug Beispiele im Forum, wie z.B. das hier:
https://forum.proxmox.com/threads/io-delay-probleme.49382/#post-230609
 
I am starting to understand fio better, thanks for the link.

When I took one of the drives offline from the LSI Raid I discovered the alarm works quite nicely! But the kids are in bed at the moment, so I will attend to that test in the future.

I ran 60-second tests read and write.
LSIRaid:
Code:
fio: (groupid=0, jobs=1): err= 0: pid=32283: Sat Jan  5 20:05:48 2019
  write: io=1286.8MB, bw=21960KB/s, iops=5490, runt= 60001msec
    slat (usec): min=5, max=2206, avg= 7.72, stdev= 4.45
    clat (usec): min=140, max=392256, avg=173.04, stdev=716.39
     lat (usec): min=147, max=392265, avg=180.76, stdev=716.42
    clat percentiles (usec):
     |  1.00th=[  149],  5.00th=[  151], 10.00th=[  153], 20.00th=[  155],
     | 30.00th=[  157], 40.00th=[  159], 50.00th=[  161], 60.00th=[  161],
     | 70.00th=[  163], 80.00th=[  167], 90.00th=[  173], 95.00th=[  187],
     | 99.00th=[  237], 99.50th=[  490], 99.90th=[ 1960], 99.95th=[ 2064],
     | 99.99th=[ 8512]
    lat (usec) : 250=99.24%, 500=0.26%, 750=0.04%, 1000=0.03%
    lat (msec) : 2=0.36%, 4=0.05%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (msec) : 500=0.01%
  cpu          : usr=1.23%, sys=7.08%, ctx=329441, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=329409/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

fio: (groupid=0, jobs=1): err= 0: pid=5975: Sat Jan  5 20:41:47 2019
  read : io=5093.3MB, bw=86923KB/s, iops=21730, runt= 60001msec
    slat (usec): min=4, max=48, avg= 6.63, stdev= 1.23
    clat (usec): min=10, max=16689, avg=38.07, stdev=54.37
     lat (usec): min=29, max=16693, avg=44.70, stdev=54.40
    clat percentiles (usec):
     |  1.00th=[   26],  5.00th=[   26], 10.00th=[   27], 20.00th=[   28],
     | 30.00th=[   28], 40.00th=[   29], 50.00th=[   29], 60.00th=[   30],
     | 70.00th=[   35], 80.00th=[   47], 90.00th=[   58], 95.00th=[   87],
     | 99.00th=[   92], 99.50th=[   95], 99.90th=[  112], 99.95th=[  117],
     | 99.99th=[  322]
    lat (usec) : 20=0.01%, 50=83.74%, 100=16.01%, 250=0.24%, 500=0.01%
    lat (usec) : 750=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=4.94%, sys=25.03%, ctx=1303935, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=1303873/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

MDADMRaid01:
Code:
fio: (groupid=0, jobs=1): err= 0: pid=2434: Sat Jan  5 20:20:54 2019
  write: io=3448.2MB, bw=58845KB/s, iops=14711, runt= 60001msec
    slat (usec): min=12, max=19687, avg=20.09, stdev=25.27
    clat (usec): min=30, max=21248, avg=46.47, stdev=110.08
     lat (usec): min=48, max=21263, avg=66.56, stdev=112.98
    clat percentiles (usec):
     |  1.00th=[   36],  5.00th=[   37], 10.00th=[   41], 20.00th=[   42],
     | 30.00th=[   42], 40.00th=[   43], 50.00th=[   43], 60.00th=[   43],
     | 70.00th=[   44], 80.00th=[   44], 90.00th=[   48], 95.00th=[   62],
     | 99.00th=[   80], 99.50th=[   88], 99.90th=[  137], 99.95th=[  446],
     | 99.99th=[ 5728]
    lat (usec) : 50=91.78%, 100=7.92%, 250=0.24%, 500=0.01%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=3.54%, sys=36.64%, ctx=882952, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=882691/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

fio: (groupid=0, jobs=1): err= 0: pid=4888: Sat Jan  5 20:35:15 2019
  read : io=6891.1MB, bw=117621KB/s, iops=29405, runt= 60001msec
    slat (usec): min=1, max=101, avg= 7.59, stdev= 2.22
    clat (usec): min=0, max=5259, avg=25.18, stdev=12.59
     lat (usec): min=2, max=5273, avg=32.77, stdev=13.95
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    1], 10.00th=[   23], 20.00th=[   25],
     | 30.00th=[   26], 40.00th=[   27], 50.00th=[   27], 60.00th=[   27],
     | 70.00th=[   27], 80.00th=[   28], 90.00th=[   28], 95.00th=[   31],
     | 99.00th=[   52], 99.50th=[   55], 99.90th=[   57], 99.95th=[   62],
     | 99.99th=[  258]
    lat (usec) : 2=9.40%, 4=0.01%, 10=0.01%, 20=0.01%, 50=89.29%
    lat (usec) : 100=1.27%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=6.71%, sys=35.22%, ctx=1598542, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=1764337/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

How's that look?

EDIT: I should note that LSIRaid is 4 Blue WD 250GB 7200rpm. MDADMRAid01 is 4 HP S700 500GB SSD.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!