2. NVMe is not an _actual_ requirement for good/great SLOG performance. There's plenty of systems or budgets where that's not an option, and you can get blazing fast performance without NVMe.
Agreed, and I don't think I ever said that it was a requirement; the above is a straw man you've introduced. For any synchronous random write workload, adding an SSD SLOG to spinning disks is going to increase performance hugely, even if it's SATA. But that doesn't mean NVMe won't bring still better performance, which I thought was the question here?
Furthermore PCIe NVMe devices are typically NOT going to be hot-swap, so that kind of an avenue has it's own pitfalls.
Fair point, for situations where the server is not in a cluster and planned downtime windows are not available. And particularly where the slog is not mirrored.
3. There are plenty of situations where the latency difference between NVMe and SATA SSD is irrelevant or unnoticeable, so typically it's just not worth the added cost.
Well, yes. Anything that's CPU intensive, read I/O intensive, or async writes, for a start. Mind you, those situations won't require a SLOG anyway. For situations where your workload is predominantly I/O bound on small synchronous writes (e.g. the main SLOG use case), I'd be surprised if NVMe didn't make at least some difference. Maybe you have benchmarks which show otherwise, in which case I'd be interested to see them.
Whether it's worth the cost is, of course, dependent on the user. In all cases I'd suggest running application performance tests and analysis over general advice and synthetic benchmarks before making decisions. But if you genuinely believe that this is the right *general* advice, maybe you should hop over to the FreeNAS forums and try to convince them to change the hardware recommendations doc?
4. SSD write performance is way more important than latency for SLOG functionality, as lower write speeds will increase any wait time a hypervisor or other system would be doing for sync writes.
There are at least three main dimensions of write performance, with the importance of each dependent on workload: throughput, random IOPS and write latency. Which did you mean, as otherwise the above statement is meaningless? Of these, I understand that write latency is the main one for SLOG, as the access pattern is small sequential writes that block the calling process. You're highly unlikely to hit throughput limits on the interface with the small writes that get directed to the SLOG. In my benchmarks below with an SM863a total write throughput is 3.3MB/s with 4k sync writes, so nowhere near either the theoretical throughput of the drive or the SATA interface it's on.
Latency is basically *defined as* wait time, so I just don't understand the statement above.
5. Comparison between sync on and off is a fallacy for benchmarking. You're fooling yourself by doing that. ZFS benchmarking is nowhere near typical to benchmarking other storage systems.
Of course there are some gotchas with benchmarking ZFS. Sequential read benchmarking with dd if=/dev/zero is a very bad idea because it's perfectly compressible and ZFS does compression, so you test your CPU rather than your ZFS pool. With random read benchmarks you need to make sure your working set matches your intended workload, otherwise you won't be stressing the right combination of ARC, L2ARC and spinning disks. But I'm not aware of any specific difficulties with benchmarking random small writes - could you share, I'm genuinely interested?
pveperf is admittedly a very basic test. So here are some more test results from inside a VM using fio, with sync enabled and disabled on the host dataset, showing some very significant differences in IOPS (846 vs 3565) below for a 4K fsync'd random write scenario - feel free to provide specific criticism of the methodology. This is on an old server - E5-5645 first generation with DDR3-1333 RAM. If anything I'd expect the difference to be greater on a modern CPU, as basically setting sync=disabled takes a critical process that's I/O-bound and makes it CPU-bound by lying to the calling process.
Sync enabled:
Code:
martin@sync:~$ sudo fio --name=synctest --ioengine=aio --rw=randwrite --bs=4k --size=1g --fsync=1
synctest: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/3388KB/0KB /s] [0/847/0 iops] [eta 00m:00s]
synctest: (groupid=0, jobs=1): err= 0: pid=2196: Fri Jul 14 14:58:28 2017
write: io=1024.0MB, bw=3387.7KB/s, iops=846, runt=309534msec
slat (usec): min=8, max=2437, avg=15.82, stdev=10.87
clat (usec): min=0, max=174, avg= 1.46, stdev= 1.06
lat (usec): min=9, max=2440, avg=17.89, stdev=11.03
clat percentiles (usec):
| 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 2],
| 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 2],
| 99.00th=[ 2], 99.50th=[ 3], 99.90th=[ 11], 99.95th=[ 25],
| 99.99th=[ 34]
bw (KB /s): min= 2104, max= 4296, per=100.00%, avg=3390.48, stdev=229.09
lat (usec) : 2=57.96%, 4=41.56%, 10=0.38%, 20=0.02%, 50=0.07%
lat (usec) : 100=0.01%, 250=0.01%
cpu : usr=1.51%, sys=7.83%, ctx=711000, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=1024.0MB, aggrb=3387KB/s, minb=3387KB/s, maxb=3387KB/s, mint=309534msec, maxt=309534msec
Disk stats (read/write):
dm-0: ios=0/675027, merge=0/0, ticks=0/278808, in_queue=278936, util=90.19%, aggrios=0/675214, aggrmerge=0/75442, aggrticks=0/277328, aggrin_queue=276896, aggrutil=89.50%
vda: ios=0/675214, merge=0/75442, ticks=0/277328, in_queue=276896, util=89.50%
Sync disabled:
Code:
martin@async:~$ sudo fio --name=synctest --ioengine=aio --rw=randwrite --bs=4k --size=1g --fsync=1
synctest: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/16332KB/0KB /s] [0/4083/0 iops] [eta 00m:00s]
synctest: (groupid=0, jobs=1): err= 0: pid=2083: Fri Jul 14 15:01:25 2017
write: io=1024.0MB, bw=14264KB/s, iops=3565, runt= 73514msec
slat (usec): min=8, max=975, avg=11.30, stdev= 8.29
clat (usec): min=0, max=96, avg= 1.09, stdev= 0.79
lat (usec): min=9, max=979, avg=12.86, stdev= 8.43
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 2], 95.00th=[ 2],
| 99.00th=[ 2], 99.50th=[ 2], 99.90th=[ 5], 99.95th=[ 16],
| 99.99th=[ 29]
bw (KB /s): min= 6680, max=17464, per=100.00%, avg=14271.10, stdev=1485.76
lat (usec) : 2=88.88%, 4=10.99%, 10=0.07%, 20=0.02%, 50=0.03%
lat (usec) : 100=0.01%
cpu : usr=3.84%, sys=21.62%, ctx=768221, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: io=1024.0MB, aggrb=14263KB/s, minb=14263KB/s, maxb=14263KB/s, mint=73514msec, maxt=73514msec
Disk stats (read/write):
dm-0: ios=0/560378, merge=0/0, ticks=0/56192, in_queue=56252, util=76.52%, aggrios=0/560633, aggrmerge=0/18169, aggrticks=0/55176, aggrin_queue=54892, aggrutil=74.59%
vda: ios=0/560633, merge=0/18169, ticks=0/55176, in_queue=54892, util=74.59%
I stress again, though - I'm not recommending sync=disabled for any production workloads, as the performance improvement comes at the expense of data integrity and durability.
Last edited: