[SOLVED] ZFS vs Other FS system using NVME SSD as cache

Martin Maisey · Jul 14, 2017

BloodyIron said:
2. NVMe is not an _actual_ requirement for good/great SLOG performance. There's plenty of systems or budgets where that's not an option, and you can get blazing fast performance without NVMe.

Agreed, and I don't think I ever said that it was a requirement; the above is a straw man you've introduced. For any synchronous random write workload, adding an SSD SLOG to spinning disks is going to increase performance hugely, even if it's SATA. But that doesn't mean NVMe won't bring still better performance, which I thought was the question here?

BloodyIron said:
Furthermore PCIe NVMe devices are typically NOT going to be hot-swap, so that kind of an avenue has it's own pitfalls.

Fair point, for situations where the server is not in a cluster and planned downtime windows are not available. And particularly where the slog is not mirrored.

BloodyIron said:
3. There are plenty of situations where the latency difference between NVMe and SATA SSD is irrelevant or unnoticeable, so typically it's just not worth the added cost.

Well, yes. Anything that's CPU intensive, read I/O intensive, or async writes, for a start. Mind you, those situations won't require a SLOG anyway. For situations where your workload is predominantly I/O bound on small synchronous writes (e.g. the main SLOG use case), I'd be surprised if NVMe didn't make at least some difference. Maybe you have benchmarks which show otherwise, in which case I'd be interested to see them.

Whether it's worth the cost is, of course, dependent on the user. In all cases I'd suggest running application performance tests and analysis over general advice and synthetic benchmarks before making decisions. But if you genuinely believe that this is the right *general* advice, maybe you should hop over to the FreeNAS forums and try to convince them to change the hardware recommendations doc?

BloodyIron said:
4. SSD write performance is way more important than latency for SLOG functionality, as lower write speeds will increase any wait time a hypervisor or other system would be doing for sync writes.

There are at least three main dimensions of write performance, with the importance of each dependent on workload: throughput, random IOPS and write latency. Which did you mean, as otherwise the above statement is meaningless? Of these, I understand that write latency is the main one for SLOG, as the access pattern is small sequential writes that block the calling process. You're highly unlikely to hit throughput limits on the interface with the small writes that get directed to the SLOG. In my benchmarks below with an SM863a total write throughput is 3.3MB/s with 4k sync writes, so nowhere near either the theoretical throughput of the drive or the SATA interface it's on.

Latency is basically *defined as* wait time, so I just don't understand the statement above.

BloodyIron said:
5. Comparison between sync on and off is a fallacy for benchmarking. You're fooling yourself by doing that. ZFS benchmarking is nowhere near typical to benchmarking other storage systems.

Of course there are some gotchas with benchmarking ZFS. Sequential read benchmarking with dd if=/dev/zero is a very bad idea because it's perfectly compressible and ZFS does compression, so you test your CPU rather than your ZFS pool. With random read benchmarks you need to make sure your working set matches your intended workload, otherwise you won't be stressing the right combination of ARC, L2ARC and spinning disks. But I'm not aware of any specific difficulties with benchmarking random small writes - could you share, I'm genuinely interested?

pveperf is admittedly a very basic test. So here are some more test results from inside a VM using fio, with sync enabled and disabled on the host dataset, showing some very significant differences in IOPS (846 vs 3565) below for a 4K fsync'd random write scenario - feel free to provide specific criticism of the methodology. This is on an old server - E5-5645 first generation with DDR3-1333 RAM. If anything I'd expect the difference to be greater on a modern CPU, as basically setting sync=disabled takes a critical process that's I/O-bound and makes it CPU-bound by lying to the calling process.

Sync enabled:

Code:

martin@sync:~$ sudo fio  --name=synctest --ioengine=aio --rw=randwrite --bs=4k --size=1g --fsync=1
synctest: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/3388KB/0KB /s] [0/847/0 iops] [eta 00m:00s]
synctest: (groupid=0, jobs=1): err= 0: pid=2196: Fri Jul 14 14:58:28 2017
  write: io=1024.0MB, bw=3387.7KB/s, iops=846, runt=309534msec
    slat (usec): min=8, max=2437, avg=15.82, stdev=10.87
    clat (usec): min=0, max=174, avg= 1.46, stdev= 1.06
     lat (usec): min=9, max=2440, avg=17.89, stdev=11.03
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    2],
     | 70.00th=[    2], 80.00th=[    2], 90.00th=[    2], 95.00th=[    2],
     | 99.00th=[    2], 99.50th=[    3], 99.90th=[   11], 99.95th=[   25],
     | 99.99th=[   34]
    bw (KB  /s): min= 2104, max= 4296, per=100.00%, avg=3390.48, stdev=229.09
    lat (usec) : 2=57.96%, 4=41.56%, 10=0.38%, 20=0.02%, 50=0.07%
    lat (usec) : 100=0.01%, 250=0.01%
  cpu          : usr=1.51%, sys=7.83%, ctx=711000, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=3387KB/s, minb=3387KB/s, maxb=3387KB/s, mint=309534msec, maxt=309534msec

Disk stats (read/write):
    dm-0: ios=0/675027, merge=0/0, ticks=0/278808, in_queue=278936, util=90.19%, aggrios=0/675214, aggrmerge=0/75442, aggrticks=0/277328, aggrin_queue=276896, aggrutil=89.50%
  vda: ios=0/675214, merge=0/75442, ticks=0/277328, in_queue=276896, util=89.50%

Sync disabled:

Code:

martin@async:~$ sudo fio  --name=synctest --ioengine=aio --rw=randwrite --bs=4k --size=1g --fsync=1
synctest: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/16332KB/0KB /s] [0/4083/0 iops] [eta 00m:00s]
synctest: (groupid=0, jobs=1): err= 0: pid=2083: Fri Jul 14 15:01:25 2017
  write: io=1024.0MB, bw=14264KB/s, iops=3565, runt= 73514msec
    slat (usec): min=8, max=975, avg=11.30, stdev= 8.29
    clat (usec): min=0, max=96, avg= 1.09, stdev= 0.79
     lat (usec): min=9, max=979, avg=12.86, stdev= 8.43
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    2], 95.00th=[    2],
     | 99.00th=[    2], 99.50th=[    2], 99.90th=[    5], 99.95th=[   16],
     | 99.99th=[   29]
    bw (KB  /s): min= 6680, max=17464, per=100.00%, avg=14271.10, stdev=1485.76
    lat (usec) : 2=88.88%, 4=10.99%, 10=0.07%, 20=0.02%, 50=0.03%
    lat (usec) : 100=0.01%
  cpu          : usr=3.84%, sys=21.62%, ctx=768221, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=262144/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=14263KB/s, minb=14263KB/s, maxb=14263KB/s, mint=73514msec, maxt=73514msec

Disk stats (read/write):
    dm-0: ios=0/560378, merge=0/0, ticks=0/56192, in_queue=56252, util=76.52%, aggrios=0/560633, aggrmerge=0/18169, aggrticks=0/55176, aggrin_queue=54892, aggrutil=74.59%
  vda: ios=0/560633, merge=0/18169, ticks=0/55176, in_queue=54892, util=74.59%

I stress again, though - I'm not recommending sync=disabled for any production workloads, as the performance improvement comes at the expense of data integrity and durability.

gvalverde · Nov 19, 2017

tane said:
Hi gvalverde, I will write you a guide ASAP ok. For now I would say it works ok very satisfied.

I'll really appreciate your guide, It would help me a lot on setting this up. I've been trying lately but this is an advanced subject that would be better with some guidance on the commands (steps) to be taken

JOduMonT · Jul 27, 2018

Hi everyone;

I know it's more or less an old topic but
as pointed by @mir

ZIL, also called SLOG
should be, at least mirrored (RAID1 for people speaking mdadm)

But how to calculate the size of this ?
1st: ZIL is temporary use as cache drive for hardrive transaction,
funny fact about it:

the data is stocked for almost 5sec.
the size must be calculate in relation of your bandwith I/O;
be default it will not be use by every transaction type;
worth it to speed VM and Database (but again you have to configure it);

So ZILs are typically very small. Just 1-2GB of ZIL storage for a server with Gb LAN is overkill.

ref:

guletz · Jul 27, 2018

JOduMonT said:
ZIL, also called SLOG

This is wrong. zil is not the same thing at slog. zil is a short term zone on each vdev members(if you do not have a slog device) where any syn write will land. After that, the data from zil will be write again in the pool like a sync write. So for any sync write, you will write twice (zil and pool)
If slog is present then the only place where async write is only the slog and not on any other vdev's. Then the zil data from the slog will be write on the pool like any other async write.

JOduMonT · Jul 27, 2018

guletz said:
zil is not the same thing at slog. zil is a short term zone

My bad

so how you will calculate your ZIL ?
as the 45drives point out and base on your bandwidth I/O, because for me that's sound more true for the SLOG ?

ZFS Intent Log, or ZIL- A logging mechanism where all of the data to be the written is stored, then later flushed as a transactional write. Similar in function to a journal for journaled filesystems, like ext3 or ext4. Typically stored on platter disk. Consists of a ZIL header, which points to a list of records, ZIL blocks and a ZIL trailer. The ZIL behaves differently for different writes. For writes smaller than 64KB (by default), the ZIL stores the write data. For writes larger, the write is not stored in the ZIL, and the ZIL maintains pointers to the synched data that is stored in the log record.
Separate Intent Log, or SLOG- A separate logging device that caches the synchronous parts of the ZIL before flushing them to slower disk. This would either be a battery-backed DRAM drive or a fast SSD. The SLOG only caches synchronous data, and does not cache asynchronous data. Asynchronous data will flush directly to spinning disk. Further, blocks are written a block-at-a-time, rather than as simultaneous transactions to the SLOG. If the SLOG exists, the ZIL will be moved to it rather than residing on platter disk. Everything in the SLOG will always be in system memory.

ref: https://pthree.org/2012/12/06/zfs-administration-part-iii-the-zfs-intent-log/

Search

Search

[SOLVED] ZFS vs Other FS system using NVME SSD as cache

Martin Maisey

Active Member

gvalverde

New Member

JOduMonT

Renowned Member

guletz

Distinguished Member

JOduMonT

Renowned Member

We value your privacy