ZFS vs Other FS system using NVME SSD as cache

tane

Member
Mar 27, 2017
10
1
8
45
Hi People,

We would like to move are KVM host to Proxmox single server on OVH.

But we are wondering what FileSystem should we setup as primary:

This the server configuration:

Code:
Intel  Xeon E5-1650v3 - 6c/12t - 3.5 GHz/3.8 GHz
RAM:
128GB DDR4 ECC 2133 MHz
 2X2TB SATA  + 2x450 GB NVME SSD


Our idea is to use Raid 1 Software ZFS with L2ARC as cache and ZIL or use some other system to increase qcow2 performance of KVM host using NVME SSD.

Recommendations ?

Keep in mind this as single server no HA system.

Thank you.
 
Depending on the workload of the disks, what about two pools? The speed difference is really huge between the two. Have you spare time to test configurations? How many main memory will you give to ZFS from the 128 GB?

If you have, I'd also suggest to use a compressed, deduplicated pool on the SSDs to get the most capacity out of them, but the deduplication option strongly relies on the amount of systems you want do virtualize, e.g. if it'll be hundreds of Debian LX(C) containers, you would save a lot of space there.
 
@Lnxbill tnx for the reply.

Well we are thing of testing this in detail but. For now probably about 32GB - to 48 GB MAX for ZFS, Unfortunately we can not use LXC but we need a dockers inside VM's
So we have two options doing raid 1 nvme more intensive apps run on NVME or try to balance speed or
run ZFS with ZIL 200GB / L2ARC 200GB on NVME

But this need to be tested I am sure.
 
Recommendations ?

Your approach sounds a bit backwards to me. You talk about hardware (that can be changed), instead of your workload (that is a given). So why not tell us how many VMs you have, what are the size of your virtual disks, and what kind of IO is expected on them, and how much data can you safely lose in case of a hardware failure, so we can give you ideas how to build your server in terms of hardware and software.

I can only give you some general advice about ZFS: the more spinning disks the better. In our experience 4 disk RAID10 was very slow on ZFS for a few VMs, so I can't imagine a 2 disk mirror to be useful. Also the ZIL + L2ARC are not a very effective all around disk cache, they will need a lot of manual tuning to actually become useful, and even that's not for sure. You are much better off using only SSDs as your main pool (mirrored, if you need the security), and use the hard drives for backups or secondary pool (but expect low performance).
 
Last edited:
Well that was the question. I thought that ZFS would not benefit much from NVME but was not sure , Mostly we run a lot of small docker about 50+ 768M per docker they small in memory and do not have big data IO usage only some peaks than we would need High IO to finish as fast as possible with that one docker in most case they sit and wait.

The LXC is not option we need docker unfortunately also RAID1 is required backup is done on remote NFS.

So this the my IDEA not sure it's ok


SLOW ZFS should consist from
2 x 2 TB -> VMDATA RAID 1
- ZIL NVME SSD 50 GB RAID 1 -- Write Cache
- L2ARC NVME1 SSD 25 GB -- Read Cache
- L2ARC NVME2 SSD 25 GB -- Read Cache
- ZFS memory limit 32 GB

Rest of NVME should be mirror for fast dockers/ VM's / Databases
in SW RAID1 QCOW2

Hope this clears a little .

I am not sure this a good idea but I am open for suggestions.
 
ZIL should be mirrored too to prevent data loss. So I would do:
- L2ARC NVME SSD 50 GB -- Read Cache
- ZIL NVME1 SSD 25 GB -- Mirrored Write Cache
- ZIL NVME2 SSD 25 GB -- Mirrored Write Cache
 
Hi MIr,
ZIL is mirrored 2 x 50 I wrote RAID 1
ZIL NVME SSD 50 GB RAID 1 <--- maybe I was not clear
 
Hi MIr,
ZIL is mirrored 2 x 50 I wrote RAID 1
ZIL NVME SSD 50 GB RAID 1 <--- maybe I was not clear
Yes, I saw that after my reply. But you don't need 50 GB for ZIL. 25 GB is more than adequate since the size of ZIL will peak around 5 - 10 GB.
 
The amount of work given to
Well that was the question. I thought that ZFS would not benefit much from NVME but was not sure , Mostly we run a lot of small docker about 50+ 768M per docker they small in memory and do not have big data IO usage only some peaks than we would need High IO to finish as fast as possible with that one docker in most case they sit and wait.

The LXC is not option we need docker unfortunately also RAID1 is required backup is done on remote NFS.

So this the my IDEA not sure it's ok


SLOW ZFS should consist from
2 x 2 TB -> VMDATA RAID 1
- ZIL NVME SSD 50 GB RAID 1 -- Write Cache
- L2ARC NVME1 SSD 25 GB -- Read Cache
- L2ARC NVME2 SSD 25 GB -- Read Cache
- ZFS memory limit 32 GB

Rest of NVME should be mirror for fast dockers/ VM's / Databases
in SW RAID1 QCOW2

Hope this clears a little .

I am not sure this a good idea but I am open for suggestions.

Yes, I saw that after my reply. But you don't need 50 GB for ZIL. 25 GB is more than adequate since the size of ZIL will peak around 5 - 10 GB.

Mirroring the ZIL and using L2ARC from both SSDs is how it should be done. But I'm not sure how on Earth would the ZIL be even 10 GB when used in a pool of 2 mirrored hard drives. I would say a 5GB ZIL (mirrored) would be more than enough, even that would cache 25+ seconds of 200 MB/s sync writes (which is closer to 100-120 MB /s in reality on a HDD). If you have more than that sync writes, use the faster pool.

Another problem I see with this model is the fact that if you partition your SSD, ZFS will need to mirror partitions instead of the entire drives for your fast pool, which is not advised to do (although possible).

So anyway, here is my recommendation: partition one SSD in Debian installer or GParted LIVE first, install Debian and Proxmox after (not sure if the Proxmox ZFS installer can work with partitions), then create the slow and fast pools later.

40 GB mirrored: root
10 GB on both: swap (do not put swap on ZFS, it's unstable even with tweaks)
10 GB mirrored: ZIL for SLOWPOOL
40 GB on both: L2ARC for SLOWPOOL (80 GB of L2ARC all together)
300 GB mirrored: FASTPOOL
50 GB on both: leave unallocated for performance/reliability (not necessary if using enterprise SSD)

SLOWPOOL will be your mirrored HDDs, try to tweak the l2arc options because it might not do much caching by default. You also want to test sync=always for your SLOWPOOL, it might help performance since then all writes would go to the ZIL first, pool later (not just sync writes).
 
Last edited:
  • Like
Reactions: gvalverde
Hi Guys,
Thank you for your answers I did setup exactly as @gkovacs said makes most sense to me.


Here is some benchmarks I did maybe some can use it as reference.

Slow Storage Randr 75% Read ZIL + L2ARC

Code:
fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test6 --filename=test7 --bs=128k --size=10G -
-readwrite=randrw --rwmixread=75
test6: (g=0): rw=randrw, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=1
fio-2.1.11
Starting 1 process
test6: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [1576MB/527.9MB/0KB /s] [12.7K/4223/0 iops] [eta 00m:00s]
test6: (groupid=0, jobs=1): err= 0: pid=955: Fri Apr  7 16:34:47 2017
  read : io=7673.7MB, bw=1617.6MB/s, iops=12940, runt=  4744msec
  write: io=2566.4MB, bw=553956KB/s, iops=4327, runt=  4744msec
  cpu          : usr=2.28%, sys=15.94%, ctx=82001, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=61389/w=20531/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=7673.7MB, aggrb=1617.6MB/s, minb=1617.6MB/s, maxb=1617.6MB/s, mint=4744msec, maxt=4744msec
  WRITE: io=2566.4MB, aggrb=553956KB/s, minb=553956KB/s, maxb=553956KB/s, mint=4744msec, maxt=4744msec

Disk stats (read/write):
  sda: ios=58438/19585, merge=0/0, ticks=2772/996, in_queue=3720, util=80.54%

Fast Storage Randr 75% Only NVME SSD

Code:
fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test6 --filename=test7 --bs=128k --size=10G -
test6: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=1

fio: option <-readwrite=randrw --rwmixread=75> outside of [] job section
fio-2.1.11
Starting 1 process
test6: Laying out IO file(s) (1 file(s) / 10240MB)

Jobs: 1 (f=1): [R(1)] [75.0% done] [2697MB/0KB/0KB /s] [21.6K/0/0 iops] [eta 00m:01s]
test6: (groupid=0, jobs=1): err= 0: pid=832: Fri Apr  7 16:40:02 2017
  read : io=10240MB, bw=2689.8MB/s, iops=21512, runt=  3808msec
  cpu          : usr=1.58%, sys=18.28%, ctx=81995, majf=0, minf=39
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=81920/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=10240MB, aggrb=2689.8MB/s, minb=2689.8MB/s, maxb=2689.8MB/s, mint=3808msec, maxt=3808msec

Disk stats (read/write):
  sda: ios=77311/0, merge=0/0, ticks=2908/0, in_queue=2872, util=77.64%
 
The amount of work given to
So anyway, here is my recommendation: partition one SSD in Debian installer or GParted LIVE first, install Debian and Proxmox after (not sure if the Proxmox ZFS installer can work with partitions), then create the slow and fast pools later.

40 GB mirrored: root
10 GB on both: swap (do not put swap on ZFS, it's unstable even with tweaks)
10 GB mirrored: ZIL for SLOWPOOL
40 GB on both: L2ARC for SLOWPOOL (80 GB of L2ARC all together)
300 GB mirrored: FASTPOOL
50 GB on both: leave unallocated for performance/reliability (not necessary if using enterprise SSD)

SLOWPOOL will be your mirrored HDDs, try to tweak the l2arc options because it might not do much caching by default. You also want to test sync=always for your SLOWPOOL, it might help performance since then all writes would go to the ZIL first, pool later (not just sync writes).

What are the steps to accomplish this partitioning?

I'd really appreciate some guidance on doing this.
 
L2ARC is only going to be useful if you actually anticipate your ARC size to be much larger than your max RAM in the system. Otherwise it is useless and a waste of your money and time. If you have 128 GB of RAM available for your ARC (which I think you do?), that's going to give you a very large headroom for your ARC and it's probably going to be sufficient. But if it isn't, you can always add L2ARC devices later.

SLOG devices are where SSDs can have the biggest immediate impact, by moving ZIL from spinning disk to SSD on a dedicated device. This is primarily useful for sync writes, and/or creating more of a buffer for fast writes in fluctuating demand windows. I'd recommend you look at SLOG before L2ARC, and for SLOG you want PLP, so PM863a's are probably of significant interest to you.
 
If you have, I'd also suggest to use a compressed, deduplicated pool on the SSDs to get the most capacity out of them, but the deduplication option strongly relies on the amount of systems you want do virtualize,

Wrong! Deduplication is usable if the duplication info can fit in the RAM. If it not the case ... is worst. As a basic rule, use deduplication when your files are not change very often in time.
 
ZIL should be mirrored too to prevent data loss

.... if you do not have an UPS. And is not entirely true. Any zfs date write is go to RAM. When this data are need to be flush, then are go to vdev devices and to zil. When the server are go down, at the startup zfs will check if are disk writes in zil, but not yet write to the zfs disks.

So zill is usefull only if the server is unexpected go down. In any case by default zfs will flush the buffers to the disk in maxim 5 secons.

So your zill space is at most the maximum read speed of your x 5 seconds.


Another case is about sync I/O operation, when data are considered on disks when zill say, ok. But in most of the cases, you will have async I/O.
 
SLOG devices are where SSDs can have the biggest immediate impact, by moving ZIL from spinning disk to SSD on a dedicated device. This is primarily useful for sync writes, and/or creating more of a buffer for fast writes in fluctuating demand windows. I'd recommend you look at SLOG before L2ARC, and for SLOG you want PLP, so PM863a's are probably of significant interest to you.

Agreed in general, and that you definitely need Power Loss Protection, as the SLOG is only read following an unclean shutdown or crash. If your slog doesn't have PLP, you might as well run without it and with sync=disabled on the ZFS dataset, which will also risk data loss but at least will be much faster.

SLOG is highly write intensive, so if you're looking at the Samsung enterprise SATA drives you might want to consider the SM863a rather than the PM863a, as it is designed for this type of workload. However, SATA has higher latency than NVMe, which is less good for SLOG. So if you can afford NVMe with PLP, go with that.

I've got an SM863a (no spare PCIe slots) and pveperf reports about 2000 fsyncs/s. For lab workloads where I don't care about data, I run sync=disabled and that gets me about 15000 fsyncs/s.
 
1. I'm not seeing any difference (based on Samsung info) between the PM863a and the SM863a, apart from marketing that one is better for write-intensive workloads. The speeds and IOPS between the two (again based on Samsung docs) show almost identical numbers. So...?
2. Have you actually seen any scenarios where SATA "latency" (this I'm rather dubious about) has actual tangible impacts? I'd like to hear more of what you have to say on this particular detail.

Agreed in general, and that you definitely need Power Loss Protection, as the SLOG is only read following an unclean shutdown or crash. If your slog doesn't have PLP, you might as well run without it and with sync=disabled on the ZFS dataset, which will also risk data loss but at least will be much faster.

SLOG is highly write intensive, so if you're looking at the Samsung enterprise SATA drives you might want to consider the SM863a rather than the PM863a, as it is designed for this type of workload. However, SATA has higher latency than NVMe, which is less good for SLOG. So if you can afford NVMe with PLP, go with that.

I've got an SM863a (no spare PCIe slots) and pveperf reports about 2000 fsyncs/s. For lab workloads where I don't care about data, I run sync=disabled and that gets me about 15000 fsyncs/s.
 
.... if you do not have an UPS.

Not absolutely the full picture, as a UPS won't save you if an OS crash occurs.

But you're certainly right on the general point. Even with an unmirrored SLOG, you have to get very unlucky to lose data. You basically need to have a drive failure and an unmanaged shutdown of some form within a very short space of time (seconds).

As soon as ZFS notices that a SLOG device has gone bad it will just revert to writing the ZIL to the main pool.

The real reasons to mirror SLOG are:

a/ If you're running performance critical workloads (typically busy VMs), your sync write speeds will take a major hit if an unmirrored SLOG fails, especially if your main pool uses traditional HDDs rather than SSDs. This may be inconvenient.

b/ The pool might not auto mount on boot until you tell ZFS to ignore the failed SLOG device. That can be the case on FreeBSD - not sure if it happens on Linux.
 
1. I'm not seeing any difference (based on Samsung info) between the PM863a and the SM863a, apart from marketing that one is better for write-intensive workloads. The speeds and IOPS between the two (again based on Samsung docs) show almost identical numbers. So...?

Write endurance is the key difference, not speed. See http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/PM863a_and_SM863a_Brochure.pdf . The SM863a is rated for 3 DWPD over 5 years, the PM863 for 1.3 DWPD over 3 years. To be fair, this would only be a worry with very busy servers - if you aren't going to write more than 340TB of small synchronous writes over the lifetime of the drive (assuming the 240GB model) both drives are probably roughly equivalent. Generally, I prefer using a drive that is explicitly designed for write intensive workloads for write intensive workloads, but YMMV :)

2. Have you actually seen any scenarios where SATA "latency" (this I'm rather dubious about) has actual tangible impacts? I'd like to hear more of what you have to say on this particular detail.

I haven't personally no, because I have not had a chance to put SATA and NVMe SLOG drives side by side; SATA is good enough for me on a home budget. But this is fairly standard advice - e.g. The FreeNAS hardware recommendations guide (and they do know their ZFS stuff in that forum) says:

"SLOG devices devices should be high end PCIe NVMe SSDs, such as the Intel P3700. The latency benefits of the NVMe specification have rendered SATA SSDs obsolete as SLOG devices, with the additional bandwidth being a nice bonus."
(https://forums.freenas.org/index.php?attachments/hardware-2016-r1e-pdf.18222/&version=58)

Looking at specs on Intel ARK, the write latency of the DC S3710 is 66 microseconds, and the DC P3700 is 20 microseconds.

I would theoretically expect lower latencies to be a primary driver of SLOG performance. After all, the whole point is to persist a sync write ASAP so that ZFS can ack back to the application. The order of magnitude difference I have observed in pveperf fsync performance between my SM863a slog and sync=disabled suggests that there's likely scope for optimisation of performance at the drive write step if it can be made to go faster.
 
1. Write endurance is one thing I did not find comparison on. I'll check that out, thanks!
2. I actually provide support for FreeNAS on the IRC channel, so I'm well aware of such things, and NVMe is not an _actual_ requirement for good/great SLOG performance. There's plenty of systems or budgets where that's not an option, and you can get blazing fast performance without NVMe. Furthermore PCIe NVMe devices are typically NOT going to be hot-swap, so that kind of an avenue has it's own pitfalls.
3. There are plenty of situations where the latency difference between NVMe and SATA SSD is irrelevant or unnoticeable, so typically it's just not worth the added cost.
4. SSD write performance is way more important than latency for SLOG functionality, as lower write speeds will increase any wait time a hypervisor or other system would be doing for sync writes.
5. Comparison between sync on and off is a fallacy for benchmarking. You're fooling yourself by doing that. ZFS benchmarking is nowhere near typical to benchmarking other storage systems. Again, I provide support and implementation for these systems.


Write endurance is the key difference, not speed. See http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/PM863a_and_SM863a_Brochure.pdf . The SM863a is rated for 3 DWPD over 5 years, the PM863 for 1.3 DWPD over 3 years. To be fair, this would only be a worry with very busy servers - if you aren't going to write more than 340TB of small synchronous writes over the lifetime of the drive (assuming the 240GB model) both drives are probably roughly equivalent. Generally, I prefer using a drive that is explicitly designed for write intensive workloads for write intensive workloads, but YMMV :)



I haven't personally no, because I have not had a chance to put SATA and NVMe SLOG drives side by side; SATA is good enough for me on a home budget. But this is fairly standard advice - e.g. The FreeNAS hardware recommendations guide (and they do know their ZFS stuff in that forum) says:

"SLOG devices devices should be high end PCIe NVMe SSDs, such as the Intel P3700. The latency benefits of the NVMe specification have rendered SATA SSDs obsolete as SLOG devices, with the additional bandwidth being a nice bonus."
(https://forums.freenas.org/index.php?attachments/hardware-2016-r1e-pdf.18222/&version=58)

Looking at specs on Intel ARK, the write latency of the DC S3710 is 66 microseconds, and the DC P3700 is 20 microseconds.

I would theoretically expect lower latencies to be a primary driver of SLOG performance. After all, the whole point is to persist a sync write ASAP so that ZFS can ack back to the application. The order of magnitude difference I have observed in pveperf fsync performance between my SM863a slog and sync=disabled suggests that there's likely scope for optimisation of performance at the drive write step if it can be made to go faster.
 
  • Like
Reactions: guletz

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!