ZFS SSD Pool with NVMe Cache (ZIL & L2ARC)

andy77

Well-Known Member
Jul 6, 2016
248
13
58
40
Hi @ all,

we are planning actually our new Proxmox Servers and therefore think about the best performing and cost efficient setup.

Now we are a bit stuck at using ZIL and L2ARC on separate devices.

Here our actual planed setup:

Xeon 6134 Gold
254GB ECC DDR4 RAM
2 x 60GB SSD SATA - mdadm Raid 1 as System disk
4 x 960GB SSD SATA - ZFS Raid10 (or two ZFS Raid1 pools) as VM pool
1 x 128GB M2.PCIe NVMe - as L2ARC "and or" ZIL??

So now the question is what makes sense to use for caching (write and read)?
Should we use one NVMe as L2ARC an second NVMe as ZIL? If we use ZIL then this should be a RAID, right? If we use separate NVMe disks as ZIL, how large should they be (as I read arround 1GB per TB of normal pool)?

As you see many questions and no real answers :)

I would really appreciate any advice on that.

Thank you a lot
 
Hi,

Because you have a SSD pool, in most cases, Slog(not zil as you write) and l2arc as separate devices is not needed (zil is not the same thing as slog). So my advice is to start use pmx without any slog/l2arc. After some production time usage, you can find if you need a slog device (for sure you will not need any l2arc, 99.90% probability ).
Maybe more useful can be to invest in 1 or 2
more SSD (so you can have 3 x striped mirror, insted of only 2 x stripped mirror ), so you can get more iops and speed.
 
Last edited:
additional to @guletz great comments, I'd also drop the system ssd's and go with one big pool. Just add more disks there. I'd also not mix non ZFS and ZFS disks unless it is absolutely necessary. You'll end up with multiple disjunct caches for data.
 
  • Like
Reactions: guletz
@guletz
THX for the thoughts. So in later Test Use, how could I determine if a l2arc or slog would improve performance?

@LnxBil
Hmm.. I read somewhere that for the Linux boot it is better to not use ZFS raid1. So this is why I ended up in using mdadm raid1 for system.
 
Where did you read such stuff? mdadm is not supported in PVE, ZFS however is.
Well difficult to say because I read a lot of stuff the last days :). But it was suggested somewhere to not boot from ZFS, and instead do a mdadm raid.

So you think that we should not have separate system disks?
(using all disks in one ZFS pool that includes system and VMs storage)
 
So you think that we should not have separate system disks?
(using all disks in one ZFS pool that includes system and VMs storage)

Why should you separate this? The only reason I could think of is to be able to reinstall PVE without loosing everything, but this applies only to the iso installer and people that do not know how to fix their Debian/Linux problems.

If you use a separate pool you have to buy the disks and not use them well, because PVE does write stuff, but no fast disk is required and you have a lot of unused space. PVE on ZFS only uses approx. 2 GB or already compressed space on disk. You will share the ARC for all pools. If you combine the two you either save the money, disk slots and electricity or could add another two data disks and have more throughput.

I would, however, install zram-config to have zram compressed swap and drop the ZFS-based swap immediately. Do not use swap on ZFS. This is the only reason - if any - not to use ZFS, but just don't use swap instead and the problem is gone.
 
  • Like
Reactions: andy77
additional to @guletz great comments, I'd also drop the system ssd's and go with one big pool. Just add more disks there. I'd also not mix non ZFS and ZFS disks unless it is absolutely necessary. You'll end up with multiple disjunct caches for data.


Thx a lot. I always appreciate yor comments.
 
THX for the thoughts. So in later Test Use, how could I determine if a l2arc or slog would improve performance?

Test both variants, with the same load in production for many weeks. And use a monitoring system to compare both solutions. This is the only way to be sure. There are many test tools (fio, and others ), but is very difficult to emulate your real load with them. I use librenms to compare different solution. The storage layer is only one aspect of your landscape. But with any wonderful test tool you can not emulate things like cpu/memory/errors/networking/load at the same time let say for a hole week for YOUR landscape. At working hours is something, in weekend is another story, and so on.


As a basic level, slog is useful if you have some appl that use sync disk transfer(like any DB's or nfs). If you do not have such appl then slog will not be useful at all.
 
  • Like
Reactions: andy77
Be careful installing on ssd. Proxmox cluster service is writing every 4 second. I've made a test server with 4x500gb ssd in raidz1, no vm running, just proxmox, and in 1 week It writes 50gb/day for every disk, so until this problem Will be fixed, usually i configure 2 sata disk for proxmox and low performance machine and another pool with SSD for only vms.

Or if you have a standalone host you can stop 2 services pve-ha-lrm/pve-ha-crm
 
  • Like
Reactions: DerDanilo
Why should you separate this? The only reason I could think of is to be able to reinstall PVE without loosing everything, but this applies only to the iso installer and people that do not know how to fix their Debian/Linux problems.

If you use a separate pool you have to buy the disks and not use them well, because PVE does write stuff, but no fast disk is required and you have a lot of unused space. PVE on ZFS only uses approx. 2 GB or already compressed space on disk. You will share the ARC for all pools. If you combine the two you either save the money, disk slots and electricity or could add another two data disks and have more throughput.

I would, however, install zram-config to have zram compressed swap and drop the ZFS-based swap immediately. Do not use swap on ZFS. This is the only reason - if any - not to use ZFS, but just don't use swap instead and the problem is gone.

How many systems did you configure to say that?
It is best practice to use separate disks for the system. VM I/O load should never slow down the system or any components. SSDs don't cost much anymore. So it's good this way.
 
@tuonoazzurro Are you sure this is something that is happaning in standard environment? Never realized something like that.

@DerDanilo Well that's what I though too, until LnxBil "convinced" me... or maybe not convinced.... :confused:

This is exactly what I wanted, an discussion about the correct way of dooing it. ;)
 
  • Like
Reactions: DerDanilo
@tuonoazzurro Are you sure this is something that is happaning in standard environment? Never realized something like that.

The problem is true.
https://forum.proxmox.com/threads/proxmox-4-x-is-killing-my-ssds.29732/
https://forum.proxmox.com/threads/high-ssd-wear-after-a-few-days.24840/
https://forum.proxmox.com/threads/zfs-with-ssds-am-i-asking-for-a-headache-in-the-near-future.25967/

The answer is to disable the 2 servicies pve-ha-lrm/pve-ha-crm. Obviusly is not a solution for who is using 2 or more nodes
 
I noticed the SSD usage quiet a while ago, but since the hoster exchanges the disks if damaged it is fine for me as long as the system is fast and work reliable, which it totally does. :)

This is one of the reasons that we decided to take separate SSDs for the System (in a Raid1) and not RAID for them.
SWAP as well as ZFS Read (not write) cache (no raid required) plus some space for ISOs and stuff is also on those SSDs.

So VM data has dedicated SSD Raidz1 pools. Raid10 might make sense but we usually go with raid1 pools for the start and extend if required. (2x 1,95TB is usually enough for the start)
 
@fabian Would you be so kind and give us some info from your site about this poblem?

Yes, the problem is true for consumer grade hardware (I also run into it on a laptop with PVE). Please don't use customer ssd's (and no, the pro version of e.g. samsung is still consumer grade) with PVE. If you'd use enterprise grade SSDs, this would not be a problem (and this is the overall experience from many, many users here in the forum and from the PVE stuff, they always refer to this).

How many systems did you configure to say that?

About 20 in this manner so far. Often the problem is to stuff in as many disks as you want. A standard 1 HE server can only hold - depending on the model - 4 to 8 disks (front), so you run into serious space problems.

The problem with mixing a OS-disk e.g. hardware or non-zfs-based software raid with zfs is that you end up having multiple disk caches, the default one in Linux for all block storage devices and the one from the solaris compatibiliby layer for zfs. This is at least sub-optimal. The same holds for ZFS. The reason behind ZFS was to only have one pool which holds everything with all the available speed, why split this? Arc's filled with two pools etc. I don't say it's bad or a no-go, but it's sub-optimal resource utilisation.
 
  • Like
Reactions: DerDanilo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!