Official sollution for SWAP in PM 5.* and ZFS?

mailinglists

Renowned Member
Mar 14, 2012
641
67
93
Hi,

now that we are aware of the problem with SWAP in ZVOLs, I wonder what is the official and supported way to get SWAP in PM 5.3+ when using ZFS?

I have installed a few nodes with 5.2 and 5.1 and have SWAP on ZVOLS.
Should I disable swap there? According to 5.3 installer I should.
Should I create SW RAID 1 partition with MDADM and put SWAP there?
But that configuration will not be officially supported anymore, right?
 
Or we could use mirrored LVM devices..

I really want an official oppinion how to do it in this use case.
 
Why do you want to have SWAP on disks at all? You can use zram if you really need swap space. Faster and does not require disk space.
 
Will look into zram. Thank you for the hint.

I need swap space for the same reasons we have always used SWAP and for the same reasons SWAP was invented.
Basically, when there is high memory pressure (for various reasons, be it we can not add more RAM, app problems, etc) and we want to avoid invoking kernel OOM killing logic. While SWAPping drastically reduces the speed, at least it allows us to shut down gracefully. Also seldomly used parts of memory can get SWAPped out and we can use RAM where it's needed most.

I still wonder what is the official position on the matter.
All questions i asked still stand unanswered.
Hopefully we will get an official response. :)
 
Normally, your ARC will will be purged until the minimum is reached before you're going to swap, therefore you will loose ZFS performance before actually swapping out. This means that if you swap, you will have already reduced ZFS performance. If you then swap to ZFS, it'll be even slower.

You can test this by monitoring your arc and starting memory intensive applications.
 
Now I understand what you are saying.
I always set zfs_arc_min and max, so I know how much ZFS will use and how performant it will be.
SWAP is sometimes (depends on sysctl swappness) populated for less used pages even before system runs out of memory in RAM.

Anyway, we went a bit off topic, and official answer is still needed.
I wonder why they ignore the question...

In the mean time, I will start by enabling swap on MD RAID mirrored devices on new install, and probably remove SWAP from ZVOL on all older installs, replacing them with the same MD RAID. Luckily I have SSD disks for cache and log, which have lots of space free and will jut put SWAP there.
 
Giving an "official" recommendation for handling Swap with ZFS depends on your needs and acceptance of potential downtime:

* You can install enough (ECC-)memory in your system, which, given that ZFS-performance is very tightly linked to having enough memory available is generally a good idea. However for some users this comes at a prohibitive cost.

* You can create swapspace directly on a blockdevice/partition and use that (the installer has been adapted to make it possible to leave empty space on a disk). The downside here is that you lose the swapspace and the contained data, if the disk, which it resides on breaks - which most likely will lead to a crash/downtime. I personally would probably use a fast Enterprise SSD and monitor its wearout, and live with the risk of a downtime when it fails

So YMMV as to the best approach w.r.t. swapping in general (apart from trying to avoid it if possible at all).

Apart from that it happens that best-practices change (e.g. not using ZVOLs for swap anymore, because of ZOL codechanges) - and posts giving official recommendations still get quoted a lot despite being completely out of date ;)

Hope that helps!
 
This is what i just did:
Code:
root@p28:/var/log# cat /etc/fstab | grep swap
/dev/md/swap none swap sw 0 0
root@p28:/var/log# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sdd3[1] sdc3[0]
      8380416 blocks super 1.2 [2/2] [UU]
Beforehand I installed mdadm. Sdc and sdd are Intel DC SSDs.

This is how I avoid downtime in case one of the SSDs fails before wearing out.
I could do it with LVM mirroring also.

@Stoiko Ivanov
Why would you not rather "officially" recommend putting SWAP on SW RAID mirror like I did or on mirrored LVM?
Am I missing something and am on path to problems in future, because I am using SW RAID?
 
Last edited:
That is easy: mdadm is official not supported. There are two RAID options supported: ZFS and hardware raid.
This sums it up more or less.

Using mdraid (mdadm)/lvm-mirror/dmraid adds yet another layer of complexity, which might lead to problems (see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5 for a case where this can lead to guest-disk corruption).
and adds complexity when it comes to replacing a broken disk.

Currently I'm not aware that this particular problem would affect using a mdraid as swap, however I didn't read all codepaths regarding swapping in the kernel.
 
I know :-/, i battled with ZFS along the way and it works fine for us now.
We needed replication and differential backups, which only ZFS brings.
 
Sadly md raid is not officially supported, while ZFS is definitely not production ready, IMHO.

The thing is ... do your need an "officially supported" version? You wrote that you use mdadm for decades, so probably won't need any help with that. I personally don't care what is supported and what is not .. but this is my personal opinion. I'm able to fix my own bugs, and I assume you also do. As I wrote recently on your other thread, try to use it on a bigger machines (mainly more disks) and it'll grow on you. The feature outweight any possible slowliness by far.
 
LnxBil, Kurgan wrote that about using mdadm for decades, but it ti true for me also.
I am happy with ZFS. :)
 
I understand that, performance apart, you managed to "tame" ZFS so it does not crash the host when it eventually ends up eating all of the available RAM.

What did you do?

I mean, what's your tuning procedure for a freshly installed PVE with ZFS? (I assume you are using RAIDZ-1 as disk configuration)

I have tried (in older versions of PVE)

zfs set primarycache=metadata rpool/swap
zfs set logbias=throughput rpool/swap

These settings should be default as of PVE 5, if I remember correctly.

I have also tried setting swappiness to 1 or to 0 on the host to prevent a (supposed) runaway condition where using swap causes ZFS to allocate more RAM thus making the issue worse (you try to swap out memory, and you end up using more RAM instead of less). This is an issue I have read about here on this forum.

I have not yet tried to set options zfs zfs_arc_max and options zfs zfs_arc_min in /etc/modprobe.d/zfs.conf. I also don't know what values should be used (how to calculate them based on disk space and available RAM).
 
What did you do?

Honestly: Nothing.

I'm using ZFS on over 10 systems now ranging from a Raspberry PI over laptop/desktop systems, multiple LUKS-encrypted one-node PVE servers on the internet to two-figure TB pools all without any crashes in the last years due to some ZFS memory malfunction stuff. They're all rock solid and do not use all the PVE ZFS, but also the ZFS that is included in Debian Stretch and Stretch-Backports.

On some systems I tune the zfs_arc_max and min settings to get MORE memory to ZFS, because "only" half is often not enough for a big ZFS-only system (without virtualization).

One side question: Did you enable deduplication? The only time I experienced massive OOM was with enabled deduplication. Be aware that once enabled, it cannot be disabled completely - only by recreating the pool.
 
I am really baffled. I have not enabled dedup. I have just installed PVE from its ISO image, setting up disks to use RAIDZ-1. I am running 5 servers, in 5 different environments. No clustering, no "fanciness" at all. Just simple single servers with local storage. On different hardware, with 16 to 64 GB RAM. All of the servers using ZFS experience issues with RAM management. The ones that use LVM on hardware RAID (installed from PVE ISO) , or md raid1 with simple ext4 (installed with plain Debian 9 and then PVE repositories) have never shown any issues at all with RAM management, OOM, etc.

I have just checked on one of my servers and dedup is off, while compression is on. And this is by default PVE installation, because the only thing I have done is:

zfs set primarycache=metadata rpool/swap
zfs set logbias=throughput rpool/swap

And I have done it following the OOM reboot issues, not before.
 
Kugan, ZFS is slow for us too when used on just a few HDDs (fast enough only with SSDs or many HDDs), but stable once ARC max is limited and when host does not SWAP too much, which can one setup using swappiness sysctl parameter. But I moved away from ZWAP on ZVOLs to SWAP on MDRAID, where I actually need SWAP.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!