New installation with ZFS Raid 1 - what about swap

Feb 2, 2021
29
0
6
41
Hello everyone,

I am in the process of planning a new server at Hetzner: 2x 1TB NVME are installed here, which I would like to use as ZFS Raid-1 directly from the setup.

I ran through this yesterday in VMWare Workstation. If I select ZFS Raid-1 directly from the setup, the server is installed without swap.
Now I think I have two options:
- Swapfile under /. But I think I've read that you shouldn't do this under ZFS because of the load on the NVMEs...
- Reserve some space at the end of the NVMEs in the Proxmox installer, and then create a small software raid over two partitions in Debian, and then use these as swap.

How do the pros do it?

Thanks for your ideas!

Bastian
 
I would not worry too much about writes from swap, swap usage should be fairly low. If you have a lot of swap usage you should have more ram
 
Code:
root@pxh1:~# free -m
               total        used        free      shared  buff/cache   available
Mem:           64220       40554        7142         430       17448       23666
Swap:          17256        2730       14525
root@pxh1:~# cat /etc/sysctl.conf|grep swap
vm.swappiness = 5
root@pxh1:~# uptime
 10:08:17 up 18 days, 13:59,  1 user,  load average: 0.32, 0.49, 0.55

Ram usage is not maxed out, but even with a low swappiness setting, theres always some used swap. Nothing to worry about?
 
- Swapfile under /. But I think I've read that you shouldn't do this under ZFS because of the load on the NVMEs...
Should be avoided but because of this:
This isn't great because of the feedback loop where swapping will cause ZFS to write more to the RAM, which will increase swapping, which will cause ZFS to write even more to RAM, which will increase swapping ... until you OOM.
See for example here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189

- Reserve some space at the end of the NVMEs in the Proxmox installer, and then create a small software raid over two partitions in Debian, and then use these as swap.
Still not sure if this is ok. Would be great if one of the staff (@Stoiko Ivanov , @wbumiller)could finally answer if this is ok or not. Looks like swap on linux software raid (mdraid) can degrade the array but I'm not sure if this is only the case when running swap on guests on top of mdraid or also when using mdraid for swap on the host:
Using mdraid (mdadm)/lvm-mirror/dmraid adds yet another layer of complexity, which might lead to problems (see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5 for a case where this can lead to guest-disk corruption).
and adds complexity when it comes to replacing a broken disk.

Currently I'm not aware that this particular problem would affect using a mdraid as swap, however I didn't read all codepaths regarding swapping in the kernel.
The first thing that comes to my mind when talking about md/dm raid is this: The default caching mode for VMs we use is none, that is, the VMs will access disks with the O_DIRECT flag. The MD/DM raid implementations, for this case, will simply forward a pointer to the memory region to each individual block device, and each of those will copy the data from memory separately. If a 2nd thread is currently writing to that data, the underlying disks will sooner or later write different data, immediately corrupting the raid.[1]

In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.

Ideally the kernel would just ignore O_DIRECT here, since it is in fact documented as *trying* to minimize cache effects... not forcibly skipping caches consistency be damned, completely disregarding the one job that for example a RAID1 has: actually writing the *same* data on both disks...

And yes, writing data which is being modified *normally* means you need to expect garbage on disk. However, the point of a RAID is to at least have the *same* garbage on *both* disks, not give userspace a trivial way to degrade the thing.

If you take care not to use this mode, you'll be fine with it though, but you'll be utilizing some more memory.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=99171
 
Should be avoided but because of this:



Still not sure if this is ok. Would be great if one of the staff (@Stoiko Ivanov , @wbumiller)could finally answer if this is ok or not. Looks like swap on linux software raid (mdraid) can degrade the array but I'm not sure if this is only the case when running swap on guests on top of mdraid or also when using mdraid for swap on the host:

Running Swap on MDADM Raid 1 on last partition of NVME is running here on multiple servers for at least two years, without a single hickup.
Of course you have more manual work to do if a NMVE dies, but if well documented I dont see a problem.

But I wondered if I just got lucky, or if there are much smarter ways to deal with SWAP.
 
But I wondered if I just got lucky, or if there are much smarter ways to deal with SWAP.
Besides the already mentioned NVMe stuff and splitting ZFS and OS/SWAP with e.g. local hardware raid, no. On some machines, I also run only with zram and no physical swap at all.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!