Best way to setup swap partition?

Dunuin

Distinguished Member
Jun 30, 2020
14,795
4,650
258
Germany
Hi,

Right now I'm writing a tutorial on how to best setup an encrypted PVE node. But the question is now how to best set up the encrypted swap?

As far as I see there are 3 options and none of them is really great:

Option 1.)
Just a LUKS encrypted swap partition on a single disk. Not that great, because there is no redundancy. As soon as the disk containing the swap partition will fail, all swapped out data will be lost and the PVE node or at least some guests may crash.

Option 2.)
Swap partition on top of a LUKS container on top of a mdadm raid1. This is how I do it right now. Are there any new info on this?:
Using mdraid (mdadm)/lvm-mirror/dmraid adds yet another layer of complexity, which might lead to problems (see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5 for a case where this can lead to guest-disk corruption).
and adds complexity when it comes to replacing a broken disk.

Currently I'm not aware that this particular problem would affect using a mdraid as swap, however I didn't read all codepaths regarding swapping in the kernel.
Edit: And here a new post, so looks like this is still not a great solution:
The first thing that comes to my mind when talking about md/dm raid is this: The default caching mode for VMs we use is none, that is, the VMs will access disks with the O_DIRECT flag. The MD/DM raid implementations, for this case, will simply forward a pointer to the memory region to each individual block device, and each of those will copy the data from memory separately. If a 2nd thread is currently writing to that data, the underlying disks will sooner or later write different data, immediately corrupting the raid.[1]

In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.

Ideally the kernel would just ignore O_DIRECT here, since it is in fact documented as *trying* to minimize cache effects... not forcibly skipping caches consistency be damned, completely disregarding the one job that for example a RAID1 has: actually writing the *same* data on both disks...

And yes, writing data which is being modified *normally* means you need to expect garbage on disk. However, the point of a RAID is to at least have the *same* garbage on *both* disks, not give userspace a trivial way to degrade the thing.

If you take care not to use this mode, you'll be fine with it though, but you'll be utilizing some more memory.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=99171

Option 3.)
Use a zvol as swap partition on an encrypted ZFS mirror. This isn't great because of the feedback loop where swapping will cause ZFS to write more to the RAM, which will increase swapping, which will cause ZFS to write even more to RAM, which will increase swapping ... until you OOM.
See for example here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189

Option 4.)
Use a hardware raid card. But this isn't great as PCIe slots are rare here and I don't like waste them for just mirrored swap when everything else is running with ZFS on HBAs.

Option 5.)
Pseudo hardwareraid of the mainboards chipset. This isn't great either, as it combines the downsides of software and hardware raid without the benefit.


No swap at all might also be a choice, but I like my nodes with enough swap space + "vm.swappiness=0". So the swap won't be used at normal operation but PVE could still make use of it, when it otherwise would have to OOM kill some guests. Disks may fail and will fail sooner or later.

So any recommendations on what of these 5 options would be the most reliable? Or is there even a sixt option I can't see?
 
Last edited:
I am using luks encrypted disks below a volume group (LVM). AFAIK VGs can create a simple mirror, so technically that should be an alternative around your ZFS mirror.
I don't like any of the ideas though because it really adds multiple layers of complexity.
Anyways. Maybe this idea helps ;)
 
Does someone of the staff maybe know if btrfs or LVM would be better for an encrypted mirrored swap?

I'm writing a tutorial on how to do a PVE installation with full system encryption and I'm now at the point where I need to decide what storage to use for the swap. Because right now it looks like using no swap is the only option, because all software raid1 can cause either system crashes or degraded arrays. And no software raid1 is also not an option as this would make the whole raid1 useless, when the server still crashes because of lost swapped out data when a disk is failing.
 
  • Like
Reactions: leesteken
Does someone of the staff maybe know if btrfs or LVM would be better for an encrypted mirrored swap?

I'm writing a tutorial on how to do a PVE installation with full system encryption and I'm now at the point where I need to decide what storage to use for the swap. Because right now it looks like using no swap is the only option, because all software raid1 can cause either system crashes or degraded arrays. And no software raid1 is also not an option as this would make the whole raid1 useless, when the server still crashes because of lost swapped out data when a disk is failing.
Did you ever manage to decide on this? perhaps staff replied or development has moved on, so providing new answers since?
Were you able to complete the tutorial/guide?
 
There might be even another option. It just popped into my mind.
Swap does not necessarily needs to be a partition. You could also use a file as swap-device.
Maybe this is something easier to accomplish?
 
  • Like
Reactions: Dark26
Just to mention something that has not been mentioned in this thread: zram / zswap. I use this very often and seldomly rely on "real" disk swap.
 
Does someone of the staff maybe know if btrfs or LVM would be better for an encrypted mirrored swap?

I know this was really old, but BTRFS can't do mirrors with SWAP and in fact has to use nodatacow, so you are better off with mdraid.

LVM should work too, but what was wrong with the mdraid?
 
  • Like
Reactions: Dunuin
LVM should work too, but what was wrong with the mdraid?
See the first post where I quoted @wbumiller (one of the PVE staff members):
In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.
But I never got an answer if this only happens to swap on virtual disks or to swap on physical disks as well.
 
See the first post where I quoted @wbumiller (one of the PVE staff members):

But I never got an answer if this only happens to swap on virtual disks or to swap on physical disks as well.

My understanding was that the solution was already there, it just needed documenting:
https://bugzilla.proxmox.com/show_bug.cgi?id=5235

I did not follow through if it appeared in the Proxmox docs (supposedly nobody cares to support it), but it is the best option after hardware raid for mirrored swap by all means.
 
Just to mention something that has not been mentioned in this thread: zram / zswap. I use this very often and seldomly rely on "real" disk swap.
Do you inter-mix them or are you settled on one over the other? I looked at some of the debates on them and it seems there's no clear winner.
 
Swapfile on encrypted filesystem which is not zfs. A zwasp config will earlier begin to swap as it use mem by itself which isn't free anyway when mem allocations from vm's/lxc's should served without an oom.
 
I know this started as a thread about SWAP (and not lack of it), but am I the only one which actually likes to run SWAP-less? At the end of the day, OOM will get rid of the "likely" culprit and leave every other process that was behaving run as intended, no performance impact.
 
I know this started as a thread about SWAP (and not lack of it), but am I the only one which actually likes to run SWAP-less? At the end of the day, OOM will get rid of the "likely" culprit and leave every other process that was behaving run as intended, no performance impact.
One problem might be missing RAM defragmentation when you never swap out data from RAM to disk and later back but reordered/compacted. You might end up with "free" RAM that can't be used because it is too fragmented and then OOM killer kicks in while still having lots of free RAM left that can't be used. But yes, usually not such a big problem if you regularily reboot the server to update your kernel. But keep in mind that we got people here with uptime of 7+ years. ;)
 
Last edited:
  • Like
Reactions: LnxBil
Uptime of 7+ years of migrated vm's or pve nodes itself ? But anyway that's good luck with hw and prepared environment. :)
If that are node years you could be sure not running on same time further ...
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!