Dunuin

Distinguished Member
Jun 30, 2020
14,340
4,209
243
Germany
Hi,

Right now I'm writing a tutorial on how to best setup an encrypted PVE node. But the question is now how to best set up the encrypted swap?

As far as I see there are 3 options and none of them is really great:

Option 1.)
Just a LUKS encrypted swap partition on a single disk. Not that great, because there is no redundancy. As soon as the disk containing the swap partition will fail, all swapped out data will be lost and the PVE node or at least some guests may crash.

Option 2.)
Swap partition on top of a LUKS container on top of a mdadm raid1. This is how I do it right now. Are there any new info on this?:
Using mdraid (mdadm)/lvm-mirror/dmraid adds yet another layer of complexity, which might lead to problems (see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5 for a case where this can lead to guest-disk corruption).
and adds complexity when it comes to replacing a broken disk.

Currently I'm not aware that this particular problem would affect using a mdraid as swap, however I didn't read all codepaths regarding swapping in the kernel.
Edit: And here a new post, so looks like this is still not a great solution:
The first thing that comes to my mind when talking about md/dm raid is this: The default caching mode for VMs we use is none, that is, the VMs will access disks with the O_DIRECT flag. The MD/DM raid implementations, for this case, will simply forward a pointer to the memory region to each individual block device, and each of those will copy the data from memory separately. If a 2nd thread is currently writing to that data, the underlying disks will sooner or later write different data, immediately corrupting the raid.[1]

In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.

Ideally the kernel would just ignore O_DIRECT here, since it is in fact documented as *trying* to minimize cache effects... not forcibly skipping caches consistency be damned, completely disregarding the one job that for example a RAID1 has: actually writing the *same* data on both disks...

And yes, writing data which is being modified *normally* means you need to expect garbage on disk. However, the point of a RAID is to at least have the *same* garbage on *both* disks, not give userspace a trivial way to degrade the thing.

If you take care not to use this mode, you'll be fine with it though, but you'll be utilizing some more memory.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=99171

Option 3.)
Use a zvol as swap partition on an encrypted ZFS mirror. This isn't great because of the feedback loop where swapping will cause ZFS to write more to the RAM, which will increase swapping, which will cause ZFS to write even more to RAM, which will increase swapping ... until you OOM.
See for example here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189

Option 4.)
Use a hardware raid card. But this isn't great as PCIe slots are rare here and I don't like waste them for just mirrored swap when everything else is running with ZFS on HBAs.

Option 5.)
Pseudo hardwareraid of the mainboards chipset. This isn't great either, as it combines the downsides of software and hardware raid without the benefit.


No swap at all might also be a choice, but I like my nodes with enough swap space + "vm.swappiness=0". So the swap won't be used at normal operation but PVE could still make use of it, when it otherwise would have to OOM kill some guests. Disks may fail and will fail sooner or later.

So any recommendations on what of these 5 options would be the most reliable? Or is there even a sixt option I can't see?
 
Last edited:
I am using luks encrypted disks below a volume group (LVM). AFAIK VGs can create a simple mirror, so technically that should be an alternative around your ZFS mirror.
I don't like any of the ideas though because it really adds multiple layers of complexity.
Anyways. Maybe this idea helps ;)
 
Does someone of the staff maybe know if btrfs or LVM would be better for an encrypted mirrored swap?

I'm writing a tutorial on how to do a PVE installation with full system encryption and I'm now at the point where I need to decide what storage to use for the swap. Because right now it looks like using no swap is the only option, because all software raid1 can cause either system crashes or degraded arrays. And no software raid1 is also not an option as this would make the whole raid1 useless, when the server still crashes because of lost swapped out data when a disk is failing.
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!