Hi,
Right now I'm writing a tutorial on how to best setup an encrypted PVE node. But the question is now how to best set up the encrypted swap?
As far as I see there are 3 options and none of them is really great:
Option 1.)
Just a LUKS encrypted swap partition on a single disk. Not that great, because there is no redundancy. As soon as the disk containing the swap partition will fail, all swapped out data will be lost and the PVE node or at least some guests may crash.
Option 2.)
Swap partition on top of a LUKS container on top of a mdadm raid1. This is how I do it right now. Are there any new info on this?:
Option 3.)
Use a zvol as swap partition on an encrypted ZFS mirror. This isn't great because of the feedback loop where swapping will cause ZFS to write more to the RAM, which will increase swapping, which will cause ZFS to write even more to RAM, which will increase swapping ... until you OOM.
See for example here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189
Option 4.)
Use a hardware raid card. But this isn't great as PCIe slots are rare here and I don't like waste them for just mirrored swap when everything else is running with ZFS on HBAs.
Option 5.)
Pseudo hardwareraid of the mainboards chipset. This isn't great either, as it combines the downsides of software and hardware raid without the benefit.
No swap at all might also be a choice, but I like my nodes with enough swap space + "vm.swappiness=0". So the swap won't be used at normal operation but PVE could still make use of it, when it otherwise would have to OOM kill some guests. Disks may fail and will fail sooner or later.
So any recommendations on what of these 5 options would be the most reliable? Or is there even a sixt option I can't see?
Right now I'm writing a tutorial on how to best setup an encrypted PVE node. But the question is now how to best set up the encrypted swap?
As far as I see there are 3 options and none of them is really great:
Option 1.)
Just a LUKS encrypted swap partition on a single disk. Not that great, because there is no redundancy. As soon as the disk containing the swap partition will fail, all swapped out data will be lost and the PVE node or at least some guests may crash.
Option 2.)
Swap partition on top of a LUKS container on top of a mdadm raid1. This is how I do it right now. Are there any new info on this?:
Edit: And here a new post, so looks like this is still not a great solution:Using mdraid (mdadm)/lvm-mirror/dmraid adds yet another layer of complexity, which might lead to problems (see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5 for a case where this can lead to guest-disk corruption).
and adds complexity when it comes to replacing a broken disk.
Currently I'm not aware that this particular problem would affect using a mdraid as swap, however I didn't read all codepaths regarding swapping in the kernel.
The first thing that comes to my mind when talking about md/dm raid is this: The default caching mode for VMs we use isnone
, that is, the VMs will access disks with theO_DIRECT
flag. The MD/DM raid implementations, for this case, will simply forward a pointer to the memory region to each individual block device, and each of those will copy the data from memory separately. If a 2nd thread is currently writing to that data, the underlying disks will sooner or later write different data, immediately corrupting the raid.[1]
In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.
Ideally the kernel would just ignoreO_DIRECT
here, since it is in fact documented as *trying* to minimize cache effects... not forcibly skipping caches consistency be damned, completely disregarding the one job that for example a RAID1 has: actually writing the *same* data on both disks...
And yes, writing data which is being modified *normally* means you need to expect garbage on disk. However, the point of a RAID is to at least have the *same* garbage on *both* disks, not give userspace a trivial way to degrade the thing.
If you take care not to use this mode, you'll be fine with it though, but you'll be utilizing some more memory.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=99171
Option 3.)
Use a zvol as swap partition on an encrypted ZFS mirror. This isn't great because of the feedback loop where swapping will cause ZFS to write more to the RAM, which will increase swapping, which will cause ZFS to write even more to RAM, which will increase swapping ... until you OOM.
See for example here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189
Option 4.)
Use a hardware raid card. But this isn't great as PCIe slots are rare here and I don't like waste them for just mirrored swap when everything else is running with ZFS on HBAs.
Option 5.)
Pseudo hardwareraid of the mainboards chipset. This isn't great either, as it combines the downsides of software and hardware raid without the benefit.
No swap at all might also be a choice, but I like my nodes with enough swap space + "vm.swappiness=0". So the swap won't be used at normal operation but PVE could still make use of it, when it otherwise would have to OOM kill some guests. Disks may fail and will fail sooner or later.
So any recommendations on what of these 5 options would be the most reliable? Or is there even a sixt option I can't see?
Last edited: