Best way to setup swap partition?

Dunuin · Oct 19, 2022

Hi,

Right now I'm writing a tutorial on how to best setup an encrypted PVE node. But the question is now how to best set up the encrypted swap?

As far as I see there are 3 options and none of them is really great:

Option 1.)
Just a LUKS encrypted swap partition on a single disk. Not that great, because there is no redundancy. As soon as the disk containing the swap partition will fail, all swapped out data will be lost and the PVE node or at least some guests may crash.

Option 2.)
Swap partition on top of a LUKS container on top of a mdadm raid1. This is how I do it right now. Are there any new info on this?:

Stoiko Ivanov said:
Using mdraid (mdadm)/lvm-mirror/dmraid adds yet another layer of complexity, which might lead to problems (see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c5 for a case where this can lead to guest-disk corruption).
and adds complexity when it comes to replacing a broken disk.

Currently I'm not aware that this particular problem would affect using a mdraid as swap, however I didn't read all codepaths regarding swapping in the kernel.

Edit: And here a new post, so looks like this is still not a great solution:

wbumiller said:
The first thing that comes to my mind when talking about md/dm raid is this: The default caching mode for VMs we use is none, that is, the VMs will access disks with the O_DIRECT flag. The MD/DM raid implementations, for this case, will simply forward a pointer to the memory region to each individual block device, and each of those will copy the data from memory separately. If a 2nd thread is currently writing to that data, the underlying disks will sooner or later write different data, immediately corrupting the raid.[1]

In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.

Ideally the kernel would just ignore O_DIRECT here, since it is in fact documented as *trying* to minimize cache effects... not forcibly skipping caches consistency be damned, completely disregarding the one job that for example a RAID1 has: actually writing the *same* data on both disks...

And yes, writing data which is being modified *normally* means you need to expect garbage on disk. However, the point of a RAID is to at least have the *same* garbage on *both* disks, not give userspace a trivial way to degrade the thing.

If you take care not to use this mode, you'll be fine with it though, but you'll be utilizing some more memory.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=99171

Option 3.)
Use a zvol as swap partition on an encrypted ZFS mirror. This isn't great because of the feedback loop where swapping will cause ZFS to write more to the RAM, which will increase swapping, which will cause ZFS to write even more to RAM, which will increase swapping ... until you OOM.
See for example here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189

Option 4.)
Use a hardware raid card. But this isn't great as PCIe slots are rare here and I don't like waste them for just mirrored swap when everything else is running with ZFS on HBAs.

Option 5.)
Pseudo hardwareraid of the mainboards chipset. This isn't great either, as it combines the downsides of software and hardware raid without the benefit.

No swap at all might also be a choice, but I like my nodes with enough swap space + "vm.swappiness=0". So the swap won't be used at normal operation but PVE could still make use of it, when it otherwise would have to OOM kill some guests. Disks may fail and will fail sooner or later.

So any recommendations on what of these 5 options would be the most reliable? Or is there even a sixt option I can't see?

Dunuin · Oct 20, 2022

What about swap on a btrfs mirror? Are there any known problems?

apoc · Oct 20, 2022

I am using luks encrypted disks below a volume group (LVM). AFAIK VGs can create a simple mirror, so technically that should be an alternative around your ZFS mirror.
I don't like any of the ideas though because it really adds multiple layers of complexity.
Anyways. Maybe this idea helps

Dunuin · Oct 24, 2022

Does someone of the staff maybe know if btrfs or LVM would be better for an encrypted mirrored swap?

I'm writing a tutorial on how to do a PVE installation with full system encryption and I'm now at the point where I need to decide what storage to use for the swap. Because right now it looks like using no swap is the only option, because all software raid1 can cause either system crashes or degraded arrays. And no software raid1 is also not an option as this would make the whole raid1 useless, when the server still crashes because of lost swapped out data when a disk is failing.

bossey1 · Oct 4, 2024

Dunuin said:
Does someone of the staff maybe know if btrfs or LVM would be better for an encrypted mirrored swap?

I'm writing a tutorial on how to do a PVE installation with full system encryption and I'm now at the point where I need to decide what storage to use for the swap. Because right now it looks like using no swap is the only option, because all software raid1 can cause either system crashes or degraded arrays. And no software raid1 is also not an option as this would make the whole raid1 useless, when the server still crashes because of lost swapped out data when a disk is failing.

Did you ever manage to decide on this? perhaps staff replied or development has moved on, so providing new answers since?
Were you able to complete the tutorial/guide?

apoc · Oct 4, 2024

There might be even another option. It just popped into my mind.
Swap does not necessarily needs to be a partition. You could also use a file as swap-device.
Maybe this is something easier to accomplish?

LnxBil · Oct 5, 2024

Just to mention something that has not been mentioned in this thread: zram / zswap. I use this very often and seldomly rely on "real" disk swap.

Dunuin · Oct 6, 2024

bossey1 said:
Did you ever manage to decide on this? perhaps staff replied or development has moved on, so providing new answers since?

No, never found a good solution. :/

esi_y · Oct 6, 2024

Dunuin said:
Does someone of the staff maybe know if btrfs or LVM would be better for an encrypted mirrored swap?

I know this was really old, but BTRFS can't do mirrors with SWAP and in fact has to use nodatacow, so you are better off with mdraid.

LVM should work too, but what was wrong with the mdraid?

Dunuin · Oct 7, 2024

esi_y said:
LVM should work too, but what was wrong with the mdraid?

See the first post where I quoted @wbumiller (one of the PVE staff members):

In [1] I also mentioned a real case where this can happen: An in-progress write to swap happening for memory that is simultaneously freed, therefore the swap entry is already discarded while the disk I/O still happening, causing the raid to be degraded.

But I never got an answer if this only happens to swap on virtual disks or to swap on physical disks as well.

esi_y · Oct 7, 2024

Dunuin said:
See the first post where I quoted @wbumiller (one of the PVE staff members):

But I never got an answer if this only happens to swap on virtual disks or to swap on physical disks as well.

My understanding was that the solution was already there, it just needed documenting:
https://bugzilla.proxmox.com/show_bug.cgi?id=5235

I did not follow through if it appeared in the Proxmox docs (supposedly nobody cares to support it), but it is the best option after hardware raid for mirrored swap by all means.

bossey1 · Oct 9, 2024

LnxBil said:
Just to mention something that has not been mentioned in this thread: zram / zswap. I use this very often and seldomly rely on "real" disk swap.

Do you inter-mix them or are you settled on one over the other? I looked at some of the debates on them and it seems there's no clear winner.

bossey1 · Oct 9, 2024

Dunuin said:
No, never found a good solution. :/

Have you tested and or do you have an opinion/experience of the ZRAM/ZSWAP options?

esi_y · Oct 9, 2024

bossey1 said:
Have you tested and or do you have an opinion/experience of the ZRAM/ZSWAP options?

FWIW E.g. Fedora went with ZRAM: https://pagure.io/fedora-workstation/issue/98

bossey1 · Oct 9, 2024

esi_y said:
FWIW E.g. Fedora went with ZRAM: https://pagure.io/fedora-workstation/issue/98

Okay. I must admit with 384GB I was leaning towards ZRAM. Are there any negative impacts on RAM such as wear-out noticed on NVME/SSD?

waltar · Oct 9, 2024

Swapfile on encrypted filesystem which is not zfs. A zwasp config will earlier begin to swap as it use mem by itself which isn't free anyway when mem allocations from vm's/lxc's should served without an oom.

esi_y · Oct 9, 2024

I know this started as a thread about SWAP (and not lack of it), but am I the only one which actually likes to run SWAP-less? At the end of the day, OOM will get rid of the "likely" culprit and leave every other process that was behaving run as intended, no performance impact.

waltar · Oct 9, 2024

esi_y said:
but am I the only one which actually likes to run SWAP-less?

NOOOOOOO, always have swap partition but always configed out in fstab, I hope that's nearly as good as your's ...

Dunuin · Oct 9, 2024

esi_y said:
I know this started as a thread about SWAP (and not lack of it), but am I the only one which actually likes to run SWAP-less? At the end of the day, OOM will get rid of the "likely" culprit and leave every other process that was behaving run as intended, no performance impact.

One problem might be missing RAM defragmentation when you never swap out data from RAM to disk and later back but reordered/compacted. You might end up with "free" RAM that can't be used because it is too fragmented and then OOM killer kicks in while still having lots of free RAM left that can't be used. But yes, usually not such a big problem if you regularily reboot the server to update your kernel. But keep in mind that we got people here with uptime of 7+ years.

waltar · Oct 9, 2024

Uptime of 7+ years of migrated vm's or pve nodes itself ? But anyway that's good luck with hw and prepared environment.

If that are node years you could be sure not running on same time further ...

Best way to setup swap partition?

Distinguished Member

Distinguished Member

Famous Member

Distinguished Member

New Member

Famous Member

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

New Member

New Member

Renowned Member

New Member

Active Member

Renowned Member

Active Member

Distinguished Member

Active Member