Recommendations on SWAP (especially on ZFS)

shad0th · Jul 30, 2024

Hi esteemed Proxmox-users,

[background]
I'm migrating a few ESXi and Hyper-V hypervisors to Proxmox.
One thing that has been bothering me is the question the swap-partition.
I am currently setting up two environments, one having swap (around 32GB on an SSD) and one completely without swap. Both PVE-nodes are standalone (no cluster) and have 64GB ECC RAM.
In the one without swap I have already seen a container die from too low provisioned RAM, but I'm thinking at least that gave an indicator of the containers needs and upping the RAM has not caused trouble since.
Now for this learning experience I was hoping to learn what the role of swap in a modern hypervisor really is, and maybe learn a thing about overprovisioning and the real usage of RAM. I usually don't run out of RAM, i usually see failures because I'm too strict on the containers/VMs.
[/background]

My questions are:
- Is a non-swap environment ever going to be attractive? In terms of stability and performance.
- (and if so) in what environment would you drop swap? What would be needed?
- Can a swap/file/partition now be put in a ZFS-dataset or best left to a MDAM-type volume?
- Does setting the RAM for an LXC-container really reserve all that RAM or is it more of a ballooning-type reservation where excess RAM is given back to the hypervisor?
- swap seems to be involved with a lot more than running out of memory, why does the hypervisor write to swap before utilizing available RAM in many cases?
- I'd like my VMS and containers to just grab the memory they need and if necessary, put restrictions up, but I have yet to discover a way to do this besides trial and error.

I don't expect an answer to address all of those questions, if there is a good resource for this already I'm sorry my search-fu failed me.

Best,

VictorSTS · Jul 30, 2024

shad0th said:
- Is a non-swap environment ever going to be attractive? In terms of stability and performance.
- (and if so) in what environment would you drop swap? What would be needed?

It's attractive if you have enough real ram for everything, so will not need the help of a swap space.

shad0th said:
- Can a swap/file/partition now be put in a ZFS-dataset or best left to a MDAM-type volume?

Swap on ZFS is not supported [1]. I should work, but deadlocks are possible thus not recommended to use Swap on ZFS. I set swap on individual disk (just leave a small partition in the system drives used as ZFS mirror for the OS). In the even of a swap drive failure, if any process tries to access the memory from swap would get a SIGBUS error and (probably) just get killed, but the system as a whole would keep running. My servers are dimensioned to not need any swap, use enterprise drives and properly monitored to lower the risk of downtime due to this event.

shad0th said:
- Does setting the RAM for an LXC-container really reserve all that RAM or is it more of a ballooning-type reservation where excess RAM is given back to the hypervisor?

For LXC it's a limit: processes inside an LXC can use up to the RAM and Swap used in LXC config. If they don't need that much ram, they will not reserve it and the host can use it for something else if needed.

In contrast, a QEMU VM requests memory from the host as the guest OS needs it, but it does not return the memory back to the host automatically if the guest OS doesn't need it anymore. If the host needs memory, the kernel will ask processes to return unused RAM, thus QEMU will try to recover some memory but it depends on guest OS to release it first. More often that not, guest OS ends up using all ram for things like disk cache, so they will not be able to release memory back to the host. Balloning driver tries to force the guest OS to release memoery at the cost of CPU usage.

shad0th said:
- swap seems to be involved with a lot more than running out of memory, why does the hypervisor write to swap before utilizing available RAM in many cases?

By default, Linux will swap oportunistically as soon as a memory page isn't accessed for some time. To make it swap less, adjust vm.swappiness to 1 or even 0. Be careful that LXC swap amount you set in LXC config will be backed up by the host swap, no matter the vm.swappiness setting.

Also remember that having something in swap is not a problem at all: it may save something in swap and the system seldom or ever access that data again. The issues arise when the system is swapping: taking data in or out swap space frequently (check with i.e. vmstat 1).

shad0th said:
- I'd like my VMS and containers to just grab the memory they need and if necessary, put restrictions up, but I have yet to discover a way to do this besides trial and error.

Unless you have deep knowledge of the apps running in your LXC/VM that's the only way. Always keep in mind that VM's will end up using all memory due to disk caches and that the host itself needs ram too fo the kernel, processes like ZFS, Ceph, backups, shells, etc. Also be careful when using low memory for VMs as they may start swapping inside the guest OS too!

[1] https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#zfs_swap

shad0th · Jul 31, 2024

I was not expecting this good an answer from a single source, thank you so much for your time @VictorSTS, it was really helpful and insightful. I'd buy you a beer if I could. I feel with this information I can make better informed designs of experimental node.
Edit:
I'd like to share my modus operandi following this:
- Don't ever use a VM when a container will do, these can be overpovisoned with RAM and VMs cannot
- Swap needs more research on my part, but I'll test with minimal swappiness. Also not all use of a swap-partition is bad. I'll endevour to find the cases where it is agreeable while flagging those that are not
- Don't EVER put SWAP on a ZFS-dataset

VictorSTS · Jul 31, 2024

shad0th said:
- Don't ever use a VM when a container will do, these can be overpovisoned with RAM and VMs cannot

I avoid LXC at all costs for two very reasons: don't support dirty-bitmaps when backing up to PBS (I've had 3TB LXCs and backing them up wasn't fun) and can't be live migrated, forcing some downtime every time I want to do something in a host. Not to mention the PITA it is to make them behave with NFS/SMB or the risks of a privileged LXC. I do use them if they are small, if they will be clustered at the app level (i.e. a web application cluster) and some scale is needed, thus making the ~500MB of ram that the VM kernel would use for each a significant saving. But that's just me

shad0th · Jul 31, 2024

I feel you on this, it's a good reminder. Most my containers are built using ansible and are volatile/not backup-material so I can afford to lose them, but I note your insight. I've learned a great deal from googling some of your terms.

Search

Search

Recommendations on SWAP (especially on ZFS)

shad0th

New Member

VictorSTS

Renowned Member

shad0th

New Member

VictorSTS

Renowned Member

shad0th

New Member