Recommendations on SWAP (especially on ZFS)

shad0th

New Member
Jul 30, 2024
3
1
3
Norway
Hi esteemed Proxmox-users,

[background]
I'm migrating a few ESXi and Hyper-V hypervisors to Proxmox.
One thing that has been bothering me is the question the swap-partition.
I am currently setting up two environments, one having swap (around 32GB on an SSD) and one completely without swap. Both PVE-nodes are standalone (no cluster) and have 64GB ECC RAM.
In the one without swap I have already seen a container die from too low provisioned RAM, but I'm thinking at least that gave an indicator of the containers needs and upping the RAM has not caused trouble since.
Now for this learning experience I was hoping to learn what the role of swap in a modern hypervisor really is, and maybe learn a thing about overprovisioning and the real usage of RAM. I usually don't run out of RAM, i usually see failures because I'm too strict on the containers/VMs.
[/background]

My questions are:
- Is a non-swap environment ever going to be attractive? In terms of stability and performance.
- (and if so) in what environment would you drop swap? What would be needed?
- Can a swap/file/partition now be put in a ZFS-dataset or best left to a MDAM-type volume?
- Does setting the RAM for an LXC-container really reserve all that RAM or is it more of a ballooning-type reservation where excess RAM is given back to the hypervisor?
- swap seems to be involved with a lot more than running out of memory, why does the hypervisor write to swap before utilizing available RAM in many cases?
- I'd like my VMS and containers to just grab the memory they need and if necessary, put restrictions up, but I have yet to discover a way to do this besides trial and error.

I don't expect an answer to address all of those questions, if there is a good resource for this already I'm sorry my search-fu failed me.

Best,
 
  • Like
Reactions: leesteken
- Is a non-swap environment ever going to be attractive? In terms of stability and performance.
- (and if so) in what environment would you drop swap? What would be needed?
It's attractive if you have enough real ram for everything, so will not need the help of a swap space.


- Can a swap/file/partition now be put in a ZFS-dataset or best left to a MDAM-type volume?
Swap on ZFS is not supported [1]. I should work, but deadlocks are possible thus not recommended to use Swap on ZFS. I set swap on individual disk (just leave a small partition in the system drives used as ZFS mirror for the OS). In the even of a swap drive failure, if any process tries to access the memory from swap would get a SIGBUS error and (probably) just get killed, but the system as a whole would keep running. My servers are dimensioned to not need any swap, use enterprise drives and properly monitored to lower the risk of downtime due to this event.

- Does setting the RAM for an LXC-container really reserve all that RAM or is it more of a ballooning-type reservation where excess RAM is given back to the hypervisor?
For LXC it's a limit: processes inside an LXC can use up to the RAM and Swap used in LXC config. If they don't need that much ram, they will not reserve it and the host can use it for something else if needed.

In contrast, a QEMU VM requests memory from the host as the guest OS needs it, but it does not return the memory back to the host automatically if the guest OS doesn't need it anymore. If the host needs memory, the kernel will ask processes to return unused RAM, thus QEMU will try to recover some memory but it depends on guest OS to release it first. More often that not, guest OS ends up using all ram for things like disk cache, so they will not be able to release memory back to the host. Balloning driver tries to force the guest OS to release memoery at the cost of CPU usage.

- swap seems to be involved with a lot more than running out of memory, why does the hypervisor write to swap before utilizing available RAM in many cases?
By default, Linux will swap oportunistically as soon as a memory page isn't accessed for some time. To make it swap less, adjust vm.swappiness to 1 or even 0. Be careful that LXC swap amount you set in LXC config will be backed up by the host swap, no matter the vm.swappiness setting.

Also remember that having something in swap is not a problem at all: it may save something in swap and the system seldom or ever access that data again. The issues arise when the system is swapping: taking data in or out swap space frequently (check with i.e. vmstat 1).

- I'd like my VMS and containers to just grab the memory they need and if necessary, put restrictions up, but I have yet to discover a way to do this besides trial and error.
Unless you have deep knowledge of the apps running in your LXC/VM that's the only way. Always keep in mind that VM's will end up using all memory due to disk caches and that the host itself needs ram too fo the kernel, processes like ZFS, Ceph, backups, shells, etc. Also be careful when using low memory for VMs as they may start swapping inside the guest OS too!



[1] https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#zfs_swap
 
I was not expecting this good an answer from a single source, thank you so much for your time @VictorSTS, it was really helpful and insightful. I'd buy you a beer if I could. I feel with this information I can make better informed designs of experimental node.
Edit:
I'd like to share my modus operandi following this:
- Don't ever use a VM when a container will do, these can be overpovisoned with RAM and VMs cannot
- Swap needs more research on my part, but I'll test with minimal swappiness. Also not all use of a swap-partition is bad. I'll endevour to find the cases where it is agreeable while flagging those that are not
- Don't EVER put SWAP on a ZFS-dataset
 
Last edited:
- Don't ever use a VM when a container will do, these can be overpovisoned with RAM and VMs cannot
I avoid LXC at all costs for two very reasons: don't support dirty-bitmaps when backing up to PBS (I've had 3TB LXCs and backing them up wasn't fun) and can't be live migrated, forcing some downtime every time I want to do something in a host. Not to mention the PITA it is to make them behave with NFS/SMB or the risks of a privileged LXC. I do use them if they are small, if they will be clustered at the app level (i.e. a web application cluster) and some scale is needed, thus making the ~500MB of ram that the VM kernel would use for each a significant saving. But that's just me ;)
 
  • Like
Reactions: shad0th
I feel you on this, it's a good reminder. Most my containers are built using ansible and are volatile/not backup-material so I can afford to lose them, but I note your insight. I've learned a great deal from googling some of your terms.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!