zram - why bother?

BloodyIron · Jul 24, 2024

On PVE Nodes with plenty of RAM (not even close to running out), why even bother with zram?

I've inherited an environment with zram present on some of the PVE Nodes in the cluster, and it seems completely redundant. RAM used to... provide RAM when you're out of RAM? What?

So far all the justifications I see are for memory starved scenarios, like laptops with 8GB of RAM or something. But there's no RAM resource contention at all that I can see. And the swap-function usage of the zram in this environment is barely anything.

It feels like adding unwarranted fragility to a base distro (Proxmox VE OS) that already addresses things like this with KSM and... swap as a file on disk (or partition?).

Am I somehow missing something here???

LnxBil · Jul 24, 2024

Compression ... that is the main goal and you'll always swap. Not so much with KVM, yet with containers. I don't know why, but there is always swapping going on in containers.

BloodyIron · Jul 24, 2024

I wasn't talking about running without swap, where did you get that impression? To me using RAM to replace RAM when you're in an OOM statement is a liability for operations. I'd rather use a local SSD for swap when absolutely necessary (last resort) and day to day just install more RAM or address capacity in similar regards.

What happens if, using ZRAM, I'm 100% out of RAM? Because to me that sounds like a scenario where everything dies horribly.

And also, doesn't KSM do compression already out of the box?

I'm still not seeing it as warranted.

UdoB · Jul 24, 2024

BloodyIron said:
And also, doesn't KSM do compression already out of the box?

No. "Kernel Samepage Merging" doesn't do compression but it looks "only" for identical pages and cross-references them. Only one page will reside in Ram until one of the not-more-existing copies is going to get modified. https://docs.kernel.org/admin-guide/mm/ksm.html does not mention compression at all.

ZRAM does compression. If it could compress data down to 66% you gain 33% of memory. Of course this requires some CPU cycles. See https://docs.kernel.org/admin-guide/blockdev/zram.html

BloodyIron · Jul 24, 2024

UdoB said:
No. "Kernel Samepage Merging" doesn't do compression but it looks "only" for identical pages and cross-references them. Only one page will reside in Ram until one of the not-more-existing copies is going to get modified. https://docs.kernel.org/admin-guide/mm/ksm.html does not mention compression at all.

ZRAM does compression. If it could compress data down to 66% you gain 33% of memory. Of course this requires some CPU cycles. See https://docs.kernel.org/admin-guide/blockdev/zram.html

Welp guess that was a misunderstanding on my part, thanks for clarifying.

Okay so apart from trying to get ahead of OOM scenarios... is there _any_ other reason to use zram instead of just swap on disk?

UdoB · Jul 25, 2024

BloodyIron said:
is there _any_ other reason to use zram instead of just swap on disk?

Disclaimer: I am not in a position to give a good or ultimative answer.

For a Node Ram should not get over-committed (there should be plenty of "free" Ram used for caches, buffers, ARC...). While CPUs can, Disk space can, Bandwidth can - Ram should not. In my personal point of view. When the host starts swapping a lot of weird things can/will happen. This (my) statement would recommend to swap inside a guest. Generally I do not do this.

That said... and while I try hard to not over-commit Ram I use both methods. ZRam is easier to activate because you do not need to modify the partition table - but then there are Swap-files. On slow storage ZRam is a lot faster as it handles "just" Ram without the need to traverse the complex storage management APIs, possibly over the network...

In my dayjob I have enough Ram to avoid tinkering. In my Homelab (with Mini-PCs) Ram is always the limiting resource :-(

Ymmv - just try it!

LnxBil · Jul 26, 2024

UdoB said:
For a Node Ram should not get over-committed (there should be plenty of "free" Ram used for caches, buffers, ARC...). While CPUs can, Disk space can, Bandwidth can - Ram should not. In my personal point of view. When the host starts swapping a lot of weird things can/will happen. This (my) statement would recommend to swap inside a guest. Generally I do not do this.

Totally agree!

BloodyIron said:
is there _any_ other reason to use zram instead of just swap on disk?

Depsite the obvious: have compressed memory and reduce disk I/O, maybe easier disk size enhancement if the swap is in the way (default in debian is on first extended partition for BIOS) ... not that I can think of.
In my PVE hosts, I use it alongside regular swap, but with a higher priority, so that I hit compressed memory first and then go to disk if you need more. Yet as @UdoB already stated, the hypervisor should not swap at all - at least for KVM-only usage.

BloodyIron · Jul 29, 2024

LnxBil said:
Totally agree!

Depsite the obvious: have compressed memory and reduce disk I/O, maybe easier disk size enhancement if the swap is in the way (default in debian is on first extended partition for BIOS) ... not that I can think of.
In my PVE hosts, I use it alongside regular swap, but with a higher priority, so that I hit compressed memory first and then go to disk if you need more. Yet as @UdoB already stated, the hypervisor should not swap at all - at least for KVM-only usage.

Well to clarify the assumption to me is that swap should generally be empty day to day, and it should be available as "only last resort when things really start getting angry". So to me I would rather not use any RAM at all that is managed by a service (zram the service) and instead plan capacity so there's day-to-day always enough RAM to expand into, and swap is served by a file-on-disk (on the PVE Node) if you're asleep and things start getting out of control despite good planning.

In scenarios like that disk IO matters less than "things keep running". And to me running zram is one more thing I could forget to do when adding/replacing a PVE Node. I am a fan of keeping PVE Nodes as uncustomised as possible from an under-the-hood regard so that PVE Node replacement is as streamlined as possible.

I appreciate the insights here but I still am not seeing the value being worthwhile, more placebo, in my experience and opinion anyways.

leesteken · Jul 29, 2024

BloodyIron said:
Well to clarify the assumption to me is that swap should generally be empty day to day, and it should be available as "only last resort when things really start getting angry".

Some software does allocate memory that happens not to be used (in your specific system or workload), since no software is perfect. If that memory that is (mostly) never touched gets swapped out, that a good thing since it make room for your VMs and cache and stuff that does matter.

Swap that is mostly only written to is not a problem (but a work-around to over-allocation by software). Swap that is often read (and rewritten and read again) is when it starts causing more problems than it solves (or works around) and is then called thrashing.

People get nervous about swap getting full because of the red color, but slowly increasing swap usage (in days or weeks, without high IO!) is in my opinion better than having less and less memory available over time because of sub-optimal software. Just reboot on a kernel update to restart the cycle.

EDIT: Maybe zram can help but I prefer to have that memory available to VMs instead and have swap use disk space. To me, keeping swap (compressed) in memory is preparing for thrashing, which should be avoided by swapping the unused stuff out instead.

BloodyIron · Jul 29, 2024

leesteken said:
Some software does allocate memory that happens not to be used (in your specific system or workload), since no software is perfect. If that memory that is (mostly) never touched gets swapped out, that a good thing since it make room for your VMs and cache and stuff that does matter.

Swap that is mostly only written to is not a problem (but a work-around to over-allocation by software). Swap that is often read (and rewritten and read again) is when it starts causing more problems than it solves (or works around) and is then called thrashing.

People get nervous about swap getting full because of the red color, but slowly increasing swap usage (in days or weeks, without high IO!) is in my opinion better than having less and less memory available over time because of sub-optimal software. Just reboot on a kernel update to restart the cycle.

EDIT: Maybe zram can help but I prefer to have that memory available to VMs instead and have swap use disk space. To me, keeping swap (compressed) in memory is preparing for thrashing, which should be avoided by swapping the unused stuff out instead.

I've exhaustively worked through proving in multiple environments that leaving data in swap tangibly reduces performance of whatever system is doing it, as well as impacting other systems in the same environment that may or may not have anything in its own swap. Whether it's Windows, Linux, or hypervisors such as Proxmox VE.

I am aware that plenty of software likes to swap anyways as a matter of day to day operations, however since being able to actually observe the performance impact of allowing swap to be wild-west as it normally is, vs flushing swap with some regularity, I have no plans to ever go back to not flushing swap with some regularity (as in cron job stuff). The performance gains and responsiveness gains were so substantial it was night and day.

The premise that low amounts of swap has no negative impact is false, and I know a lot of people that are very smart (legitimately so) make the claim it has no impact. I spent exhaustive testing over multiple years to explore this topic.

Yes, I agree that certain percentages of RAM should be made available for elastic growth and other rationale. But I am completely convinced that swap should generally never be populated with data day to day, and that it does have performance impacts. Especially in a PVE Cluster. Even if you have many VMs with a little bit of swap, the swap of one VM impacts the performance of another VM, and this becomes a compounding problem.

So at this point, for me and the environments I architect for, and am otherwise responsible for, swap is a last resort, and I don't plan to ever touch zram.

selund · Oct 16, 2024

There are actually a lot of situations where having enough swap space available is cruicial. Yes, having everything in ram usually sems best, but that's not always the case. If you have a workload that allocates and deallocates a lot of memory continiously you might find that having swap available can increase your perfomance significantly. Basically it comes down to memory presure, and the fact that the kernel can work with memory more efficiently with swap available. All that being said, hypervisors usually don't see that kind of load from my experience, but on the other hand, the Linux kernel usually tries to avoid swapping unless it see a significant advantage or if it doesn't have any other choice.

BloodyIron · Oct 16, 2024

selund said:
There are actually a lot of situations where having enough swap space available is cruicial. Yes, having everything in ram usually sems best, but that's not always the case. If you have a workload that allocates and deallocates a lot of memory continiously you might find that having swap available can increase your perfomance significantly. Basically it comes down to memory presure, and the fact that the kernel can work with memory more efficiently with swap available. All that being said, hypervisors usually don't see that kind of load from my experience, but on the other hand, the Linux kernel usually tries to avoid swapping unless it see a significant advantage or if it doesn't have any other choice.

Sure, and none of that warrants ZRAM IMO. Also, you don't need swap to enable rapid RAM loading/unloading, that would actually substantially slow it down as the swap device itself would be a huge bottleneck relative to the performance of said RAM. It would be substantially more sensible to just buy more RAM, especially if that kind of workload is active, as requiring swap to be that active as part of such a process again would be a huge performance hit, and would also aggressively increase wear on the swap device (which at this point is _probably_ going to be flash based storage).

selund · Oct 16, 2024

Unfortunately that's not how the real world always works. Linux is actually made to use swap when actually needed. Adding more ram won't be the correct way to solve problems in all cases. Nor is it always an option. One of the situations I've experienced this in was on a database server with max amount of ram possilbe at the time on a 2 way intel setup. Adding just 10G of swap in that system transformed the situation from totally useless back to blazing fast. Yes this is a few years back, but still. Not having swap can still be completely wrong in a lot of cases. If the Linux kernel elects to swap out some memory to swap, the kernel have seen that it has more use for that amount of ram for other tasks rather than keeping that data in ram. Now, can swap have a negative effect? For sure, and zram might be better to use in some cases either alone or tiered with some kind of flash. Using zram can reduce the latency significantly instead of using some form of flash.

BloodyIron · Oct 16, 2024

selund said:
Unfortunately that's not how the real world always works. Linux is actually made to use swap when actually needed. Adding more ram won't be the correct way to solve problems in all cases. Nor is it always an option. One of the situations I've experienced this in was on a database server with max amount of ram possilbe at the time on a 2 way intel setup. Adding just 10G of swap in that system transformed the situation from totally useless back to blazing fast. Yes this is a few years back, but still. Not having swap can still be completely wrong in a lot of cases. If the Linux kernel elects to swap out some memory to swap, the kernel have seen that it has more use for that amount of ram for other tasks rather than keeping that data in ram. Now, can swap have a negative effect? For sure, and zram might be better to use in some cases either alone or tiered with some kind of flash. Using zram can reduce the latency significantly instead of using some form of flash.

I've been working with Linux for 20+ years and using swap for more than emergency situations has tangible performance and wear-level costs. I am in the real world, just like you, and it is my responsibility to deal with aspects like this of architecture. Consider for a moment what forum we're in. Do you really think I'd be here talking like this if I didn't work with these systems?

Adding more RAM actually is the right way to do this as it lowers the pressure on pushing data into swap. RAM is orders of magnitude faster than swap in both throughput and latency. Any software that is large enough to use lots of RAM will notice the performance difference of regularly using swap.

Furthermore, if you have many systems (in this case on a hypervisor, be it Proxmox or otherwise) running data regularly in swap (regardless of Windows, Linux or otherwise) then this has compounding performance costs across the whole environment. I've literally gone through deep dive performance explorations of these impacts and seen very substantial gains in tuning systems such that they avoid putting anything into swap unless an emergency situation.

And I know that adding more RAM isn't always an option, but it is extremely achievable in the modern sense. All servers can address very large amounts of RAM for any server made in the last 10+ years, and the costs of adding more RAM is substantially low from an IT CapEx/TCO regard. The performance gains for large implementations grossly outweigh the cost of adding more RAM. Or, dare i say, tuning the environment (tools/apps) to use less RAM if it is misconfigured.

I guarantee your example database would run faster with more RAM than swap, and it is mathematically provable if you compare the actual performance of swap (as in on disk) performance vs RAM. There are no scenarios where swap (on disk) performs anywhere near as close to how RAM performs in any generation of RAM or physical disk, including top-end NVMe.

I _NEVER_ said do not have swap. Don't act like that's what I said because I have not said that. I have, however, said that day to day you do not want data running in swap as there are substantial performance impacts to that, and again, increased wear on storage devices (in ways that is completely avoidable). Swap, in my professional experience and opinion, should only ever be used as last resort.

The claims that swap has zero performance cost is bunk and I have encountered evidence in my professional career many times that proves this.

Consider that a Dell R720, a server from the 2012/2014 era, can have 768GB to 1.5TB of RAM installed in it. And that's just one server example that's extremely affordable in the modern sense (I can pick up a Dell R720 for about $100 pretty often, before upgrading the RAM of course). Newer servers have even larger RAM ceiling options to them.

BloodyIron · Oct 16, 2024

I want to clarify a bit on what I mean by "Swap, in my professional experience and opinion, should only ever be used as last resort.".

Swap should (almost) _NEVER_ be turned off, except maybe special circumstances like on kubernetes nodes where it is part of the recommended architecting (by the kubernetes devs & related).

What I do mean is that swap should be present, should be scaled down in size but to something that makes a bit sense, and the overall system (be it a VM or hypervisor, or whatever) should be architected such that whatever is intended to operate within it has a substantial headroom of RAM for the application and caching of data in addition to that (in this case a Linux OS as Linux caches stuff in RAM in addition to application data).

A common misconception is that you should follow a certain formula for how much Swap you should have relative to how much RAM the system has (be it a VM or bare-metal or whatever). And that's not the case in the modern sense. Such formulae were more derived for systems 15+ years ago when RAM was way smaller and more expensive.

In my experience working with many different size of systems, the amount of swap you want is more in the realm of 1GB-4GB, depending on purpose. So if there's a VM with 4GB of RAM, you really shouldn't be giving it more than 1GB of swap. But if you have a yuuggee boi 64GB RAM VM, well you maybe want 3GB or 4GB of swap allocated.

And generally you want the swap to be a file on disk, not a partition, as resizing the swap file on disk is achievable without rebooting the system. But with a swap partition, it is very hard to adjust the sizing of that without rebooting, and in some circumstances/configurations you may _have_ to reboot to make such a change.

But this should not be the only thing for consideration. You need monitoring and alerting in-tandem with this so that you can be alerted of a system (or systems) behaving abnormally.

For example, one of the environments I work in I have notifications set for any monitored system to alert if their swap reaches 10% usage or higher. Typically this is a clear-cut indication that the system has not enough RAM, or that the system is in a bad state for $unknownReason. Both of which are reasons I want to be notified about.

When I get notifications for these things (and at this point it's rare because I've properly sized the systems in this example environment) I go and evaluate what's going on.

Was the system breached? Is an application going berserk or have a bug that needs a patch? Or maybe there's a growth pattern I didn't account for?

I look at the metrics/graphs/whatever in addition to probably logging into the system to help determine what's going on. If the metrics/graphs show this is abnormal sudden usage growth, I'll probably explore if the system is breached or if an application is unhappy and check logs, maybe a patch fixes this. If the graphs/metrics show a growth pattern demonstrating more RAM is warranted, I'll give it more RAM such that there is still headroom. And if that RAM can be grown online (I don't enable ballooning BTW because there is tangible performance costs to that) then I do that and flush the swap from that system. Or if I can't do it online, I configure it to have more RAM, reboot the system when it makes sense, and move on with my day.

The function of swap being used and treated in this regard is that it is a buffer for things to keep running while something "wrong" is happening. This buffer buys you time while you ascertain what exactly is going on & what to do about it, in such a way that operations are expected to not be interrupted. There are substantial costs to using any real amount of swap day to day, as I've outlined, and I highly recommend against using swap in that regard (unless you _literally_ have no other choice). If you don't use swap as a buffer, and it's full day to day, well when you start actually needing that buffer you enter an OOM state and in Linux's case it literally starts forcefully killing applications just to free up RAM and swap. That means your operations now probably stops completely for that system, or multiple systems.

So yeah, again, I did _not_ say that swap should not be used. I did however say that you really don't want data in swap day to day as general operations. That is for Windows, Linux, and generally any operating system.

Search

Search

zram - why bother?

BloodyIron

Renowned Member

LnxBil

Distinguished Member

BloodyIron

Renowned Member

UdoB

Distinguished Member

BloodyIron

Renowned Member

UdoB

Distinguished Member

LnxBil

Distinguished Member

BloodyIron

Renowned Member

leesteken

Distinguished Member

BloodyIron

Renowned Member

selund

Renowned Member

BloodyIron

Renowned Member

selund

Renowned Member

BloodyIron

Renowned Member

BloodyIron

Renowned Member

We value your privacy