Proxmox, OpenVZ, memory, Java VMs and Zimbra

jens · May 25, 2009

It took me a while to find out why any Java based services were problematic inside a VZ container and to actually fix the problems. This is an approach to document the issues and provide some workarounds

As a motivation, take the Zimbra Collaboration Suite as an example: http://pve.proxmox.com/wiki/Zimbra says that it requires 3GB of memory and another 3GB of swap. Trying to run Zimbra with less memory won't work indeed. In the case of Zimbra, there are multiple reasons; for now, just believe me that Zimbra is running well with around 750MB of memory.

(1) What do the Memory and Swap settings in Proxmox actually do?

For a test, I created a VM with 256 MB Memory and 512 MB Swap. Upon entering the machine, a look into /proc/meminfo gives me a MemTotal of 786432kB and a SwapTotal of 0kB - not exactly what one would have expected!

Let's have a look at the resources that were allocated for this VM. From within the container, root can read the file /proc/user_beancounters. In the OpenVZ docs, this is sometimes referred to as UCB. Here is an edited version of this file, with some lines omitted and some huge values replaced by a single * character:

Code:

root@test:~# cat /proc/user_beancounters
Version: 2.5
       uid  resource          held    maxheld    barrier     limit    failcnt
      101:  privvmpages       2017       3097     196608    209108          0
            physpages         1416       2473          0         *          0
            vmguarpages          0          0     196608         *          0
            oomguarpages      1416       2473     196608         *          0

The rows are as follows: 'held' and 'maxheld' do accounting, with 'held' being the current value and 'maxheld' the maximum during the host lifetime. 'barrier' and 'limit' are configured maximum values. 'failcnt' is i.e. the number of memory allocation failures encountered. Except for 'failcnt', all memory resources are measured in pages of 4KB.

I can't give a full description of these resources here (see the OpenVZ Wiki), but in short, the barrier on 'privvmpages' is the maximum amount of virtual memory a container can allocate, and the total sum of all containers on a host might be much larger than the host's actual memory or swap space. The reason for this overcommitment is that there is a difference between allocating memory and actually using it (i.e. by writing to the allocated memory). The held 'physpages' is what the container is actually using and there is normally no barrier on it. The barrier on 'vmguarpages' is the memory that is guranteed to a container in normal operation. Any allocation on top of this may fail if the server is low on memory. Finally, consider the situation where the host is completely run out of memory ("OOM") and needs to kill some container processes. The barrier on 'oomguarpages' defines the guaranteed memory during OOM. If the container is using more memory than this, the host may kill container processes.

Let's interpret these numbers: The relevant barriers are all set to (196608*4KB)=786432KB. This is the MemTotal what we found previously. 'privvmpages' held are (2017*4KB)=8068KB. This is the memory allocated by all processes in the container, while 'physpages' held (1416*4KB)=5664KB is the memory actually used by the container (this was fresh minimum Ubuntu 8.04 install).

Still, there doesn't seem to exist a distinction between Memory and Swap here. I would have expected the VMGUARPAGES and OOMGUARPAGES barriers to be around the Memory value in the Proxmox GUI and PRIVVMPAGES to reflect (Memory+Swap). Maybe one of the Proxmox guys can fill in here?

(2) The trouble with Java applications

As we have seen above, the MemTotal visible to a container is reported as if there is no difference between the configured Memory and Swap size. Any process looking at MemTotal and assuming a "standard UNIX configuration" with X MB of memory being matched with 2-3 times the amount of swap space will be totally off our actual configuration. In the above case, with 256+512 MB configured, such a process will assume 768 MB to be available.

And so does the Sun's Java VM, which defaults to allocate a quarter of the available "physical" RAM for it's heap. If that's not enough, it also adds 64MB for the permanent generation plus approx another 64MB for stacks, buffers, etc. For now, let's assume we are running a single application server instance which allocates 768MB/4+64MB+64MB=320MB. Let's further assume, it only uses 75% of the allocated amount and we should still be running comfortable!?

Why doesn't it work in practice? For one, there are often other processes running in parallel: think of a data base with a huge demand on caches and several pre-spawned apache instances. The other issues are scripts that fire up additional Java VMs to interact with the application server: even a simple java -version command will try to allocate the 320MB we calculated above, and some scripts are firing up several VMs in parallel or in background cron jobs.

(3) Overriding Java's memory defaults

The command line parameters to control the initial and maximum heap size are -Xms and -Xmx. In 1.5, Sun added DefaultInitialRAMFraction(=64) and DefaultMaxRAMFraction(=4) to set defaults relative to the "physical memory". Depending on your application or server, you might be able to manually calculate sensible values and provide them as additional command line options. In the above case, adding -Xms128m -Xmx128m would assign a good amount of memory to the application server while still leaving some space for other apps. Please do check your server's docs and Sun's heap tuning recommendations before you change any production systems. Also note that for these command line options to be effective, they must be placed before any class or jar files!!

The other problem is helper VMs started from various scripts and cron jobs as you might not want to edit every single java invocation. The solution here is to provide defaults through a special environment variable, i.e. export _JAVA_OPTIONS="-XXefaultInitialRAMFraction=128 -XXefaultMaxRAMFraction=16". Using these parameters instead of -Xms and -Xmx still allows an override on the command line. I'm currently adding these from a script in profile.d.

The problem with this approach is that any scripts run by init don't pick them up and you may need to edit your init scripts to source them manually. Does anyone have a better suggestion?

(4) Monitoring fail_cnt in /proc/user_beancounters

(to be completed)

(5) Running Zimbra in a VZ container

(to be completed on another day, there are additional issues related to /proc/meminfo reporting the wrong values here and Zimbra's base configuration which is targeted at having several 1000 mailboxes)

dietmar · May 26, 2009

We use the 'LOCKEDPAGES' to simulate something like swap.

There is a new vzctl option to modify what is shown as available swap space inside the container (--swappages). I guess we can avoid the problem if we set --meminfo and --swappages more carefully - please can you test that (man vzctl).

- Dietmar

jens · May 26, 2009

I tried the following for the example 256MB+512MB configuration I used above:

Code:

vzctl set 101 --swappages 131072 --meminfo pages:65536 --save

Running on 2.6.24-6-pve, meminfo works (at least as far as MemTotal is now reported as 256MB instead of 768MB).

swappages isn't even persisted to the configuration file. Manually adding it didn't help either. I read previously that swappages would require special kernel patches.

Is there a documentation of the differences between your PVE kernels and the standard Lenny OpenVZ kernels?

I noticed what you did with LOCKEDPAGES. Out of my head, I can't think of any software actually locking pages in memory, so I don't believe it makes too much sense.

vz_fake_swap

Another solution that might help some people trying to run an app that insists on configured swap space is the script below (source: OpenVZ forum). It requires the swap size in MB as the first parameter:

Code:

#!/bin/bash

SWAP="${1:-512}"

NEW="$[SWAP*1024]"; TEMP="${NEW//?/ }"; OLD="${TEMP:1}0"

umount /proc/meminfo 2> /dev/null
sed "/^Swap\(Total\|Free\):/s,$OLD,$NEW," /proc/meminfo > /etc/fake_meminfo
mount --bind /etc/fake_meminfo /proc/meminfo

dietmar · May 27, 2009

jens said:
swappages isn't even persisted to the configuration file.

Please can you file a bug to the openvz bug tracker.

jens said:
Manually adding it didn't help either. I read previously that swappages would require special kernel patches.

I am not aware of that - please can you ask on the openvz list.

jens said:
Is there a documentation of the differences between your PVE kernels and the standard Lenny OpenVZ kernels?

We use the ubuntu kernel, because openvz people support that one.

http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-hardy.git;a=summary

jens said:
I noticed what you did with LOCKEDPAGES. Out of my head, I can't think of any software actually locking pages in memory, so I don't believe it makes too much sense.

I agree, openvz does not have the concept of 'swap', but it makes sense because of some other points:

1.) future versions will have ability to control swap (cgroup system)

2.) most people do not know the openvz memory concept (combined ram+swap). So they always assign much too less RAM to a VM. Presenting RAM/SWAP as separate entry avoids that.

- Dietmar

jens · May 27, 2009

I would agree to keep the distinction between RAM+SWAP as it allows to display and manage the sum of allocated memory resources with respect to the host's physical and virtual memory (think of a warning before creating or migrating a machine to an overcommitted host).

Assuming a 'cooperative' sharing of resources, RAM should probably cover the continuous memory requirements of a container with 3 times the amount of SWAP allowing for temporary spikes. In OpenVZ terms, I would translate this to:

Code:

    oomguarpages=vmguarpages=meminfo pages=RAM/4KB
    swappages=SWAP/4KB
    privvmpages=(RAM+SWAP)/4KB

(For reference: good summary and explanation of the relevant resource parameters)

For swappages support, the OpenVZ wiki notes that you need RHEL5 028stab060.2 (and indeed, swappages related patches appear there). The bug with vzctl not persisting swappages was apparently fixed yesterday

Lenny's OpenVZ kernel runs without any issues on my dev machine (together with software RAID10 which I would strongly recommend for development or budget usage...)

most people do not know the openvz memory concept (combined ram+swap). So they always assign much too less RAM to a VM.

It would be useful to poll failcnts in running container and display some sort of icon or a warning in the Web GUI.

dietmar · May 28, 2009

jens said:
In OpenVZ terms, I would translate this to:

Code:

oomguarpages=vmguarpages=RAM/4KB privvmpages=(RAM+SWAP)/4KB

Please can you elaborate why that is better than:

Code:

    oomguarpages=vmguarpages=privvmpages=(RAM+SWAP)/4KB

jens said:
For swappages support, the OpenVZ wiki notes that you need RHEL5 028stab060.2 (and indeed, swappages related patches appear there).

But where can I find that patches?

jens said:
It would be useful to poll failcnts in running container and display some sort of icon or a warning in the Web GUI.

yes, thats already on our TODO list.

jens · May 28, 2009

Unless memory on the host is overcommitted, there is absolutely no difference! But if SUM(privvmpages)> HOST_MEM + HOST_SWAP, this has several advantages:

The kernel can make sure that at any time, a container can allocate and use at least vmguarpages, no matter how "rogue" other containers behave.
If there is a kernel out-of-memory situation, the kernel will try to kill processes in containers that are over their oomguarpages limit first. With oomguarpages=privvmpages, you are effectively disabling this. Not sure what is supposed to happen then, but I would guess that the kernel starts killing the largest processes in either the host itself or any of it's containers. In a mixed KVM/OpenVZ environment things might be even worse as KVM instances are probably the biggest memory hogs and such likely to be killed first.

Overcommitting memory, especially privvmpages, is very legitimate. You can also configure much more vmguarpages than the available physical memory on the host. This is useful for mostly idle containers, i.e. running only a nightly batch, client demos or test environments.

But where can I find that patches?

No idea, I'm just a stupid OpenVZ user and it's more than 10 years that I build my last kernel from source

What you have done with Proxmox VE really leverages both KVM and OpenVZ technology and made me decide against HyperV. It should be included with Debian once there are no more issues that require a special installation. If you agree, the most logical step would be to contact the Debian OpenVZ maintainer, present the PVE project and ask for an update to bring the patches on par with Red Hat kernels!

On vacation the next 10 days - und auch mal wieder in der Heimat

dietmar · May 28, 2009

jens said:
Overcommitting memory, especially privvmpages, is very legitimate. You can also configure much more vmguarpages than the available physical memory on the host. This is useful for mostly idle containers, i.e. running only a nightly batch, client demos or test environments.

Thats exaclty what we wanted to avoid - unexpected, random process kills is not what one expects from a stable system.

jens said:
What you have done with Proxmox VE really leverages both KVM and OpenVZ technology and made me decide against HyperV. It should be included with Debian once there are no more issues that require a special installation. If you agree, the most logical step would be to contact the Debian OpenVZ maintainer, present the PVE project and ask for an update to bring the patches on par with Red Hat kernels!

What? We use the ubuntu 2.6.24 kernel for various reasons.

jens · May 28, 2009

unexpected, random process kills is not what one expects from a stable system.

Yep, and that's exactly why I suggested "oomguarpages=vmguarpages=RAM/4KB; privvmpages=(RAM+SWAP)/4KB".

Let FREE_VM = physical memory + host swap - host requirements - some host reserve.

"Overcommitted" shall be any case where SUM(active container privvmpages) > FREE_VM. This allows containers to share burstable memory.

Ignoring the possibility of host processes going rogue, OOM panic is avoided if SUM(active container vmguarpages) < FREE_VM as the kernel won't allow a container to allocate memory that is guaranteed to other containers. Let's call this "safe overcommitting".

Your current configuration choice doesn't allow any overcommitting at all as vmguarpages=privvmpages. Or in other words, you disallow dynamic allocation/sharing of unused memory resources. It's more like a "static partitioning". Trying to start an additional container will most likely fail, even if the system has plenty of unused ressources.

The configuration I suggested is identical to yours as long as there is no overcommitement. In addition, it allows safe overcommitting, or in other words: using the available resources more efficiently. At least for me, this is one of the major reasons for virtualization.

PS: With the configuration I suggested, the user can still do the "static partitioning" by assigning a container only RAM and no SWAP memory (because that's the configuration you are actually converting it into).

PPS: Have to catch my flight now, but using an Ubuntu kernel with a Lenny system isn't the best long-term solution I guess!?

dietmar · May 28, 2009

jens said:
Ignoring the possibility of host processes going rogue, OOM panic is avoided if SUM(active container vmguarpages) < FREE_VM as the kernel won't allow a container to allocate memory that is guaranteed to other containers. Let's call this "safe overcommitting".

I don't see how that is 'safe'?

jens said:
Your current configuration choice doesn't allow any overcommitting at all as vmguarpages=privvmpages. Or in other words, you disallow dynamic allocation/sharing of unused memory resources. It's more like a "static partitioning". Trying to start an additional container will most likely fail, even if the system has plenty of unused ressources.

yes, that is what i call 'safe'.

NOTE: you can simple add more SWAP space on the host (that ways you can share unused resources and you are safe).

jens said:
PPS: Have to catch my flight now, but using an Ubuntu kernel with a Lenny system isn't the best long-term solution I guess!?

Lenny uses an unstable, unsupported version of openvz - I guess that is also not the best solution?

dietmar · May 28, 2009

jens said:
Your current configuration choice doesn't allow any overcommitting at all as vmguarpages=privvmpages. Or in other words, you disallow dynamic allocation/sharing of unused memory resources. It's more like a "static partitioning". Trying to start an additional container will most likely fail, even if the system has plenty of unused ressources.

Seems I missunderstand something. Our approach allows exactly the same amount of memory to be allocated to VM (RAM+SWAP)?

- Dietmar

jens · May 28, 2009

dietmar said:
Lenny uses an unstable, unsupported version of openvz - I guess that is also not the best solution?

As far as I can see, OpenVZ provides "stable" patches for RHEL4/5 and "less stable" patches for "distro kernels which maintain their own security patches (such as Debian)". Anything based on 2.6.24, 2.6.26 or 2.6.27 "are development, bleeding edge branches". KVM can't be really called 'stable' yet and so I wouldnt use 'stable' for PVE either (no offence!).

What I suggested was trying to get recent OpenVZ stuff and PVE packages into Debian unstable. Mixing Debian stable and unstable was never a problem here and Ubuntu is more 'bleeding edge' anyway (not that i don't like Ubuntu!).

jens · May 28, 2009

Dietmar, let's calculate an example:

My virtualization hosts come with 4.5GB physical memory and 8GB swap. I want swap to be used only occasionally and consider continuous swap usage as a critical issue. To simplify calculations, I assume that the host itself will never need more than 512MB, resulting in 4GB of free physical memory and a total virtual memory of 12GB.

We have a couple of physical, yet unvirtualized servers, each allocating no more than 750MB RAM during 99.5% of uptime, with allocated RAM spiking up to 2048MB at times. These servers are currently running stable with 1024MB physical memory and 2048MB swap and we use the same values in the PVE GUI. RAM spikes are random and uncorellated.

I further assume that the actual memory usage (pages that have been written to) is 30% less than the number of allocated pages (giving ~525MB / 1433MB max).

CPU usage, IO performance and pages shared between containers are ignored.

Worst Case Scenarios

W1: all containers actually use 750MB (and not just 525MB) during normal operations
W2*: all containers actually use 1024MB during normal operations
W3: all containers have a 2048MB memory spike at the same time
W4*: all containers have a 2048MB memory spike at the same time, with all of the memory being actually used
W5*: 1 "rogue" container allocating and using all possible resources (3072MB)
W6*: 2 "rogue" containers allocating and using all possible resources (2x3072MB)

Note that all cases marked with (*) would have to be regarded as critical or having severe consequences on the original non-virtualized server, too.

Current PVE Configuration

=> Allowing for a maximum of 4 containers

W1: OK
W2*: critical, any further growth will cause permanent use of swap
W3: OK
W4*: critical, peak swap usage >= physical memory
W5*: critical, rogue container using 3GB and the 3 other normal servers 525MB each, causing permanent use of 551MB swap
W6*: desasterous, permanent use of 3098MB swap

Suggested Configuration

=> running 4 containers
identical to current configuration

=> running 5 containers
W1: OK
W2*: critical, permanent usage of 1024MB swap
W3: OK
W4*: critical, peak swap usage >= physical memory
W5*: critical, rogue container using 3GB and the 4 other normal container 525MB each, causing permanent usage of 1076MB swap
W6*: desasterous, permanent use of 3623MB swap

=> running 6 containers
W1: critical, permanent usage of 404MB swap
W2*: critical, permanent usage of 2048MB swap
W3: critical, peak swap usage >= physical memory
W4*: critical, peak swap usage >= physical memory
W5*: critical, rogue container using 3GB and the 5 other normal containers 525MB each, causing permanent usage of 1601MB swap
W6*: desasterous, permanent swap usage >= physical memory

=> running 7 containers
W1: critical, permanent usage of 1154MB swap
W2*: critical, permanent usage of 3072MB swap
W3: critical, peak swap usage >= physical memory
W4*: critical, peak swap usage >= physical memory
W5*: critical, rogue container using 3GB and the 6 other normal containers 525MB each, causing permanent usage of 2126MB swap
W6*: desasterous, permanent swap usage >= physical memory

=> running 8 containers
possible, but critical as permanent swap usage is 104MB

Conclusion

With 4 containers, both configurations react identical to all test cases.
With the original configuration, the host is limited to 4 containers since guaranteed memory is identical to the available virtual memory.
The suggested configuration allows running up to 7 containers with identical performance during regular operations.
With 5 containers, tolerance to worst case scenarios is nearly identical to the 4 container configuration, but running 6 and 7 containers shows less tolerant behaviour.
Running 8 containers is possible with minimal performance degradation during regular operations. Worst case results are in line with the 6/7 container configurations.
Worst case vulnerability of both 4 and 5 container configurations is identical to the original non-virtualized servers situation.
None of the tested worst case scenarios should result in failcnt increments, a kernel out-of-memory panic or random killing of processes.
The system should stay manageable (e.g. allow container migration or killing a rogue process/container) under all worst case assumptions. However, under permanent high swap usage, host processes might compete for physical memory and management can become sluggish (TODO: would increasing the priority of host management processes help here??)

The suggested configuration would allow an even higher density with mostly idle containers, i.e. cron jobs, client demos, test and build environments. Such containers can be largely swapped out by the host during inactivity.

dietmar · May 29, 2009

jens said:
As far as I can see, OpenVZ provides "stable" patches for RHEL4/5 and "less stable" patches for "distro kernels which maintain their own security patches (such as Debian)".

AFAIK the only 'supported' disrto is ubuntu (At least the openvz people told me that).

jens said:
Anything based on 2.6.24, 2.6.26 or 2.6.27 "are development, bleeding edge branches".

Wer had serious problems with those development branches. ...

dietmar · May 29, 2009

jens said:
None of the tested worst case scenarios should result in failcnt increments, a kernel out-of-memory panic or random killing of processes.

OK, I must be stupid. You run 8 container, assign 3GB to each. This sums up to 24GB (but you have only 4+8swap). Wouldn't that trigger the OOM killer sometimes? (how do you gurantee the the oom killer is never triggered)?

thefool808 · Jun 12, 2009

I think the point is that with the current configuration you will never be able to run more than 4 containers. It's not that you want to always prevent process killing, it's that once you hit that process killing point (with 5 machines using all available ram), you are already in a critical situation (high sustained swap usage) with only 4 machines.

It's a philosophical point saying:

"I'm willing to take the process killing chance, because preventing the over commitment of RAM is not going to prevent a critical situation from occurring in the worst case scenario, however, it does prevent high density under normal circumstances."

Disclaimer: I'm not advocating Jens point of view, just trying to understand it a little clearer.

flosoft · Apr 18, 2010

I think I grasped jens' point, and I think he is right.

Let me try to explain it (the way I'd like Proxmox to behave).

I have 1 Server (S1), with 1GB of RAM, and 2GB of SSD for SWAP. (don't think of Proxmox's memory use here - just an example

)

VM1 has a configuration: 512MB RAM; 1024MB of SWAP.
VM2 has a configuration: 512MB RAM; 1024MB of SWAP.

Now, highest possible load: both machines use 100% of SWAP and RAM. System runs fine.

Now, let's say VM1 uses 256MB of RAM, and 0 MB of SWAP.
VM2 uses 768MB of "Memory". Proxmox notices that there's RAM available, so it uses the RAM instead of SWAP on the host machine. So it allocates 768MB on S1's RAM.

Now, VM1 starts some processes that use more memory. It also starts using 768MB of "Memory". Now, Proxmox moves 256MB of VM2's memory into SWAP, as it exceeded the memory limit set by 256MB, so that VM1 can use it's guaranteed 512MB of RAM, and 256MB of SWAP on S1.

Basically:
Guaranteed Memory (RAM): Memory that the VM can get at any time on the host that is on the hosts RAM.
Memory limit (SWAP): Memory that the VM can allocate on the host. If the host has spare RAM, it is allocated there, but otherwise on SWAP. Guaranteed memory has priority on the hosts RAM.

This way, the system runs stable, and there's no way a single VM can abuse memory (i.e. hog all the RAM, and all other VMs get SWAP.)

Search

Search

Proxmox, OpenVZ, memory, Java VMs and Zimbra

jens

New Member

dietmar

Proxmox Staff Member

jens

New Member

dietmar

Proxmox Staff Member

jens

New Member

dietmar

Proxmox Staff Member

jens

New Member

dietmar

Proxmox Staff Member

jens

New Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

jens

New Member

jens

New Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

thefool808

Guest

flosoft

Member