It took me a while to find out why any Java based services were problematic inside a VZ container and to actually fix the problems. This is an approach to document the issues and provide some workarounds
As a motivation, take the Zimbra Collaboration Suite as an example: http://pve.proxmox.com/wiki/Zimbra says that it requires 3GB of memory and another 3GB of swap. Trying to run Zimbra with less memory won't work indeed. In the case of Zimbra, there are multiple reasons; for now, just believe me that Zimbra is running well with around 750MB of memory.
(1) What do the Memory and Swap settings in Proxmox actually do?
For a test, I created a VM with 256 MB Memory and 512 MB Swap. Upon entering the machine, a look into /proc/meminfo gives me a MemTotal of 786432kB and a SwapTotal of 0kB - not exactly what one would have expected!
Let's have a look at the resources that were allocated for this VM. From within the container, root can read the file /proc/user_beancounters. In the OpenVZ docs, this is sometimes referred to as UCB. Here is an edited version of this file, with some lines omitted and some huge values replaced by a single * character:
The rows are as follows: 'held' and 'maxheld' do accounting, with 'held' being the current value and 'maxheld' the maximum during the host lifetime. 'barrier' and 'limit' are configured maximum values. 'failcnt' is i.e. the number of memory allocation failures encountered. Except for 'failcnt', all memory resources are measured in pages of 4KB.
I can't give a full description of these resources here (see the OpenVZ Wiki), but in short, the barrier on 'privvmpages' is the maximum amount of virtual memory a container can allocate, and the total sum of all containers on a host might be much larger than the host's actual memory or swap space. The reason for this overcommitment is that there is a difference between allocating memory and actually using it (i.e. by writing to the allocated memory). The held 'physpages' is what the container is actually using and there is normally no barrier on it. The barrier on 'vmguarpages' is the memory that is guranteed to a container in normal operation. Any allocation on top of this may fail if the server is low on memory. Finally, consider the situation where the host is completely run out of memory ("OOM") and needs to kill some container processes. The barrier on 'oomguarpages' defines the guaranteed memory during OOM. If the container is using more memory than this, the host may kill container processes.
Let's interpret these numbers: The relevant barriers are all set to (196608*4KB)=786432KB. This is the MemTotal what we found previously. 'privvmpages' held are (2017*4KB)=8068KB. This is the memory allocated by all processes in the container, while 'physpages' held (1416*4KB)=5664KB is the memory actually used by the container (this was fresh minimum Ubuntu 8.04 install).
Still, there doesn't seem to exist a distinction between Memory and Swap here. I would have expected the VMGUARPAGES and OOMGUARPAGES barriers to be around the Memory value in the Proxmox GUI and PRIVVMPAGES to reflect (Memory+Swap). Maybe one of the Proxmox guys can fill in here?
(2) The trouble with Java applications
As we have seen above, the MemTotal visible to a container is reported as if there is no difference between the configured Memory and Swap size. Any process looking at MemTotal and assuming a "standard UNIX configuration" with X MB of memory being matched with 2-3 times the amount of swap space will be totally off our actual configuration. In the above case, with 256+512 MB configured, such a process will assume 768 MB to be available.
And so does the Sun's Java VM, which defaults to allocate a quarter of the available "physical" RAM for it's heap. If that's not enough, it also adds 64MB for the permanent generation plus approx another 64MB for stacks, buffers, etc. For now, let's assume we are running a single application server instance which allocates 768MB/4+64MB+64MB=320MB. Let's further assume, it only uses 75% of the allocated amount and we should still be running comfortable!?
Why doesn't it work in practice? For one, there are often other processes running in parallel: think of a data base with a huge demand on caches and several pre-spawned apache instances. The other issues are scripts that fire up additional Java VMs to interact with the application server: even a simple java -version command will try to allocate the 320MB we calculated above, and some scripts are firing up several VMs in parallel or in background cron jobs.
(3) Overriding Java's memory defaults
The command line parameters to control the initial and maximum heap size are -Xms and -Xmx. In 1.5, Sun added DefaultInitialRAMFraction(=64) and DefaultMaxRAMFraction(=4) to set defaults relative to the "physical memory". Depending on your application or server, you might be able to manually calculate sensible values and provide them as additional command line options. In the above case, adding -Xms128m -Xmx128m would assign a good amount of memory to the application server while still leaving some space for other apps. Please do check your server's docs and Sun's heap tuning recommendations before you change any production systems. Also note that for these command line options to be effective, they must be placed before any class or jar files!!
The other problem is helper VMs started from various scripts and cron jobs as you might not want to edit every single java invocation. The solution here is to provide defaults through a special environment variable, i.e. export _JAVA_OPTIONS="-XXefaultInitialRAMFraction=128 -XXefaultMaxRAMFraction=16". Using these parameters instead of -Xms and -Xmx still allows an override on the command line. I'm currently adding these from a script in profile.d.
The problem with this approach is that any scripts run by init don't pick them up and you may need to edit your init scripts to source them manually. Does anyone have a better suggestion?
(4) Monitoring fail_cnt in /proc/user_beancounters
(to be completed)
(5) Running Zimbra in a VZ container
(to be completed on another day, there are additional issues related to /proc/meminfo reporting the wrong values here and Zimbra's base configuration which is targeted at having several 1000 mailboxes)
As a motivation, take the Zimbra Collaboration Suite as an example: http://pve.proxmox.com/wiki/Zimbra says that it requires 3GB of memory and another 3GB of swap. Trying to run Zimbra with less memory won't work indeed. In the case of Zimbra, there are multiple reasons; for now, just believe me that Zimbra is running well with around 750MB of memory.
(1) What do the Memory and Swap settings in Proxmox actually do?
For a test, I created a VM with 256 MB Memory and 512 MB Swap. Upon entering the machine, a look into /proc/meminfo gives me a MemTotal of 786432kB and a SwapTotal of 0kB - not exactly what one would have expected!
Let's have a look at the resources that were allocated for this VM. From within the container, root can read the file /proc/user_beancounters. In the OpenVZ docs, this is sometimes referred to as UCB. Here is an edited version of this file, with some lines omitted and some huge values replaced by a single * character:
Code:
root@test:~# cat /proc/user_beancounters
Version: 2.5
uid resource held maxheld barrier limit failcnt
101: privvmpages 2017 3097 196608 209108 0
physpages 1416 2473 0 * 0
vmguarpages 0 0 196608 * 0
oomguarpages 1416 2473 196608 * 0
I can't give a full description of these resources here (see the OpenVZ Wiki), but in short, the barrier on 'privvmpages' is the maximum amount of virtual memory a container can allocate, and the total sum of all containers on a host might be much larger than the host's actual memory or swap space. The reason for this overcommitment is that there is a difference between allocating memory and actually using it (i.e. by writing to the allocated memory). The held 'physpages' is what the container is actually using and there is normally no barrier on it. The barrier on 'vmguarpages' is the memory that is guranteed to a container in normal operation. Any allocation on top of this may fail if the server is low on memory. Finally, consider the situation where the host is completely run out of memory ("OOM") and needs to kill some container processes. The barrier on 'oomguarpages' defines the guaranteed memory during OOM. If the container is using more memory than this, the host may kill container processes.
Let's interpret these numbers: The relevant barriers are all set to (196608*4KB)=786432KB. This is the MemTotal what we found previously. 'privvmpages' held are (2017*4KB)=8068KB. This is the memory allocated by all processes in the container, while 'physpages' held (1416*4KB)=5664KB is the memory actually used by the container (this was fresh minimum Ubuntu 8.04 install).
Still, there doesn't seem to exist a distinction between Memory and Swap here. I would have expected the VMGUARPAGES and OOMGUARPAGES barriers to be around the Memory value in the Proxmox GUI and PRIVVMPAGES to reflect (Memory+Swap). Maybe one of the Proxmox guys can fill in here?
(2) The trouble with Java applications
As we have seen above, the MemTotal visible to a container is reported as if there is no difference between the configured Memory and Swap size. Any process looking at MemTotal and assuming a "standard UNIX configuration" with X MB of memory being matched with 2-3 times the amount of swap space will be totally off our actual configuration. In the above case, with 256+512 MB configured, such a process will assume 768 MB to be available.
And so does the Sun's Java VM, which defaults to allocate a quarter of the available "physical" RAM for it's heap. If that's not enough, it also adds 64MB for the permanent generation plus approx another 64MB for stacks, buffers, etc. For now, let's assume we are running a single application server instance which allocates 768MB/4+64MB+64MB=320MB. Let's further assume, it only uses 75% of the allocated amount and we should still be running comfortable!?
Why doesn't it work in practice? For one, there are often other processes running in parallel: think of a data base with a huge demand on caches and several pre-spawned apache instances. The other issues are scripts that fire up additional Java VMs to interact with the application server: even a simple java -version command will try to allocate the 320MB we calculated above, and some scripts are firing up several VMs in parallel or in background cron jobs.
(3) Overriding Java's memory defaults
The command line parameters to control the initial and maximum heap size are -Xms and -Xmx. In 1.5, Sun added DefaultInitialRAMFraction(=64) and DefaultMaxRAMFraction(=4) to set defaults relative to the "physical memory". Depending on your application or server, you might be able to manually calculate sensible values and provide them as additional command line options. In the above case, adding -Xms128m -Xmx128m would assign a good amount of memory to the application server while still leaving some space for other apps. Please do check your server's docs and Sun's heap tuning recommendations before you change any production systems. Also note that for these command line options to be effective, they must be placed before any class or jar files!!
The other problem is helper VMs started from various scripts and cron jobs as you might not want to edit every single java invocation. The solution here is to provide defaults through a special environment variable, i.e. export _JAVA_OPTIONS="-XXefaultInitialRAMFraction=128 -XXefaultMaxRAMFraction=16". Using these parameters instead of -Xms and -Xmx still allows an override on the command line. I'm currently adding these from a script in profile.d.
The problem with this approach is that any scripts run by init don't pick them up and you may need to edit your init scripts to source them manually. Does anyone have a better suggestion?
(4) Monitoring fail_cnt in /proc/user_beancounters
(to be completed)
(5) Running Zimbra in a VZ container
(to be completed on another day, there are additional issues related to /proc/meminfo reporting the wrong values here and Zimbra's base configuration which is targeted at having several 1000 mailboxes)