Best strategy to handle strange JVM errors inside VPS

pezi · Sep 10, 2011

I am playing arround with PROXMOX /openvz to migrate a couple of VMware instances. Most applications are Java based, which can easy be migrate to OpenVZ. For each customer/project an own VPS.

Now I try to to migrate Alfresco an Open Source Document Mangement System. The actual problem - the JVM which serves the Tomcat Container dies after an hour. Tested with 3 different VMs - SUN JDK two different versions and OpenJDK - same behaviour. Search in the Alfresco forum there is no hint according this problem. Seems to be an JVM/OpenVZ problem. An other VPS with Open-Xchanage (Java based Groupware) works fine.

Enviroment:
pve-manager: 1.8-23 (pve-manager/1.8/6533)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.8-42
pve-kernel-2.6.32-6-pve: 2.6.32-42
qemu-server: 1.1-31
pve-firmware: 1.0-13
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.28-1pve5
vzdump: 1.2-15
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.0-6

VPS settings

Question: It is allowed to use VZ templates from OpenVZ - or is there a restriction to use only VZ templates provides by PROXMOX ( tested, etc.).

Any idea to handle such problems?

With best reagards
Peter

JVM dump

PHP:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (synchronizer.cpp:1401), pid=672, tid=6302576
#  guarantee(mid->header()->is_neutral()) failed: invariant
#
# JRE version: 6.0_22-b22
# Java VM: OpenJDK Client VM (20.0-b11 mixed mode, sharing linux-x86 )
# Derivative: IcedTea6 1.10.2
# Distribution: Ubuntu 11.04, package 6b22-1.10.2-0ubuntu1~11.04.1
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-6/
#

---------------  T H R E A D  ---------------

Current thread (0x08873800):  VMThread [stack: 0x00582000,0x00603000] [id=675]

tom · Sep 10, 2011

any failcounts? see http://wiki.openvz.org/Proc/user_beancounters

tom · Sep 10, 2011

pezi said:
Question: It is allowed to use VZ templates from OpenVZ - or is there a restriction to use only VZ templates provides by PROXMOX ( tested, etc.).

you can use whatever template you want but the Debian 6 template are very preferred here and can also be created with dab)

pezi · Sep 10, 2011

Thanks for the quick response - Yes, there are failcounts
privvmpages 26283 562327 655360 667860 14
There are some hints to this "problem" -I will try solve this problem with this hints. I will post my results

tom · Sep 10, 2011

The java vm does have problems inside a openVZ container with the calculation of available memory. this issues is known and discussed several time when you run Zimbra on OpenVZ - just search for this.

pezi · Sep 14, 2011

My newest results - the problem with the fail counts for the property privvmpages was fixed by increasing the memory! But this step doesn't fix the main problem: The page fault of the VM.
I played arround with the GC parameter of the VM, I created a debian template (instead of Ubuntu) with Alfresco - no chance - the JVM dies after a while.

As a last test I moved the this template to the test node of the cluster - surprise, there is no problem with the JVM, Alfresco runs since two days!

Master node:
model name : Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz
cpu MHz : 2499.544
16 GB Ram

Second node:
vendor_id : AuthenticAMD
model name : AMD Athlon(tm)64 X2 Dual Core Processor 4200+
4 GB Ram

Both nodes are has beend updated to Proxmox 1.9. Very strange problem! There is an other VZ template on the master node with an open source java stack - Open Xchange since three week. No problems.

cadiolis · Sep 23, 2011

I am having a similar problem. I just updated two Proxmox installs from 1.8 to 1.9. Now various Java VEs are having strange problems similar to those you report above. One install is our custom Java webapp and it just seems to freeze after awhile. No errors, no high cpu, no nothing. Just quits working. We had a Hudson build server (a Java app) that just won't run after the 1.9 upgrade. It starts fine with no errors but then same thing, just seems to freeze. Every once in awhile it will segfault as well.

Do you have any additional thoughts on what is going on here?

pezi · Sep 23, 2011

I gave up to get Alfreso running on the master node - using Proxmox 1.8 at the start ot this thread, now Proxmox 1.9 latest version including PVE test.

I tried various JVM paramters - but on the master node Alfreso dies after period. Most JVM Dump messages were internal memory management related.

On the other hand - on the old test PC (test node) the JVM (Alfresco) runs. I think this is a problem related to a timing problem: Just in time compiler (JVM - different results for different CPUs) in combination with new hardware (Intel(R) Core(TM) i3-2100T CPU) and OpenVZ

cadiolis · Sep 23, 2011

Ughh... this is incredibly frustrating. Java apps seem to be running fine but then just stop responding.

I guess I'll try rolling back to 1.8

dietmar · Sep 23, 2011

The new kernel forces cpu limits as set in the vm configuration. So maybe it helps if you asssign more cpu power.

iti-asi · Sep 23, 2011

We've got exactly the same problem: after upgrading from 1.8 to 1.9 all ours virtual machines with applications using JVM (jboss, tomcat, nuxeo) stop to work.

We did those upgrade (proxmox 1.8 to 1.9):

Kernel 2.6.18 -> kernel 2.6.32-6
Kernel 2.6.24 -> kernel 2.6.32-6
Kernel 2.6.32-4 -> kernel 2.6.32-6

The virtual machines (with JVM) working fine before the upgrade, and after upgrading to 2.6.32-6, they stop to work (the jvm crashing or stopping to respond).
After rebooting the hosts on their initial kernels (2.6.18, 2.6.24, 2.6.32-4), everything works fine again.
As well, we've migrated a virtual machine with this trouble in 2.6.32-6 to a cluster still under proxmox 1.8 with kernel 2.6.32-4 and it works fine.

cadiolis · Sep 23, 2011

Good to know we have a real bug here. Dietmar, I assume you mean to bump up the 'CPUs' option on the VE web config. I will try this but then I will need to downgrade to 1.8 (or boot into the old kernel) as I need these machines operational asap

ChristOff · Sep 24, 2011

+1 for me.

I use Zimbra in a Lucid OpenVZ container (configuration based on ve-vswap-1024m.conf-sample in /etc/vz/conf so most parameters but PHYSPAGES, SWAPPAGES, KMEMSIZE and LOCKEDPAGES are on unlimited), all failcnt are on 0 but after "some time" (5 minutes, 6 hours, 15 hours), Zimbra stops responding with no error message at all (checked all the logs in /var/log, on the host and in the container - also in /opt/zimbra/log here). SSH connexion are still possible when this occurs, JVM/Zimbra simply stop answering. Only way to get it back is to reboot the container.

Yesterday evening it goes worse: the whole host was unanswering (ping ok, but no https, no ssh, and no access to any container), after a reboot of the host I cannot find anything in the logs neither ("grep -Ri error /var/log" displays nothing interesting, cron jobs have run past the point where all services were unavailable but were unable to communicate with outside world)

I'll try downgrading the kernel to 2.6.32-4 and see if it helps.

Host: Core i5 i2400, 16GB RAM

lspci:
lspci
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network Connection (rev 05)
00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b5)
00:1c.3 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4 (rev b5)
00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation H67 Express Chipset Family LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 05)
01:00.0 PCI bridge: Integrated Technology Express, Inc. Device 8892 (rev 10)
03:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04)

pveversion -v:
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.9-43
pve-kernel-2.6.32-4-pve: 2.6.32-33
pve-kernel-2.6.32-6-pve: 2.6.32-43
qemu-server: 1.1-32
pve-firmware: 1.0-13
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.28-1pve5
vzdump: 1.2-15
vzprocps: 2.0.11-2
vzquota: 3.0.11-1dso1
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.0-6

pezi · Sep 24, 2011

I made a kernel downgrade to
pve-kernel-2.6.32-4-pve: 2.6.32-33

Alfresco runs now since 6 hours without crash.

. I will monitor this app tov verify that the old kernel fix the Jav/OpenVZ problem!

pezi · Sep 25, 2011

Hi!

Switching to the prior kernel, "fixes" the JVM crash problem!

During my tests with Alfresco we discovered 3 types of JVM/application missbehaviour.
- JVM crash - problem Nr. 1
- JVM runs with 100% CPU, but the appilcation can handle HTTP request
- JVM seems to be still alive - but the application doesn't response

You wrote
http://forum.proxmox.com/threads/7023-Proxmox-VE-1.9-released!?p=40248#post40248

not really, the 2.6.32-4 is based on Squeeze, the 2.6.32-6 is based on RHEL61.

So I belive to find the exact problem will be difficult.

Can we do anything for you to fix this problem. Testing, etc.?

with best
regards
Peter

dietmar · Sep 25, 2011

pezi said:
Can we do anything for you to fix this problem. Testing, etc.?

It would be great if you find an easy why to reproduce that bug. You can also report the bug on the openvz forum - maybe someone there has an idea.

dik23 · Sep 25, 2011

Can I confirm that this is an issue solely with the new kernel ? Is the rest of the 1.9 update safe for use with jvm ?

cadiolis · Sep 26, 2011

I downgraded two machines to 1.8 (had some other errors trying to boot into original kernel) and everything is working again.

I think this will be a difficult bug to track down. To reproduce it you could probably do what I did when trying to rebuild my build server. I just created a new Debian VE, installed Java, downloaded Jenkins (or Hudson) and ran it with 'java -jar jenkins.war'

pezi · Sep 26, 2011

I wil try to find a test case for a easy reproducible JVM fail.

dietmar said:
It would be great if you find an easy why to reproduce that bug. You can also report the bug on the openvz forum - maybe someone there has an idea.

For posting on the openvz forum. Which relationship exists between the pve kernel and the offical openvz-kernel
http://download.openvz.org/kernel/branches/rhel6-2.6.32/current/
pve kernel = openvz-kernel +some modifications e.g. newer driver?

tom · Sep 26, 2011

the latest 2.6.32-6 kernel is based on the stable OpenVZ branch (RHEL6) but with some small modifications, and a bunch of newer drivers for NIC´s and raid controllers.

Best strategy to handle strange JVM errors inside VPS

New Member

Attachments

Proxmox Staff Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

cadiolis

Guest

New Member

cadiolis

Guest

Proxmox Staff Member

Member

cadiolis

Guest

Renowned Member

New Member

New Member

Proxmox Staff Member

Well-Known Member

cadiolis

Guest

New Member

Proxmox Staff Member