Best strategy to handle strange JVM errors inside VPS

pezi

New Member
Aug 2, 2011
28
0
1
I am playing arround with PROXMOX /openvz to migrate a couple of VMware instances. Most applications are Java based, which can easy be migrate to OpenVZ. For each customer/project an own VPS.

Now I try to to migrate Alfresco an Open Source Document Mangement System. The actual problem - the JVM which serves the Tomcat Container dies after an hour. Tested with 3 different VMs - SUN JDK two different versions and OpenJDK - same behaviour. Search in the Alfresco forum there is no hint according this problem. Seems to be an JVM/OpenVZ problem. An other VPS with Open-Xchanage (Java based Groupware) works fine.

Enviroment:
pve-manager: 1.8-23 (pve-manager/1.8/6533)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.8-42
pve-kernel-2.6.32-6-pve: 2.6.32-42
qemu-server: 1.1-31
pve-firmware: 1.0-13
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.28-1pve5
vzdump: 1.2-15
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.0-6

VPS settings
105.jpg

Question: It is allowed to use VZ templates from OpenVZ - or is there a restriction to use only VZ templates provides by PROXMOX ( tested, etc.).

Any idea to handle such problems?

With best reagards
Peter

JVM dump
PHP:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (synchronizer.cpp:1401), pid=672, tid=6302576
#  guarantee(mid->header()->is_neutral()) failed: invariant
#
# JRE version: 6.0_22-b22
# Java VM: OpenJDK Client VM (20.0-b11 mixed mode, sharing linux-x86 )
# Derivative: IcedTea6 1.10.2
# Distribution: Ubuntu 11.04, package 6b22-1.10.2-0ubuntu1~11.04.1
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-6/
#

---------------  T H R E A D  ---------------

Current thread (0x08873800):  VMThread [stack: 0x00582000,0x00603000] [id=675]
 

Attachments

  • 105.jpg
    105.jpg
    162.7 KB · Views: 32
Question: It is allowed to use VZ templates from OpenVZ - or is there a restriction to use only VZ templates provides by PROXMOX ( tested, etc.).

you can use whatever template you want but the Debian 6 template are very preferred here and can also be created with dab)
 
Thanks for the quick response - Yes, there are failcounts
privvmpages 26283 562327 655360 667860 14
There are some hints to this "problem" -I will try solve this problem with this hints. I will post my results
 
The java vm does have problems inside a openVZ container with the calculation of available memory. this issues is known and discussed several time when you run Zimbra on OpenVZ - just search for this.
 
My newest results - the problem with the fail counts for the property privvmpages was fixed by increasing the memory! But this step doesn't fix the main problem: The page fault of the VM.
I played arround with the GC parameter of the VM, I created a debian template (instead of Ubuntu) with Alfresco - no chance - the JVM dies after a while.

As a last test I moved the this template to the test node of the cluster - surprise, there is no problem with the JVM, Alfresco runs since two days!

Master node:
model name : Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz
cpu MHz : 2499.544
16 GB Ram

Second node:
vendor_id : AuthenticAMD
model name : AMD Athlon(tm)64 X2 Dual Core Processor 4200+
4 GB Ram

Both nodes are has beend updated to Proxmox 1.9. Very strange problem! There is an other VZ template on the master node with an open source java stack - Open Xchange since three week. No problems.
 
I am having a similar problem. I just updated two Proxmox installs from 1.8 to 1.9. Now various Java VEs are having strange problems similar to those you report above. One install is our custom Java webapp and it just seems to freeze after awhile. No errors, no high cpu, no nothing. Just quits working. We had a Hudson build server (a Java app) that just won't run after the 1.9 upgrade. It starts fine with no errors but then same thing, just seems to freeze. Every once in awhile it will segfault as well.

Do you have any additional thoughts on what is going on here?
 
I gave up to get Alfreso running on the master node - using Proxmox 1.8 at the start ot this thread, now Proxmox 1.9 latest version including PVE test.

I tried various JVM paramters - but on the master node Alfreso dies after period. Most JVM Dump messages were internal memory management related.

On the other hand - on the old test PC (test node) the JVM (Alfresco) runs. I think this is a problem related to a timing problem: Just in time compiler (JVM - different results for different CPUs) in combination with new hardware (Intel(R) Core(TM) i3-2100T CPU) and OpenVZ
 
Ughh... this is incredibly frustrating. Java apps seem to be running fine but then just stop responding.

I guess I'll try rolling back to 1.8
 
The new kernel forces cpu limits as set in the vm configuration. So maybe it helps if you asssign more cpu power.
 
We've got exactly the same problem: after upgrading from 1.8 to 1.9 all ours virtual machines with applications using JVM (jboss, tomcat, nuxeo) stop to work.

We did those upgrade (proxmox 1.8 to 1.9):

Kernel 2.6.18 -> kernel 2.6.32-6
Kernel 2.6.24 -> kernel 2.6.32-6
Kernel 2.6.32-4 -> kernel 2.6.32-6

The virtual machines (with JVM) working fine before the upgrade, and after upgrading to 2.6.32-6, they stop to work (the jvm crashing or stopping to respond).
After rebooting the hosts on their initial kernels (2.6.18, 2.6.24, 2.6.32-4), everything works fine again.
As well, we've migrated a virtual machine with this trouble in 2.6.32-6 to a cluster still under proxmox 1.8 with kernel 2.6.32-4 and it works fine.
 
Good to know we have a real bug here. Dietmar, I assume you mean to bump up the 'CPUs' option on the VE web config. I will try this but then I will need to downgrade to 1.8 (or boot into the old kernel) as I need these machines operational asap
 
+1 for me.

I use Zimbra in a Lucid OpenVZ container (configuration based on ve-vswap-1024m.conf-sample in /etc/vz/conf so most parameters but PHYSPAGES, SWAPPAGES, KMEMSIZE and LOCKEDPAGES are on unlimited), all failcnt are on 0 but after "some time" (5 minutes, 6 hours, 15 hours), Zimbra stops responding with no error message at all (checked all the logs in /var/log, on the host and in the container - also in /opt/zimbra/log here). SSH connexion are still possible when this occurs, JVM/Zimbra simply stop answering. Only way to get it back is to reboot the container.

Yesterday evening it goes worse: the whole host was unanswering (ping ok, but no https, no ssh, and no access to any container), after a reboot of the host I cannot find anything in the logs neither ("grep -Ri error /var/log" displays nothing interesting, cron jobs have run past the point where all services were unavailable but were unable to communicate with outside world)

I'll try downgrading the kernel to 2.6.32-4 and see if it helps.

Host: Core i5 i2400, 16GB RAM

lspci:
lspci
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network Connection (rev 05)
00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b5)
00:1c.3 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4 (rev b5)
00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation H67 Express Chipset Family LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 05)
01:00.0 PCI bridge: Integrated Technology Express, Inc. Device 8892 (rev 10)
03:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04)

pveversion -v:
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.9-43
pve-kernel-2.6.32-4-pve: 2.6.32-33
pve-kernel-2.6.32-6-pve: 2.6.32-43
qemu-server: 1.1-32
pve-firmware: 1.0-13
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.28-1pve5
vzdump: 1.2-15
vzprocps: 2.0.11-2
vzquota: 3.0.11-1dso1
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.0-6
 
I made a kernel downgrade to
pve-kernel-2.6.32-4-pve: 2.6.32-33

Alfresco runs now since 6 hours without crash. :p. I will monitor this app tov verify that the old kernel fix the Jav/OpenVZ problem!
 
Hi!

Switching to the prior kernel, "fixes" the JVM crash problem!

During my tests with Alfresco we discovered 3 types of JVM/application missbehaviour.
- JVM crash - problem Nr. 1
- JVM runs with 100% CPU, but the appilcation can handle HTTP request
- JVM seems to be still alive - but the application doesn't response

You wrote
http://forum.proxmox.com/threads/7023-Proxmox-VE-1.9-released!?p=40248#post40248
not really, the 2.6.32-4 is based on Squeeze, the 2.6.32-6 is based on RHEL61.

So I belive to find the exact problem will be difficult.

Can we do anything for you to fix this problem. Testing, etc.?

with best
regards
Peter
 
Can we do anything for you to fix this problem. Testing, etc.?

It would be great if you find an easy why to reproduce that bug. You can also report the bug on the openvz forum - maybe someone there has an idea.
 
Can I confirm that this is an issue solely with the new kernel ? Is the rest of the 1.9 update safe for use with jvm ?
 
I downgraded two machines to 1.8 (had some other errors trying to boot into original kernel) and everything is working again.

I think this will be a difficult bug to track down. To reproduce it you could probably do what I did when trying to rebuild my build server. I just created a new Debian VE, installed Java, downloaded Jenkins (or Hudson) and ran it with 'java -jar jenkins.war'
 
I wil try to find a test case for a easy reproducible JVM fail.

It would be great if you find an easy why to reproduce that bug. You can also report the bug on the openvz forum - maybe someone there has an idea.
For posting on the openvz forum. Which relationship exists between the pve kernel and the offical openvz-kernel
http://download.openvz.org/kernel/branches/rhel6-2.6.32/current/
pve kernel = openvz-kernel +some modifications e.g. newer driver?
 
the latest 2.6.32-6 kernel is based on the stable OpenVZ branch (RHEL6) but with some small modifications, and a bunch of newer drivers for NIC´s and raid controllers.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!