proxmox server hard crash

quickstarter

New Member
Jan 4, 2014
7
0
1
I have the latest proxmox server 3.1 installed, and am having stability issues. My problem is that the server crashes hard twhen doing intensive work in either an openVZ container or kvm vm. I can make the server die by running a weblogic install script in any container.

If I try install weblogic in an Openvz container the proxmox server and it works a the few times, but then suddenly crashes taking all vms and kicking all other machines off the network. If I install weblogic in a kvm container the server still crashes, but at least it doesn't kick other machines off the network. It is the installation process that causes the issue, and I see the cpu being pegged.

I've see a couple other posts by people with similar issues. It could be hardware related, and I've seen a couple other posts with similar issues. The hardware is SGI rackable with dual quadcore opteratrons and 28G RAM. I've verified the memory is good with memtest86. All guests are CentOs 6.5. the weblogic install is weblogic 12.2 on oracle JDK 1.7. This is the latest up to date proxmox 3.1 as of yesterday installed on Debian Wheezy.

I'm about ready give up on proxmox and go with something else, but I'd thought I'd check in case I missed something obvious. Any suggestions for something more stable?
 
Do you run (emd) enterprise manager on the weblogic server? On our 10g and 11g when have seen from time to time that emd is capable of bringing the server down. After we switched to grid control we have not seen this again (eating all available memory and CPU). So if emd is running on the server try to shut it down and see if that helps.

BTW. what does logs from messages and Oracle log show?
 
Do you run (emd) enterprise manager on the weblogic server? On our 10g and 11g when have seen from time to time that emd is capable of bringing the server down. After we switched to grid control we have not seen this again (eating all available memory and CPU). So if emd is running on the server try to shut it down and see if that helps.

BTW. what does logs from messages and Oracle log show?

Nope no enterprise manager - this is a simple install of the lightweight weblogic "dev" server, which just has the appserver, distribution installation process. Weblogic hasn't started yet. During those times when proxmox doesn't die I can start the admin server normally without errors, but I haven't had a chance to test it under load.
 
/var/log/messages from the server and the weblogic log file.

I am at the server and here was the message displayed on the screen:

Kernel panic - not syncing: Watchdog detect hard lockup on cpu 0
Pid 151368 comm: kvm veid: 0 Not tainted 2.6.32-36-pve #1

I should also note that the local network was completely unusable until I restarted the server.

Now that I am sitting in front of the console I see the following error pop up when I start a guest:
kvm: 4530: cpu0 unhandled rdmsr: 0Xc0010001 - I saw this error occur multiple times before the server crashed.


/var /log/messages on proxmox server before the crash:
Feb 2 11:24:16 quickstart1 kernel: device tap131i0 entered promiscuous mode
Feb 2 11:24:16 quickstart1 kernel: vmbr0: port 1(tap131i0) entering forwarding state
Feb 3 06:25:04 quickstart1 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2340" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Feb 3 08:12:35 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:12:35 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:16:55 quickstart1 pvesh: <root@pam> starting task UPID:quickstart1:00022DF0:00A794F2:52EFC0F7:qmclone:129:root@pam:
Feb 3 08:18:18 quickstart1 pvesh: <root@pam> end task UPID:quickstart1:00022DF0:00A794F2:52EFC0F7:qmclone:129:root@pam: OK
Feb 3 08:18:20 quickstart1 kernel: device tap131i0 entered promiscuous mode
Feb 3 08:18:20 quickstart1 kernel: vmbr0: port 1(tap131i0) entering forwarding state
Feb 3 08:18:51 quickstart1 kernel: EXT4-fs (sda1): Unaligned AIO/DIO on inode 16778340 by kvm; performance will be poor.
Feb 3 08:33:31 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:33:31 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:33:47 quickstart1 kernel: device tap131i0 entered promiscuous mode
Feb 3 08:33:47 quickstart1 kernel: vmbr0: port 1(tap131i0) entering forwarding state
Feb 3 08:35:24 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:35:24 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:35:36 quickstart1 kernel: device tap131i0 entered promiscuous mode
Feb 3 08:35:36 quickstart1 kernel: vmbr0: port 1(tap131i0) entering forwarding state
Feb 3 08:36:35 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:36:35 quickstart1 kernel: vmbr0: port 1(tap131i0) entering disabled state
Feb 3 08:36:49 quickstart1 kernel: device tap131i0 entered promiscuous mode
Feb 3 08:36:49 quickstart1 kernel: vmbr0: port 1(tap131i0) entering forwarding state
Feb 3 09:12:07 quickstart1 pvesh: <root@pam> starting task UPID:quickstart1:0002434A:00ACA2A1:52EFCDE7:qmclone:129:root@pam:
Feb 3 09:16:31 quickstart1 pvesh: <root@pam> starting task UPID:quickstart1:0002458D:00AD09CC:52EFCEEF:qmclone:129:root@pam:
Feb 3 09:17:48 quickstart1 pvesh: <root@pam> end task UPID:quickstart1:0002458D:00AD09CC:52EFCEEF:qmclone:129:root@pam: OK
Feb 3 09:17:51 quickstart1 kernel: device tap132i0 entered promiscuous mode
Feb 3 09:17:51 quickstart1 kernel: vmbr0: port 3(tap132i0) entering forwarding state
Feb 3 09:40:54 quickstart1 pvesh: <root@pam> starting task UPID:quickstart1:00024EC0:00AF4560:52EFD4A6:qmclone:129:root@pam:
Feb 3 09:42:14 quickstart1 pvesh: <root@pam> end task UPID:quickstart1:00024EC0:00AF4560:52EFD4A6:qmclone:129:root@pam: OK
Feb 3 09:42:17 quickstart1 kernel: device tap133i0 entered promiscuous mode
Feb 3 09:42:17 quickstart1 kernel: vmbr0: port 5(tap133i0) entering forwarding state
Feb 3 09:52:58 quickstart1 pvesh: <root@pam> starting task UPID:quickstart1:000253FA:00B06055:52EFD77A:qmclone:129:root@pam:
Feb 3 09:54:16 quickstart1 pvesh: <root@pam> end task UPID:quickstart1:000253FA:00B06055:52EFD77A:qmclone:129:root@pam: OK
Feb 3 09:54:19 quickstart1 kernel: device tap134i0 entered promiscuous mode
Feb 3 09:54:19 quickstart1 kernel: vmbr0: port 6(tap134i0) entering forwarding state
Feb 3 09:56:38 quickstart1 pvesh: <root@pam> starting task UPID:quickstart1:00025584:00B0B614:52EFD856:qmclone:129:root@pam:
Feb 3 09:58:05 quickstart1 pvesh: <root@pam> end task UPID:quickstart1:00025584:00B0B614:52EFD856:qmclone:129:root@pam: OK
Feb 3 09:58:08 quickstart1 kernel: device tap135i0 entered promiscuous mode
Feb 3 09:58:08 quickstart1 kernel: vmbr0: port 7(tap135i0) entering forwarding state
Feb 3 10:21:23 quickstart1 pvesh: <root@pam> starting task UPID:quickstart1:00025FE7:00B2FA51:52EFDE23:qmclone:129:root@pam:
Feb 3 10:22:52 quickstart1 pvesh: <root@pam> end task UPID:quickstart1:00025FE7:00B2FA51:52EFDE23:qmclone:129:root@pam: OK
Feb 3 10:22:55 quickstart1 kernel: device tap136i0 entered promiscuous mode
Feb 3 10:22:55 quickstart1 kernel: vmbr0: port 8(tap136i0) entering forwarding state
<<CRASH>>

The weblogic install process did not leave any logs - but I did manage to reproduce the issues while watching and did a bash -x to see what the weblogic install was doing. All it was doing at the time was unpacking jars using the $JAVA_HOME/bin/unpack200 command. This command runs in a loop many times, and the console becomes less and less responsive until the cpu hard lock up error kills the server. Here a log of the last commands weblogic installer did before it crashed:

+ for packedjar in '`echo $tpack`'
++ expr 61 - 1
+ tpacknum=60
++ sed 's/\.pack//g'
++ basename /opt/weblogic/wlserver/modules/com.bea.core.diagnostics.accessor_3.0.0.0.jar.pack
+ jarname=com.bea.core.diagnostics.accessor_3.0.0.0.jar
++ printf %-80s com.bea.core.diagnostics.accessor_3.0.0.0.jar
+ formattedname='com.bea.core.diagnostics.accessor_3.0.0.0.jar '
++ printf %3d 60
+ tpackstr=' 60'
+ '[' -z true ']'
+ echo ..... com.bea.core.diagnostics.accessor_3.0.0.0.jar
++ dirname /opt/weblogic/wlserver/modules/com.bea.core.diagnostics.accessor_3.0.0.0.jar.pack
+ path2jar=/opt/weblogic/wlserver/modules
+ /usr/java/jdk1.7.0_45/bin/unpack200 -r /opt/weblogic/wlserver/modules/com.bea.core.diagnostics.accessor_3.0.0.0.jar.pack /opt/weblogic/wlserver/modules/com.bea.core.diagnostics.accessor_3.0.0.0.jar
+ for packedjar in '`echo $tpack`'
++ expr 60 - 1
+ tpacknum=59
++ sed 's/\.pack//g'
++ basename /opt/weblogic/wlserver/modules/monfox.dsnmp.agent_1.2.0.0_4-7-30.jar.pack
+ jarname=monfox.dsnmp.agent_1.2.0.0_4-7-30.jar
++ printf %-80s monfox.dsnmp.agent_1.2.0.0_4-7-30.jar
+ formattedname='monfox.dsnmp.agent_1.2.0.0_4-7-30.jar '
++ printf %3d 59
+ tpackstr=' 59'
+ '[' -z true ']'
+ echo ..... monfox.dsnmp.agent_1.2.0.0_4-7-30.jar
++ dirname /opt/weblogic/wlserver/modules/monfox.dsnmp.agent_1.2.0.0_4-7-30.jar.pack
+ path2jar=/opt/weblogic/wlserver/modules
+ /usr/java/jdk1.7.0_45/bin/unpack200 -r /opt/weblogic/wlserver/modules/monfox.dsnmp.agent_1.2.0.0_4-7-30.jar.pack /opt/weblogic/wlserver/modules/monfox.dsnmp.agent_1.2.0.0_4-7-30.jar
+ for packedjar in '`echo $tpack`'
++ expr 59 - 1
+ tpacknum=58
++ sed 's/\.pack//g'
++ basename /opt/weblogic/wlserver/modules/clients/com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar.pack
+ jarname=com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar
++ printf %-80s com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar
+ formattedname='com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar '
++ printf %3d 58
+ tpackstr=' 58'
+ '[' -z true ']'
+ echo ..... com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar
++ dirname /opt/weblogic/wlserver/modules/clients/com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar.pack
+ path2jar=/opt/weblogic/wlserver/modules/clients
+ /usr/java/jdk1.7.0_45/bin/unpack200 -r /opt/weblogic/wlserver/modules/clients/com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar.pack /opt/weblogic/wlserver/modules/clients/com.oracle.webservices.wls.jaxrpc-client_12.1.2.jar

I run <INSTALL>/configure.sh --silent which consistently causes the problem.

Hope this helps.