ProxMox VE 2.1 Total VE crash -never happened before!!

Petrus4

Member
Feb 18, 2009
249
0
16
Hello,

I just hard rebooted my proxmox VE 2.1 server from a total freeze up. Wow this concerns me, I have never had this happen before, have been with ProxMox since version 1.4.

What I was doing right before it happened:

I was uploading a new version crashplan ProE backup via the Crashplan proE webconsole. Crashplan Proe 3.3 is running on a debian Squeeze OpenVZ container.
It looked like the upload was not working and then both the Crashplan ProE webinterface as well as Proxmox VE web interface froze up. Crashplan is a Java application.

I could not access the server via ssh only via direct console. Here is the screenshot of the console at the time of freeze up.

I can attach server logs later if need be.

I really hope this is a bug that can be fixed soon, 6 production servers went down at once.

Here is my PVE environment info:

root@ProxMox-DMZ-1:~# pveversion -v
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
 
Just want to ensure I understand this correctly.
You had 6 Production VM servers go down on this one Proxmox server, not 6 Proxmox servers going down at the same time.
Is that right?

The bad_area_nosemaphore is the biggest clue as to what the problems is.
Most likely bad RAM, suggest running memtest86.

Are you using ECC RAM?
 
Just want to ensure I understand this correctly.
You had 6 Production VM servers go down on this one Proxmox server, not 6 Proxmox servers going down at the same time.
Is that right?

The bad_area_nosemaphore is the biggest clue as to what the problems is.
Most likely bad RAM, suggest running memtest86.

Are you using ECC RAM?

Hi e100

6 Production VM servers went down not 6 Proxmox Servers :) That would be really bad. :(

Thanks for the tip on on Memory!. I am using ECC RAM. I will run memory tests tomorrow.. though with 32 GB RAM it will take a long time to test. Do you have any suggestions on how I can increase speed of testing?
 
Hi e100

I am using ECC RAM. I will run memory tests tomorrow.. though with 32 GB RAM it will take a long time to test. Do you have any suggestions on how I can increase speed of testing?

Look in your logs for EDAC errors from the kernel.
If any correctable errors have ever occurred they should be logged.
The EDAC errors can tell you exactly what stick of RAM is bad :)

Good for you, bad for me I just had a server that is having some issues.
This is what an EDAC error looks like:
Code:
EDAC MC0: [B]CE[/B] page 0x3f1f6f, offset 0x520, grain 0, syndrome 0x11c1, [B]row 3, channel 1[/B], label "": amd64_edac
The Bolded parts:
CE = Correctable Error
row3, channel 1, the specific ram module with a problem

See http://www.kernel.org/doc/Documentation/edac.txt for additional help.

While the error I had was correctable, right after that I had all sorts of things go wrong.
vzdump hung
lvm locks up removing snapshot
vms not working right
Fairly confident that these other issues are related to some bad RAM.
 
Look in your logs for EDAC errors from the kernel.
If any correctable errors have ever occurred they should be logged.
The EDAC errors can tell you exactly what stick of RAM is bad :)

Good for you, bad for me I just had a server that is having some issues.
This is what an EDAC error looks like:
Code:
EDAC MC0: [B]CE[/B] page 0x3f1f6f, offset 0x520, grain 0, syndrome 0x11c1, [B]row 3, channel 1[/B], label "": amd64_edac
The Bolded parts:
CE = Correctable Error
row3, channel 1, the specific ram module with a problem

See http://www.kernel.org/doc/Documentation/edac.txt for additional help.

While the error I had was correctable, right after that I had all sorts of things go wrong.
vzdump hung
lvm locks up removing snapshot
vms not working right
Fairly confident that these other issues are related to some bad RAM.

Sorry to hear about your server problems. Hope replacing your RAM will fix this issue.

I looked for edac errors in my syslog, kernel log but none to find.
I don't think edac was actually working on my system I just installed edac-utils and assume these are needed in order to log any errors.
 
Sorry to hear about your server problems. Hope replacing your RAM will fix this issue.

I looked for edac errors in my syslog, kernel log but none to find.
I don't think edac was actually working on my system I just installed edac-utils and assume these are needed in order to log any errors.

I've never had to install edac-utils.
The errors are logged to /var/log/messages

After running some tests and looking at the time of the events I believe the errors I had are related to the solar flare that hit us with a CME today.
Server has been running fine since reboot and passes every memory test I throw at it.
 
I've never had to install edac-utils.
The errors are logged to /var/log/messages

Yes I looked in /var/log/messages also but no edac errors

After running some tests and looking at the time of the events I believe the errors I had are related to the solar flare that hit us with a CME today.
Server has been running fine since reboot and passes every memory test I throw at it.

Really.. I have heard solar flares can cause problems, but the one that just occurred did not have much effect on us earthlings according to NASA. Who knows my issues occurred on Friday July 13th around 17:45 EST any flare activity at that time?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!