Kernel Panic: qmrestore result in mce report...

John Martin

New Member
Mar 12, 2010
7
0
1
Oslo
After a long stabil period, the Proxmox VE environment suddenly get a kernel panic when trying to restore.

pveversion -v
1.9-26 (pve-manager/1.9/6567)
running kernel: 2.6.18-6-pve
proxmox-ve-2.6.18: 1.8-15
pve-kernel-2.6.18-4-pve: 2.6.18-10
pve-kernel-2.6.18-6-pve: 2.6.18-15
qemu-server: 1.1-32
pve-firmware: 1.0-14
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.29-3pve1
vzdump: 1.2-16
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm-2.6.18: 0.9.1-15

I thougth this was caused by an hardware error, since I get messages like this:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 0 TSC 8999881b5cc
MISC 435200040081
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: Unknown Error 9f
STATUS cc0007800001009f MCGSTATUS 0
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: Unknown Error 9f

Now I am not sure, since I read in another forum that kernel panic might be caused by different processors accessing the same area of memory. Running top allways indicate that the kernel panic occure under similar circumstances.

top - 22:22:41 up 1 day, 22:47, 1 user, load average: 2.41, 1.90, 0.89
Tasks: 144 total, 1 running, 143 sleeping, 0 stopped, 0 zombie
Cpu(s): 10.6%us, 3.6%sy, 0.0%ni, 51.3%id, 32.8%wa, 0.1%hi, 1.7%si, 0.0%st
Mem: 6083208k total, 6047960k used, 35248k free, 8188k buffers
Swap: 5242872k total, 2764k used, 5240108k free, 3806484k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31568 root 18 0 4152 568 388 D 44 0.0 1:32.21 gzip
31570 root 18 0 3648 356 284 S 7 0.0 0:22.47 sparsecp
31567 root 18 0 18432 2524 808 S 3 0.0 0:11.54 tar
327 root 15 0 0 0 0 S 1 0.0 1:31.23 pdflush
328 root 10 -5 0 0 0 S 1 0.0 1:51.40 kswapd0
8750 root 10 -5 0 0 0 S 0 0.0 0:31.78 iscsi_q_9
10817 root 15 0 1177m 170m 1448 S 0 2.9 2:32.45 kvm
10907 root 18 0 4286m 1.6g 1524 S 0 28.1 30:00.10 kvm


Everything starts OK:
INFO: restore QemuServer backup 'vzdump-qemu-203-2011_11_16-23_17_52.tgz' using ID 200
INFO: extracting 'qemu-server.conf' from archive
INFO: extracting 'vm-disk-ide0.raw' from archive
INFO: Formatting '/var/lib/vz/images/200/vm-200-disk-1.raw', fmt=raw, size=32 kB
INFO: new volume ID is 'local:200/vm-200-disk-1.raw'
INFO: restore data to '/var/lib/vz/images/200/vm-200-disk-1.raw' (137438953472 bytes)

After some time the kernel panic occure, and the system crash. Since I have had similar episodes earlier with different kernel versions, I wonder if there could be some problem related to either the pve firmware or the kernel version we run.

The server is a Supermicro:

product: X8SIE
vendor: Supermicro
physical id: 0

Running 4 of these:
description: CPU
product: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz
vendor: Intel Corp.
physical id: 4
bus info: cpu@0
version: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz
serial: To Be Filled By O.E.M.
slot: CPU
size: 2400MHz
capacity: 2400MHz
width: 64 bits


 
Last edited:
OK, then I will try to upgrade.

Unfortunately, when I did that the last time, I got Kernel Panic as a result of the upgrade.
I hope that's fixed now and not gonna happen again.


JM
 
if you got problem with the current stable we dig deeper.
 
Installed new kernel: Linux pve 2.6.32-6-pve

With the new kernel I get this:

INFO: restore QemuServer backup 'vzdump-qemu-203-2011_11_16-23_17_52.tgz' using ID 200
INFO: extracting 'qemu-server.conf' from archive
INFO: extracting 'vm-disk-ide0.raw' from archive
INFO: Formatting '/var/lib/vz/images/200/vm-200-disk-1.raw', fmt=raw, size=32 kB
INFO: new volume ID is 'local:200/vm-200-disk-1.raw'
INFO: restore data to '/var/lib/vz/images/200/vm-200-disk-1.raw' (137438953472 bytes)
Message from syslogd@pve at Nov 21 13:24:14 ...
kernel:[Hardware Error]: No human readable MCE decoding support on this CPU type.
Message from syslogd@pve at Nov 21 13:24:14 ...
kernel:[Hardware Error]: Run the message through 'mcelog --ascii' to decode.


JM

Serverfail20111121.jpg
 
Last edited:
check your hardware for defects.
 
I have done, I have run memtest. I also have experienced this problem before, and it was definately linked to different kernels.

I will also try to install new memory later today.

The last weeks, it's been connected to the qmrestore command. As far as I know, gzip might utilize more than one processor. I would like to know if I could turn off that feature in qmrestore just to test.

JM
 
I just google this issue and most times its hardware related - bios, cpu, memory, powersupply, ...
 
The server is stable when not trying to restore, so I do not believe this is a hardware problem. All sensors show nice values in IPMI. No high temperature, normal voltage, no problems with any fan and so on.

There are no known disk problems, no memory problems, still I will switch memory today. That leaves me with a possible BIOS problem, which also is quite strange, since the system is stable unless i try to use qmrestore.

If I google "Kernel Panic qmrestore", I get three hits, all related to my report:

If google "Kernel Panic proxmox", I get more hits, among them th following link, describing a similar problem on a different system:
http://pve.proxmox.com/wiki/Mainboards

  • GA-MA770 UD3 rev 2.0 - BIOS FG causes kernel panic on copy large files (> 5GB) to server (AMD Athlon 5050e, 160GB SATA, 8GB RAM
Since there are no BIOS updates I can use, what should I do to try this theory?

There are also other interesting hits: http://forum.proxmox.com/threads/7536-Is-this-kernel-bug-fixed-in-1-9

Since the new kernel is unstable on this system, I believe that you must check it again.