Kernel crash with kernel 2.6.32-pve

iti-asi · Oct 26, 2010

Hi,

I'm in a difficult situation with our Proxmox deployment.

Kernel 2.6.24 is stable, but shows horrible disk I/O performance. When I say horrible, I mean a chmod -R of a a few GB in a VE will make the rest of the system grind to a halt, to the point the DNS server (in a different container) will stop responding queries, etc.

rsyncs take ages, and backup operations are a no-go right now because they would basically kill every other service (email, dns, ldap, samba...) in the different VEs.

This is relatively new hardware, so I guess it's just a SATA driver not working correctly in this kernel version, something that I think 2.6.32 most probably would fix. However, 2.6.32 dies a few minutes after booting (4 or 5 minutes max.), but I'm not sure if it's related to network as others are reporting.

I'm attaching a photograph of what was readable on the server's screen after the crash.

This is a summary of the hosts' hardware, if it can give a clue for the 2.6.24 or 2.6.32 problems. Any tip is welcome, because the situation is getting worse by the day.

Code:

00:00.0 Host bridge: Intel Corporation QuickPath Architecture I/O Hub to ESI Port (rev 13)
00:01.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 1 (rev 13)
00:03.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 3 (rev 13)
00:07.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 7 (rev 13)
00:09.0 PCI bridge: Intel Corporation QuickPath Architecture I/O Hub PCI Express Root Port 9 (rev 13)
00:14.0 PIC: Intel Corporation QuickPath Architecture I/O Hub System Management Registers (rev 13)
00:14.1 PIC: Intel Corporation QuickPath Architecture I/O Hub GPIO and Scratch Pad Registers (rev 13)
00:14.2 PIC: Intel Corporation QuickPath Architecture I/O Hub Control Status and RAS Registers (rev 13)
00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Port 1
00:1c.4 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Port 5
00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller
02:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200e [Pilot] ServerEngines (SEP1) (rev 02)
07:00.0 Ethernet controller: Intel Corporation Device 10c9 (rev 01)
07:00.1 Ethernet controller: Intel Corporation Device 10c9 (rev 01)

tom · Oct 26, 2010

post your pveversion -v. do you talk about container or KVM guests when you say "VM"?

also, try the latest kernel from our pvetest repository (this one will be moved to stable soon).
http://download.proxmox.com/debian/...4/pve-kernel-2.6.32-4-pve_2.6.32-25_amd64.deb

do you need OpenVZ support? if not, 2.6.35 is also an option.

iti-asi · Oct 26, 2010

tom said:
post your pveversion -v. do you talk about container or KVM guests when you say "VM"?

also, try the latest kernel from our pvetest repository (this one will be moved to stable soon).
http://download.proxmox.com/debian/...4/pve-kernel-2.6.32-4-pve_2.6.32-25_amd64.deb

do you need OpenVZ support? if not, 2.6.35 is also an option.

Oops, sorry. I mean OpenVZ. There is a KVM on that box too, but it's not set to autoboot, the stability problems are with OpenVZ.

I'll try the new kernel ASAP, which I assume it updates to the very last OpenVZ patch?

iti-asi · Oct 26, 2010

tom said:
post your pveversion -v.

Here's pveversion -v:

Code:

pve-manager: 1.6-5 (pve-manager/1.6/5261)
running kernel: 2.6.24-12-pve
proxmox-ve-2.6.32: 1.6-24
pve-kernel-2.6.32-4-pve: 2.6.32-24
pve-kernel-2.6.24-11-pve: 2.6.24-23
pve-kernel-2.6.24-12-pve: 2.6.24-25
qemu-server: 1.1-22
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-14
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-8
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1iti1
pve-qemu-kvm: 0.12.5-2
ksm-control-daemon: 1.0-4

Note the server is running 2.6.24-12-pve right now, for obvious reasons.

tom · Oct 26, 2010

did you try 2.6.18? this one has the stable OpenVZ and (and also good KVM support).

for release notes of the latest 2.6.18er see:
http://pve.proxmox.com/wiki/Roadmap....6.32_Kernel_with_OpenVZ_including_KVM_0.12.5

iti-asi · Oct 27, 2010

I did, with some version before the Proxmox 1.6 release, and it's slow as a tortoise too. Does it make sense to you that some driver (SATA controller?) needs some quirk for my hardware controller version, and that is available in 2.6.32 but not in 2.6.24 or 2.6.18? One test I have not done is testing the 2.6.26 lenny kernel, but I don't know if proxmox actually runs on that.

mmenaz · Nov 3, 2010

Same here. I have a 1.5 proxmox, kernel 2.6.24 with a old OpenVZ in it (with Zimbra), and I had to move it to a new 1.6, kernel 2.6.32 server.
First problem was that trying to set "on boot" yes produced an error that I fixed enabling and disabling quota through web interface.
(the error was:
# /usr/bin/pvectl vzset 103 --onboot yes
Use of uninitialized value in numeric ne (!=) at /usr/bin/pvectl line 255.
vzctl set 103 --onboot yes --save
Saved parameters for CT 103
proxmox:/etc/vz/conf#

line 255 is:
if ($veconf->{'quotaugidlimit'}->{value} != $quotaugidlimit) {
push @$changes, '--quotaugidlimit', "$quotaugidlimit";
}
)

Then I started the vm that worked for some minutes (2?) then the whole proxmox did not responded anymore. The problem is that this "migration" has to be done outside work hours, so I'm doing from home with ssh and screen.
The morning on the video there was an image like your one.
I can only tell that VM /etc/vz/conf/103.conf is:
# PVE default config for 256MB RAM

ONBOOT="no"

# Primary parameters
NUMPROC="1024:1024"
NUMTCPSOCK="9223372036854775807:9223372036854775807"
NUMOTHERSOCK="9223372036854775807:9223372036854775807"
VMGUARPAGES="1835008:9223372036854775807"

# Secondary parameters
KMEMSIZE="9223372036854775807:9223372036854775807"
OOMGUARPAGES="1835008:9223372036854775807"
PRIVVMPAGES="1835008:1847508"
TCPSNDBUF="9223372036854775807:9223372036854775807"
TCPRCVBUF="9223372036854775807:9223372036854775807"
OTHERSOCKBUF="9223372036854775807:9223372036854775807"
DGRAMRCVBUF="9223372036854775807:9223372036854775807"

# Auxiliary parameters
NUMFILE="9223372036854775807:9223372036854775807"
NUMFLOCK="9223372036854775807:9223372036854775807"
NUMPTY="255:255"
NUMSIGINFO="1024:1024"
DCACHESIZE="9223372036854775807:9223372036854775807"
LOCKEDPAGES="917504:917504"
SHMPAGES="9223372036854775807:9223372036854775807"
NUMIPTENT="9223372036854775807:9223372036854775807"
PHYSPAGES="9223372036854775807:9223372036854775807"

# Disk quota parameters
DISKSPACE="83886080:92274688"
DISKINODES="16000000:17600000"
QUOTATIME="0"

# CPU fair sheduler parameter
CPUUNITS="1000"
CPUS="1"
VE_ROOT="/var/lib/vz/root/$VEID"
VE_PRIVATE="/var/lib/vz/private/$VEID"
OSTEMPLATE="ubuntu-8.04-minimal_8.04-1_i386"
ORIGIN_SAMPLE="pve.auto"
HOSTNAME="lsmtp"
NAMESERVER="127.0.0.1"
SEARCHDOMAIN="mydomain.it"
NETIF="ifname=eth0,bridge=vmbr1,mac=00:18:51:02:81:BD,host_ifname=veth103.0,host_mac=00:18:51:6A:C0:2D"
DESCRIPTION="Zimbra su Ubuntu 8.04 IP .12 e gw .1"

Hope this helps in find a workaround.
I will try to install kernel 2.6.24 tomorrow, hoping that the Windows KVM guest will not think that something in the hardware has changed and asking for a re-registration.
Best regards

dietmar · Nov 3, 2010

mmenaz said:
# /usr/bin/pvectl vzset 103 --onboot yes
Use of uninitialized value in numeric ne (!=) at /usr/bin/pvectl line 255.

OK, I will fix that in the next release. For now, simply use:

# vzctl set 103 --onboot yes

mmenaz · Nov 3, 2010

Yes, as I did, but this is the minor of the problems. The real problem is that I've installed proxmox-ve-2.6.24-12 but at boot time tells "Volume 'pve' not found" probably because the LSI Logic / Symbios Logic LSI MegaSAS 9260 (rev 04) controller is not supported and this confuses proxmox (proxmox is installed in SATA /dev/sda, while SAS_RAID is additional storage. Zimbra VM resides in local storage).
I've restored a samba (debian based) openvz vm and works fine. I've tried to modify the zimbra .conf copying some parameter from an empty, just created, similar vm but there is no good news, after 3 minutes of uptime of Zimbra VM ALL proxmox crashes!
(of course is still working perfectly fine on the old proxmox server since long long time).
I've also disabled zimbra autostart, but still crashes.
Is there some relevant config/program change from template ubuntu-8.04-minimal_8.04-1_i386 I used to the new ubuntu-8.04-standard_8.04-3_i386.tar.gz I can try to copy?
Any idea about openvz vm that crashes all proxmox?
Thanks a lot
Current config:
proxmox:/var/lib/vz/private# pveversion -v
pve-manager: 1.6-5 (pve-manager/1.6/5261)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.24: 1.6-26
pve-kernel-2.6.32-4-pve: 2.6.32-25
pve-kernel-2.6.24-12-pve: 2.6.24-25
qemu-server: 1.1-22
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-14
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-8
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.12.5-2
proxmox:/var/lib/vz/private#

mmenaz · Nov 3, 2010

I'm confused... I've re-created a OpenVZ vm based upon, this time, debian5. I've installed Zimbra for debian5 32 bit. Installation went fine, I was able to login in the web interface... tried to compose a message and... boom! Proxmox crashed again.
So could it be that are the packages Zimbra is installing? But I've a similar installation that works fine:
proxmox:~# pveversion -v
pve-manager: 1.6-5 (pve-manager/1.6/5261)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.32: 1.6-24
pve-kernel-2.6.32-4-pve: 2.6.32-24
pve-kernel-2.6.18-2-pve: 2.6.18-5
qemu-server: 1.1-22
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-14
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-8
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.12.5-2
ksm-control-daemon: 1.0-4
proxmox:~#

The only noticeable difference is ksm-control-daemon that is missing in my new proxmox. I'm really worried, if this problem would have happend in a more critical situation (is not just because old proxmox server is still running) would have been a disaster.

Here the crash photo:

tom · Nov 3, 2010

can you try the latest stable kernel (from pve - 2.6.32-25)?

mmenaz · Nov 3, 2010

Maybe I confused you... the first pveversion I listed is MINE where I have the problem, and is pve-kernel-2.6.32-4-pve: 2.6.32-25.
Last pveversion I posted is of a client of mine, where zimbra works in a similar setup.
So or it's a bug that has only the very last kernel and that pve-kernel-2.6.32-4-pve: 2.6.32-24 has not, or there is a different problem.
Thanks for your interest

tom · Nov 3, 2010

mmenaz said:
Maybe I confused you... the first pveversion I listed is MINE where I have the problem, and is pve-kernel-2.6.32-4-pve: 2.6.32-25.
Last pveversion I posted is of a client of mine, where zimbra works in a similar setup.
So or it's a bug that has only the very last kernel and that pve-kernel-2.6.32-4-pve: 2.6.32-24 has not, or there is a different problem.
Thanks for your interest

yes, I am confused. which kernel has the problem and can you instruct me how to reproduce it?

mmenaz · Nov 3, 2010

Simply put, where I work has been bought a brand new server with 300*4 RAID5 sas drivers and a 1TB sata, 16GB ecc ram.
I've installed proxmox in sda that is sata, and added sas as lvm storage. This so I don't waste precious sas storage to proxmox root and local storage, and I have plenty of local storage in sata for "low I/O" vms.
I've used 1.6 iso and aptitude update/dist-upgrade to most recent proxmox stuff.
I've also added sdc1, 2TB sata, for backup storage.
Our final goal is migrate the VM we now have split in 2 proxmox servers with 8GB ram and only sata local storage.
First VM I moved in the new server has been a KVM Win2003 that I've put in sas storage. Has been working fine for one week.
Encouraged by this success, I've vzdumped and vzrestored our Zimbra VM based upon OpenVZ. I did through ssh in the weekend, but the Proxmox server was abruptly unavailable.
Yesterday I tested again, and I've discovered that 2-3 minutes after Zimbra VM is up, I get that message on the screen and proxmox is dead (no magic keys possible from keyboard, no connectivity, nothing. See my picture attached in previous post).
I've found this thread and tryed kernel 2.6.24, but our RAID controller is not recognized, so I have to stick with 2.6.32.
Our current info is the first one I posted, that is:
proxmox:/var/lib/vz/private# pveversion -v
pve-manager: 1.6-5 (pve-manager/1.6/5261)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.24: 1.6-26
pve-kernel-2.6.32-4-pve: 2.6.32-25
pve-kernel-2.6.24-12-pve: 2.6.24-25
qemu-server: 1.1-22
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-14
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-8
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.12.5-2
proxmox:/var/lib/vz/private#

We have to consider that other VM OpenVZ that I have, that seem to run smootly once vzrestored (also in local storage), are very low in I/O and ram required, so now I don't know if is Zimbra that is the only one able to trigger a foulty kernel or a faulty hardware or what.
For the record, since we have another installation that has proxmox and zimbra installed and works fine, I've posted his pveversion if it can help you (last pveversion posted).
I thought I'd better do a memetest tonight, I reboot proxmox and choose memetest from grub menu, but screen become black and server restarted. I booted from systemrescue CD and is running the test now (this also is scaring, isn't it?)
If ram is ok tomorrow will try to restore zimbra on sas storage, who knows. I'm in a really bad situation

Thanks a lot for your attention

mmenaz · Nov 6, 2010

I've reinstalled the same OpenVZ vm in a test proxmox, with athlon X2 and simple sata driver, and it works fine (no kernel crash)! So seems that there is a sort of incompatibility with debian and that Fujitsu server (would be very odd, since is very expensive!), or it could be that in the Fujitsu installation pveversion -v does NOT show the line "ksm-control-daemon: 1.0-4"? How could it be?
Another element to consider is that the "faulty" system was installed directly from the 1.6 iso, while other 2 system (where zimbra works in the same conditions) have reached 1.6 through an upgrade.
This is pveversion of the test proxmox I've used today (Athlon X2) and where Zimbra vm, once restored, works fine:
proxmox:~# pveversion -v
pve-manager: 1.6-5 (pve-manager/1.6/5261)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.32: 1.6-25
pve-kernel-2.6.32-4-pve: 2.6.32-25
pve-kernel-2.6.24-2-pve: 2.6.24-5
pve-kernel-2.6.18-1-pve: 2.6.18-4
qemu-server: 1.1-22
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-14
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-8
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.12.5-2
ksm-control-daemon: 1.0-4

Btw, what seems to me a bug that I've triggered while experimenting trying to understand the above problem is: let's say I've openvz vm 101 in system A, and a different OpenVZ vm 101 on system B, both with bridged network.
Both will have a line like this in their /etc/vz/conf/101.conf file:
NETIF="ifname=eth0,bridge=vmbr0,mac=00:18:51:02:81:BD,host_ifname=veth101.0,host_mac=00:18:51:6A:C0:2D"
(of course, with different mac addresses).
If I vzdump 101 in A, copy to B and then vzrestore filename.tar 102, to avoid collision, I will have /etc/vz/conf/102.conf STILL with "host_ifname=veth101.0", so the VM will not start (since veth101.0 is already in use by vm 101)!
Shouldn't vzdump take care of this? Now I've to manually change host_ifname=veth101.0 to host_ifname=veth102.0.
Thanks a lot.

dietmar · Nov 7, 2010

mmenaz said:
Shouldn't vzdump take care of this?

Yes - thanks for reporting that bug.

dietmar · Nov 8, 2010

mmenaz said:
Shouldn't vzdump take care of this? Now I've to manually change host_ifname=veth101.0 to host_ifname=veth102.0.
Thanks a lot.

Please can you test with:

wget ftp://download.proxmox.com/debian/dists/lenny/pvetest/binary-amd64/vzdump_1.2-9_all.deb
dpkg -i vzdump_1.2-9_all.deb

mmenaz · Nov 8, 2010

vzdumped 105 with old version, vzrestored as 125 with old version: bug confirmed
vzrestored with the new version: works PERFECT! (veth105.0 -> veth125.0)
NETIF="ifname=eth0,bridge=vmbr1,mac=00:18:51:9E:F2:81,host_ifname=veth125.0,host_mac=00:18:51:61:13:1A"

Thanks a lot.

iti-asi · Nov 9, 2010

dietmar said:
Please can you test with:

wget ftp://download.proxmox.com/debian/dists/lenny/pvetest/binary-amd64/vzdump_1.2-9_all.deb
dpkg -i vzdump_1.2-9_all.deb

Wow, thanks Dietmar. I had forgotten to report this, but this was a real pain when cloning VEs. Thanks for the fix!

flow · Nov 10, 2010

Hello everyone,

We have the same problem.
Our Proxmox VE runs in a cluster on 2 Intel Modular Server Blades.
Both 2 hosts run on 180GB partitions each, located on the Multiflex Storage.
The KVM machines are located on the shared Storage via LVM.
KVM guests work fine, no problem with that.

But if i create a OpenVZ (in my case it was Debian Lenny as well as Squeeze) and run "aptitude update" within the VM the OpenVZ crashes the whole host.
The output of the host on the remote console resembles the screenshot of the OP.

We have other Proxmox Clusters running on some Phenoms as well as on some Xeons connected to Storages via NFS or iSCSI.
We never experienced problems of that kind on those Clusters (not with Kernel Versions 2.6.18, 2.6.24 and 2.6.32).

The pveversion -v output of the IMS:

vm1:~# pveversion -v
pve-manager: 1.6-5 (pve-manager/1.6/5261)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.32: 1.6-25
pve-kernel-2.6.32-4-pve: 2.6.32-25
qemu-server: 1.1-22
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-14
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-8
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.12.5-2
ksm-control-daemon: 1.0-4

I already tried a fresh install of the node but the outcome is the same.
Start a OpenVZ and shortly afterwards the host crashes.

Some feedback to this bug would be nice

Thanks in advance and please keep up your excellent work!

Kernel crash with kernel 2.6.32-pve

Member

Attachments

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Attachments

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Member

Renowned Member