Kernel bug in pve-kernel 2.6.24-7 (soft lockup)

andio

Member
Aug 28, 2009
18
0
21
I just installed Lenny with kernel 2.6.26-2 (amd64) on IBM x3850 (4 CPU DualCore Xeon 7110 with VT + Hyperthreading) a few minutes ago. It's a fresh and clean install with nothing else running.

Vanilla Kernel 2.6.26-2 works fine and lenny's included openvz-kernel works fine too (i used that one before on this server).

I installed pve-kernel 2.6.24-7-pve (2.6.24-11) from proxmox.com Repository and did a reboot.

The server needed 30 minutes just to boot.

When i finally managed to log in from remote, i noticed that the machine is absolutely slow.
A "ps -ef" needs more then 10 seconds to display current proccesses running, display output very slow line by line ..
A simple "apt-get install less" command needs 5 minutes to install 'less'. The machine seems to be absolutely overloaded and is unuseable.



I googled around and found some hints on a kernel bug which seems to be fixed in 2.6.26 or later. But i'm not sure! The only thing i'm sure is that i definitely will not be able to use that pve-kernel on the IBM x3850 servers......

Any ideas?
I'd be glad to support finding that bug in pve-kernel. I've got several of these servers i just bought to install proxmox. I could immediately "lend" one server with remote access for testing and fixing.


--
Update 1: I tested the other pve.old- and pvetest-kernels from proxmox-Repository as well, all with the same result :-(

Update 2: what i further found out is that it seems to be a bug in kernel 2.6.24 especially in Ubuntu. Do you use Ubuntu to compile pve-kernels?

Update 3: tried to fiddle around with clocksource=.. kernel paramaters but without success (tried all available options acpi_pm jiffies tsc notsc... without any difference), i tried acpi=off and some other tricks without success

Update 4: similiar situations described here: http://forum.openvz.org/index.php?t=msg&goto=30122 - maybe kernel and/or hardware problem, at least kernel 2.6.25 does have a workaround (Quote: "Kernel 2.6.25 includes a workaround which detects this issue and uses only so much RAM as possible without a slowdown."). Will there be a 2.6.25+ pvekernel soon???

For proxmox staff i put dmesg online from plain vanilla debian 5 lenny kernel (2.6.26) and from the pvekernel which makes the server unuseable slow (both with printk.time=1 to compare -- you see the difference very clearly)

http://www.andreasotto.net/dmesg-vanillakernel.txt runs like a charm, fast and stable - time for booting 38 seconds

http://www.andreasotto.net/dmesg-pvekernel.txt terrible slow - time for booting: 25 minutes!!

Hopefully there can be done something..
 
Last edited:
I just installed Lenny with kernel 2.6.26-2 (amd64) on IBM x3850 (4 CPU DualCore Xeon 7110 with VT + Hyperthreading) a few minutes ago. It's a fresh and clean install with nothing else running.

Vanilla Kernel 2.6.26-2 works fine and lenny's included openvz-kernel works fine too (i used that one before on this server).

I installed pve-kernel 2.6.24-7-pve (2.6.24-11) from proxmox.com Repository and did a reboot.

The server needed 30 minutes just to boot.

When i finally managed to log in from remote, i noticed that the machine is absolutely slow.
A "ps -ef" needs more then 10 seconds to display current proccesses running, display output very slow line by line ..
A simple "apt-get install less" command needs 5 minutes to install 'less'. The machine seems to be absolutely overloaded and is unuseable.



I googled around and found some hints on a kernel bug which seems to be fixed in 2.6.26 or later. But i'm not sure! The only thing i'm sure is that i definitely will not be able to use that pve-kernel on the IBM x3850 servers......

Any ideas?
I'd be glad to support finding that bug in pve-kernel. I've got several of these servers i just bought to install proxmox. I could immediately "lend" one server with remote access for testing and fixing.


--
Update 1: I tested the other pve.old- and pvetest-kernels from proxmox-Repository as well, all with the same result :-(

Update 2: what i further found out is that it seems to be a bug in kernel 2.6.24 especially in Ubuntu. Do you use Ubuntu to compile pve-kernels?

Update 3: tried to fiddle around with clocksource=.. kernel paramaters but without success (tried all available options acpi_pm jiffies tsc notsc... without any difference), i tried acpi=off and some other tricks without success

Update 4: similiar situations described here: http://forum.openvz.org/index.php?t=msg&goto=30122 - maybe kernel and/or hardware problem, at least kernel 2.6.25 does have a workaround (Quote: "Kernel 2.6.25 includes a workaround which detects this issue and uses only so much RAM as possible without a slowdown."). Will there be a 2.6.25+ pvekernel soon???

For proxmox staff i put dmesg online from plain vanilla debian 5 lenny kernel (2.6.26) and from the pvekernel which makes the server unuseable slow (both with printk.time=1 to compare -- you see the difference very clearly)

http://www.andreasotto.net/dmesg-vanillakernel.txt runs like a charm, fast and stable - time for booting 38 seconds

http://www.andreasotto.net/dmesg-pvekernel.txt terrible slow - time for booting: 25 minutes!!

Hopefully there can be done something..

Looks like you need a new Kernel. Please contact office at proxmox.com to get an offer for this.
 
I have the same problem, but not on an IBM server.

cpu: Intel Xeon Quad, 4x 2.83+ GHz 12 Mo L2 - FSB 1333 MHz
ram: 8Go DDR2

Debian Lenny 5.0 / Proxmox 1.3 with kernel: pve-kernel-2.6.24-7-pve (pve-kernel-2.6.24-7-pve_2.6.24-11_amd64.deb from 21 august).

I have random server lockups under network or cpu load. And I can't upgrade my bios since it is an hosted server that I rent. And it crashes randomly at boot too.

now it works but my "dmesg" has plenty of:

BUG: soft lockup - CPU#2 stuck for 11s! [kstopmachine:3731]
CPU 2:
Modules linked in: ata_generic pata_acpi psmouse uhci_hcd ehci_hcd pata_marvell serio_raw pcspkr usbcore e1000e evdev video output button dm_snapshot thermal processor fan sata_nv via686a ahci mptctl mptsas scsi_transport_sas mptspi mptscsih mptbase dm_crypt raid456 async_xor async_memcpy async_tx xor raid0 raid1 md_mod dm_mirror dm_mod sata_via ata_piix sata_sis pata_sis libata sym53c8xx megaraid aic7xxx scsi_transport_spi sd_mod 3w_xxxx scsi_mod atl1 sky2 skge r8169 e1000 via_rhine sis900 8139too e100 mii
Pid: 3731, comm: kstopmachine Not tainted 2.6.24-7-pve #1 ovz005
RIP: 0010:[<ffffffff8028932c>] [<ffffffff8028932c>] stopmachine+0x4c/0x110
RSP: 0018:ffff810232195f40 EFLAGS: 00000202
RAX: 0000000000000001 RBX: 0000000000000e92 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
RBP: ffffffff804c7c30 R08: ffff810232194000 R09: 000000000007a574
R10: 0000000000000009 R11: ffffffff80423c30 R12: 0000000000000004
R13: ffffffff8024d9ca R14: 0000000000000e92 R15: ffff810232195ec0
FS: 0000000000000000(0000) GS:ffff810232c02c00(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f2ba0e68000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
[<ffffffff8023ca82>] schedule_tail+0x22/0x70
[<ffffffff8020d4e8>] child_rip+0xa/0x12
[<ffffffff802892e0>] stopmachine+0x0/0x110
[<ffffffff8020d4de>] child_rip+0x0/0x12




See: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/210672
it seems to be resolved for them