KVM machines hanging

laradji · Dec 1, 2009

dietmar said:
Is it possible to reproduce that bug - how?

No it's was not , but it seem that vm how have work to do crash more often.

kionez · Jan 26, 2010

I have the same problems while i'm testing virtio disk performances with dd (reading 1Gb from /dev/zero or /dev/urandom) my KVM's virtual machine give me some kernel oops:
the first time a [ 2673.667286] BUG: unable to handle kernel NULL pointer dereference at 00000006 and segfaults dd, then after a reboot i leave a "while true" loop that gives many [1266927070.756591] BUG: soft lockup - CPU#0 stuck for 1179869794s! [swapper:0] and bring the VM to an un-usable status (cpu at 100%, difficulty to access to filesystem)..
I do tests on a standard i386 debian lenny with stock linux-image-2.6.26-2-686-bigmem kernel... now i installed a kernel backported from squeeze (2.6.32-trunk) and i repeat dd's test.

Do you confirm that kind of problem are solved with new kernel in guest system?

Thanks!

k.

TiagoRF · Jan 26, 2010

I have a similar problem in PVE 1.4.

Proxmox WEBGUI claims the machines are running, but they're totally frozen.

We saw something curious on a Exchange server behavior, the machine got up, but until the NIC got up (the X on top of the icon got away), you couldn't do anything on the machine.

If you pressed IE icon, for an instance, it would only come up after NIC.

The nic is E1000, and we're facing some issues.

We've changed the memory configuration, trying to stabalize the system, we've been working on the NICs, and apparently it's getting more and more stable with the time being.

I'll update you as soon as I have info.

And we're running 4 cores per machine, WS2008 on all of them.

kionez · Jan 27, 2010

I leave the dd stress test running for about 20hours on two gest, and gives no problems.. so i thought that removing host's cpufreq ondemand governor (i don't know why was enabled on hetzner's default installation) and upgrading guest lenny's kernel to squeeze's one solves this problem.

Today i re-created everything from scratch, in order to install my VM, i build a brand new guest, installed lenny, installed few packages, updated kernel to squeeze's one.. re-run a dd test ( dd if=/dev/zero of=/tmp/test bs=1M count=1000 ) and the second time I launch it, every problems come back:

It starts with:
[ 140.177741] BUG: Bad page state in process dd pfn:dae7d
Then:
[ 140.216248] BUG: unable to handle kernel NULL pointer dereference at 00000006
Andstarts with "Cpu Stuck",
[ 273.808162] BUG: soft lockup - CPU#1 stuck for 61s! [rm:1975]
(i can attach full dmesg log if it helps..)So, i can't trust this platform for my production enviroment.. :|

Code:

# pveversion -v
pve-manager: 1.5-5 (pve-manager/1.5/4627)
running kernel: 2.6.32-1-pve
proxmox-ve-2.6.32: 1.5-4
pve-kernel-2.6.32-1-pve: 2.6.32-4
qemu-server: 1.1-11
pve-firmware: 1.0-3
libpve-storage-perl: 1.0-8
vncterm: 0.9-2
vzctl: 3.0.23-1pve7
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.11.1-2
ksm-control-daemon: 1.0-2

ProxmoxVE is manually installed a Debian lenny hosted at Hetzner.de

The guest is a i386 debian, everything in ext3, with 4cores assigned, virtio disk and network, disk is stored on a LVM group over a SW raid mirror..

Is there something that I can do to solve this issue?

Thanks in advance

k.

TiagoRF · Jan 27, 2010

Apparently myself and sniffer were able to stabalize the machines in production.

We uninstalled the intel driver for E1000 and all got good after that, when we stressed the machines with gigabytes of info, it always did crash.

Most of the times were windows shut down sync from 3+ machines at the same time.

Quite bizarre, but everything seems fine atm.

tom · Jan 27, 2010

kionez said:
I leave the dd stress test running for about 20hours on two gest, and gives no problems.. so i thought that removing host's cpufreq ondemand governor (i don't know why was enabled on hetzner's default installation) and upgrading guest lenny's kernel to squeeze's one solves this problem.

Today i re-created everything from scratch, in order to install my VM, i build a brand new guest, installed lenny, installed few packages, updated kernel to squeeze's one.. re-run a dd test ( dd if=/dev/zero of=/tmp/test bs=1M count=1000 ) and the second time I launch it, every problems come back:

It starts with:
[ 140.177741] BUG: Bad page state in process dd pfn:dae7d
Then:
[ 140.216248] BUG: unable to handle kernel NULL pointer dereference at 00000006
Andstarts with "Cpu Stuck",
[ 273.808162] BUG: soft lockup - CPU#1 stuck for 61s! [rm:1975]
(i can attach full dmesg log if it helps..)So, i can't trust this platform for my production enviroment.. :|

Code:

# pveversion -v pve-manager: 1.5-5 (pve-manager/1.5/4627) running kernel: 2.6.32-1-pve proxmox-ve-2.6.32: 1.5-4 pve-kernel-2.6.32-1-pve: 2.6.32-4 qemu-server: 1.1-11 pve-firmware: 1.0-3 libpve-storage-perl: 1.0-8 vncterm: 0.9-2 vzctl: 3.0.23-1pve7 vzdump: 1.2-5 vzprocps: 2.0.11-1dso2 vzquota: 3.0.11-1 pve-qemu-kvm: 0.11.1-2 ksm-control-daemon: 1.0-2

ProxmoxVE is manually installed a Debian lenny hosted at Hetzner.de

The guest is a i386 debian, everything in ext3, with 4cores assigned, virtio disk and network, disk is stored on a LVM group over a SW raid mirror..

Is there something that I can do to solve this issue?

Thanks in advance

k.

I would go for the default 1.5 version - 2.6.18 kernel. And I would never use sw raid.

kionez · Jan 28, 2010

tom said:
I would go for the default 1.5 version - 2.6.18 kernel. And I would never use sw raid.

I know, i read many many posts on this forum..
But I cant use the default installation, because Hetzner don't allow installing from iso, and I want to use SW raid because at the moment we can't buy an HW controller (it costs too much).

I should try to downgrade to 2.6.18 (loosing ksm)..but maybe we don't have hardware requirement to run proxmox kvm, and i have to (quickly) find a replacement..

Anyway, I think that ProxmoxVE is a great software and i'll keep to advise to friends and co-worker (and saying them to donate at the project).

Thanks!

k.

vitor costa · Jan 28, 2010

I already see this error type. Your CPU is Intel Xeon 5520 or 5540 ? This CPUs have HT (virtual cpus). So with 4 cores you have the double...

I see that runing a guest kernel 2.18 (centOS) . With ubuntu (2.6.28 or so) no have this...

Try disable this Virtual CPUS in bios and Turbo booster too (i do that and problems stoped

TiagoRF · Jan 28, 2010

We've noticed that by loging to a server in proxmox via RDP, frequently it crashes.

Machine hangs, then dies and gets totally unresponsive.

JustaGuy · Feb 20, 2010

I have one that freezes during a snapshot backup.
Using 2.6.32-1-pve kernel.

Also a strange error in syslog:
kernel hrtimer: interrupt too slow, forcing clock min delta to 10266 ns

I can't tell if it was with regard to a snapshot that failed to complete prior to this one currently going on or not.
There's an older copy of this frozen one that froze earlier in the week & wouldn't start afterward.
Now it won't delete from the host, or let me change any of it's hardware. eg. remove the disk to recover space.
So I'm not surprised that it won't backup, either.

I'll write again after the snapshot's done & report if this one becomes responsive again afterward.

I can't read the forum-hosted thumbnail, so here's another.

EDIT:
It came back to life as soon as the vmtar was done.

Search

Search

KVM machines hanging

laradji

Guest

kionez

Guest

TiagoRF

Guest

kionez

Guest

TiagoRF

Guest

tom

Proxmox Staff Member

kionez

Guest

vitor costa

Active Member

TiagoRF

Guest

JustaGuy

Renowned Member