KVM machines hanging

I have the same problems while i'm testing virtio disk performances with dd (reading 1Gb from /dev/zero or /dev/urandom) my KVM's virtual machine give me some kernel oops:
the first time a [ 2673.667286] BUG: unable to handle kernel NULL pointer dereference at 00000006 and segfaults dd, then after a reboot i leave a "while true" loop that gives many [1266927070.756591] BUG: soft lockup - CPU#0 stuck for 1179869794s! [swapper:0] and bring the VM to an un-usable status (cpu at 100%, difficulty to access to filesystem)..
I do tests on a standard i386 debian lenny with stock linux-image-2.6.26-2-686-bigmem kernel... now i installed a kernel backported from squeeze (2.6.32-trunk) and i repeat dd's test.

Do you confirm that kind of problem are solved with new kernel in guest system?

Thanks!

k.
 
I have a similar problem in PVE 1.4.

Proxmox WEBGUI claims the machines are running, but they're totally frozen.

We saw something curious on a Exchange server behavior, the machine got up, but until the NIC got up (the X on top of the icon got away), you couldn't do anything on the machine.

If you pressed IE icon, for an instance, it would only come up after NIC.

The nic is E1000, and we're facing some issues.

We've changed the memory configuration, trying to stabalize the system, we've been working on the NICs, and apparently it's getting more and more stable with the time being.

I'll update you as soon as I have info.

And we're running 4 cores per machine, WS2008 on all of them.
 
I leave the dd stress test running for about 20hours on two gest, and gives no problems.. so i thought that removing host's cpufreq ondemand governor (i don't know why was enabled on hetzner's default installation) and upgrading guest lenny's kernel to squeeze's one solves this problem.

Today i re-created everything from scratch, in order to install my VM, i build a brand new guest, installed lenny, installed few packages, updated kernel to squeeze's one.. re-run a dd test ( dd if=/dev/zero of=/tmp/test bs=1M count=1000 ) and the second time I launch it, every problems come back:

It starts with:
[ 140.177741] BUG: Bad page state in process dd pfn:dae7d
Then:
[ 140.216248] BUG: unable to handle kernel NULL pointer dereference at 00000006
Andstarts with "Cpu Stuck",
[ 273.808162] BUG: soft lockup - CPU#1 stuck for 61s! [rm:1975]
(i can attach full dmesg log if it helps..)So, i can't trust this platform for my production enviroment.. :|

Code:
# pveversion -v
pve-manager: 1.5-5 (pve-manager/1.5/4627)
running kernel: 2.6.32-1-pve
proxmox-ve-2.6.32: 1.5-4
pve-kernel-2.6.32-1-pve: 2.6.32-4
qemu-server: 1.1-11
pve-firmware: 1.0-3
libpve-storage-perl: 1.0-8
vncterm: 0.9-2
vzctl: 3.0.23-1pve7
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.11.1-2
ksm-control-daemon: 1.0-2

ProxmoxVE is manually installed a Debian lenny hosted at Hetzner.de

The guest is a i386 debian, everything in ext3, with 4cores assigned, virtio disk and network, disk is stored on a LVM group over a SW raid mirror..

Is there something that I can do to solve this issue?

Thanks in advance

k.
 
Apparently myself and sniffer were able to stabalize the machines in production.

We uninstalled the intel driver for E1000 and all got good after that, when we stressed the machines with gigabytes of info, it always did crash.

Most of the times were windows shut down sync from 3+ machines at the same time.

Quite bizarre, but everything seems fine atm.
 
I leave the dd stress test running for about 20hours on two gest, and gives no problems.. so i thought that removing host's cpufreq ondemand governor (i don't know why was enabled on hetzner's default installation) and upgrading guest lenny's kernel to squeeze's one solves this problem.

Today i re-created everything from scratch, in order to install my VM, i build a brand new guest, installed lenny, installed few packages, updated kernel to squeeze's one.. re-run a dd test ( dd if=/dev/zero of=/tmp/test bs=1M count=1000 ) and the second time I launch it, every problems come back:

It starts with:
[ 140.177741] BUG: Bad page state in process dd pfn:dae7d
Then:
[ 140.216248] BUG: unable to handle kernel NULL pointer dereference at 00000006
Andstarts with "Cpu Stuck",
[ 273.808162] BUG: soft lockup - CPU#1 stuck for 61s! [rm:1975]
(i can attach full dmesg log if it helps..)So, i can't trust this platform for my production enviroment.. :|

Code:
# pveversion -v
pve-manager: 1.5-5 (pve-manager/1.5/4627)
running kernel: 2.6.32-1-pve
proxmox-ve-2.6.32: 1.5-4
pve-kernel-2.6.32-1-pve: 2.6.32-4
qemu-server: 1.1-11
pve-firmware: 1.0-3
libpve-storage-perl: 1.0-8
vncterm: 0.9-2
vzctl: 3.0.23-1pve7
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.11.1-2
ksm-control-daemon: 1.0-2
ProxmoxVE is manually installed a Debian lenny hosted at Hetzner.de

The guest is a i386 debian, everything in ext3, with 4cores assigned, virtio disk and network, disk is stored on a LVM group over a SW raid mirror..

Is there something that I can do to solve this issue?

Thanks in advance

k.

I would go for the default 1.5 version - 2.6.18 kernel. And I would never use sw raid.
 
I would go for the default 1.5 version - 2.6.18 kernel. And I would never use sw raid.

I know, i read many many posts on this forum..
But I cant use the default installation, because Hetzner don't allow installing from iso, and I want to use SW raid because at the moment we can't buy an HW controller (it costs too much).

I should try to downgrade to 2.6.18 (loosing ksm)..but maybe we don't have hardware requirement to run proxmox kvm, and i have to (quickly) find a replacement..

Anyway, I think that ProxmoxVE is a great software and i'll keep to advise to friends and co-worker (and saying them to donate at the project).

Thanks!

k.
 
I already see this error type. Your CPU is Intel Xeon 5520 or 5540 ? This CPUs have HT (virtual cpus). So with 4 cores you have the double...

I see that runing a guest kernel 2.18 (centOS) . With ubuntu (2.6.28 or so) no have this...

Try disable this Virtual CPUS in bios and Turbo booster too (i do that and problems stoped
 
We've noticed that by loging to a server in proxmox via RDP, frequently it crashes.

Machine hangs, then dies and gets totally unresponsive.
 
I have one that freezes during a snapshot backup.
Using 2.6.32-1-pve kernel.

Also a strange error in syslog:
kernel hrtimer: interrupt too slow, forcing clock min delta to 10266 ns

I can't tell if it was with regard to a snapshot that failed to complete prior to this one currently going on or not.
There's an older copy of this frozen one that froze earlier in the week & wouldn't start afterward.
Now it won't delete from the host, or let me change any of it's hardware. eg. remove the disk to recover space.
So I'm not surprised that it won't backup, either.

I'll write again after the snapshot's done & report if this one becomes responsive again afterward.

I can't read the forum-hosted thumbnail, so here's another.

tooslow.png..jpg

EDIT:
It came back to life as soon as the vmtar was done.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!