linux KVM VM freezing on SMP kernel

jhammer

Member
Dec 21, 2009
55
1
6
I have a linux VM (KVM) that is freezing randomly. It occasionally freezes during the boot process. It mostly happens when there is heavy disk access or several things going on. For example, it freezes consistently if I am compiling something and doing an rsync.

I've seen 2 clues as to what might be happening.

Clue #1: When using virtio drives I see the following on boot:

virtio-pci 0000:00:03.0: can't find IRQ for PCI INT A; probably buggy MP table
virtio-pci 0000:00:04.0: can't find IRQ for PCI INT A; probably buggy MP table
vda: vda1 vda2 vda3
virtio-pci 0000:00:05.0: can't find IRQ for PCI INT A; probably buggy MP table
vdb: vdb1
virtio-pci 0000:00:06.0: can't find IRQ for PCI INT A; probably buggy MP table

When the system freezes I see no kernel panic messages nor is anything written to the logs.

Clue #2: When I use IDE drives for the same system, I do see a kernel panic when the system freezes. The messages in the kernel panic are always ext3 related. It sometimes panics when doing the file system check on boot or, as mentioned above, when there is heavy disk access.

Any thoughts on things to look for to fix this?

Environment:
pve-manager: 1.5-8 (pve-manager/1.5/4674)
running kernel: 2.6.32-1-pve
proxmox-ve-2.6.32: 1.5-4
pve-qemu-kvm: 0.11.1-2
VM is running 2.6.32 kernel.
 
A few more pieces of information:

* When using a uniprocessor kernel, the "buggy MP table" errors go away and the system does not appear to freeze.

* Many sites I have found suggest it is 'bios' related.

* There were several suggestions to add the boot flags 'nolapic' and/or 'noapic'. Enabling 'nolapic' disables SMP. I'm left with a single processor. Enabling 'noapic' fixes the "buggy MP table" errors. I also still have SMP (in this case 4 processors). There is a difference in /proc/interrupts when 'noapic' is and is not enabled.

So 'noapic' could be a solution here. What kind of negative effect does it have?
 
I was able to get it working without the 'noapic' boot option. I had to enable Power Management support and ACPI Support in the kernel.
 
I was able to get it working without the 'noapic' boot option. I had to enable Power Management support and ACPI Support in the kernel.

Well...maybe not.

The "buggy MP table" messages are gone now...but I still get the random freezes when there is heavy disk/proc activity on the VM.
 
And what kernel does the VM use? Try a newer one if possible.

The VM uses the latest 2.6.32 kernel.

I've been doing more testing using dbench and rsync. It seems that any time I have more than 1 socket or more than 1 core selected for the VM, I can get it to crash. The only configuration that has not had a kernel panic so far is 1-core/1-socket with a non-SMP kernel.
 
If you give a bit more info, we can try to reproduce the problem (what distrib, exact kernel versions, how to trigger the bug).

Distrib - Gentoo
Kernel version - sys-kernel/gentoo-sources-2.6.32-r7
Sockets: 1
Cores: 4
Memory: 2048
Disk backend: Virtio, iscsi over multipath (different "cache=xxxx" settings don't appear to make a difference)
Ethernet: Virtio
Triggering the bug - I've caused the kernel to panic in several ways:

  • Simultaneously running 'dbench -c /usr/share/dbench/client.txt 40' and a local rsync of a lot of data (same disk to same disk).
  • Simultaneously running a large compile (emerge gcc) and a local rsync of a lot of data (same disk to same disk).
  • Simultaneously running a large compile (emerge gcc) and an rsync of a remote disk to a local disk
  • Occasionally the oops happens during an fsck.ext3 on boot.
NOTES:


  1. This VM is a physical host I am trying to migrate to Proxmox.
  2. We have a Gentoo VM running SMP that I have not been able to cause a kernel panic on. There are lots of variables though. A few include:

  • 1 socket/2 cores v.s. 1 socket/4 cores
  • Disk backend - Virtio, Local file, raw format v.s. Virtio, iscsi over multipath
  • The working VM is newly created in Proxmox v.s. a migrated physical host
  • System packages are all current on the newly created VM.
  • Kernel version is slightly older on newly created VM - sys-kernel/gentoo-sources-2.6.31-r10
I have pdf's of kernel oopses I could send if that would be helpful. They were a little too large to attach here.

I appreciate your help.
 
I doubt we can debug that - you should can try to eliminate most of those variables.

I setup test cases for each and eliminated all the variables. It doesn't seem to matter where the disk is located (iscsi or local). Disk bus type doesn't matter. Kernel version doesn't matter. Many different kernel settings tried didn't make a difference. A system update did not resolve the problem.

This is the only VM with the issue. I have migrated other servers without incident. Newly created VM's in Proxmox don't have the issue.

I'm just going to rebuild this server.

Thanks for all the help and suggestions.
 
I started an SMP server from scratch. I'm still getting kernel Oopses. Here is what I am getting:

<snip>
kernel BUG at arch/x86/mm/highmem_32.c:45!
invalid opcode: 0000 [#6] SMP
last sysfs file: /sys/devices/virtio-pci/virtio1/block/vda/uevent
Modules linked in:
<snip>

Here is the config of the VM:

ide2: NFSBlueISOS:iso/ubuntu-10.04-desktop-i386.iso,media=cdrom
vlan0: virtio=3E:E9:E7:92:F5:2A
bootdisk: virtio0
ostype: l26
memory: 2048
sockets: 1
virtio1: MAILDATA:vm-109-disk-1
onboot: 0
cores: 2
boot: cd
freeze: 0
cpuunits: 1000
acpi: 1
kvm: 1
virtio0: MAIL:vm-109-disk-1

Is there something I'm missing here in the kernel config?

Thanks.
 
always add your version (pveversion -v)
 
Thanks. Here it is:

plaid:/etc/qemu-server# pveversion -v
pve-manager: 1.5-8 (pve-manager/1.5/4674)
running kernel: 2.6.32-1-pve
proxmox-ve-2.6.32: 1.5-4
pve-kernel-2.6.32-1-pve: 2.6.32-4
pve-kernel-2.6.18-2-pve: 2.6.18-5
qemu-server: 1.1-11
pve-firmware: 1.0-3
libpve-storage-perl: 1.0-10
vncterm: 0.9-2
vzctl: 3.0.23-1pve8
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.11.1-2
ksm-control-daemon: 1.0-3
 
try the latest .32 kernel (with kvm 0.12.4) - better?
 
Is this the correct process for upgrading:

apt-get update
apt-get upgrade
apt-get install proxmox-ve-2.6.32
apt-get install kvm
 
OK. I've upgraded one of my servers to proxmox-ve-2.6.32.

plaid:~# pveversion -v
pve-manager: 1.5-10 (pve-manager/1.5/4822)
running kernel: 2.6.32-2-pve
proxmox-ve-2.6.32: 1.5-7
pve-kernel-2.6.32-1-pve: 2.6.32-4
pve-kernel-2.6.32-2-pve: 2.6.32-7
pve-kernel-2.6.18-2-pve: 2.6.18-5
qemu-server: 1.1-16
pve-firmware: 1.0-5
libpve-storage-perl: 1.0-13
vncterm: 0.9-2
vzctl: 3.0.23-1pve11
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.12.4-1
ksm-control-daemon: 1.0-3

When I tried doing an online migration from another server back to this one it didn't work. It said transfer failed but that the migration succeeded. The VM then appears on the host I was trying to migrate back to, but it is shutoff. Is that supposed to happen if KVM versions differ on the 2 boxes?

Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!