KVM machines hanging

bohansen · Oct 22, 2009

Hi,

I have installed proxmox 1.4 and in order to test KVM I installed an Ubuntu server 8.04 i386 and a Debian Lenny. When I stress the machines I sometimes lose the connection to one of the VMs. Neither VNC or ssh works, but it is still listed as running in proxmox although it seems complety stalled.
Mounting the harddrive from the VM does not show anything in the /var log files. Any ideas how I proceed debugging this problem?
Has anybody else experienced problems with Ubuntu or Debian hanging?
I'm using 1 socket / 4 cores, but I'll test with a single core tomorrow to see if it makes any difference.

I'll try running a memtest over the weekend to see if I have a hardware problem, but the OpenVZ machines is running fast and stable.

Any suggestions appreciated,
Bo

dietmar · Oct 23, 2009

Do you use virtio? If so, please update to the latest 'qemu-server' (uploaded today), and restart (stop/start) your VMs.

bohansen · Oct 23, 2009

No my configuration is the following:
name: debianbuild
ide2: local:iso/debian-503-i386-netinst.iso,media=cdrom
vlan0: rtl8139=6A:8E:96:8C:74:48
bootdisk: ide0
ostype: l26
ide0: local:101/vm-101-disk-1.raw
memory: 2048
sockets: 1

I tested today with one socket and it still freezes. I can see the kvm thread still running, but it almost doesn't use CPU time.
I had 2 kvm's and 2 openvz running last night. 1 kvm was freezed. And one VZ finished the build job but in approx one hour where I expected half an hour. The other one was not finished after more than 10 hours. Maybe I have a similar problem to the "Slow server"-thread?

bohansen · Oct 29, 2009

I ran a memtest during last weekend without any trouble.
After upgrading to the latest PVE I have been running tests on 2 kvm's and 2 openvz. It all worked fine until this morning

.

I had lost connection to one KVM. As before it still runs according to PVE and top, but I cannot contact it using ssh or vnc from PVE. When I run "qm status 900" it says running. But going into "qm monitor 900" and running "info status" it says pause. Can I connect to a kvm locally like vzctl enter for openvz? Any other advice for debugging this?

Another problem occurred with one of the vz machines. It was sort of stopped in the middle of the task, but when I clicked with the mouse in the ssh window it suddenly continued and finshed the job (although 10 hours late). I don't know what this can be related to. The other to ssh sessions is still running fine - to one kvm and one vz machine.

Thanks in advance,
Bo

dietmar · Oct 30, 2009

Please try to find a way to reproduce the problem.

bohansen · Nov 3, 2009

Yes, reproducing is the way to go - just a bit difficult when the error is very rare

.

I might have narrowed it down to multicore KVM machines. I have been running tests on three single core KVMs and 2 openVZ for 5 days now without a problem.
Added a 4 core kvm yesterday - it has not freezed (yet), but I found the following in the dmesg log:

Code:

[352416.798960] BUG: soft lockup - CPU#0 stuck for 4096s! [swapper:0]
[352416.798970] Modules linked in: ipv6 loop snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc psmouse parport pcspkr button serio_raw virtio_balloon i2c_piix4 i2c_core joydev evdev ext3 jbd mbcache ide_disk ide_cd_mod cdrom usbhid hid ff_memless 8139too piix ide_pci_generic ide_core floppy 8139cp ata_generic virtio_pci mii libata scsi_mod dock uhci_hcd usbcore thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
[352416.799131] 
[352416.799131] Pid: 0, comm: swapper Not tainted (2.6.26-2-686 #1)
[352416.799131] EIP: 0060:[<c0114d78>] EFLAGS: 00000246 CPU: 0
[352416.799131] EIP is at native_safe_halt+0x2/0x3
[352416.799131] EAX: c0378000 EBX: c0102656 ECX: 01c57000 EDX: 03fd7c73
[352416.799131] ESI: 00000000 EDI: c036c000 EBP: 00847007 ESP: c0379fe0
[352416.799131]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[352416.799131] CR0: 8005003b CR2: 4012bd20 CR3: 003bc000 CR4: 000006d0
[352416.799131] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[352416.799131] DR6: ffff0ff0 DR7: 00000400
[352416.799131]  [<c0102683>] ? default_idle+0x2d/0x53
[352416.799131]  [<c01025ce>] ? cpu_idle+0xab/0xcb
[352416.799131]  =======================
[352416.799635] BUG: soft lockup - CPU#1 stuck for 4096s! [swapper:0]
[352416.799635] Modules linked in: ipv6 loop snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc psmouse parport pcspkr button serio_raw virtio_balloon i2c_piix4 i2c_core joydev evdev ext3 jbd mbcache ide_disk ide_cd_mod cdrom usbhid hid ff_memless 8139too piix ide_pci_generic ide_core floppy 8139cp ata_generic virtio_pci mii libata scsi_mod dock uhci_hcd usbcore thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
[352416.799635] 
[352416.799635] Pid: 0, comm: swapper Not tainted (2.6.26-2-686 #1)
[352416.799635] EIP: 0060:[<c0114d78>] EFLAGS: 00000246 CPU: 1
[352416.799635] EIP is at native_safe_halt+0x2/0x3
[352416.799635] EAX: f7474000 EBX: c0102656 ECX: 01c61000 EDX: 03fd7c73
[352416.799635] ESI: 00000001 EDI: 00000000 EBP: 00000000 ESP: f7475fa8
[352416.799635]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[352416.799635] CR0: 8005003b CR2: 400be6d0 CR3: 36e32000 CR4: 000006d0
[352416.799635] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[352416.799635] DR6: ffff0ff0 DR7: 00000400
[352416.799635]  [<c0102683>] default_idle+0x2d/0x53
[352416.799635]  [<c01025ce>] cpu_idle+0xab/0xcb
[352416.799635]  =======================
[352416.798956] BUG: soft lockup - CPU#3 stuck for 4096s! [cc1:31197]
[352416.798956] Modules linked in: ipv6 loop snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc psmouse parport pcspkr button serio_raw virtio_balloon i2c_piix4 i2c_core joydev evdev ext3 jbd mbcache ide_disk ide_cd_mod cdrom usbhid hid ff_memless 8139too piix ide_pci_generic ide_core floppy 8139cp ata_generic virtio_pci mii libata scsi_mod dock uhci_hcd usbcore thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
[352416.798956] 
[352416.798956] Pid: 31197, comm: cc1 Not tainted (2.6.26-2-686 #1)
[352416.798956] EIP: 0060:[<c015d4c3>] EFLAGS: 00000202 CPU: 3
[352416.798956] EIP is at __pagevec_lru_add_active+0x94/0xad
[352416.798956] EAX: c035531e EBX: c17ab1e0 ECX: 00000002 EDX: c0355300
[352416.798956] ESI: c0355300 EDI: c202d560 EBP: 0000000e ESP: f6d49efc
[352416.798956]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[352416.798956] CR0: 80050033 CR2: 4024a000 CR3: 30991000 CR4: 000006d0
[352416.798956] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[352416.798956] DR6: ffff0ff0 DR7: 00000400
[352416.798956]  [<c0163e2f>] ? handle_mm_fault+0x238/0x690
[352416.798956]  [<c0165677>] ? arch_get_unmapped_area+0x87/0xe7
[352416.798956]  [<c0166f3f>] ? do_mmap_pgoff+0x266/0x2b9
[352416.798956]  [<c0115b8f>] ? do_page_fault+0x2a3/0x5c0
[352416.798956]  [<c0106a2a>] ? sys_mmap2+0x62/0xa0
[352416.798956]  [<c01158ec>] ? do_page_fault+0x0/0x5c0
[352416.798956]  [<c02b9cea>] ? error_code+0x72/0x78
[352416.798956]  [<c02b0000>] ? quirk_piix4_acpi+0x51/0x13c
[352416.798956]  =======================
[352416.802953] BUG: soft lockup - CPU#2 stuck for 4096s! [swapper:0]
[352416.814232] Modules linked in: ipv6 loop snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc psmouse parport pcspkr button serio_raw virtio_balloon i2c_piix4 i2c_core joydev evdev ext3 jbd mbcache ide_disk ide_cd_mod cdrom usbhid hid ff_memless 8139too piix ide_pci_generic ide_core floppy 8139cp ata_generic virtio_pci mii libata scsi_mod dock uhci_hcd usbcore thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
[352416.814232] 
[352416.814232] Pid: 0, comm: swapper Not tainted (2.6.26-2-686 #1)
[352416.814232] EIP: 0060:[<c0114d78>] EFLAGS: 00000246 CPU: 2
[352416.814232] EIP is at native_safe_halt+0x2/0x3
[352416.814232] EAX: f747e000 EBX: c0102656 ECX: 01c6b000 EDX: 03fd7c74
[352416.814232] ESI: 00000002 EDI: 00000000 EBP: 00000000 ESP: f747ffa8
[352416.814232]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[352416.814232] CR0: 8005003b CR2: 40059195 CR3: 36d3a000 CR4: 000006d0
[352416.814232] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[352416.814232] DR6: ffff0ff0 DR7: 00000400
[352416.814232]  [<c0102683>] default_idle+0x2d/0x53
[352416.814232]  [<c01025ce>] cpu_idle+0xab/0xcb
[352416.814232]  =======================
[352561.190502] BUG: soft lockup - CPU#0 stuck for 135s! [swapper:0]
[352561.190502] Modules linked in: ipv6 loop snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc psmouse parport pcspkr button serio_raw virtio_balloon i2c_piix4 i2c_core joydev evdev ext3 jbd mbcache ide_disk ide_cd_mod cdrom usbhid hid ff_memless 8139too piix ide_pci_generic ide_core floppy 8139cp ata_generic virtio_pci mii libata scsi_mod dock uhci_hcd usbcore thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
[352561.190502] 
[352561.190502] Pid: 0, comm: swapper Not tainted (2.6.26-2-686 #1)
[352561.190502] EIP: 0060:[<c0114d78>] EFLAGS: 00000246 CPU: 0
[352561.190502] EIP is at native_safe_halt+0x2/0x3
[352561.190502] EAX: c0378000 EBX: c0102656 ECX: 01c57000 EDX: 03fdf213
[352561.190502] ESI: 00000000 EDI: c036c000 EBP: 00847007 ESP: c0379fe0
[352561.190502]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[352561.190502] CR0: 8005003b CR2: 093f0000 CR3: 3082a000 CR4: 000006d0
[352561.190502] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[352561.190502] DR6: ffff0ff0 DR7: 00000400
[352561.190502]  [<c0102683>] ? default_idle+0x2d/0x53
[352561.190502]  [<c01025ce>] ? cpu_idle+0xab/0xcb
[352561.190502]  =======================
[352655.630689] BUG: soft lockup - CPU#0 stuck for 88s! [sh:23727]
[352655.630689] Modules linked in: ipv6 loop snd_pcm snd_timer snd soundcore snd_page_alloc parport_pc psmouse parport pcspkr button serio_raw virtio_balloon i2c_piix4 i2c_core joydev evdev ext3 jbd mbcache ide_disk ide_cd_mod cdrom usbhid hid ff_memless 8139too piix ide_pci_generic ide_core floppy 8139cp ata_generic virtio_pci mii libata scsi_mod dock uhci_hcd usbcore thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
[352655.630689] 
[352655.630689] Pid: 23727, comm: sh Not tainted (2.6.26-2-686 #1)
[352655.630689] EIP: 0060:[<c01683a7>] EFLAGS: 00000213 CPU: 0
[352655.630689] EIP is at page_remove_rmap+0xb/0xd4
[352655.630689] EAX: c1e6de40 EBX: c1e6de40 ECX: 736f2045 EDX: f7a0ada0
[352655.630689] ESI: f7a0ada0 EDI: 09996000 EBP: f6d12040 ESP: f6df9e08
[352655.630689]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[352655.630689] CR0: 8005003b CR2: 40059195 CR3: 3793b000 CR4: 000006d0
[352655.630689] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[352655.630689] DR6: ffff0ff0 DR7: 00000400
[352655.630689]  [<c016298e>] ? unmap_vmas+0x2fe/0x5ad
[352655.630689]  [<c013199c>] ? autoremove_wake_function+0x0/0x2d
[352655.630689]  [<c016598d>] ? exit_mmap+0x67/0xd3
[352655.630689]  [<c012074e>] ? mmput+0x20/0x7e
[352655.630689]  [<c017848b>] ? flush_old_exec+0x3e5/0x677
[352655.630689]  [<c0177b7a>] ? kernel_read+0x32/0x43
[352655.630689]  [<c019b250>] ? load_elf_binary+0x310/0x108a
[352655.630689]  [<c0164527>] ? get_user_pages+0x2a0/0x334
[352655.630689]  [<c01614a0>] ? page_address+0x73/0x93
[352655.630689]  [<c0161612>] ? kmap_high+0x19/0x17a
[352655.630689]  [<c01614a0>] ? page_address+0x73/0x93
[352655.630689]  [<c017789e>] ? copy_strings+0x169/0x173
[352655.630689]  [<c0177963>] ? search_binary_handler+0x8f/0x1a4
[352655.630689]  [<c0178a9d>] ? do_execve+0x138/0x1c6
[352655.630689]  [<c010213b>] ? sys_execve+0x2a/0x4a
[352655.630689]  [<c01038ce>] ? syscall_call+0x7/0xb
[352655.630689]  [<c02b0000>] ? quirk_piix4_acpi+0x51/0x13c
[352655.630689]  =======================

On the single core KVMs (Ubuntu 8.04.3-server i386) I found this entry:

Code:

[33449.369907] Clocksource tsc unstable (delta = 71337852 ns)
[33449.396219] Time: acpi_pm clocksource has been installed.

I don't see any errors in the dmesg output of the host.
Any idea what this can be related to?

dietmar · Nov 3, 2009

looks like a bug recently discussed on the KVM list - not sure.

bohansen · Nov 3, 2009

Do you have a link for the topic?
If it's something I should report to kvm the version is 0.11.0-2?

dietmar · Nov 3, 2009

bohansen said:
Do you have a link for the topic?

no sorry, just search.

bohansen said:
If it's something I should report to kvm the version is 0.11.0-2?

user space: qemu-kvm-0.11.0
kernel: kvm-kmod-2.6.30.1

bohansen · Nov 12, 2009

Just to mention it if anybody else encounters the same problem I think I found the thread on the KVM bugtracker:
http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2351676&group_id=180599

Judging from some of the replies the error seems to be multicore related - possibly solved in 2.6.31. Finding the right commit to backport from that information might just be like finding a needle in a haystack.

tdi · Nov 19, 2009

Hi,

Im am using proxmox 1.4 ve system on my host. It is 2x quad core xeon (nehalem), 12GB ram.
I have only two guests highly underloaded (load 0 most of time) of centos 5.4in KVM. I encounter problemswith CPU stuck 2-4 time per hour.
BUG: soft lockup - CPU#0 stuck for 10s! [events/0:14] CPU 0:

The hosts are configured so each has 8GB ram and 4 cores / 1 processor. I use kernel 2.6.18-128.7.1.el5, also used the 164 newer, but tried the old one. If i can provide any information please ask, these are production machines hosting a portal, a fix or a temporal fix would be appreciated.

The problem does notoccur with debians, only centos.

laradji · Nov 24, 2009

Same error here with vm how freeze some times Debian lenny amd64 :

[ 4796.259434] Pid: 0, comm: swapper Not tainted 2.6.26-2-amd64 #1
[ 4796.259434] RIP: 0010:[<ffffffff8021eb64>] [<ffffffff8021eb64>] native_safe_halt+0x2/0x3
[ 4796.259434] RSP: 0018:ffff81021f13df38 EFLAGS: 00000246
[ 4796.259434] RAX: ffff81021f13dfd8 RBX: 0000000000000000 RCX: 0000000000000000
[ 4796.259434] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff804fbe70
[ 4796.259434] RBP: 00000000000285ac R08: ffff8100010536a0 R09: ffff81021f3392a0
[ 4796.259434] R10: ffff81021d5fde00 R11: ffff81021f3078f0 R12: ffff81021f13ded8
[ 4796.259434] R13: 0000000000000000 R14: ffffffff8023ced6 R15: 00000041ecc67919
[ 4796.259434] FS: 0000000000000000(0000) GS:ffff81021f11edc0(0000) knlGS:0000000000000000
[ 4796.259434] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 4796.259434] CR2: 0000000001cb8610 CR3: 000000021bdc1000 CR4: 00000000000006e0
[ 4796.259434] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4796.259434] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4796.259434]
[ 4796.259434] Call Trace:
[ 4796.259434] [<ffffffff8020b0cd>] ? default_idle+0x2a/0x49
[ 4796.259434] [<ffffffff8020ac79>] ? cpu_idle+0x89/0xb3
[ 4796.259434]

Edit :
This bug look like it :
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/286285
And it seem solved.

bohansen · Nov 25, 2009

Have you tried the latest kernel + kmod-kvm from the test repository?

http://www.proxmox.com/forum/showthread.php?t=2591

laradji · Nov 26, 2009

bohansen said:
Have you tried the latest kernel + kmod-kvm from the test repository?

http://www.proxmox.com/forum/showthread.php?t=2591

I will make a try tonight

tom · Nov 26, 2009

laradji said:
I will make a try tonight

we released this kernel today to the stable repository.

laradji · Nov 26, 2009

tom said:
we released this kernel today to the stable repository.

yeah \o/

laradji · Nov 27, 2009

hi,

My vm are still crashing with a 100% cpu load in proxmox console.

I don't have any kernel error message in kern.log with the new kernel.

I am trying to limit the vm to 4 socket + 4gige of ram now.
Before it was 8 socket and 8 gige of ram.
HT seem activated.

Update : I use virtio on network and disk.

More info on my proxmox servers :
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU L5520 @ 2.27GHz
stepping : 5
cpu MHz : 2266.747
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 sse4_2 popcnt lahf_lm ida
bogomips : 4533.34
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

laradji · Nov 29, 2009

I have this on a kvm vm how have a lot of crash :

Approx 1 crash every 24H

[136449.611490] Pid: 7158, comm: rsyslogd Not tainted 2.6.26-2-amd64 #1
[136449.611490] RIP: 0010:[<ffffffff8023caa2>] [<ffffffff8023caa2>] run_timer_softirq+0x1d7/0x1e2
[136449.611490] RSP: 0018:ffffffff805e4ef0 EFLAGS: 00000207
[136449.611490] RAX: 0000000101ac8b17 RBX: ffffffff805e4ef0 RCX: ffffffff80627588
[136449.611490] RDX: 0000000101ac8b18 RSI: ffffffff805e4ec0 RDI: ffffffff80627400
[136449.611490] RBP: ffffffff805e4e70 R08: 0000000000000c01 R09: 0000000000000000
[136449.611490] R10: 000000006849a0e5 R11: ffffffff805e4cd4 R12: ffffffff8020ccf2
[136449.611490] R13: ffffffff805e4e70 R14: 0000000000000017 R15: ffffffff8023cf75
[136449.611490] FS: 000000004194a950(0063) GS:ffffffff8053c000(0000) knlGS:0000000000000000
[136449.611490] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[136449.611490] CR2: 0000000002469378 CR3: 000000011e52b000 CR4: 00000000000006e0
[136449.611490] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[136449.611490] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[136449.611490]
[136449.611490] Call Trace:
[136449.611490] <IRQ> [<ffffffff80239403>] ? __do_softirq+0x5c/0xd1
[136449.611490] [<ffffffff8020d2cc>] ? call_softirq+0x1c/0x28
[136449.611490] [<ffffffff8020f3d8>] ? do_softirq+0x3c/0x81
[136449.611490] [<ffffffff80239363>] ? irq_exit+0x3f/0x83
[136449.611490] [<ffffffff8021aa7b>] ? smp_apic_timer_interrupt+0x8c/0xa4
[136449.611490] [<ffffffff8020ccf2>] ? apic_timer_interrupt+0x72/0x80
[136449.611490] <EOI> [<ffffffff8042a46d>] ? _spin_unlock_irqrestore+0x7/0xe
[136449.611490] [<ffffffff8024f883>] ? wake_futex+0x1f/0x29
[136449.611490] [<ffffffff802509ca>] ? do_futex+0x33c/0x777
[136449.611490] [<ffffffff803b14af>] ? sys_recvfrom+0xff/0x119
[136449.611490] [<ffffffff80248be6>] ? hrtimer_start+0x112/0x134
[136449.611490] [<ffffffff802290fc>] ? hrtick_start_fair+0xfb/0x144
[136449.611490] [<ffffffff80250f03>] ? sys_futex+0xfe/0x11c
[136449.611490] [<ffffffff8024ac96>] ? getnstimeofday+0x39/0x98
[136449.611490] [<ffffffff8024ad05>] ? do_gettimeofday+0x10/0x32
[136449.611490] [<ffffffff8020beca>] ? system_call_after_swapgs+0x8a/0x8f
[136449.611490]

dietmar · Nov 30, 2009

laradji said:
I have this on a kvm vm how have a lot of crash :

Is it possible to reproduce that bug - how?

laradji · Dec 1, 2009

Ok it seem to be stable with 2.6.30 kernel on my vm from debian lenny backport.

KVM machines hanging

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

tdi

Guest

laradji

Guest

Active Member

laradji

Guest

Proxmox Staff Member

laradji

Guest

laradji

Guest

laradji

Guest

Proxmox Staff Member

laradji

Guest