KVM 100% cpu usage after online migration

F

Frido Roose

Guest
I'll start a new thread about online migration, because I think it is not related with http://forum.proxmox.com/threads/7213-Live-migration-reliability

After adding the sleep as a workaround for the 'stopped' vm after an online migration, there are still conditions where the VM does not recover from a migration.
In this case, the VM cpu usage goes up to 100% (for every core, in both cases single as multi-core vms).

On the host, the kvm process also utilizes 100% cpu:
Code:
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                        
   3683 root      20   0  654m 150m 2176 R 99.9  1.9   1:39.51 kvm                                                                                                                                                                            
   3669 root      20   0  654m 150m 2176 S  3.0  1.9   0:03.57 kvm

I did a strace on the process while this was happening, and only got this output:
Code:
root@yamu:~# strace -p 3683
Process 3683 attached - interrupt to quit
rt_sigtimedwait([BUS RT_6], 0x7ff4e4f87dc0, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigpending([])                       = 0
ioctl(18, KVM_RUN^C <unfinished ...>
Process 3683 detached

It's not always reproducible, but I have the impression that putting IO load in the guest (like running fio or mysql sysbench) increases the chances for this to happen.
 
I have some more testinfo about this:

I tried to do the same test on a plain KVM-HA setup (on a CentOS host with 2.6.32-220.2.1.el6.x86_64 kernel), and noticed that on an SMP guest, the chance that the guest would become unresponsive was pretty high within 4-5 online migrations.
So here we have the same behavior.

After some troubleshooting, I found that I could be related with the clocksource on the guest. By default, the para-virtualized kvm-clock driver is used:
Code:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
kvm-clock
I have these available on my guest:
Code:
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
kvm-clock tsc acpi_pm
To further test this, I switched the clocksource to tsc, since the host has a constant tsc flag:
Code:
# cat /proc/cpuinfo | grep constant_tsc
flags      : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm [B]constant_tsc[/B] arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm arat epb dts tpr_shadow vnmi flexpriority ept vpid
On the guest (SLES11 SP1):
Code:
# echo tsc >/sys/devices/system/clocksource/clocksource0/current_clocksource
Or add "clocksource=tsc" to the kernel in grub and reboot.
From this moment, live migrations were stable and didn't make the VM unresponsive after a migration (or being idle). This was on my CentOS HA-KVM setup. I didn't test it with Proxmox yet, but I expect the same behavior.
Normally, kvm-clock source is recommended, but it seems tsc works more reliable in my case.
 
When I first started making virtualized servers with Proxmox, I tried KVM. Soon enough I discovered a major showstopper, which I was unable to communicate to KVM developers so that they would understand and be motivated to fix it (you can lookup posts from Dmitry Golubev on KVM mailing list). I think you are having problems that could be related... but no fix existed. I was hoping it is fixed in recent KVM, but your post convinced me it is not. The lockup was happening for me quite regularly even without having to do online migration - it was enough to have two KVMs running on the same host with both using 3.5GB of RAM on the host having 8GB of RAM (I don't have a recipe for getting this, however, so it is not actually reproducible). Changing clock source improved stability considerably, but did not fix the issues completely. Soon enough I got upset with it and migrated to OpenVZ (which, by itself, had a lot of other issues, like kernel panics all over but they at least are getting fixed, and recent kernels are decently stable)
 
You may want to try again. I figured out that I was running on a 2.6.32.12-0.7 kernel from beginning of 2011. Updating to the latest available kernel in the guest (SLES11 SP1 2.6.32.49-0.3) allows me to still use the kvm-clock source without problems so far.
Like you said, I also had the problem with an idle machine. I had ntpd running in the guest, and I have the feeling that clock adjustments by ntpd also caused the machine locks (but then less often). Now with the latest kernel in the guest I can also run ntpd. Some sources say that you shouldn't run ntpd in the guest, but my experience is that the clock may start drifting without ntp.
 
wow, great find! i think, i did have ntp there because i could not afford clock drifting, and yeah, i was experimenting with this in mid-2010. it's a bit too late for me, though as i have already sold my soul to the devi... er... that is... openvz :D and in all honesty, i think openvz is still faster than kvm (of course if you do not require full virtualization). as a drawback, of course, you will have a possibility that doing something in the container would make your kernel running around in panic, but these bugs are finally getting fixed, so i don't have any crashes with the latest proxmox kernel (yes, i know, there is one issue still left, but it does not affect me right now, thanks devil).

however, thanks once again for the find! if i will be doing anything new, i will look into kvm again :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!