KVM/Qemu Online Migration often ends up with 100% cpu, no ping, frozen VM

sommarnatt · Jul 7, 2014

Hi,

I'm running a VE 3.2 cluster with Ceph Dumpling. All works very well except a few hickups - one of the mayor issues is with Online Migration of VMs between the same hardware/software (exactly the same CPU).

What is interesting is that Online Migration works perfectly if the VM was recently rebooted, but after a certain uptime (can be several hours) it fill fail the online migration and freeze with 100% CPU. Pings won't reach it and neither console (actually it connects the console but the VM is frozen there).

I managed to get a strace on the qemu process but i'm not sure if there's anything in here.
If anyone has ANY idea where I should start debugging, that'd be awesome.

Possibly this part:

Code:

read(5, "\2\0\0\0\0\0\0\0", 16)         = 8write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
recvmsg(50, {msg_name(0)=NULL, msg_iov(1)=[{"}", 1}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 1
ioctl(11, KVM_CHECK_EXTENSION, 0x10)    = 1
write(50, "{\"return\": {\"actual\": 8589934592"..., 79) = 79
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8

Attaching the rest of the strace.View attachment strace.txt.zip

I've also noticed that KSM is on, but that shouldn't affect Online Migration right?

ScOut3R · Jul 8, 2014

Is the freeze permanent of will it disappear after a few minutes? I had the same issue and it turned out that the hosts were using the intel_pstate driver and altered the clock frequency which affected time keeping in the guests so during live migration the guests were frozen for a few minutes. I'm running the 3.10 pve kernel and using ceph firefly as backend. Disabling the intel_pstate driver solved the issue for me.

sommarnatt · Jul 8, 2014

Ah you may be right there, I see that intel_pstate is set as the scaling driver (we're using backport kernel 3.13 due to some issues with LACP with 2.6.32 and 3.10).
I actually left it running for 15 minutes or more and it was still stuck, but maybe I'll try and reboot with the old scaling driver and see what happens.
Thanks!

udo · Jul 10, 2014

sommarnatt said:
Ah you may be right there, I see that intel_pstate is set as the scaling driver (we're using backport kernel 3.13 due to some issues with LACP with 2.6.32 and 3.10).
I actually left it running for 15 minutes or more and it was still stuck, but maybe I'll try and reboot with the old scaling driver and see what happens.
Thanks!

Hi,
I'm wondering why you ask for help, but you use an non-standard kernel (3.13) and especially your wrote nothing about that in the first post!!
How should other people can compare this issue with different base...

Udo

sommarnatt · Jul 10, 2014

I apologize about leaving that out. I've tried with the standard kernels as well 2.6.32 and 3.10, that's why I didn't mention it. But you're right, I should have mentioned it.

sommarnatt · Jul 31, 2014

Tried to disable intel_pstate and reboot the hosts but that didn't work out (although now cpuinfo reports a stable frequency).

However the solution was to change clocksource in the GUEST from kvm_clock to tsc. Now we're able to live migrate CentOS / CloudLinux guests. It seems to be a bug in CentOS 6.5 kernel / kvm_clock because ubuntu 14.04 works perfectly fine with kvm_clock.

Hope this helps someone

Search

Search

KVM/Qemu Online Migration often ends up with 100% cpu, no ping, frozen VM

sommarnatt

Active Member

ScOut3R

Member

sommarnatt

Active Member

udo

Distinguished Member

sommarnatt

Active Member

sommarnatt

Active Member