KVM/Qemu Online Migration often ends up with 100% cpu, no ping, frozen VM

sommarnatt

Active Member
Mar 20, 2014
24
1
41
Sweden
Hi,

I'm running a VE 3.2 cluster with Ceph Dumpling. All works very well except a few hickups - one of the mayor issues is with Online Migration of VMs between the same hardware/software (exactly the same CPU).

What is interesting is that Online Migration works perfectly if the VM was recently rebooted, but after a certain uptime (can be several hours) it fill fail the online migration and freeze with 100% CPU. Pings won't reach it and neither console (actually it connects the console but the VM is frozen there).

I managed to get a strace on the qemu process but i'm not sure if there's anything in here.
If anyone has ANY idea where I should start debugging, that'd be awesome.

Possibly this part:
Code:
read(5, "\2\0\0\0\0\0\0\0", 16)         = 8write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
recvmsg(50, {msg_name(0)=NULL, msg_iov(1)=[{"}", 1}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 1
ioctl(11, KVM_CHECK_EXTENSION, 0x10)    = 1
write(50, "{\"return\": {\"actual\": 8589934592"..., 79) = 79
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8

Attaching the rest of the strace.View attachment strace.txt.zip

I've also noticed that KSM is on, but that shouldn't affect Online Migration right?
 
Last edited:
Is the freeze permanent of will it disappear after a few minutes? I had the same issue and it turned out that the hosts were using the intel_pstate driver and altered the clock frequency which affected time keeping in the guests so during live migration the guests were frozen for a few minutes. I'm running the 3.10 pve kernel and using ceph firefly as backend. Disabling the intel_pstate driver solved the issue for me.
 
Ah you may be right there, I see that intel_pstate is set as the scaling driver (we're using backport kernel 3.13 due to some issues with LACP with 2.6.32 and 3.10).
I actually left it running for 15 minutes or more and it was still stuck, but maybe I'll try and reboot with the old scaling driver and see what happens.
Thanks!
 
Ah you may be right there, I see that intel_pstate is set as the scaling driver (we're using backport kernel 3.13 due to some issues with LACP with 2.6.32 and 3.10).
I actually left it running for 15 minutes or more and it was still stuck, but maybe I'll try and reboot with the old scaling driver and see what happens.
Thanks!
Hi,
I'm wondering why you ask for help, but you use an non-standard kernel (3.13) and especially your wrote nothing about that in the first post!!
How should other people can compare this issue with different base...

Udo
 
I apologize about leaving that out. I've tried with the standard kernels as well 2.6.32 and 3.10, that's why I didn't mention it. But you're right, I should have mentioned it.
 
Tried to disable intel_pstate and reboot the hosts but that didn't work out (although now cpuinfo reports a stable frequency).

However the solution was to change clocksource in the GUEST from kvm_clock to tsc. Now we're able to live migrate CentOS / CloudLinux guests. It seems to be a bug in CentOS 6.5 kernel / kvm_clock because ubuntu 14.04 works perfectly fine with kvm_clock.

Hope this helps someone ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!