VM freezeing (with vCPU at 100%) when doing LiveMigration

Daniel S

New Member
Feb 10, 2016
4
0
1
47
Hi.

We have a test Promox 4.1 cluster with 3 servers. All Dell R415 and R515, with AMD cpu.
"Promox1" server has AMD Opteron 4334 CPU.s "Promox2" and "Promox3" have AMD Opteron 4180 CPUs.

The whole Promox is updated to the latest version, 4.1.5/f910ef5c (pve-no-suscription repository), with the lastest pve-kernel (pve-kernel-4.2.6-1-pve).

If we move (live) a KVM base virtual server between "Promox2" and "Promox3" (P2 to P3 or P3 to P2), it works always. If we move from "Promox2" or "Promox3" to "Promox1" (P2 to P1 or P3 to P3), it works with no problems. But a live migration from "Promox1" to "Promox2" or "Promox3" (P1 to P2 , or P1 to P3) is done, but the virtual server hangs up. It gets freezed (network and console doesn't work) and the vCPU of the server comes to 100%. We tried to leave it that way up to 10 minutes and it doesn't come responsive.

I saw a bug that could be more or less similar (steal time bug) but its already patched in the kernel that we use.

It seems like a problem when the virtual machine goes from a "newer" CPU (Opteron 4334) to an older one (Opteron 4180).

To test, we added a 4th server to the cluster (same sotfware version). An HP ProLiant DL165 G7 with AMD Opteron 6238 CPU. We call it "Promox4".
Then we do live migrations from "Promox4" to "Promox1" or "Promox1" to "Promox4", it works perfect. Also from "Promox4" to "Promox2" or "Promox3" (P4 to P2 , or P4 to P3).
But again the same problem appears if we migrate from "Promox4" to "Promox2" or "Promox3" (P4 to P2 or P4 to P3).

It seems to be something related to the CPU flags. "Old to new" is ok, but "new to old" freeze the server.

So, live migrations are OK if they are done:
AMD Opteron 4180 to AMD Opteron 4180
AMD Opteron 4180 to AMD Opteron 4334
AMD Opteron 4180 to AMD Opteron 6238

AMD Opteron 4334 to AMD Opteron 6238

AMD Opteron 6238 to AMD Opteron 4334


And VM freeze and vCPU goes to 100% if it's done that way:
AMD Opteron 4334 to AMD Opteron 4180

AMD Opteron 6238 to AMD Opteron 4180


Those are the flags of each type of CPU

AMD 4180 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall


AMD 4334 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi

AMD 6238 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core arat cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold


And this is the info about the Proxmox software:

proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-31
qemu-server: 4.0-49
pve-firmware: 1.1-7
libpve-common-perl: 4.0-45
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-3
pve-container: 1.0-39
pve-firewall: 2.0-15
pve-ha-manager: 1.0-19
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-6
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie


Any idea about how to solve it? A kernel bug?

Thanks in advance.

Daniel.
 
Last edited:
Hi, sorry if I overread in your post, but what CPU type have you assigned to the VM, can you post it's config?

When doing live migration between nodes with different CPUs be sure to set it to the default: "kvm64".

If this is set and the error gets triggered it seems quite strange yeah.
 
The CPU was set to kvm64 (default). We also tried changing it to qemu64, and Opteron_G1 and Opteron_G2 with the same results.

Also tried to change disk and controller type to check if it could be the error. Even tried with no network card. Same results always. We use CEPH as shared storage, but tried also with VM disk set to "qcow2" format over NFS share.

If we move from a host that has CPU AMD Opteron 6238 or AMD Opteron 4334, to one with AMD Opteron 4180, the virtual machine gets freezed and the vCPU usage goes to 100% (CPU usage 100% of 1 CPU)

I don't know if it may help. If we set 1 vCPU with 2 cores, the result is the same, but vCPU usage gets 50% (CPU usage 100% of 2 CPUs)
If we set 1 CPU with 4 cores, if freezes and usage (acording to VM summary) is: CPU usage 25% of 4 CPUs).

The base config is:

---------------------
bootdisk: scsi0
cores: 1
ide2: none,media=cdrom
keyboard: es
memory: 2048
name: TEST-35.130
net0: virtio=32:32:37:38:65:61,bridge=vmbr0
numa: 0
ostype: l26
scsi0: virtualkvm:vm-100-disk-1,cache=writeback,discard=on,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=450b6e58-c967-471a-8a23-a95b4a6292a5
sockets: 1

---------------------

We've tried with Debian 8 and Ubuntu 12.04 as guest.
 
Last edited:
Hi again.

Trying to guess what happens, I saw that there was in 2014 (or still is there) a bug with live migration in qemu-kvm between Family 10h and Family 15h of AMD CPUs.

http://lists.gnu.org/archive/html/qemu-discuss/2014-02/msg00002.html

We have

- AMD Opteron 6238 - Family 15h (1st gen)
- AMD Opteron 4334 - Family 15h (2nd gen)
- AMD Opteron 4180 - Family 10h

So the freeze problem that we see is exactly when doing a "Family 10h" to a "Family 15h" live migration. Anybody knows if there's a way to solve it or is just a "live with it and assume it as best as you can"?

Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!