Hi.
We have a test Promox 4.1 cluster with 3 servers. All Dell R415 and R515, with AMD cpu.
"Promox1" server has AMD Opteron 4334 CPU.s "Promox2" and "Promox3" have AMD Opteron 4180 CPUs.
The whole Promox is updated to the latest version, 4.1.5/f910ef5c (pve-no-suscription repository), with the lastest pve-kernel (pve-kernel-4.2.6-1-pve).
If we move (live) a KVM base virtual server between "Promox2" and "Promox3" (P2 to P3 or P3 to P2), it works always. If we move from "Promox2" or "Promox3" to "Promox1" (P2 to P1 or P3 to P3), it works with no problems. But a live migration from "Promox1" to "Promox2" or "Promox3" (P1 to P2 , or P1 to P3) is done, but the virtual server hangs up. It gets freezed (network and console doesn't work) and the vCPU of the server comes to 100%. We tried to leave it that way up to 10 minutes and it doesn't come responsive.
I saw a bug that could be more or less similar (steal time bug) but its already patched in the kernel that we use.
It seems like a problem when the virtual machine goes from a "newer" CPU (Opteron 4334) to an older one (Opteron 4180).
To test, we added a 4th server to the cluster (same sotfware version). An HP ProLiant DL165 G7 with AMD Opteron 6238 CPU. We call it "Promox4".
Then we do live migrations from "Promox4" to "Promox1" or "Promox1" to "Promox4", it works perfect. Also from "Promox4" to "Promox2" or "Promox3" (P4 to P2 , or P4 to P3).
But again the same problem appears if we migrate from "Promox4" to "Promox2" or "Promox3" (P4 to P2 or P4 to P3).
It seems to be something related to the CPU flags. "Old to new" is ok, but "new to old" freeze the server.
So, live migrations are OK if they are done:
AMD Opteron 4180 to AMD Opteron 4180
AMD Opteron 4180 to AMD Opteron 4334
AMD Opteron 4180 to AMD Opteron 6238
AMD Opteron 4334 to AMD Opteron 6238
AMD Opteron 6238 to AMD Opteron 4334
And VM freeze and vCPU goes to 100% if it's done that way:
AMD Opteron 4334 to AMD Opteron 4180
AMD Opteron 6238 to AMD Opteron 4180
Those are the flags of each type of CPU
AMD 4180 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
AMD 4334 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi
AMD 6238 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core arat cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
And this is the info about the Proxmox software:
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-31
qemu-server: 4.0-49
pve-firmware: 1.1-7
libpve-common-perl: 4.0-45
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-3
pve-container: 1.0-39
pve-firewall: 2.0-15
pve-ha-manager: 1.0-19
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-6
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
Any idea about how to solve it? A kernel bug?
Thanks in advance.
Daniel.
We have a test Promox 4.1 cluster with 3 servers. All Dell R415 and R515, with AMD cpu.
"Promox1" server has AMD Opteron 4334 CPU.s "Promox2" and "Promox3" have AMD Opteron 4180 CPUs.
The whole Promox is updated to the latest version, 4.1.5/f910ef5c (pve-no-suscription repository), with the lastest pve-kernel (pve-kernel-4.2.6-1-pve).
If we move (live) a KVM base virtual server between "Promox2" and "Promox3" (P2 to P3 or P3 to P2), it works always. If we move from "Promox2" or "Promox3" to "Promox1" (P2 to P1 or P3 to P3), it works with no problems. But a live migration from "Promox1" to "Promox2" or "Promox3" (P1 to P2 , or P1 to P3) is done, but the virtual server hangs up. It gets freezed (network and console doesn't work) and the vCPU of the server comes to 100%. We tried to leave it that way up to 10 minutes and it doesn't come responsive.
I saw a bug that could be more or less similar (steal time bug) but its already patched in the kernel that we use.
It seems like a problem when the virtual machine goes from a "newer" CPU (Opteron 4334) to an older one (Opteron 4180).
To test, we added a 4th server to the cluster (same sotfware version). An HP ProLiant DL165 G7 with AMD Opteron 6238 CPU. We call it "Promox4".
Then we do live migrations from "Promox4" to "Promox1" or "Promox1" to "Promox4", it works perfect. Also from "Promox4" to "Promox2" or "Promox3" (P4 to P2 , or P4 to P3).
But again the same problem appears if we migrate from "Promox4" to "Promox2" or "Promox3" (P4 to P2 or P4 to P3).
It seems to be something related to the CPU flags. "Old to new" is ok, but "new to old" freeze the server.
So, live migrations are OK if they are done:
AMD Opteron 4180 to AMD Opteron 4180
AMD Opteron 4180 to AMD Opteron 4334
AMD Opteron 4180 to AMD Opteron 6238
AMD Opteron 4334 to AMD Opteron 6238
AMD Opteron 6238 to AMD Opteron 4334
And VM freeze and vCPU goes to 100% if it's done that way:
AMD Opteron 4334 to AMD Opteron 4180
AMD Opteron 6238 to AMD Opteron 4180
Those are the flags of each type of CPU
AMD 4180 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
AMD 4334 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vmmcall bmi
AMD 6238 CPU Flags:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core arat cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
And this is the info about the Proxmox software:
proxmox-ve: 4.1-34 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-5 (running version: 4.1-5/f910ef5c)
pve-kernel-4.2.6-1-pve: 4.2.6-34
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-31
qemu-server: 4.0-49
pve-firmware: 1.1-7
libpve-common-perl: 4.0-45
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-3
pve-container: 1.0-39
pve-firewall: 2.0-15
pve-ha-manager: 1.0-19
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-6
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
Any idea about how to solve it? A kernel bug?
Thanks in advance.
Daniel.
Last edited: