Live migration broken for VMs doing "real" work

sseidel

Renowned Member
Jul 8, 2015
49
7
73
We started using Proxmox for production in March this year. Everything was working fine, live migration was great and working perfectly (had the VM CPU setting to "Host" for performance).

Then with some update (around the time the major SPECTRE/MELTDOWN fixes came in) this stopped working for all except the most basic VMs. So for instance, a VM running just postfix will still migrate ok, if there's any Java process running on the machine, we get an almost immediate kernel dump in the VM and need to hard reset it.
Here's a screenshot of a crashed VM:Screenshot_2018-12-03_07-41-30.png
This happens very "reliably" with Linux (Debian 9) VMs running Java, but sometimes also with others. Windows are usually ok, also pfSense can be migrated without problems.

Anybody knows what's going on? Storage is Ceph, connection is 10GbE between hosts, if that matters.

We tried setting CPU to "host", "default" (kvm64) and it doesn't work with either. Also, if I set the CPU type to "Westmere", then almost the exact same kernel dump happens already at startup (i.e. start VM, start Java process).

Thanks for any pointers,

Stefan
 
Hi,

have you installed the actual intel microcode updated?
 
Code:
ii  amd64-microcode                      3.20160316.3
ii  intel-microcode                      3.20180807a.1~deb9u1
Do you mean that? That's on the hosts.
 
Do you migrate from AMD to Intel?
If so this was never working 100%.
 
Even on two machines that are virtually identically (same CPU, same RAM) it doesn't work. Also, I wouldn't except a machine to crash on startup only because I select the wrong CPU type (Westmere for example). I'll try to work on a reproducible test case.
 
Here's an interesting test case: I created a VM with these parameters:Screenshot_2018-12-03_08-13-32.png

And started a basic Debian 9 text install via netboot.xyz. It crashed halfway through the install process.
I could see the error message on console 3 (Alt+F3).
 
Last edited:
I think this was related the physical CPU of the host, not the VM.

That's how I understood it. The problem happens regardless whether the physical machines have the same or different CPU. We have 5 machines, 2 Xeon, 2 Ryzen, 1 Epyc. So plenty of possibilities to test - none work.
 
That's how I understood it. The problem happens regardless whether the physical machines have the same or different CPU. We have 5 machines, 2 Xeon, 2 Ryzen, 1 Epyc. So plenty of possibilities to test - none work.
Hi,
I guess the 2 xeons don't have the same Bios patchlevel (spectre and so on)?! Are the cpu-flags 100% identical?

Does live migration work with kvm64?

Udo
 
Hi,
I guess the 2 xeons don't have the same Bios patchlevel (spectre and so on)?! Are the cpu-flags 100% identical?

Does live migration work with kvm64?

Udo
Why would you guess that?

The BIOS versions and CPU microcode levels as well as /proc/cpuinfo are absolutely identical. These were provisioned on the same day, so I wouldn't have thought why they would be different. Also, how would anything there explain that migration from Ryzen to Ryzen (again, two identical machines) also does not work?

We tried setting CPU to "host", "default" (kvm64) and it doesn't work with either. Some other CPU settings will outright fail to even let a basic Linux installer run successfully, but I didn't have the time to check all the different CPU settings.
 
So, is anybody able to confirm or deny that installing Debian 9 in a VM with the parameters outlined above works? I think that could be the first step to find out where the problem is.
 
@sseidel Do you have CPU and RAM hotplugging enabled? If so, try to disable it and/or configure you cdrom device to scsi.
We've got bad experiences with this combination and getting rid of that ide-cdrom did the trick.

Cheers Knuuut
 
Ok, we got new hardware and since it still didn't work even between identical machines. We even purchased a PVE subscription to eliminate this as a factor.

I did some more searching then and found this thread:
https://pve.proxmox.com/pipermail/pve-user/2018-February/169238.html

Which describes similar problem and eventually lead to this bug report:
https://bugzilla.proxmox.com/show_bug.cgi?id=1660

And then I did some trial and error and it works definitely when I remove kvm_pv_unhalt from the CPU flags.

I now settled for these flags for Linux machines:
  • kvm64,+ssse3,+sse4.1,+sse4.2,+x2apic,+aes,+sep,+ibpb,+movbe,+lahf_lm,+virt-ssbd,+kvm_pv_eoi
and these for Windows:
  • kvm64,+ssse3,+sse4.1,+sse4.2,+x2apic,+aes,+sep,+ibpb,+movbe,+lahf_lm,+virt-ssbd,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer

That seems to be stable for now. Windows migrations appeared to work with "Westmere" CPU type, but after some minutes the Windows VMs would crash/reboot. Right now this doesn't seem to be happening with the above flags. Migration is stable even between completely different CPUs (EPYC, Ryzen, Xeon E3, Xeon W)

Still very unsatisfactory having to do this for every single VM. I do wonder why there's no response on that bug since it's very easily verifiable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!