Problem with live migration?

dswartz

Renowned Member
Dec 13, 2010
286
9
83
Running a new install of 7.2, upgraded to 7.3. Guests live on a ceph/rbd datastore. I've noticed a couple of times when doing a live migration, that the guest seems to be migrated succesfully, but then fails to be up and running on the target host. Log snippet:

2022-12-05 09:40:36 migration active, transferred 1.7 GiB of 8.0 GiB VM-state, 121.7 MiB/s
2022-12-05 09:40:37 migration active, transferred 1.8 GiB of 8.0 GiB VM-state, 111.4 MiB/s
2022-12-05 09:40:39 migration active, transferred 1.9 GiB of 8.0 GiB VM-state, 113.3 MiB/s
2022-12-05 09:40:40 average migration speed: 410.6 MiB/s - downtime 311 ms
2022-12-05 09:40:40 migration status: completed
2022-12-05 09:40:40 ERROR: tunnel replied 'ERR: resume failed - VM 106 qmp command 'query-status' failed - client closed connection' to command 'resume 106'
2022-12-05 09:40:53 ERROR: migration finished with problems (duration 00:00:36)
TASK ERROR: migration problems
 
Check the syslog/journal on the target host for any errors.
Most likely the VM crashed (segfault or so) on the target node.

Do you use an EPYC CPU by any chance?
In this case try kernel 5.19 on both source and target host.
 
The 3 hosts:

32 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (2 Sockets)
6 x Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (1 Socket)
16 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (1 Socket)

Aha, I think I see what happened:

Dec 05 09:40:40 pve1 QEMU[1625064]: kvm: warning: TSC frequency mismatch between VM (1699998 kHz) and host (2099999 kHz), and TSC scaling unavailable
Dec 05 09:40:40 pve1 QEMU[1625064]: kvm: error: failed to set MSR 0x38f to 0x7000000ff
Dec 05 09:40:40 pve1 QEMU[1625064]: kvm: ../target/i386/kvm/kvm.c:3096: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.

Migrating from the slower CPU to a faster one failed due to this assertion. It doesn't happen every time though. A bug? I'm not overly concerned, as I had repurposed the slower server to use as the 3rd node. I was thinking of upgrading the CPU anyway.
 
Make sure that the BIOS of all 3 servers is up-to-date. Also make sure to install the latest microcode package on all nodes.
Maybe there is a mismatch between features caused by exactly this, different microcode versions.
 
It's curious it doesn't always happen though. All 3 nodes are up to date, according to 'apt update and etc...'
 
I think I have an idea why this just started happening - I changed the CPU type of running guests from the default kvm64 to host.
 
I don't think that's necessary.
Those are all of the same generation.

Make sure the BIOS on each of those machines is the newest one.
In addition install the Intel microcode package: https://wiki.debian.org/Microcode
 
But why then is that assertion failure happening? In the event, I had been thinking about replacing that processor anyway, since that host gets too busy due to having 6 cores/threads that are significantly slower. I will check the microcode you referenced.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!