Online migration fails from 6.3 to 6.4 with CPU set to AMD Epyc Rome

Nov 8, 2018
6
1
23
54
We have an issue with live migration for VMs using cpu: EPYC-Rome (or cpu: EPYC) between Proxmox 6.3 and 6.4. We've tested with 6.4-4 and 6.4-6 as a target, with no sucess.

Offline migration does work, but that's not an option for our cluster upgrade to 6.4.

With cpu: kvm64 online migration does work, too, but that's also not an option.

Thanks for any help -
the migration log:

Code:
2021-05-26 11:59:26 starting migration of VM 330 to node 'cloud-10' (10.122.22.10)
2021-05-26 11:59:26 starting VM 330 on remote node 'cloud-10'
2021-05-26 11:59:27 start remote tunnel
2021-05-26 11:59:28 ssh tunnel ver 1
2021-05-26 11:59:28 starting online/live migration on unix:/run/qemu-server/330.migrate
2021-05-26 11:59:28 set migration_caps
2021-05-26 11:59:28 migration speed limit: 8589934592 B/s
2021-05-26 11:59:28 migration downtime limit: 100 ms
2021-05-26 11:59:28 migration cachesize: 67108864 B
2021-05-26 11:59:28 set migration parameters
2021-05-26 11:59:28 start migrate command to unix:/run/qemu-server/330.migrate
2021-05-26 11:59:29 migration speed: 512.00 MB/s - downtime 46 ms
2021-05-26 11:59:29 migration status: completed
2021-05-26 11:59:29 ERROR: tunnel replied 'ERR: resume failed - VM 330 qmp command 'query-status' failed - client closed connection' to command 'resume 330'
2021-05-26 11:59:37 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems

VM config:

Code:
#Test AMD Hosts
agent: 1
boot: cd
bootdisk: scsi0
cores: 1
cpu: EPYC-Rome
ide2: cephfs-prox:iso/ubuntu-20.04-live-server-amd64.iso,media=cdrom,size=908M
kvm: 1
memory: 512
name: sotest-03
net0: virtio=06:FF:FF:FF:01:5a,bridge=vmbr1234,firewall=1
numa: 1
onboot: 1
ostype: l26
parent: autosnap_2021_05_26_00_15
scsi0: ceph-prox-hdd-r2:vm-330-disk-0,discard=on,size=10G
scsihw: virtio-scsi-pci
smbios1: uuid=ded14219-90fd-41d0-ac10-1a724728ec43
sockets: 2

pveversion on source host:
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.114-1-pve)
pve-manager: 6.3-7 (running version: 6.3-7/85c4930a)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.11-pve1
ceph-fuse: 15.2.11-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-2
libpve-storage-perl: 6.3-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.6-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-5
pve-cluster: 6.4-1
pve-container: 3.3-3
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-3
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-10
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

pveversion on target host:
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.114-1-pve)
pve-manager: 6.4-6 (running version: 6.4-6/be2fa32c)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.11-pve1
ceph-fuse: 15.2.11-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.6-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-5
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-3
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
 
Just to confirm, both source and target node support the given CPU model? Also, is there anything in the journal (source and target, 'journalctl -e') from the time that the failure occurs?
 
Yes, both hosts run identical AMD CPUs.

Thanks for the hint, that's what I've found on target:

Code:
May 27 12:25:59 cloud-10 QEMU[3065431]: kvm: Unknown savevm section or instance 'apic' 8. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices
May 27 12:25:59 cloud-10 QEMU[3065431]: kvm: load of migration failed: Invalid argument

Edit: full logs attached
 

Attachments

Last edited:
Hm, could you try and run a linux VM with that CPU type on both the source and target node (without migrating, just start them there) and show the output of 'lscpu' in both of them? Would be interesting to see if there's a difference. On the target, also try starting the test VM with "Machine" version (in "Hardware" tab, check the "Advanced" check box) set to the previous or current one instead of "latest", see if that changes it. I'm going to take a closer look if I can reproduce the issue here too.
 
Setting machine type to 5.1, 5.0 or 4.2 does not make any difference. With 5.2, the VM can't be started on the source Host (qemu too old).

Here's the diff (full files attached):

Code:
diff lscpu-source.txt lscpu-target.txt
16,17c16,17
< CPU MHz:                         2800.000
< BogoMIPS:                        5600.00
---
> CPU MHz:                         2799.998
> BogoMIPS:                        5599.99

in /proc/cpuinfo, there's an additional difference in apicid - showing the same id number as in the kvm error message above:

Code:
< cpu MHz               : 2800.000
---
> cpu MHz               : 2799.998
22c22
< bogomips      : 5600.00
---
> bogomips      : 5599.99
36c36
< cpu MHz               : 2800.000
---
> cpu MHz               : 2799.998
42,43c42,43
< apicid                : 8
< initial apicid        : 8
---
> apicid                : 1
> initial apicid        : 1
50c50
< bogomips      : 5600.00
---
> bogomips      : 5599.99
 

Attachments

Hm yes, IIRC the code that calculates the apicid changed between QEMU 5.1 and 5.2, with some special casing for EPYC... I'm afraid I don't really see a way around that, if you can at least restart the VMs, you could set the CPU model to something else, maybe even 'host', then migrate and change it back on the target. Alternatively install pve-qemu-kvm=5.1.0-8 on the target, upgrade everything else (this is a compatible setup), migrate over, upgrade QEMU, then restart the VMs there (e.g. 'qm reboot'). In any case, the currently running VMs are most likely not able to be migrated to a newer QEMU version without issue.
 
For the records:

After upgrading to 6.4, downgrade qemu and libproxmox-backup (on which pve-qemu-kvm depends):
apt install pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1

Then migrate VMs to the upgraded host, do apt dist-upgrade and qm reboot <vmid>. That works well at least in our case.
 
  • Like
Reactions: Stefan_R