Migration Problem

Hi, this just happened here. Once. I tried to reproduce, but I couldn't.

Here is the relevant information:

Node A (Origin)
Code:
root@tcn-05-lon-vh22:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
root@tcn-05-lon-vh22:~#


Node B (Destination)
Code:
root@tcn-05-lon-vh23:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
root@tcn-05-lon-vh23:~#

Task Log:
Code:
2020-07-07 12:36:15 use dedicated network address for sending migration traffic (192.168.254.23)
2020-07-07 12:36:15 starting migration of VM 115 to node 'tcn-05-lon-vh23' (192.168.254.23)
2020-07-07 12:36:16 starting VM 115 on remote node 'tcn-05-lon-vh23'
2020-07-07 12:36:18 start remote tunnel
2020-07-07 12:36:19 ssh tunnel ver 1
2020-07-07 12:36:19 starting online/live migration on tcp:192.168.254.23:60000
2020-07-07 12:36:19 set migration_caps
2020-07-07 12:36:19 migration speed limit: 8589934592 B/s
2020-07-07 12:36:19 migration downtime limit: 100 ms
2020-07-07 12:36:19 migration cachesize: 1073741824 B
2020-07-07 12:36:19 set migration parameters
2020-07-07 12:36:19 start migrate command to tcp:192.168.254.23:60000
2020-07-07 12:36:20 migration status: active (transferred 860171064, remaining 7604506624), total 8601477120)
2020-07-07 12:36:20 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:21 migration status: active (transferred 1490839321, remaining 5166247936), total 8601477120)
2020-07-07 12:36:21 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:22 migration status: active (transferred 2419432698, remaining 4145524736), total 8601477120)
2020-07-07 12:36:22 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:23 migration status: active (transferred 3341693116, remaining 3168620544), total 8601477120)
2020-07-07 12:36:23 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:24 migration status: active (transferred 4250583963, remaining 2085208064), total 8601477120)
2020-07-07 12:36:24 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:25 migration status: active (transferred 5133897493, remaining 1171369984), total 8601477120)
2020-07-07 12:36:25 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 5993867428, remaining 122421248), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 6087660295, remaining 28807168), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 6132640532, remaining 158900224), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 6066 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 6144808989, remaining 146718720), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 9031 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 6152910364, remaining 138604544), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 11005 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 6159620448, remaining 131891200), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 12640 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 6164504584, remaining 126853120), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 13830 overflow 0
2020-07-07 12:36:26 migration status: active (transferred 6167890491, remaining 123428864), total 8601477120)
2020-07-07 12:36:26 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 14655 overflow 0
2020-07-07 12:36:27 migration status: active (transferred 6172339540, remaining 118849536), total 8601477120)
2020-07-07 12:36:27 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 15740 overflow 0
2020-07-07 12:36:27 migration status: active (transferred 6177980550, remaining 112267264), total 8601477120)
2020-07-07 12:36:27 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 17113 overflow 0
2020-07-07 12:36:27 migration status: active (transferred 6216921470, remaining 72413184), total 8601477120)
2020-07-07 12:36:27 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 26603 overflow 0
2020-07-07 12:36:27 migration speed: 1024.00 MB/s - downtime 40 ms
2020-07-07 12:36:27 migration status: completed
2020-07-07 12:36:27 ERROR: tunnel replied 'ERR: resume failed - VM 115 qmp command 'query-status' failed - client closed connection' to command 'resume 115'
2020-07-07 12:36:30 ERROR: migration finished with problems (duration 00:00:15)
TASK ERROR: migration problems

VM config:
Code:
root@tcn-05-lon-vh23:~# cat /etc/pve/qemu-server/115.conf
#start_at_boot=1
balloon: 0
bios: ovmf
boot: dcn
bootdisk: virtio0
cores: 8
efidisk0: TN01:115/vm-115-disk-2.raw,size=128K
ide2: SN4-ISOS:iso/virtio-win-0.1.164.iso,media=cdrom,size=362130K
memory: 8192
name: ASN-05-LON-VPS03
net0: e1000=16:1A:60:D0:C2:AC,bridge=vmbr1,tag=1617
net1: e1000=DA:6A:E3:F9:39:A6,bridge=vmbr1,tag=1549
numa: 0
onboot: 1
ostype: win10
sata1: TN01:115/vm-115-disk-1.raw,size=2T
scsihw: virtio-scsi-pci
smbios1: uuid=2c68ba39-f37b-45c7-ade0-2f73724b32d0
sockets: 1
startup: up=5
vga: virtio
virtio0: TN01:115/vm-115-disk-0.raw,size=120G
vmgenid: dbb06674-47de-44ea-80d0-f1329ea57c45
root@tcn-05-lon-vh23:~#
 
I'm on ProxMox 6.2-6 now and my problem with migrations has gone away.
I'm not sure why I had a Bandwidth limit on Migrations but I removed it around the same time as the upgrade, not sure if that helped.
 
We also randomly experience similar issues:

Code:
2020-12-02 12:30:55 migration xbzrle cachesize: 1073741824 transferred 48252 pages 256 cachemiss 45394 overflow 12
2020-12-02 12:30:55 migration speed: 101.14 MB/s - downtime 59 ms
2020-12-02 12:30:55 migration status: completed
2020-12-02 12:30:56 ERROR: tunnel replied 'ERR: resume failed - VM 248 qmp command 'query-status' failed - client closed connection' to command 'resume 248'
2020-12-02 12:30:59 ERROR: migration finished with problems (duration 00:01:28)
TASK ERROR: migration problems

Not really reproducible and migration works most of the time. No nested Virtualization here. Storage is Multipathed iSCSI.
 
I've experienced the same ugly issue today. The "workaround" of using offline migration first and booting on the other node works and VM can be migrated online since then. But it's really sad. It afects all VMs on 1 of my 7 nodes. All nodes are fully upgraded.
 
Hello,

Please check have you created a "Linux Bridge". In my case, this fixed my problem.

1. When trying to migrate KVM/QUEMU I get this error:
Code:
2020-12-02 12:30:56 ERROR: tunnel replied 'ERR: resume failed - VM 248 qmp command 'query-status' failed - client closed connection' to command 'resume 248'
2020-12-02 12:30:59 ERROR: migration finished with problems ....

2. I successfully migrate LXC/Container but it fails to Start. So, the error that I am getting is:
Code:
run_buffer: 314 Script exited with status 2
lxc_create_network_priv: 3068 No such device - Failed to create a network device
lxc_spawn: 1786 Failed to create the network
__lxc_start: 1999 Failed to spawn container "103"
TASK ERROR: startup for container '103' failed

Obviously, it seems like a networking opportunity ...

The catch here I think it's because we guys have successfully joined a node to the cluster without setting up "Linux Bridge".
 
Last edited:
I have the same error with an archlinux KVM guest. Resuming after livemigration fails. I narrowed it down. It is working if you use the LTS kernel in the archlinux guest. Which options do I have to set to get the standard kernel working on resume?
 
I migrate VM from 6.8.8-2 to 6.8.12-4. The problem remains. Is there any information on how to solve the problem?
 
And I have such an experience. Proxmox 8 is already installed. and the problem seems to be when using the Host processor type. This type was chosen in order to use more than 512 GB of RAM on a VM. Otherwise, YOU would not have started.
 
  • Like
Reactions: Gilberto Ferreira
And I have such an experience. Proxmox 8 is already installed. and the problem seems to be when using the Host processor type. This type was chosen in order to use more than 512 GB of RAM on a VM. Otherwise, YOU would not have started.
So I have this issue today as well, with two servers which dispite has the same processor, one has two and the other has just one (we are waiting for a shippiment).
Using host doesn't work.
Using x86-64-v2-AES works as expected.
 
I've experienced the same scenario again. I'm also using the "host" processor type. Both nodes have identical CPU.
I've updated the destination host before migration. So it went from old to the new one.
I've tried also to upgrade both nodes (source as well as destination) before the migration. But the migration failed too.
Source node:
Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-2-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-13
proxmox-kernel-6.8: 6.8.12-6
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
pve-kernel-5.4: 6.4-18
pve-kernel-5.15.152-1-pve: 5.15.152-1
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
Destination node:
Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.4 (running version: 8.3.4/65224a0f9cd294a3)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
pve-kernel-5.4: 6.4-18
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.3-1
proxmox-backup-file-restore: 3.3.3-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.6
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.0
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1

Here is migration task log:
Code:
task started by HA resource agent
2025-03-11 14:08:40 use dedicated network address for sending migration traffic (##.###.###.###)
2025-03-11 14:08:41 starting migration of VM 10108 to node 'redacted' (##.###.###.###)
2025-03-11 14:08:41 starting VM 10108 on remote node 'redacted'
2025-03-11 14:08:47 start remote tunnel
2025-03-11 14:08:49 ssh tunnel ver 1
2025-03-11 14:08:49 starting online/live migration on tcp:##.###.###.###:60000
2025-03-11 14:08:49 set migration capabilities
2025-03-11 14:08:49 migration downtime limit: 100 ms
2025-03-11 14:08:49 migration cachesize: 1.0 GiB
2025-03-11 14:08:49 set migration parameters
2025-03-11 14:08:49 spice client_migrate_info
2025-03-11 14:08:49 start migrate command to tcp:##.###.###.###:60000
2025-03-11 14:08:50 migration active, transferred 620.6 MiB of 7.1 GiB VM-state, 543.3 MiB/s
2025-03-11 14:08:51 migration active, transferred 1.2 GiB of 7.1 GiB VM-state, 690.0 MiB/s
2025-03-11 14:08:52 migration active, transferred 1.9 GiB of 7.1 GiB VM-state, 685.2 MiB/s
2025-03-11 14:08:53 migration active, transferred 2.6 GiB of 7.1 GiB VM-state, 682.8 MiB/s
2025-03-11 14:08:54 migration active, transferred 3.2 GiB of 7.1 GiB VM-state, 716.7 MiB/s
2025-03-11 14:08:55 migration active, transferred 3.9 GiB of 7.1 GiB VM-state, 1.4 GiB/s
2025-03-11 14:08:56 migration active, transferred 4.4 GiB of 7.1 GiB VM-state, 528.6 MiB/s
2025-03-11 14:08:57 migration active, transferred 4.8 GiB of 7.1 GiB VM-state, 519.5 MiB/s
2025-03-11 14:08:58 migration active, transferred 5.3 GiB of 7.1 GiB VM-state, 517.2 MiB/s
2025-03-11 14:08:59 migration active, transferred 5.9 GiB of 7.1 GiB VM-state, 735.6 MiB/s
2025-03-11 14:09:01 average migration speed: 608.1 MiB/s - downtime 56 ms
2025-03-11 14:09:01 migration status: completed
2025-03-11 14:09:01 ERROR: tunnel replied 'ERR: resume failed - VM 10108 not running' to command 'resume 10108'
2025-03-11 14:09:02 Waiting for spice server migration
VM quit/powerdown failed - terminating now with SIGTERM
2025-03-11 14:09:11 ERROR: migration finished with problems (duration 00:00:31)
TASK ERROR: migration problems

I haven't figured out the root cause here.
 
Last edited:
I've experienced the same scenario again.
As the original thread is almost 7 years old, it's likely that the root cause differ, but the error message is the same.
I'm also using the "host" processor type. Both nodes have identical CPU.
I've updated the destination host before migration. So it went from old to the new one.
And doing this previously worked fine for some longer time?
2025-03-11 14:09:01 ERROR: tunnel replied 'ERR: resume failed - VM 10108 not running' to command 'resume 10108'
It seems like the VM process on the target node stopped or crashed shortly before the live migration could be fully finished.
It might be good to check the system log at both source and (especially) target to see if there are any additional pointers for why that
happened.
 
I've checked the log at the destination host:
Code:
mar 11 12:50:58 [target_node] QEMU[723971]: kvm: error: failed to set MSR 0x38f to 0x7000000ff
mar 11 12:50:58 [target_node] QEMU[723971]: kvm: ../target/i386/kvm/kvm.c:3213: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
mar 11 12:50:58 [target_node] kernel: vmbr1: port 3(tap1046i1) entered disabled state
mar 11 12:50:58 [target_node] kernel: tap1046i1 (unregistering): left allmulticast mode
mar 11 12:50:58 [target_node] kernel: vmbr1: port 3(tap1046i1) entered disabled state
mar 11 12:50:58 [target_node] sshd[724222]: Received disconnect from [source_node_IP] port 38180:11: disconnected by user
mar 11 12:50:58 [target_node] sshd[724222]: Disconnected from user root [source_node_IP] port 38180
mar 11 12:50:58 [target_node] sshd[724222]: pam_unix(sshd:session): session closed for user root
mar 11 12:50:58 [target_node] systemd[1]: session-36.scope: Deactivated successfully.
mar 11 12:50:58 [target_node] systemd[1]: session-36.scope: Consumed 1.639s CPU time.
mar 11 12:50:58 [target_node] systemd-logind[1007]: Session 36 logged out. Waiting for processes to exit.
mar 11 12:50:58 [target_node] systemd-logind[1007]: Removed session 36.
mar 11 12:50:58 [target_node] kernel: fwbr1046i0: port 2(tap1046i0) entered disabled state
mar 11 12:50:58 [target_node] kernel: tap1046i0 (unregistering): left allmulticast mode
mar 11 12:50:58 [target_node] kernel: fwbr1046i0: port 2(tap1046i0) entered disabled state
mar 11 12:50:58 [target_node] systemd[1]: 1046.scope: Deactivated successfully.
mar 11 12:50:58 [target_node] systemd[1]: 1046.scope: Consumed 37.943s CPU time.
mar 11 12:51:00 [target_node] qmeventd[724737]: Starting cleanup for 1046
mar 11 12:51:00 [target_node] kernel: fwbr1046i0: port 1(fwln1046i0) entered disabled state
mar 11 12:51:00 [target_node] kernel: vmbr2: port 3(fwpr1046p0) entered disabled state
mar 11 12:51:00 [target_node] kernel: fwln1046i0 (unregistering): left allmulticast mode
mar 11 12:51:00 [target_node] kernel: fwln1046i0 (unregistering): left promiscuous mode
mar 11 12:51:00 [target_node] kernel: fwbr1046i0: port 1(fwln1046i0) entered disabled state
mar 11 12:51:00 [target_node] kernel: fwpr1046p0 (unregistering): left allmulticast mode
mar 11 12:51:00 [target_node] kernel: fwpr1046p0 (unregistering): left promiscuous mode
mar 11 12:51:00 [target_node] kernel: vmbr2: port 3(fwpr1046p0) entered disabled state
mar 11 12:51:00 [target_node] qmeventd[724737]: Finished cleanup for 1046

Don't know what the "Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed." means.
Any clues?
 
mar 11 12:50:58 [target_node] QEMU[723971]: kvm: error: failed to set MSR 0x38f to 0x7000000ff mar 11 12:50:58 [target_node] QEMU[723971]: kvm: ../target/i386/kvm/kvm.c:3213: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
That could come from using type host, it sometimes exposes a bit more CPU state than KVM can control, so a migration gets some state, like the Machine-Specific Register (MSR) values in your case, that then cannot be set on the target. Could also stem from different CPU models or different CPU µcode versions, even if both nodes use the exact same CPU model. A newer kernel and/or QEMU on either side might have exposed the newer CPU state in the first place, making this issue visible.

I'd recommend switching to a CPU model that either matches the CPU generation in use or to the more generic x86-64-vX ones, see
https://pve.proxmox.com/pve-docs/chapter-qm.html#_qemu_cpu_types

That should ensure good performance, while avoiding any CPU flag/state that is overly specific for migration, even if the same CPU models are used.
 
Last edited:
  • Like
Reactions: aasami
@aasami what host CPU models do you have, i.e. lscpu from both nodes? What OS/kernel is running inside the guest?

Haven't had the time to look into this yet, but the following commit adds support for some new MSRs in QEMU, while on a glance (again, haven't looked deeply yet, so I might well be wrong) not adapting migration handling: https://gitlab.com/qemu-project/qemu/-/commit/0418f90809aea5b375c859e744c8e8610e9be446


EDIT: sorry, this is most likely a red herring, because the feature needs to be explicitly enabled via the command-line. The information requested above would still be interesting though.
 
Last edited:
Thank you @fiona for your interest.

lscpu of source machine:
Code:
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel
  Model name:             Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
    BIOS Model name:       Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz        CPU @ 2.6GHz
    BIOS CPU family:      179
    CPU family:           6
    Model:                62
    Thread(s) per core:   1
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             4
    CPU(s) scaling MHz:   68%
    CPU max MHz:          3400,0000
    CPU min MHz:          1200,0000
    BogoMIPS:             5187,39
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology n
                          onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pt
                          i intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization features: 
  Virtualization:         VT-x
Caches (sum of all):     
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     2 MiB (8 instances)
  L3:                     20 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7
Vulnerabilities:         
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
  Mds:                    Mitigation; Clear CPU buffers; SMT disabled
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

lscpu of destination machine:
Code:
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel
  Model name:             Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
    BIOS Model name:       Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz        CPU @ 2.6GHz
    BIOS CPU family:      179
    CPU family:           6
    Model:                62
    Thread(s) per core:   2
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             4
    CPU(s) scaling MHz:   55%
    CPU max MHz:          3400,0000
    CPU min MHz:          1200,0000
    BogoMIPS:             5187,45
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology n
                          onstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pt
                          i intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization features: 
  Virtualization:         VT-x
Caches (sum of all):     
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     2 MiB (8 instances)
  L3:                     20 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-15
Vulnerabilities:         
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: Split huge pages
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                    Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

The guest OS is running on Oracle's kernel 5.4.17-2102.201.3.el8uek.x86_64

Now it makes more sense as I've looked onto the CPUs closer. I've supposed they are identical. Thank you for asking. Could it be related to disabled hyper-threading on the source machine?
Both CPUs have the same microcode revision installed.