VM crash after hot-migration

Ahmet Bas

Well-Known Member
Aug 3, 2018
74
0
46
32
Hello,

Since upgrading to Proxmox VE 7 we see VMs hang after live migration on some hypervisors. The VM stops responding and we see CPU spikes. If we move VM from hypervisor 1 to 2 nothing happens. But if we move it migrate it back from hypervisor 2 to 1 it crashes.

The hypervisor 1, pveversion details:
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.64-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

Hypervisor 2, pveversion details:
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.4: 6.4-18
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

The vm details:
Code:
agent: 1,fstrim_cloned_disks=1
boot: cdn
bootdisk: scsi0
cores: 1
cpu: Haswell-noTSX,flags=+aes
cpuunits: 1000
ide2: none,media=cdrom
memory: 1024
name: xxxx
net0: virtio=8e:06:85:ca:54:48,bridge=,firewall=1,rate=12.5
numa: 0
onboot: 1
ostype: l26
scsi0: xxx.qcow2,cache=writeback,discard=on,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=c4ce01af-18c2-42ee-90a9-ca90a6426347
sockets: 1
vmgenid: f03d920a-de6f-430d-8762-663ea43e1811

The spike which we seeing in cpu usage:
Screenshot 2022-11-04 at 09.47.39.png

Any ideas what can cause this?
 
Also tried to change kernel 5.15 to 5.19, but did not make any chance with the result. The VM becomes unreachable after the hot-migration.
 
Hello,

I would check the syslog/journalctl during the migration process. Can you provide us with the full task log of the migration?
 
Hello,

I would check the syslog/journalctl during the migration process. Can you provide us with the full task log of the migration?
Sure, hereby the full task log. Forgot to mention it before in order to make the VM work again a reset is not enough. VM stop VM Start is needed.


Code:
2022-11-04 16:12:59 use dedicated network address for sending migration traffic ()
2022-11-04 16:12:59 starting migration of VM 5105 to node 'hv01' ()
2022-11-04 16:12:59 starting VM 5105 on remote node 'hv01'
2022-11-04 16:13:03 start remote tunnel
2022-11-04 16:13:04 ssh tunnel ver 1
2022-11-04 16:13:04 starting online/live migration on unix:/run/qemu-server/5105.migrate
2022-11-04 16:13:04 set migration capabilities
2022-11-04 16:13:04 migration speed limit: 625.0 MiB/s
2022-11-04 16:13:04 migration downtime limit: 100 ms
2022-11-04 16:13:04 migration cachesize: 128.0 MiB
2022-11-04 16:13:04 set migration parameters
2022-11-04 16:13:04 start migrate command to unix:/run/qemu-server/5105.migrate
2022-11-04 16:13:05 migration active, transferred 264.4 MiB of 1.0 GiB VM-state, 339.0 MiB/s
2022-11-04 16:13:06 migration active, transferred 670.2 MiB of 1.0 GiB VM-state, 483.1 MiB/s
2022-11-04 16:13:07 average migration speed: 346.9 MiB/s - downtime 94 ms
2022-11-04 16:13:07 migration status: completed
2022-11-04 16:13:11 migration finished successfully (duration 00:00:12)
TASK OK

In syslog of hv01 i also see this:

Code:
2022-11-04T20:38:18.405048+01:00 hv01.snel.com kernel: [33274.792652] device tap5105i0 entered promiscuous mode
2022-11-04T20:38:18.431061+01:00 hv01.snel.com systemd-udevd[1004646]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
2022-11-04T20:38:18.459012+01:00 hv01.snel.com systemd-udevd[1004646]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
2022-11-04T20:38:18.463181+01:00 hv01.snel.com systemd-udevd[1004649]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
2022-11-04T20:38:18.463345+01:00 hv01.snel.com systemd-udevd[1004649]: Using default interface naming scheme 'v247'.

But ethtool is not installed by me. On hypervisor 2 i do not see this.
 
Last edited:
Thank you for the output!

To narrow down the case, can you try to remove the ethtool on hv01 node if installed and test the migration again?
 
Thank you for the output!

To narrow down the case, can you try to remove the ethtool on hv01 node if installed and test the migration again?

I tried this but it did not work. Due to this issue, we have updated all systems to kernel version 5.19.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!