VM crash after hot-migration

Ahmet Bas · Nov 4, 2022

Hello,

Since upgrading to Proxmox VE 7 we see VMs hang after live migration on some hypervisors. The VM stops responding and we see CPU spikes. If we move VM from hypervisor 1 to 2 nothing happens. But if we move it migrate it back from hypervisor 2 to 1 it crashes.

The hypervisor 1, pveversion details:

Code:

proxmox-ve: 7.2-1 (running kernel: 5.15.64-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

Hypervisor 2, pveversion details:

Code:

proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.4: 6.4-18
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

The vm details:

Code:

agent: 1,fstrim_cloned_disks=1
boot: cdn
bootdisk: scsi0
cores: 1
cpu: Haswell-noTSX,flags=+aes
cpuunits: 1000
ide2: none,media=cdrom
memory: 1024
name: xxxx
net0: virtio=8e:06:85:ca:54:48,bridge=,firewall=1,rate=12.5
numa: 0
onboot: 1
ostype: l26
scsi0: xxx.qcow2,cache=writeback,discard=on,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=c4ce01af-18c2-42ee-90a9-ca90a6426347
sockets: 1
vmgenid: f03d920a-de6f-430d-8762-663ea43e1811

The spike which we seeing in cpu usage:

Any ideas what can cause this?

Moayad · Nov 4, 2022

Hi,

Ahmet Bas said:
cpu: Haswell-noTSX,flags=+aes

Can you try to set the CPU to Host instead of `Haswell-noTSX,flags=+aes` and see if this did the trick?

Ahmet Bas · Nov 4, 2022

Moayad said:
Hi,

Can you try to set the CPU to Host instead of `Haswell-noTSX,flags=+aes` and see if this did the trick?

I have changed the type to "host" but the same issue.

Ahmet Bas · Nov 4, 2022

Also tried to change kernel 5.15 to 5.19, but did not make any chance with the result. The VM becomes unreachable after the hot-migration.

Moayad · Nov 4, 2022

Hello,

I would check the syslog/journalctl during the migration process. Can you provide us with the full task log of the migration?

Ahmet Bas · Nov 4, 2022

Moayad said:
Hello,

I would check the syslog/journalctl during the migration process. Can you provide us with the full task log of the migration?

Sure, hereby the full task log. Forgot to mention it before in order to make the VM work again a reset is not enough. VM stop VM Start is needed.

Code:

2022-11-04 16:12:59 use dedicated network address for sending migration traffic ()
2022-11-04 16:12:59 starting migration of VM 5105 to node 'hv01' ()
2022-11-04 16:12:59 starting VM 5105 on remote node 'hv01'
2022-11-04 16:13:03 start remote tunnel
2022-11-04 16:13:04 ssh tunnel ver 1
2022-11-04 16:13:04 starting online/live migration on unix:/run/qemu-server/5105.migrate
2022-11-04 16:13:04 set migration capabilities
2022-11-04 16:13:04 migration speed limit: 625.0 MiB/s
2022-11-04 16:13:04 migration downtime limit: 100 ms
2022-11-04 16:13:04 migration cachesize: 128.0 MiB
2022-11-04 16:13:04 set migration parameters
2022-11-04 16:13:04 start migrate command to unix:/run/qemu-server/5105.migrate
2022-11-04 16:13:05 migration active, transferred 264.4 MiB of 1.0 GiB VM-state, 339.0 MiB/s
2022-11-04 16:13:06 migration active, transferred 670.2 MiB of 1.0 GiB VM-state, 483.1 MiB/s
2022-11-04 16:13:07 average migration speed: 346.9 MiB/s - downtime 94 ms
2022-11-04 16:13:07 migration status: completed
2022-11-04 16:13:11 migration finished successfully (duration 00:00:12)
TASK OK

In syslog of hv01 i also see this:

Code:

2022-11-04T20:38:18.405048+01:00 hv01.snel.com kernel: [33274.792652] device tap5105i0 entered promiscuous mode
2022-11-04T20:38:18.431061+01:00 hv01.snel.com systemd-udevd[1004646]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
2022-11-04T20:38:18.459012+01:00 hv01.snel.com systemd-udevd[1004646]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
2022-11-04T20:38:18.463181+01:00 hv01.snel.com systemd-udevd[1004649]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
2022-11-04T20:38:18.463345+01:00 hv01.snel.com systemd-udevd[1004649]: Using default interface naming scheme 'v247'.

But ethtool is not installed by me. On hypervisor 2 i do not see this.

Moayad · Nov 7, 2022

Thank you for the output!

To narrow down the case, can you try to remove the ethtool on hv01 node if installed and test the migration again?

Ahmet Bas · Nov 7, 2022

Moayad said:
Thank you for the output!

To narrow down the case, can you try to remove the ethtool on hv01 node if installed and test the migration again?

I tried this but it did not work. Due to this issue, we have updated all systems to kernel version 5.19.

Ahmet Bas · Nov 7, 2022

Moayad said:
Thank you for the output!

To narrow down the case, can you try to remove the ethtool on hv01 node if installed and test the migration again?

In another ticket there was a reference to those two reports:
https://bugzilla.proxmox.com/show_bug.cgi?id=4073
https://bugzilla.proxmox.com/show_bug.cgi?id=4218

Can it be related to this issue?

Search

Search

VM crash after hot-migration

Ahmet Bas

Well-Known Member

Moayad

Proxmox Staff Member

Ahmet Bas

Well-Known Member

Ahmet Bas

Well-Known Member

Moayad

Proxmox Staff Member

Ahmet Bas

Well-Known Member

Moayad

Proxmox Staff Member

Ahmet Bas

Well-Known Member

Ahmet Bas

Well-Known Member

We value your privacy