VM stuck/freeze after live migration

supermicro_server

Well-Known Member
Sep 13, 2017
107
7
58
45
Hello everyone,
I have a PVE cluster version 7.2-7 upgrated to the latest packets versions.
I'm having troubles to figure out why some VM (both Windows and Linux) are in stuck status after live migration.
Morover, this doesn't heppen for every host in the cluster.
For example, If I migrate the VM from pve1 to pve2 I have no problems, while I migrate the same VM from pve1 to pve3 I the problems occurs and the VM will be in freeze state after ram migration. After the migration proces the VM is running but when I try to open console you can realize that it is in freeze state.
I have RAM/CPU enough for running VMs..
This is very strange..

Thanks in advance
 
What are the specs for your different nodes? For instance does pve1/2 have Intel CPUs while pve3 has an AMD CPU or vice-versa?

Can you post the task logs for the migration?
 
This is the migration log:

Code:
2022-09-07 08:57:36 starting migration of VM 213 to node 'pve3' (192.168.253.58)
2022-09-07 08:57:36 found local, replicated disk 'local-zfs:vm-213-disk-0' (in current VM config)
2022-09-07 08:57:36 found local, replicated disk 'local-zfs:vm-213-disk-1' (in current VM config)
2022-09-07 08:57:36 virtio0: start tracking writes using block-dirty-bitmap 'repl_virtio0'
2022-09-07 08:57:36 virtio1: start tracking writes using block-dirty-bitmap 'repl_virtio1'
2022-09-07 08:57:36 replicating disk images
2022-09-07 08:57:36 start replication job
2022-09-07 08:57:36 guest => VM 213, running => 307821
2022-09-07 08:57:36 volumes => local-zfs:vm-213-disk-0,local-zfs:vm-213-disk-1
2022-09-07 08:57:37 freeze guest filesystem
2022-09-07 08:57:37 create snapshot '__replicate_213-0_1662533856__' on local-zfs:vm-213-disk-0
2022-09-07 08:57:37 create snapshot '__replicate_213-0_1662533856__' on local-zfs:vm-213-disk-1
2022-09-07 08:57:37 thaw guest filesystem
2022-09-07 08:57:37 using secure transmission, rate limit: none
2022-09-07 08:57:37 incremental sync 'local-zfs:vm-213-disk-0' (__replicate_213-0_1662533108__ => __replicate_213-0_1662533856__)
2022-09-07 08:57:39 send from @__replicate_213-0_1662533108__ to rpool/data/vm-213-disk-0@__replicate_213-0_1662533856__ estimated size is 6.74M
2022-09-07 08:57:39 total estimated size is 6.74M
2022-09-07 08:57:39 successfully imported 'local-zfs:vm-213-disk-0'
2022-09-07 08:57:39 incremental sync 'local-zfs:vm-213-disk-1' (__replicate_213-0_1662533108__ => __replicate_213-0_1662533856__)
2022-09-07 08:57:40 send from @__replicate_213-0_1662533108__ to rpool/data/vm-213-disk-1@__replicate_213-0_1662533856__ estimated size is 45.3M
2022-09-07 08:57:40 total estimated size is 45.3M
2022-09-07 08:57:41 successfully imported 'local-zfs:vm-213-disk-1'
2022-09-07 08:57:41 delete previous replication snapshot '__replicate_213-0_1662533108__' on local-zfs:vm-213-disk-0
2022-09-07 08:57:41 delete previous replication snapshot '__replicate_213-0_1662533108__' on local-zfs:vm-213-disk-1
2022-09-07 08:57:42 (remote_finalize_local_job) delete stale replication snapshot '__replicate_213-0_1662533108__' on local-zfs:vm-213-disk-0
2022-09-07 08:57:42 (remote_finalize_local_job) delete stale replication snapshot '__replicate_213-0_1662533108__' on local-zfs:vm-213-disk-1
2022-09-07 08:57:42 end replication job
2022-09-07 08:57:42 starting VM 213 on remote node 'pve3'
2022-09-07 08:57:44 volume 'local-zfs:vm-213-disk-0' is 'local-zfs:vm-213-disk-0' on the target
2022-09-07 08:57:44 volume 'local-zfs:vm-213-disk-1' is 'local-zfs:vm-213-disk-1' on the target
2022-09-07 08:57:44 start remote tunnel
2022-09-07 08:57:45 ssh tunnel ver 1
2022-09-07 08:57:45 starting storage migration
2022-09-07 08:57:45 virtio1: start migration to nbd:unix:/run/qemu-server/213_nbd.migrate:exportname=drive-virtio1
drive mirror re-using dirty bitmap 'repl_virtio1'
drive mirror is starting for drive-virtio1 with bandwidth limit: 102400 KB/s
drive-virtio1: transferred 704.0 KiB of 704.0 KiB (100.00%) in 0s
drive-virtio1: transferred 704.0 KiB of 704.0 KiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2022-09-07 08:57:46 virtio0: start migration to nbd:unix:/run/qemu-server/213_nbd.migrate:exportname=drive-virtio0
drive mirror re-using dirty bitmap 'repl_virtio0'
drive mirror is starting for drive-virtio0 with bandwidth limit: 102400 KB/s
drive-virtio0: transferred 704.0 KiB of 2.7 MiB (25.58%) in 0s
drive-virtio0: transferred 2.7 MiB of 2.7 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2022-09-07 08:57:47 starting online/live migration on unix:/run/qemu-server/213.migrate
2022-09-07 08:57:47 set migration capabilities
2022-09-07 08:57:47 migration speed limit: 100.0 MiB/s
2022-09-07 08:57:47 migration downtime limit: 100 ms
2022-09-07 08:57:47 migration cachesize: 512.0 MiB
2022-09-07 08:57:47 set migration parameters
2022-09-07 08:57:47 start migrate command to unix:/run/qemu-server/213.migrate
2022-09-07 08:57:48 migration active, transferred 100.5 MiB of 4.0 GiB VM-state, 139.9 MiB/s
2022-09-07 08:57:49 migration active, transferred 201.5 MiB of 4.0 GiB VM-state, 100.4 MiB/s
2022-09-07 08:57:50 migration active, transferred 302.0 MiB of 4.0 GiB VM-state, 99.1 MiB/s
2022-09-07 08:57:51 migration active, transferred 403.6 MiB of 4.0 GiB VM-state, 100.0 MiB/s
2022-09-07 08:57:52 migration active, transferred 504.1 MiB of 4.0 GiB VM-state, 101.3 MiB/s
2022-09-07 08:57:53 migration active, transferred 605.1 MiB of 4.0 GiB VM-state, 101.8 MiB/s
2022-09-07 08:57:54 migration active, transferred 706.1 MiB of 4.0 GiB VM-state, 99.9 MiB/s
2022-09-07 08:57:55 migration active, transferred 806.1 MiB of 4.0 GiB VM-state, 100.0 MiB/s
2022-09-07 08:57:56 migration active, transferred 908.6 MiB of 4.0 GiB VM-state, 102.0 MiB/s
2022-09-07 08:57:57 migration active, transferred 1010.1 MiB of 4.0 GiB VM-state, 160.0 MiB/s
2022-09-07 08:57:58 migration active, transferred 1.1 GiB of 4.0 GiB VM-state, 99.8 MiB/s
2022-09-07 08:57:59 migration active, transferred 1.2 GiB of 4.0 GiB VM-state, 119.9 MiB/s
2022-09-07 08:58:00 migration active, transferred 1.3 GiB of 4.0 GiB VM-state, 102.3 MiB/s
2022-09-07 08:58:01 migration active, transferred 1.4 GiB of 4.0 GiB VM-state, 120.7 MiB/s
2022-09-07 08:58:02 migration active, transferred 1.5 GiB of 4.0 GiB VM-state, 161.4 MiB/s
2022-09-07 08:58:03 migration active, transferred 1.6 GiB of 4.0 GiB VM-state, 100.4 MiB/s
2022-09-07 08:58:04 migration active, transferred 1.7 GiB of 4.0 GiB VM-state, 181.2 MiB/s
2022-09-07 08:58:05 migration active, transferred 1.8 GiB of 4.0 GiB VM-state, 100.3 MiB/s
2022-09-07 08:58:06 migration active, transferred 1.9 GiB of 4.0 GiB VM-state, 99.9 MiB/s
2022-09-07 08:58:07 migration active, transferred 2.0 GiB of 4.0 GiB VM-state, 100.2 MiB/s
2022-09-07 08:58:08 migration active, transferred 2.1 GiB of 4.0 GiB VM-state, 138.7 MiB/s
2022-09-07 08:58:09 migration active, transferred 2.2 GiB of 4.0 GiB VM-state, 186.5 MiB/s
2022-09-07 08:58:10 migration active, transferred 2.3 GiB of 4.0 GiB VM-state, 101.9 MiB/s
2022-09-07 08:58:11 migration active, transferred 2.4 GiB of 4.0 GiB VM-state, 98.9 MiB/s
2022-09-07 08:58:12 migration active, transferred 2.5 GiB of 4.0 GiB VM-state, 106.2 MiB/s
2022-09-07 08:58:13 migration active, transferred 2.5 GiB of 4.0 GiB VM-state, 110.0 MiB/s
2022-09-07 08:58:14 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 107.4 MiB/s
2022-09-07 08:58:15 migration active, transferred 2.7 GiB of 4.0 GiB VM-state, 100.5 MiB/s
2022-09-07 08:58:16 migration active, transferred 2.8 GiB of 4.0 GiB VM-state, 100.3 MiB/s
2022-09-07 08:58:17 migration active, transferred 2.9 GiB of 4.0 GiB VM-state, 100.4 MiB/s
2022-09-07 08:58:18 migration active, transferred 3.0 GiB of 4.0 GiB VM-state, 100.4 MiB/s
2022-09-07 08:58:19 migration active, transferred 3.1 GiB of 4.0 GiB VM-state, 100.9 MiB/s
2022-09-07 08:58:20 migration active, transferred 3.2 GiB of 4.0 GiB VM-state, 100.3 MiB/s
2022-09-07 08:58:21 migration active, transferred 3.3 GiB of 4.0 GiB VM-state, 103.1 MiB/s
2022-09-07 08:58:22 migration active, transferred 3.5 GiB of 4.0 GiB VM-state, 99.8 MiB/s
2022-09-07 08:58:23 average migration speed: 114.2 MiB/s - downtime 70 ms
2022-09-07 08:58:23 migration status: completed
all 'mirror' jobs are ready
drive-virtio0: Completing block job_id...
drive-virtio0: Completed successfully.
drive-virtio1: Completing block job_id...
drive-virtio1: Completed successfully.
drive-virtio0: mirror-job finished
drive-virtio1: mirror-job finished
2022-09-07 08:58:24 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve3' root@192.168.253.58 pvesr set-state 213 \''{"local/pve2":{"storeid_list":["local-zfs"],"last_try":1662533856,"last_node":"pve2","last_sync":1662533856,"duration":5.561552,"last_iteration":1662533856,"fail_count":0}}'\'
2022-09-07 08:58:25 stopping NBD storage migration server on target.
2022-09-07 08:58:29 migration finished successfully (duration 00:00:53)
TASK OK
 
The problem occurs in these pve below here:

pve3:
  • Intel Xeon processor E5-2600 v4/ v3 family (up to 160W TDP)*
  • Dual Socket R3 (LGA 2011)
pve4:
  • Intel Xeon E5-2630 v3 2.4 GHz 8 Core Processor 20MB LGA 2011-3 BX80644E52630V3 CPU

Many thanks
 
This is my current cluster configuration:

PVE1
32 x Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz (2 Sockets)
Linux 5.15.39-3-pve #2 SMP PVE 5.15.39-3

PVE2

32 x Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz (2 Sockets)
Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1

PVE3
40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1

PVE4
32 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (2 Sockets)
Linux 5.15.53-1-pve #1 SMP PVE 5.15.53-1

Thank you
 
Good Evening,

Did you manage to solve the issue?
I have the extact same problem, the vm is stuck after migration.

Thank you.
 
Hi Kellion, I still have the same problem.
I haven't upgraded the PVE yet to the last version 7.3.
Perhaps with the last kernel version the problem could be fixed? I hope so...:mad:
 
I have the following problem: If I migrate a VM from a node to another it works perfect.
If I migrate from the node B the migration works but the VM is stuck with high CPU usage.
I have seen several posts here with the same issue, some say that with kernel downgrade the problem is solved..

No issue with cointainer, probabily because they are shoot down before migration.
 
Thanks for your answers..
In general, It seems that the problem occurs when you migrate from a node with CPU Xeon to CPU E5-2630 and viceversa.
Migration between CPU Xeon I have no problems.

Can you confirm it?

Bye
 
Thanks for your answers..
In general, It seems that the problem occurs when you migrate from a node with CPU Xeon to CPU E5-2630 and viceversa.
Migration between CPU Xeon I have no problems.

Can you confirm it?

Bye
I am using AMD CPU, the problem was moving from AMD Ryzen 9 5900HX to AMD Ryzen 7 3750H.

Anyway, the new pve-kernel solved the problem, now migration works perfect.

You can check you pve-kernel with uname -a

I belive is some kind of bug.
5.19.17-1-pve works well.