The problem found in thread https://forum.proxmox.com/threads/slow-livemigration-performance-since-5-0.35522/, but there discussed live migration with local disks. To not hijack that thread, I open new one.
In PVE 5.0 after live migration with shared storage, VM hangs for 2-4 seconds. It can be checked by comaring localtime before and after migration. I setted up 3 nested clusters with PVE 4.4, 5.0 beta2 and 5.0.
For test executing date command in parallel-ssh before and after migration, comparing output on PVE nodes and migrated VM. Both 5 versions have a 2-4 seconds skew. In 4.4 clock skew is very small, much less then 1 second.
4 seconds in PVE 5.0
3 seconds in PVE 5.0 beta2
less then 1 second in PVE 4.4
IMO 3 seconds is gigantic delay and no-go for some production systems.
In PVE 5.0 after live migration with shared storage, VM hangs for 2-4 seconds. It can be checked by comaring localtime before and after migration. I setted up 3 nested clusters with PVE 4.4, 5.0 beta2 and 5.0.
For test executing date command in parallel-ssh before and after migration, comparing output on PVE nodes and migrated VM. Both 5 versions have a 2-4 seconds skew. In 4.4 clock skew is very small, much less then 1 second.
Code:
root@nat01:~# parallel-ssh -H 10.10.10.202 -H 10.10.10.201 -i pveversion
[1] 18:11:39 [SUCCESS] 10.10.10.201
pve-manager/5.0-23/af4267bf (running kernel: 4.10.15-1-pve)
[2] 18:11:40 [SUCCESS] 10.10.10.202
pve-manager/5.0-23/af4267bf (running kernel: 4.10.15-1-pve)
root@nat01:~# parallel-ssh -H 10.10.10.202 -H 172.17.2.108 -H 10.10.10.201 -i date
[1] 18:11:42 [SUCCESS] 172.17.2.108
Mon Jul 24 18:11:42 MSK 2017
[2] 18:11:42 [SUCCESS] 10.10.10.202
Mon Jul 24 18:11:42 MSK 2017
[3] 18:11:42 [SUCCESS] 10.10.10.201
Mon Jul 24 18:11:42 MSK 2017
root@nat01:~# ssh 10.10.10.201 qm migrate 300 pve02 --online
2017-07-24 18:11:46 starting migration of VM 300 to node 'pve02' (10.10.10.202)
2017-07-24 18:11:46 copying disk images
2017-07-24 18:11:46 starting VM 300 on remote node 'pve02'
2017-07-24 18:11:49 start remote tunnel
2017-07-24 18:11:49 starting online/live migration on unix:/run/qemu-server/300.migrate
2017-07-24 18:11:49 migrate_set_speed: 8589934592
2017-07-24 18:11:49 migrate_set_downtime: 0.1
2017-07-24 18:11:49 set migration_caps
2017-07-24 18:11:49 set cachesize: 53687091
2017-07-24 18:11:49 start migrate command to unix:/run/qemu-server/300.migrate
2017-07-24 18:11:51 migration speed: 256.00 MB/s - downtime 14 ms
2017-07-24 18:11:51 migration status: completed
2017-07-24 18:11:52 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pve02' root@10.10.10.202 pvesr set-state 300 \''{}'\'
2017-07-24 18:11:56 migration finished successfully (duration 00:00:10)
root@nat01:~# parallel-ssh -H 10.10.10.202 -H 172.17.2.108 -H 10.10.10.201 -i date
[1] 18:12:03 [SUCCESS] 10.10.10.202
Mon Jul 24 18:12:03 MSK 2017
[2] 18:12:03 [SUCCESS] 172.17.2.108
Mon Jul 24 18:11:59 MSK 2017
[3] 18:12:03 [SUCCESS] 10.10.10.201
Mon Jul 24 18:12:03 MSK 2017
Code:
root@nat01:~# parallel-ssh -H 10.10.10.203 -H 10.10.10.204 -i pveversion
[1] 18:12:21 [SUCCESS] 10.10.10.203
pve-manager/5.0-10/0d270679 (running kernel: 4.10.11-1-pve)
[2] 18:12:21 [SUCCESS] 10.10.10.204
pve-manager/5.0-10/0d270679 (running kernel: 4.10.11-1-pve)
root@nat01:~# parallel-ssh -H 10.10.10.203 -H 172.17.2.178 -H 10.10.10.204 -i date
[1] 18:12:26 [SUCCESS] 10.10.10.204
Mon Jul 24 18:12:26 MSK 2017
[2] 18:12:26 [SUCCESS] 10.10.10.203
Mon Jul 24 18:12:26 MSK 2017
[3] 18:12:26 [SUCCESS] 172.17.2.178
Mon Jul 24 18:12:26 MSK 2017
root@nat01:~# ssh 10.10.10.203 qm migrate 100 pve04 --online
Jul 24 18:12:33 starting migration of VM 100 to node 'pve04' (10.10.10.204)
Jul 24 18:12:33 copying disk images
Jul 24 18:12:33 starting VM 100 on remote node 'pve04'
Jul 24 18:12:36 start remote tunnel
Jul 24 18:12:37 starting online/live migration on unix:/run/qemu-server/100.migrate
Jul 24 18:12:37 migrate_set_speed: 8589934592
Jul 24 18:12:37 migrate_set_downtime: 0.1
Jul 24 18:12:37 set migration_caps
Jul 24 18:12:37 set cachesize: 53687091
Jul 24 18:12:37 start migrate command to unix:/run/qemu-server/100.migrate
Jul 24 18:12:39 migration speed: 256.00 MB/s - downtime 39 ms
Jul 24 18:12:39 migration status: completed
Jul 24 18:12:43 migration finished successfully (duration 00:00:10)
root@nat01:~# parallel-ssh -H 10.10.10.203 -H 172.17.2.178 -H 10.10.10.204 -i date
[1] 18:12:48 [SUCCESS] 172.17.2.178
Mon Jul 24 18:12:45 MSK 2017
[2] 18:12:48 [SUCCESS] 10.10.10.204
Mon Jul 24 18:12:48 MSK 2017
[3] 18:12:48 [SUCCESS] 10.10.10.203
Mon Jul 24 18:12:48 MSK 2017
Code:
root@nat01:~# parallel-ssh -H 10.10.10.205 -H 10.10.10.206 -i pveversion
[1] 18:12:58 [SUCCESS] 10.10.10.206
pve-manager/4.4-1/eb2d6f1e (running kernel: 4.4.35-1-pve)
[2] 18:12:58 [SUCCESS] 10.10.10.205
pve-manager/4.4-1/eb2d6f1e (running kernel: 4.4.35-1-pve)
root@nat01:~# parallel-ssh -H 10.10.10.206 -H 172.17.2.247 -H 10.10.10.205 -i date
[1] 18:13:01 [SUCCESS] 10.10.10.206
Mon Jul 24 18:13:01 MSK 2017
[2] 18:13:01 [SUCCESS] 10.10.10.205
Mon Jul 24 18:13:01 MSK 2017
[3] 18:13:01 [SUCCESS] 172.17.2.247
Mon Jul 24 18:13:01 MSK 2017
root@nat01:~# ssh 10.10.10.205 qm migrate 200 pve06 --online
Jul 24 18:13:05 starting migration of VM 200 to node 'pve06' (10.10.10.206)
Jul 24 18:13:05 copying disk images
Jul 24 18:13:05 starting VM 200 on remote node 'pve06'
Jul 24 18:13:08 start remote tunnel
Jul 24 18:13:09 starting online/live migration on unix:/run/qemu-server/200.migrate
Jul 24 18:13:09 migrate_set_speed: 8589934592
Jul 24 18:13:09 migrate_set_downtime: 0.1
Jul 24 18:13:09 set migration_caps
Jul 24 18:13:09 set cachesize: 53687091
Jul 24 18:13:09 start migrate command to unix:/run/qemu-server/200.migrate
Jul 24 18:13:11 migration speed: 256.00 MB/s - downtime 16 ms
Jul 24 18:13:11 migration status: completed
Jul 24 18:13:15 migration finished successfully (duration 00:00:11)
root@nat01:~# parallel-ssh -H 10.10.10.206 -H 172.17.2.247 -H 10.10.10.205 -i date
[1] 18:13:22 [SUCCESS] 10.10.10.205
Mon Jul 24 18:13:22 MSK 2017
[2] 18:13:22 [SUCCESS] 10.10.10.206
Mon Jul 24 18:13:22 MSK 2017
[3] 18:13:22 [SUCCESS] 172.17.2.247
Mon Jul 24 18:13:22 MSK 2017
IMO 3 seconds is gigantic delay and no-go for some production systems.