2017-07-19 09:29:10 migration status: active (transferred 332452457, remaining 3585961984), total 4312604672)
2017-07-19 09:29:10 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-19 09:29:12 migration status: active (transferred 671704944, remaining 10498048), total 4312604672)
2017-07-19 09:29:12 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 128 overflow 0
2017-07-19 09:29:13 migration speed: 341.33 MB/s - downtime 15 ms
2017-07-19 09:29:13 migration status: completed
drive-scsi0: transferred: 42949672960 bytes remaining: 0 bytes total: 42949672960 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0 : finished
2017-07-19 09:29:25 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=somehost' root@1.1.1.1 pvesr set-state 9006 \''{}'\'
39 packets transmitted, 24 received, 38% packet loss, time 38891ms
2017-07-19 09:42:57 migration status: active (transferred 409286050, remaining 3496914944), total 4312604672)
2017-07-19 09:42:57 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-19 09:42:59 migration speed: 1024.00 MB/s - downtime 16 ms
2017-07-19 09:42:59 migration status: completed
2017-07-19 09:43:00 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=somehost' root@1.1.1.1 pvesr set-state 9006 \''{}'\'
2017-07-19 09:43:03 migration finished successfully (duration 00:00:11)
42 packets transmitted, 38 received, 9% packet loss, time 41002ms
Yes it is possible. This is why i mentioned that pinging from the same switch (L2 and no routes and this is really good fast 10G switch and little network traffic and no congestion), and that in PVE 5.0 beta2 were no delays.I think you mix some things here.
The bug it a problem in --with-local-storage migration and has nothing to do with pvesr.
We run in a timeout what wait for 10 sec for each drive.
You ceph migration do not use this code path and your 4 package lost is may by a related the new route of the network.
This can happened because the ping is send to the old node.
2017-07-19 11:10:59 migration speed: 1024.00 MB/s - downtime 12 ms
2017-07-19 11:10:59 migration status: completed
2017-07-19 11:11:00 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=somehost' root@1.1.1.1 pvesr set-state 9006 \''{}'\'
2017-07-19 11:11:04 migration finished successfully (duration 00:00:11)
Wed Jul 19 11:10:58 MSK 2017
35: tap9006i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 02:e7:d7:63:72:c3 brd ff:ff:ff:ff:ff:ff
Wed Jul 19 11:10:59 MSK 2017
35: tap9006i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 02:e7:d7:63:72:c3 brd ff:ff:ff:ff:ff:ff
Wed Jul 19 11:11:00 MSK 2017
35: tap9006i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 02:e7:d7:63:72:c3 brd ff:ff:ff:ff:ff:ff
Wed Jul 19 11:11:01 MSK 2017
35: tap9006i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 02:e7:d7:63:72:c3 brd ff:ff:ff:ff:ff:ff
Wed Jul 19 11:11:02 MSK 2017
Device "tap9006i0" does not exist.
2017-07-19 13:30:32 start migrate command to unix:/run/qemu-server/9006.migrate
2017-07-19 13:30:34 migration status: active (transferred 362090113, remaining 3551707136), total 4312866816)
2017-07-19 13:30:34 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-19 13:30:36 migration speed: 1024.00 MB/s - downtime 22 ms
2017-07-19 13:30:36 migration status: completed
2017-07-19 13:30:37 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=somehost' root@1.1.1.102 pvesr set-state 9006 \''{}'\'
2017-07-19 13:30:41 migration finished successfully (duration 00:00:12)
13:30:33.761594 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:33.761594 B ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:34.762692 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:34.762709 Out ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:34.762692 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:34.762692 B ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:35.763549 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:35.763566 Out ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:35.763549 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:35.763549 B ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:36.764546 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:36.764565 Out ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:36.764546 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:36.764546 B ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:37.765683 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:37.765704 Out ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:37.765683 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:37.765683 B ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:38.766809 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:38.766827 Out ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:38.766809 B ee:85:5a:77:77:77 ethertype 802.1Q (0x8100), length 66: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:38.766809 B ee:85:5a:77:77:77 ethertype ARP (0x0806), length 62: Request who-has 1.1.1.111 tell 1.1.1.101, length 46
13:30:39.323734 P 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.323761 P 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.323769 P 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.323775 P 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.323780 B 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Request who-has 1.1.1.111 tell 1.1.1.111, length 28
13:30:39.323786 P 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.323801 P 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.324153 B 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.111, length 28
13:30:39.324158 B 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Request who-has 1.1.1.111 tell 1.1.1.111, length 28
13:30:39.324333 Out 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.324340 Out 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.324343 Out 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.324345 Out 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.324351 Out 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Request who-has 1.1.1.111 tell 1.1.1.111, length 28
13:30:39.324354 Out 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.324356 Out 42:86:df:88:d3:c8 ethertype 802.1Q (0x8100), length 48: vlan 10, p 0, ethertype ARP, Reply 1.1.1.111 is-at 42:86:df:88:d3:c8, length 28
13:30:39.373192 B 42:86:df:88:d3:c8 ethertype ARP (0x0806), length 44: Request who-has 1.1.1.111 tell 1.1.1.111, length 28
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=28 time=7.221 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=29 time=6.157 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=30 time=13.081 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=31 time=12.521 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=32 time=3.438 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=33 time=10.555 msec
Timeout
Timeout
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=34 time=575.345 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=35 time=575.377 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=36 time=575.386 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=37 time=575.393 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=38 time=575.399 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=39 time=575.405 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=40 time=575.412 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=41 time=11.016 msec
60 bytes from 42:86:df:88:d3:c8 (1.1.1.111): index=42 time=6.305 msec
PING 172.20.60.128 (172.20.60.128) 56(84) bytes of data.
..
[1500469027.533452] 64 bytes from 172.20.60.128: icmp_seq=16 ttl=64 time=0.234 ms
[1500469028.557419] 64 bytes from 172.20.60.128: icmp_seq=17 ttl=64 time=0.191 ms
[1500469045.965378] 64 bytes from 172.20.60.128: icmp_seq=34 ttl=64 time=0.148 ms
[1500469046.989439] 64 bytes from 172.20.60.128: icmp_seq=35 ttl=64 time=0.211 ms
..
^C
--- 172.20.60.128 ping statistics ---
37 packets transmitted, 21 received, 43% packet loss, time 36854ms
rtt min/avg/max/mdev = 0.133/0.213/0.831/0.142 ms
..
[1500469026.333883] 64 bytes from 8.8.8.8: icmp_seq=13 ttl=46 time=13.1 ms
[1500469027.335640] 64 bytes from 8.8.8.8: icmp_seq=14 ttl=46 time=13.4 ms
[1500469028.336981] 64 bytes from 8.8.8.8: icmp_seq=15 ttl=46 time=13.0 ms
[1500469029.338191] 64 bytes from 8.8.8.8: icmp_seq=16 ttl=46 time=13.0 ms
[1500469030.339465] 64 bytes from 8.8.8.8: icmp_seq=17 ttl=46 time=13.1 ms
..
^C
--- 8.8.8.8 ping statistics ---
21 packets transmitted, 21 received, 0% packet loss, time 20027ms
rtt min/avg/max/mdev = 13.059/13.210/13.461/0.102 ms
p@dev:[~]0$ pssh -l root -H 10.0.7.149 -H somehost01 -i date
[1] 11:46:09 [SUCCESS] somehost01
Thu Jul 20 11:46:09 MSK 2017
[2] 11:46:09 [SUCCESS] 10.0.7.149
Thu Jul 20 11:46:09 MSK 2017
p@dev:[~]0$ ssh root@somehost03 'qm migrate 9004 somehost01 --online'
2017-07-20 11:46:34 starting migration of VM 9004 to node 'somehost01' (10.0.0.100)
2017-07-20 11:46:34 copying disk images
2017-07-20 11:46:34 starting VM 9004 on remote node 'somehost01'
2017-07-20 11:46:37 start remote tunnel
2017-07-20 11:46:37 starting online/live migration on unix:/run/qemu-server/9004.migrate
2017-07-20 11:46:37 migrate_set_speed: 8589934592
2017-07-20 11:46:37 migrate_set_downtime: 0.1
2017-07-20 11:46:37 set migration_caps
2017-07-20 11:46:37 set cachesize: 429496729
2017-07-20 11:46:37 start migrate command to unix:/run/qemu-server/9004.migrate
2017-07-20 11:46:39 migration status: active (transferred 336960988, remaining 3563630592), total 4312604672)
2017-07-20 11:46:39 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-20 11:46:41 migration speed: 1024.00 MB/s - downtime 11 ms
2017-07-20 11:46:41 migration status: completed
2017-07-20 11:46:43 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=somehost01' root@10.0.0.100 pvesr set-state 9004 \''{}'\'
2017-07-20 11:46:46 migration finished successfully (duration 00:00:12)
p@dev:[~]0$ pssh -l root -H 10.0.7.149 -H somehost01 -i date
[1] 11:46:51 [SUCCESS] somehost01
Thu Jul 20 11:46:51 MSK 2017
[2] 11:46:51 [SUCCESS] 10.0.7.149
Thu Jul 20 11:46:48 MSK 2017
You forget that you migrate a nic to a other port and the switch do not know this.
As you say you have a switch and not a hub.
So arp(ipv4) or ndp(ipv6) need some time to find the new location(port).
Downtime improved for shared disk migration. It is now in range from 0.2 to 1 second.a test package with improved live-migration downtime is available in pvetest: http://download.proxmox.com/debian/...est/binary-amd64/qemu-server_5.0-15_amd64.deb
please report your results!
Downtime improved for shared disk migration. It is now in range from 0.2 to 1 second.
For "with local disk" migration almost no effect 12.5-13 seconds.
Tested in nested cluster.
"with local disk" needs to wait for block jobs to finish, so additional delay is expected there and resuming early is dangerous. thanks for the feedback!