[SOLVED] Live VM migration hangs and times out (was network issue)

dmgeurts

New Member
Nov 30, 2023
5
0
1
Hi,

Trying to migrate a VM between two nodes, which share LVM storage (external SAS). Offline migrations work fine, but live migrations hang.

Bash:
user@pmx01:~$ sudo pvs
  PV                       VG       Fmt  Attr PSize   PFree
  /dev/mapper/md34-pmx-cl1 md34-sas lvm2 a--  <10.00t  5.82t
  /dev/sdc3                pve      lvm2 a--  222.50g 16.00g
user@pmx01:~$ sudo vgs
  VG       #PV #LV #SN Attr   VSize   VFree
  md34-sas   1   3   0 wz--n- <10.00t  5.82t
  pve        1   3   0 wz--n- 222.50g 16.00g
user@pmx01:~$ sudo lvs
  LV            VG       Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  vm-100-disk-0 md34-sas -wi-ao----   40.00g
  vm-101-disk-0 md34-sas -wi-ao----   40.00g
  vm-101-disk-1 md34-sas -wi-ao----   <4.10t
  data          pve      twi-aotz-- <130.22g             0.00   1.22
  root          pve      -wi-ao----   65.62g
  swap          pve      -wi-ao----    8.00g

Log of the live migration attempt:

Bash:
2023-12-05 11:36:38 use dedicated network address for sending migration traffic (10.0.0.21)
2023-12-05 11:36:39 starting migration of VM 100 to node 'pmx01' (10.0.0.21)
2023-12-05 11:36:39 starting VM 100 on remote node 'pmx01'
2023-12-05 11:36:43 start remote tunnel
2023-12-05 11:36:44 ssh tunnel ver 1
2023-12-05 11:36:44 starting online/live migration on unix:/run/qemu-server/100.migrate
2023-12-05 11:36:44 set migration capabilities
2023-12-05 11:36:44 migration downtime limit: 100 ms
2023-12-05 11:36:44 migration cachesize: 1.0 GiB
2023-12-05 11:36:44 set migration parameters
2023-12-05 11:36:44 start migrate command to unix:/run/qemu-server/100.migrate
2023-12-05 11:36:45 migration active, transferred 2.5 MiB of 8.0 GiB VM-state, 0.0 B/s

1701773174168.png
 
Last edited:
What PVE Version?
 
So please post SYSLOG around a live migration.....
 
  • Like
Reactions: dmgeurts
Logs after the migration fails:

Bash:
2023-12-05 11:36:45 migration active, transferred 2.5 MiB of 8.0 GiB VM-state, 0.0 B/s
sss_ssh_knownhostsproxy: unable to proxy data: Connection timed out
client_loop: send disconnect: Broken pipe
2023-12-05 11:52:32 migration status error: failed - Unable to write to socket: Broken pipe
2023-12-05 11:52:32 ERROR: online migrate failure - aborting
2023-12-05 11:52:32 aborting phase 2 - cleanup resources
2023-12-05 11:52:32 migrate_cancel
2023-12-05 11:52:35 ERROR: writing to tunnel failed: broken pipe
2023-12-05 11:52:35 ERROR: migration finished with problems (duration 00:15:57)
TASK ERROR: migration problems

But pmx02 can ssh as root to pmx1 just fine:

Bash:
root@pmx02:~# ssh -o 'HostKeyAlias=pmx01.domain.com' root@10.0.0.21
Linux pmx01.domain.com 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Nov 30 16:10:22 2023 from 10.0.0.22
root@pmx01:~# exit
logout
Connection to 10.0.0.21 closed.
root@pmx02:~# ssh -o 'HostKeyAlias=pmx01' root@10.0.0.21
Linux pmx01.domain.com 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Dec  5 13:00:49 2023 from 10.0.0.22
 
Code:
2023-12-05 11:52:32 migration status error: failed - Unable to write to socket: Broken pipe

Network Connection Issues....

LACP? Some type of BOND? Live migration can saturate network and maybe corosync gets stuck if not dedicated.... a bit more about the nodes and hardware could help....
 
Found this on both nodes:

Bash:
Dec 05 11:57:27 pmx01.domain.com corosync[1872]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 9212 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Dec 05 11:57:27 pmx01.domain.com corosync[1872]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.

I had everything set to 9216 except for vmbr0 which was default (1500). So it seems that PVE expects vmbr0 to be able to use jumbo frames if it's enabled on the main interfaces, or maybe I should've set vmbr0 to 1500 instead of leaving it at the default. I then found that the NICs don't do 9216 but had to lower the vmbr0 MTU to 9152. Once done, the live-migration worked perfectly. So, this was completely unrelated to the storage.
 
Last edited:
Found this on both nodes:

Bash:
Dec 05 11:57:27 pmx01.ixbru.ipnexia.com corosync[1872]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 9212 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Dec 05 11:57:27 pmx01.ixbru.ipnexia.com corosync[1872]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.

I had everything set to 9216 except for vmbr0 which was default (1500). So it seems that PVE expects vmbr0 to be able to use jumbo frames if it's enabled on the main interfaces, or maybe I should've set vmbr0 to 1500 instead of leaving it at the default. I then found that the NICs don't do 9216 but had to lower the vmbr0 MTU to 9152. Once done, the live-migration worked perfectly. So, this was completely unrelated to the storage.
Well done.... so mark this as "solved" please.....
 
  • Like
Reactions: dmgeurts
Code:
2023-12-05 11:52:32 migration status error: failed - Unable to write to socket: Broken pipe

Network Connection Issues....

LACP? Some type of BOND? Live migration can saturate network and maybe corosync gets stuck if not dedicated.... a bit more about the nodes and hardware could help....
Yes, indeed. I didn't expect it to be a network issue.

1701779977694.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!