Live Migration Issue

Mike84 · Nov 20, 2019

Hi,

I have set up a cluster with three nodes, dedicated corosync network (two rings) and dedicated migration network.
When I login to the cluster nodes, I can reach (ping/ssh) each other using the migration network IP address.

All nodes have one single IP inside the migration network:
10.182.40.97 lxgamora-migration
10.182.40.98 lxgroot-migration
10.182.40.99 lxrocket-migration

....and are set up to use this network:
root@lxgroot-mgmt:~# grep migration /etc/pve/datacenter.cfg
migration: secure,network=10.182.40.97/27

Maybe it is important:
I am using openvswitch for the networking -- the /etc/network/interfaces part for the migration network looks like this:

allow-vmbr2 bond2
iface bond2 inet manual
ovs_bonds enp24s0f0 enp24s0f1
ovs_type OVSBond
ovs_bridge vmbr2
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast
pre-up (ip link set dev enp24s0f0 mtu 9000 && ip link set dev enp24s0f1 mtu 9000)
mtu 9000
#Migration

iface enp24s0f0 inet manual

iface enp24s0f1 inet manual

auto vmbr2
iface vmbr2 inet static
address 10.182.40.98
netmask 27
ovs_type OVSBridge
ovs_ports bond2
#Migration Network

Unfortunately when I try migrating a running VM I get this:

root@lxgroot-mgmt:~# qm migrate 107 lxgamora-mgmt --migration_network 10.182.40.97/27 --migration_type secure --online
2019-11-20 17:21:13 use dedicated network address for sending migration traffic (10.182.40.97)
2019-11-20 17:21:13 starting migration of VM 107 to node 'lxgamora-mgmt' (10.182.40.97)
2019-11-20 17:21:13 copying disk images
2019-11-20 17:21:13 starting VM 107 on remote node 'lxgamora-mgmt'
2019-11-20 17:21:14 start remote tunnel
2019-11-20 17:21:15 ssh tunnel ver 1
2019-11-20 17:21:15 starting online/live migration on unix:/run/qemu-server/107.migrate
2019-11-20 17:21:15 migrate_set_speed: 8589934592
2019-11-20 17:21:15 migrate_set_downtime: 0.1
2019-11-20 17:21:15 set migration_caps
2019-11-20 17:21:15 set cachesize: 2147483648
2019-11-20 17:21:15 start migrate command to unix:/run/qemu-server/107.migrate
2019-11-20 17:21:16 migration status: active (transferred 14440101, remaining 17194475520), total 17197506560)
2019-11-20 17:21:16 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0

--- now it hangs until it is getting killed ---

^C2019-11-20 17:23:10 ERROR: online migrate failure - interrupted by signal
2019-11-20 17:23:10 aborting phase 2 - cleanup resources
2019-11-20 17:23:10 migrate_cancel
2019-11-20 17:23:51 ssh tunnel still running - terminating now with SIGTERM
2019-11-20 17:24:01 ssh tunnel still running - terminating now with SIGKILL
2019-11-20 17:24:02 ERROR: no reply to command 'quit': reading from tunnel failed: got timeout
2019-11-20 17:24:02 ERROR: migration finished with problems (duration 00:02:49)
migration problems

I tried with different running VMs and from/to different nodes but it is always the same result.
Do you have any hint or solution for me?

All nodes are installed with Promox 6:
pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)

The cluster itself is running and quorate:
root@lxgroot-mgmt:~# pvecm status
Quorum information
------------------
Date: Wed Nov 20 15:51:39 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1/168
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.182.40.1
0x00000002 1 10.182.40.2 (local)
0x00000003 1 10.182.40.3

root@lxgroot-mgmt:~# pveversion --verbose
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

fabian · Nov 22, 2019

please upgrade to the most recent package versions - especially regarding knet/corosync there have been some initial issues that are now fixed!

please also include your network, storage and VM configs. system logs from both nodes from around the time of the failed migration would probably also be interesting!

Mike84 · Nov 22, 2019

I did pveupdate and pveupgrade.
For testing, I removed the "migration: " line from /etc/pve/datacenter.cfg and live migration works now -- but using the slow interface.
It seems there is something wrong with my migration network configuration.

For migration I use the network 10.182.40.96/27
All nodes have a bonding interface inside this network.
The network itself works (ping / ssh from any to any node using the IP inside the migration network)

Here is the configuration of the nodes -- /etc/network/interfaces (only the migration network part)
allow-vmbr2 bond2
iface bond2 inet manual
ovs_bonds enp24s0f0 enp24s0f1
ovs_type OVSBond
ovs_bridge vmbr2
ovs_options bond_mode=balance-tcp other_config:lacp-time=fast lacp=active
pre-up (ip link set dev enp24s0f0 mtu 9000 && ip link set dev enp24s0f1 mtu 9000)
mtu 9000
#Migration

iface enp24s0f0 inet manual

iface enp24s0f1 inet manual

auto vmbr2
iface vmbr2 inet static
address 10.182.40.97
netmask 27
ovs_type OVSBridge
ovs_ports bond2
#Migration

The same configuration is on the other two nodes, just the IP address differs:
node0 10.182.40.97
node1 10.182.40.98
node2 10.182.40.99

Each node has 6 interfaces:
bond0: VM traffic --> vmbr0
bond1: Storage/NAS --> vmbr1
bond2: Migration network --> vmbr2
eno4: Management / WebUI (Setup IP address)
eno5: Kronosnet Ring 1
eno6: Kronosnet Ring 0

When I understand it correct, migration network is not required to be a Kronosnet ring, right?

fabian · Nov 22, 2019

no, migration network can be an arbitray network. the other bonds/bridges are also all on-top of OVS?

Mike84 · Nov 22, 2019

Yes. Here is the full interfaces file of one node -- they are all set up the same way:

# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

allow-vmbr1 bond1
iface bond1 inet manual
ovs_bonds enp134s0f0 enp94s0f1
ovs_type OVSBond
ovs_bridge vmbr1
ovs_options bond_mode=balance-tcp other_config:lacp-time=fast lacp=active
pre-up (ip link set dev enp134s0f0 mtu 9000 && ip link set dev enp94s0f1 mtu 9000)
mtu 9000
#Storage

allow-vmbr2 bond2
iface bond2 inet manual
ovs_bonds enp24s0f0 enp24s0f1
ovs_type OVSBond
ovs_bridge vmbr2
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast
pre-up (ip link set dev enp24s0f0 mtu 9000 && ip link set dev enp24s0f1 mtu 9000)
mtu 9000
#Migration

allow-vmbr0 bond0
iface bond0 inet manual
ovs_bonds enp134s0f1 enp94s0f0
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options other_config:lacp-time=fast lacp=active bond_mode=balance-tcp
#VM Data

auto lo
iface lo inet loopback

allow-vmbr3 eno4
iface eno4 inet manual
ovs_type OVSPort
ovs_bridge vmbr3
#Management

iface enp94s0f0 inet manual

iface enp94s0f1 inet manual

iface enp134s0f0 inet manual

iface enp134s0f1 inet manual

iface enp24s0f0 inet manual

iface eno1 inet manual

iface enp24s0f1 inet manual

iface eno3 inet manual

iface eno2 inet manual

allow-vmbr4 eno5
iface eno5 inet manual
ovs_type OVSPort
ovs_bridge vmbr4
#Kronosnet Ring 1

allow-vmbr5 eno6
iface eno6 inet manual
ovs_type OVSPort
ovs_bridge vmbr5
#Kronosnet Ring 0

auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0
#VM Guest Networks

auto vmbr1
iface vmbr1 inet static
address 10.182.40.66
netmask 27
ovs_type OVSBridge
ovs_ports bond1
#Storage Network

auto vmbr2
iface vmbr2 inet static
address 10.182.40.98
netmask 27
ovs_type OVSBridge
ovs_ports bond2
#Migration Network

auto vmbr3
iface vmbr3 inet static
address 10.182.40.130
netmask 27
gateway 10.182.40.158
ovs_type OVSBridge
ovs_ports eno4
#Management

auto vmbr4
iface vmbr4 inet static
address 10.182.40.34
netmask 27
ovs_type OVSBridge
ovs_ports eno5
#Kronosnet Ring 1

auto vmbr5
iface vmbr5 inet static
address 10.182.40.2
netmask 27
ovs_type OVSBridge
ovs_ports eno6
#Kronosnet Ring 0

fabian · Nov 22, 2019

can you post the VM config and logs surrounding a migration as well please? and "pveversion -v" output after the upgrade

Mike84 · Nov 22, 2019

PVE Version Output

proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

VM Config, VM 107

bootdisk: sata0
cores: 3
ide2: none,media=cdrom
memory: 16384
name: lxscan
net0: virtio=0A:F3:8E:B0:27:F2,bridge=vmbr0,tag=314
numa: 0
onboot: 1
ostype: l26
sata0: stnas00:107/vm-107-disk-1.qcow2,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=082a6aa7-c8a0-4b5a-9aaf-fd89f22054ba
sockets: 2

Mike84 · Nov 22, 2019

Do you need specific logfiles or just the output of "qm migrate" and /var/log/syslog?

fabian · Nov 22, 2019

that is still showing the outdated versions - can you verify your repositories are setup correctly? https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_package_repositories

journalctl output limited to start->end of attempted migration and 'qm migrate' output should be enough.

Mike84 · Nov 22, 2019

Thank you, the hint with the sysadmin package repos was gold.
The non-subscription sources list was missing.

Now it looks like this:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-15 (running version: 6.0-15/52b91481)
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-4
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-8
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-9
pve-cluster: 6.0-8
pve-container: 3.0-13
pve-docs: 6.0-9
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-8
pve-firmware: 3.0-4
pve-ha-manager: 3.0-5
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-15
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

Mike84 · Nov 22, 2019

I tried online migration again, the journal just says:
Nov 22 13:48:51 node0 qm[7294]: <root@pam> starting task UPID:node0:00001C7F:00021F5F:5DD7D933:qmigrate:107:root@pam:
Nov 22 13:48:52 node0 pmxcfs[2575]: [status] notice: received log
Nov 22 13:48:53 node0 pmxcfs[2575]: [status] notice: received log
Nov 22 13:49:00 node0 systemd[1]: Starting Proxmox VE replication runner...
Nov 22 13:49:00 node0 systemd[1]: pvesr.service: Succeeded.
Nov 22 13:49:00 node0 systemd[1]: Started Proxmox VE replication runner.
Nov 22 13:50:00 node0 systemd[1]: Starting Proxmox VE replication runner...
Nov 22 13:50:00 node0 systemd[1]: pvesr.service: Succeeded.
Nov 22 13:50:00 node0 systemd[1]: Started Proxmox VE replication runner.
Nov 22 13:51:00 node0 systemd[1]: Starting Proxmox VE replication runner...
Nov 22 13:51:00 node0 systemd[1]: pvesr.service: Succeeded.
Nov 22 13:51:00 node0 systemd[1]: Started Proxmox VE replication runner.

Console output:
root@node0:~# qm migrate 107 node2 --migration_network 10.182.40.96/27 --migration_type insecure --online
2019-11-22 13:48:51 use dedicated network address for sending migration traffic (10.182.40.99)
2019-11-22 13:48:51 starting migration of VM 107 to node 'node2' (10.182.40.99)
2019-11-22 13:48:51 starting VM 107 on remote node 'node2'
2019-11-22 13:48:53 start remote tunnel
2019-11-22 13:48:53 ssh tunnel ver 1
2019-11-22 13:48:53 starting online/live migration on tcp:10.182.40.99:60000
2019-11-22 13:48:53 migrate_set_speed: 8589934592
2019-11-22 13:48:53 migrate_set_downtime: 0.1
2019-11-22 13:48:53 set migration_caps
2019-11-22 13:48:53 set cachesize: 2147483648
2019-11-22 13:48:53 start migrate command to tcp:10.182.40.99:60000
2019-11-22 13:48:54 migration status: active (transferred 3547716, remaining 17196441600), total 17197506560)
2019-11-22 13:48:54 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
---now it hangs---

Mike84 · Nov 22, 2019

Oh my god, I found the bug.
It was in the switch config -- MTU size mismatch.

Search

Search

Live Migration Issue

Mike84

New Member

fabian

Proxmox Staff Member

Mike84

New Member

fabian

Proxmox Staff Member

Mike84

New Member

fabian

Proxmox Staff Member

Mike84

New Member

Mike84

New Member

fabian

Proxmox Staff Member

Mike84

New Member

Mike84

New Member

Mike84

New Member