Hi,
I have set up a cluster with three nodes, dedicated corosync network (two rings) and dedicated migration network.
When I login to the cluster nodes, I can reach (ping/ssh) each other using the migration network IP address.
All nodes have one single IP inside the migration network:
10.182.40.97 lxgamora-migration
10.182.40.98 lxgroot-migration
10.182.40.99 lxrocket-migration
....and are set up to use this network:
root@lxgroot-mgmt:~# grep migration /etc/pve/datacenter.cfg
migration: secure,network=10.182.40.97/27
Maybe it is important:
I am using openvswitch for the networking -- the /etc/network/interfaces part for the migration network looks like this:
allow-vmbr2 bond2
iface bond2 inet manual
ovs_bonds enp24s0f0 enp24s0f1
ovs_type OVSBond
ovs_bridge vmbr2
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast
pre-up (ip link set dev enp24s0f0 mtu 9000 && ip link set dev enp24s0f1 mtu 9000)
mtu 9000
#Migration
iface enp24s0f0 inet manual
iface enp24s0f1 inet manual
auto vmbr2
iface vmbr2 inet static
address 10.182.40.98
netmask 27
ovs_type OVSBridge
ovs_ports bond2
#Migration Network
Unfortunately when I try migrating a running VM I get this:
root@lxgroot-mgmt:~# qm migrate 107 lxgamora-mgmt --migration_network 10.182.40.97/27 --migration_type secure --online
2019-11-20 17:21:13 use dedicated network address for sending migration traffic (10.182.40.97)
2019-11-20 17:21:13 starting migration of VM 107 to node 'lxgamora-mgmt' (10.182.40.97)
2019-11-20 17:21:13 copying disk images
2019-11-20 17:21:13 starting VM 107 on remote node 'lxgamora-mgmt'
2019-11-20 17:21:14 start remote tunnel
2019-11-20 17:21:15 ssh tunnel ver 1
2019-11-20 17:21:15 starting online/live migration on unix:/run/qemu-server/107.migrate
2019-11-20 17:21:15 migrate_set_speed: 8589934592
2019-11-20 17:21:15 migrate_set_downtime: 0.1
2019-11-20 17:21:15 set migration_caps
2019-11-20 17:21:15 set cachesize: 2147483648
2019-11-20 17:21:15 start migrate command to unix:/run/qemu-server/107.migrate
2019-11-20 17:21:16 migration status: active (transferred 14440101, remaining 17194475520), total 17197506560)
2019-11-20 17:21:16 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
--- now it hangs until it is getting killed ---
^C2019-11-20 17:23:10 ERROR: online migrate failure - interrupted by signal
2019-11-20 17:23:10 aborting phase 2 - cleanup resources
2019-11-20 17:23:10 migrate_cancel
2019-11-20 17:23:51 ssh tunnel still running - terminating now with SIGTERM
2019-11-20 17:24:01 ssh tunnel still running - terminating now with SIGKILL
2019-11-20 17:24:02 ERROR: no reply to command 'quit': reading from tunnel failed: got timeout
2019-11-20 17:24:02 ERROR: migration finished with problems (duration 00:02:49)
migration problems
I tried with different running VMs and from/to different nodes but it is always the same result.
Do you have any hint or solution for me?
All nodes are installed with Promox 6:
pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)
The cluster itself is running and quorate:
root@lxgroot-mgmt:~# pvecm status
Quorum information
------------------
Date: Wed Nov 20 15:51:39 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1/168
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.182.40.1
0x00000002 1 10.182.40.2 (local)
0x00000003 1 10.182.40.3
root@lxgroot-mgmt:~# pveversion --verbose
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
I have set up a cluster with three nodes, dedicated corosync network (two rings) and dedicated migration network.
When I login to the cluster nodes, I can reach (ping/ssh) each other using the migration network IP address.
All nodes have one single IP inside the migration network:
10.182.40.97 lxgamora-migration
10.182.40.98 lxgroot-migration
10.182.40.99 lxrocket-migration
....and are set up to use this network:
root@lxgroot-mgmt:~# grep migration /etc/pve/datacenter.cfg
migration: secure,network=10.182.40.97/27
Maybe it is important:
I am using openvswitch for the networking -- the /etc/network/interfaces part for the migration network looks like this:
allow-vmbr2 bond2
iface bond2 inet manual
ovs_bonds enp24s0f0 enp24s0f1
ovs_type OVSBond
ovs_bridge vmbr2
ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast
pre-up (ip link set dev enp24s0f0 mtu 9000 && ip link set dev enp24s0f1 mtu 9000)
mtu 9000
#Migration
iface enp24s0f0 inet manual
iface enp24s0f1 inet manual
auto vmbr2
iface vmbr2 inet static
address 10.182.40.98
netmask 27
ovs_type OVSBridge
ovs_ports bond2
#Migration Network
Unfortunately when I try migrating a running VM I get this:
root@lxgroot-mgmt:~# qm migrate 107 lxgamora-mgmt --migration_network 10.182.40.97/27 --migration_type secure --online
2019-11-20 17:21:13 use dedicated network address for sending migration traffic (10.182.40.97)
2019-11-20 17:21:13 starting migration of VM 107 to node 'lxgamora-mgmt' (10.182.40.97)
2019-11-20 17:21:13 copying disk images
2019-11-20 17:21:13 starting VM 107 on remote node 'lxgamora-mgmt'
2019-11-20 17:21:14 start remote tunnel
2019-11-20 17:21:15 ssh tunnel ver 1
2019-11-20 17:21:15 starting online/live migration on unix:/run/qemu-server/107.migrate
2019-11-20 17:21:15 migrate_set_speed: 8589934592
2019-11-20 17:21:15 migrate_set_downtime: 0.1
2019-11-20 17:21:15 set migration_caps
2019-11-20 17:21:15 set cachesize: 2147483648
2019-11-20 17:21:15 start migrate command to unix:/run/qemu-server/107.migrate
2019-11-20 17:21:16 migration status: active (transferred 14440101, remaining 17194475520), total 17197506560)
2019-11-20 17:21:16 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
--- now it hangs until it is getting killed ---
^C2019-11-20 17:23:10 ERROR: online migrate failure - interrupted by signal
2019-11-20 17:23:10 aborting phase 2 - cleanup resources
2019-11-20 17:23:10 migrate_cancel
2019-11-20 17:23:51 ssh tunnel still running - terminating now with SIGTERM
2019-11-20 17:24:01 ssh tunnel still running - terminating now with SIGKILL
2019-11-20 17:24:02 ERROR: no reply to command 'quit': reading from tunnel failed: got timeout
2019-11-20 17:24:02 ERROR: migration finished with problems (duration 00:02:49)
migration problems
I tried with different running VMs and from/to different nodes but it is always the same result.
Do you have any hint or solution for me?
All nodes are installed with Promox 6:
pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)
The cluster itself is running and quorate:
root@lxgroot-mgmt:~# pvecm status
Quorum information
------------------
Date: Wed Nov 20 15:51:39 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1/168
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.182.40.1
0x00000002 1 10.182.40.2 (local)
0x00000003 1 10.182.40.3
root@lxgroot-mgmt:~# pveversion --verbose
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1