[SOLVED] Online migrate failure - unable to detect remote migration address

Le PAH · Oct 17, 2018

Hello,

I've installed a Proxmox clustered environment on the 5.2-2 version:

Code:

root@srv-pve1:~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-28
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

The cluster is based on 3 exactly similar nodes with the same exact versions running and the same physical architecture.

The cluster is healthy:

Code:

root@srv-pve1:~# pvecm status
Quorum information
------------------
Date:             Wed Oct 17 07:57:12 2018
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/64
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.101 (local)
0x00000002          1 10.0.0.102
0x00000003          1 10.0.0.103

The storage is an also healthy ceph cluster:

Code:

root@srv-pve1:~# ceph-brag
{
  "cluster_creation_date": "2018-10-11 17:51:26.259055",
  "uuid": "dd52bfc1-5409-4730-8f3a-72637478418a",
  "components_count": {
    "num_data_bytes": 20857718616,
    "num_mons": 3,
    "num_pgs": 768,
    "num_mdss": 0,
    "num_pools": 2,
    "num_osds": 18,
    "num_bytes_total": 72002146295808,
    "num_objects": 5911
  },
  "crush_types": [
    {
      "count": 6,
      "type": "host"
    },
    {
      "count": 2,
      "type": "root"
    },
    {
      "count": 18,
      "type": "devices"
    }
  ],
  "ownership": {},
  "pool_metadata": [
    {
      "type": 1,
      "id": 6,
      "size": 3
    },
    {
      "type": 1,
      "id": 8,
      "size": 2
    }
  ],
  "sysinfo": {
    "kernel_types": [
      {
        "count": 18,
        "type": "#1 SMP PVE 4.15.18-27 (Wed, 10 Oct 2018 10:50:11 +0200)"
      }
    ],
    "cpu_archs": [
      {
        "count": 18,
        "arch": "x86_64"
      }
    ],
    "cpus": [
      {
        "count": 18,
        "cpu": "Intel(R) Xeon(R) CPU E5-2440 0 @ 2.40GHz"
      }
    ],
    "kernel_versions": [
      {
        "count": 18,
        "version": "4.15.18-7-pve"
      }
    ],
    "ceph_versions": [
      {
        "count": 18,
        "version": "12.2.8(6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840)"
      }
    ],
    "os_info": [
      {
        "count": 18,
        "os": "Linux"
      }
    ],
    "distros": []
  }
}

For whatever reasons, I cannot do live migration on VM:

Code:

()
2018-10-17 07:48:19 starting migration of VM 101 to node 'srv-pve2' (192.168.1.102)
2018-10-17 07:48:19 copying disk images
2018-10-17 07:48:19 starting VM 101 on remote node 'srv-pve2'
2018-10-17 07:48:21 ERROR: online migrate failure - unable to detect remote migration address
2018-10-17 07:48:21 aborting phase 2 - cleanup resources
2018-10-17 07:48:21 migrate_cancel
2018-10-17 07:48:22 ERROR: migration finished with problems (duration 00:00:04)
TASK ERROR: migration problems

I've suspected network DNS issues, the SSH session initiation between nodes being longish (~1.5 sec) so I've added hosts information in the /etc/hosts file but to no avail:

Code:

root@srv-pve1:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.1.101 srv-pve1.mydomain.local srv-pve1 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

10.0.0.101 srv-pve1-private
10.0.0.102 srv-pve2-private
10.0.0.103 srv-pve3-private

192.168.1.101 srv-pve1
192.168.1.102 srv-pve2
192.168.1.103 srv-pve3

Duckduckgo failed my so far and I cannot find a reasonable explanation to this issue.

Does anyone has an idea?

dcsapak · Oct 17, 2018

can you also post the /etc/hosts of the other node and the /etc/pve/datacenter.cfg ?

Le PAH · Oct 17, 2018

Hello. Here are the nodes #2 & #3:

Code:

root@srv-pve2:~# cat /etc/hosts
127.0.0.1 lopvecm create mycluster -bindnet0_addr 10.1.0.1 -ring0_addr serveur1.vpncalhost.localdomain localhost
192.168.1.102 srv-pve2.mydomain.local srv-pve2 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

10.0.0.101 srv-pve1-private
10.0.0.102 srv-pve2-private
10.0.0.103 srv-pve3-private

192.168.1.101 srv-pve1
192.168.1.102 srv-pve2
192.168.1.103 srv-pve3

Code:

root@srv-pve3:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.1.103 srv-pve3.mydomain.local srv-pve3 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

10.0.0.101 srv-pve1-private
10.0.0.102 srv-pve2-private
10.0.0.103 srv-pve3-private

192.168.1.101 srv-pve1
192.168.1.102 srv-pve2
192.168.1.103 srv-pve3

I don't find any datacenter.cfg so here is my corosync.

Code:

root@srv-pve3:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: srv-pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: srv-pve1-private
  }
  node {
    name: srv-pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.102
  }
  node {
    name: srv-pve3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.103
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster
  config_version: 3
  interface {
    bindnetaddr: 10.0.0.101
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

dcsapak · Oct 17, 2018

Le PAH said:
127.0.0.1 lopvecm create mycluster -bindnet0_addr 10.1.0.1 -ring0_addr serveur1.vpncalhost.localdomain localhost

this line looks weird, try to correct it

Le PAH · Oct 17, 2018

Indeed! I didn't catch that.

I've patched the line and reboot the node but the problem persists...

dcsapak · Oct 17, 2018

mhmm try entering the 192.168.1.0 subnet as migration network in /etc/pve/datacenter.cfg

see
man datacenter.cfg

Le PAH · Oct 17, 2018

I've created and configured the datacenter.cfg file as requested but it still fails:

Code:

TASK ERROR: failed to get ip for node 'srv-pve2' in network '192.168.1.0/24'

It doesn't work either when I'm using the 10.0.0.0/24 CIDR.

I've added the 192.168.1.x addresses in the corosync.conf for the ring 1:

Code:

[...]
nodelist {
  node {
    name: srv-pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.101
    ring1_addr: 192.168.1.101

  }
  node {
    name: srv-pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.102
    ring1_addr: 192.168.1.102
  }
  node {
    name: srv-pve3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.103
    ring1_addr: 192.168.1.103
  }
}
[...]

Is changes nothing, the datacenter.cfg settings are making migration impossible whether live or offline.

When commenting the migration setting of said file, live migration still fail but offline migration perform as expected but seems to be using the 192.168. interface.

I'm really puzzled by this problem. How come the cluster is not using the interface that I've explicitly assigned for clustering and CEPH operation? How can I confirm which interface the operation are using?

Le PAH · Oct 25, 2018

I'm still stuck

.

I cannot figure why the routing is so unstable.

I've double checked the IPs of the hosts and everything seems to be fine.

Does anyone have an idea for further investigations?

dcsapak · Oct 25, 2018

does the syslog say anything during a migration?

Le PAH · Oct 25, 2018

Yes!!! Indeed there is an error:

Code:

Oct 25 14:20:42 srv-pve1 systemd[1]: Started Session 304 of user root.
Oct 25 14:20:42 srv-pve1 systemd[1]: Started Session 305 of user root.
Oct 25 14:20:43 srv-pve1 qm[373076]: start VM 101: UPID:srv-pve1:0005B154:0409021B:5BD1B51B:qmstart:101:root@pam:
Oct 25 14:20:43 srv-pve1 qm[372989]: <root@pam> starting task UPID:srv-pve1:0005B154:0409021B:5BD1B51B:qmstart:101:root@pam:
Oct 25 14:20:43 srv-pve1 systemd[1]: Started 101.scope.
Oct 25 14:20:43 srv-pve1 systemd-udevd[373089]: Could not generate persistent MAC address for tap101i0: No such file or directory
Oct 25 14:20:44 srv-pve1 kernel: [676985.338557] device tap101i0 entered promiscuous mode
Oct 25 14:20:44 srv-pve1 kernel: [676985.351648] vmbr0: port 2(tap101i0) entered blocking state
Oct 25 14:20:44 srv-pve1 kernel: [676985.351651] vmbr0: port 2(tap101i0) entered disabled state
Oct 25 14:20:44 srv-pve1 kernel: [676985.351815] vmbr0: port 2(tap101i0) entered blocking state
Oct 25 14:20:44 srv-pve1 kernel: [676985.351817] vmbr0: port 2(tap101i0) entered forwarding state
Oct 25 14:20:45 srv-pve1 qm[372989]: <root@pam> end task UPID:srv-pve1:0005B154:0409021B:5BD1B51B:qmstart:101:root@pam: OK
Oct 25 14:20:45 srv-pve1 systemd[1]: Started Session 306 of user root.
Oct 25 14:20:46 srv-pve1 qm[373171]: <root@pam> starting task UPID:srv-pve1:0005B20D:04090304:5BD1B51E:qmstop:101:root@pam:
Oct 25 14:20:46 srv-pve1 qm[373261]: stop VM 101: UPID:srv-pve1:0005B20D:04090304:5BD1B51E:qmstop:101:root@pam:
Oct 25 14:20:46 srv-pve1 qm[373171]: <root@pam> end task UPID:srv-pve1:0005B20D:04090304:5BD1B51E:qmstop:101:root@pam: OK
Oct 25 14:20:46 srv-pve1 pmxcfs[2056]: [status] notice: received log
Oct 25 14:20:46 srv-pve1 kernel: [676987.172705] vmbr0: port 2(tap101i0) entered disabled state

dcsapak · Oct 25, 2018

i actually meant also on the target node, since we try to start it there

Le PAH · Oct 25, 2018

I've already pasted the syslog of the target node (srv-pve1). The virtual machine is currently on the srv-pve2.

Here is the syslog of the node on which the virtual host is running:

Code:

Oct 25 15:00:45 srv-pve2 pvedaemon[2658454]: <root@pam> starting task UPID:srv-pve2:002C3DE9:04392604:5BD1BE7D:qmigrate:101:root@pam:
Oct 25 15:00:47 srv-pve2 pmxcfs[2057]: [status] notice: received log
Oct 25 15:00:48 srv-pve2 pmxcfs[2057]: [status] notice: received log
Oct 25 15:00:49 srv-pve2 pmxcfs[2057]: [status] notice: received log
Oct 25 15:00:49 srv-pve2 pmxcfs[2057]: [status] notice: received log
Oct 25 15:00:49 srv-pve2 pvedaemon[2899433]: migration problems
Oct 25 15:00:49 srv-pve2 pvedaemon[2658454]: <root@pam> end task UPID:srv-pve2:002C3DE9:04392604:5BD1BE7D:qmigrate:101:root@pam: migration problems

I've also searched as per this thread.
Don't know if it matters but I cannot omping the 2 other nodes when connected to one of them.
The 3 nodes are connected through a meshed network configured with 2 bonded physical interfaces that brodcasts to the 2 others nodes so I believe that multicast ping is not a issue per se.

dcsapak · Oct 25, 2018

oh ok..

this line does raise some questions:

Le PAH said:
Oct 25 14:20:46 srv-pve1 qm[373171]: <root@pam> starting task UPID:srv-pve1:0005B20D:04090304:5BD1B51E:qmstop:101:root@pam:

it seems there is something stopping the vm on the target node immediately after starting it (14:20:45 started ok, 14.20:46 stop) ?
do you have ha activated or something else that would explain that ?

dcsapak · Oct 25, 2018

also can you post your vm config ?

dcsapak · Oct 25, 2018

dcsapak said:
it seems there is something stopping the vm on the target node immediately after starting it (14:20:45 started ok, 14.20:46 stop) ?
do you have ha activated or something else that would explain that ?

disregard this, we stop the vm if the migration does fail, but the vm config would be interesting stell as well as the output of 'pvesh get /cluster/resources' (anonymised of course)

Le PAH · Oct 25, 2018

There is no HA currently active on the VM nor on the datacenter.

I first thought that this line would be more concerning:

Code:

Could not generate persistent MAC address for tap101i0: No such file or directory

Here is the VM Config:

Code:

root@srv-pve1:~# less /etc/pve/nodes/srv-pve2/qemu-server/101.conf
agent: 1
bootdisk: virtio0
cores: 1
memory: 2048
name: SRV-ROCKET
net0: virtio=62:64:85:B9:15:E0,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
parent: Avant_ouverture
scsihw: virtio-scsi-pci
smbios1: uuid=31550042-bf1c-4062-afaa-d9a96c97aaa0
sockets: 2
virtio0: STD_POOL_vm:vm-101-disk-0,size=32G
vmgenid: 8c5174dd-6b7b-47e5-9bca-145af31786ac

[Avant_ouverture]
agent: 1
bootdisk: virtio0
cores: 1
memory: 2048
name: SRV-ROCKET
net0: virtio=62:64:85:B9:15:E0,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
runningmachine: pc-i440fx-2.11
scsihw: virtio-scsi-pci
smbios1: uuid=31550042-bf1c-4062-afaa-d9a96c97aaa0
snaptime: 1539538774
sockets: 2
virtio0: STD_POOL_vm:vm-101-disk-0,size=32G
vmgenid: 1963211a-dbdd-4f37-9eb8-10ac6e4da1b6
vmstate: STD_POOL_vm:vm-101-state-Avant_ouverture

The live migration also fails with another VM, configured as followed:

Code:

root@srv-pve1:~# less /etc/pve/nodes/srv-pve3/qemu-server/102.conf
agent: 1
bootdisk: scsi0
cores: 2
memory: 4096
name: SRV-ELK-CISCO
net0: virtio=C2:D9:97:64:7F:A6,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: STD_POOL_vm:vm-102-disk-0,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=80f8c84b-e53b-4608-ac95-2b82bc03b9e2
sockets: 2
vmgenid: 99d3e121-9b05-4539-a9ed-0b481ac8783b

Le PAH · Oct 31, 2018

I've tried to update all the nodes to the latest version on the repos but still can't do live migration. I'm starting to wonder if I should reinstall the cluster from scratch…

Le PAH · Nov 27, 2018

Hello,

I still can figure out a solution. Does anyone have an idea before I start reinstalling?

dcsapak · Nov 28, 2018

try posting your corosync.conf and the output of 'pvesh get /cluster/resources'

Le PAH · Nov 28, 2018

Hello,

here is my corosync.conf

Code:

root@srv-pve1:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: srv-pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.0.101
    ring1_addr: 192.168.1.101

  }
  node {
    name: srv-pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.0.102
    ring1_addr: 192.168.1.102
  }
  node {
    name: srv-pve3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.0.103
    ring1_addr: 192.168.1.103
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster
  config_version: 3
  interface {
    bindnetaddr: 10.0.0.101
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

and the pvesh get

Code:

root@srv-pve1:~# pvesh get /cluster/resources
┌──────────────────────────────┬─────────┬───────┬───────────┬─────────┬───────┬
│ id                           │ type    │   cpu │ disk      │ hastate │ level │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ /pool/IT_Internal            │ pool    │       │           │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ /pool/Monitoring             │ pool    │       │           │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ /pool/Prototypes             │ pool    │       │           │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ lxc/100                      │ lxc     │       │           │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ node/srv-pve1                │ node    │ 0.25% │ 3.90 GiB  │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ qemu/101                     │ qemu    │       │           │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ qemu/102                     │ qemu    │       │           │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ storage/srv-pve1/LOW_POOL_ct │ storage │       │ 7.48 GiB  │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ storage/srv-pve1/LOW_POOL_vm │ storage │       │ 7.48 GiB  │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ storage/srv-pve1/STD_POOL_ct │ storage │       │ 11.16 GiB │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ storage/srv-pve1/STD_POOL_vm │ storage │       │ 11.16 GiB │         │       │
├──────────────────────────────┼─────────┼───────┼───────────┼─────────┼───────┼
│ storage/srv-pve1/local       │ storage │       │ 3.90 GiB  │         │       │
└──────────────────────────────┴─────────┴───────┴───────────┴─────────┴───────┴

I've tried to upgrade the all cluster in order to check if the problem would be solved and the whole clustering system is now broken:

Code:

root@srv-pve1:~# tail -f /var/log/syslog
Nov 28 09:39:54 srv-pve1 pmxcfs[2081]: [quorum] crit: quorum_initialize failed: 2
Nov 28 09:39:54 srv-pve1 pmxcfs[2081]: [confdb] crit: cmap_initialize failed: 2
Nov 28 09:39:54 srv-pve1 pmxcfs[2081]: [dcdb] crit: cpg_initialize failed: 2
Nov 28 09:39:54 srv-pve1 pmxcfs[2081]: [status] crit: cpg_initialize failed: 2

It looks like the routing on the private bond0 interface (meshed networking between the nodes) is not working properly for the clustering.

On the other hand, the Ceph cluster works perfectly.

I've tried to stop the bond but the cluster didn't seem to switch on the ring 1 interfaces and Ceph switches into an degraded state.

Here is the hosts file if it can help:

Code:

root@srv-pve1:~# cat /etc/hosts

127.0.0.1 localhost.localdomain localhost

192.168.1.101 srv-pve1.lechesnay.local srv-pve1 pvelocalhost


# The following lines are desirable for IPv6 capable hosts


::1     ip6-localhost ip6-loopback

fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts


10.0.0.101 srv-pve1
10.0.0.102 srv-pve2
10.0.0.103 srv-pve3

[SOLVED] Online migrate failure - unable to detect remote migration address

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member