ZFS replication via migration network - sporadic problems finding target IP address

rholighaus

Well-Known Member
Dec 15, 2016
97
8
48
61
Berlin
We have a cluster of 2 production nodes and 2 PVE nodes that are only working as ZFS replication destinations.
As recommended, we have recently created a separate migration network and entered it in /etc/pve/datacenter.cfg:

Code:
# use dedicated migration network
migration: secure,network=10.1.2.0/24

This seems to work but in most cases.
For one target host (carrier-3), every few hours we get emails reporting migration errors:

Code:
failed to get ip for node 'carrier-3' in network '10.1.2.0/24'

The host has a network interface with the address 10.1.2.4/24 and is pingable all the time.
The next replication normally works well again.

Any idea why this may happen and what we could do to fix it?
 
the call to retrieve an IP in a specific network does the following:
- get default IP of other node via pmxcfs' nodelist (this gets recorded on startup of pmxcfs on the other node, by resolving the node's hostname)
- connect via SSH to that IP and call "pvecm mtunnel -migration_network $NETWORK -get_migration_ip"
- this calls "ip address show to $NETWORK up" and selects the IP addresses from the relevant output lines
- if a single IP was found, it gets printed, otherwise an error gets printed (which might not be handed back up the stack properly?)

are all the nodes resolvable via /etc/hosts on all nodes? or do you rely on DNS for this?
maybe the migration interface is up, but SSH via the regular interface does not work sporadically?
 
Thank you Fabian! Is there any way to make PVE log these actions so I can start debugging? I had 5 incidents since midnight. Is there a timeout that this resolving could be running into? The only affected target node is a (slow) HP Microserver with 4 spinning WD Red disks...
 
Thank you Fabian! Is there any way to make PVE log these actions so I can start debugging? I had 5 incidents since midnight. Is there a timeout that this resolving could be running into? The only affected target node is a (slow) HP Microserver with 4 spinning WD Red disks...

not directly. you can try running "pvecm mtunnel -migration_network '10.1.2.0/24' -get_migration_ip" repeatedly on the affected node and log the output? e.g., in a loop with 30s delay inbetween calls, until it fails once:
Code:
while :; do date; pvecm mtunnel -migration_network '10.1.2.0/24' -get_migration_ip || break; sleep 30; done
 
Hi Fabian, thanks for that tip.

I think I found the culprit:

Code:
Thu Sep 26 11:22:08 CEST 2019

ip: '10.1.2.4'

Thu Sep 26 11:22:39 CEST 2019

no quorum

Thu Sep 26 11:23:10 CEST 2019

ip: '10.1.2.4'

Here's my config:

Code:
root@carrier-3:/var/log# pveversion --verbose
proxmox-ve: 5.4-2 (running kernel: 4.15.18-20-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-8
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-17-pve: 4.15.18-43
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-55
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-6
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-40
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

The only traffic to this node is replication, which has now been relocated to a separate network (hence this issue).
I wonder why we are temporarily loosing quorum (because it's not for long).
 
Hi Fabian, thanks for that tip.

I think I found the culprit:

Code:
Thu Sep 26 11:22:08 CEST 2019

ip: '10.1.2.4'

Thu Sep 26 11:22:39 CEST 2019

no quorum

Thu Sep 26 11:23:10 CEST 2019

ip: '10.1.2.4'

Here's my config:

Code:
root@carrier-3:/var/log# pveversion --verbose
proxmox-ve: 5.4-2 (running kernel: 4.15.18-20-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-8
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-17-pve: 4.15.18-43
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-55
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-6
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-40
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

The only traffic to this node is replication, which has now been relocated to a separate network (hence this issue).
I wonder why we are temporarily loosing quorum (because it's not for long).

check the logs of pve-cluster and corosync:

"journalctl -b -u corosync -u pve-cluster"
 
Hi Fabian,

I seen to get regular temporary failures at the one node:

Code:
Sep 27 05:51:39 carrier-3 pmxcfs[3697]: [status] notice: received log
Sep 27 05:51:53 carrier-3 pmxcfs[3697]: [status] notice: received log
Sep 27 05:52:45 carrier-3 corosync[3887]: error   [TOTEM ] FAILED TO RECEIVE
Sep 27 05:52:45 carrier-3 corosync[3887]:  [TOTEM ] FAILED TO RECEIVE
Sep 27 05:52:48 carrier-3 corosync[3887]: notice  [TOTEM ] A new membership (192.168.1.18:485132) was formed. Members left: 4 2 1
Sep 27 05:52:48 carrier-3 corosync[3887]: notice  [TOTEM ] Failed to receive the leave message. failed: 4 2 1
Sep 27 05:52:48 carrier-3 corosync[3887]:  [TOTEM ] A new membership (192.168.1.18:485132) was formed. Members left: 4 2 1
Sep 27 05:52:48 carrier-3 corosync[3887]: warning [CPG   ] downlist left_list: 3 received
Sep 27 05:52:48 carrier-3 corosync[3887]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 27 05:52:48 carrier-3 corosync[3887]: notice  [QUORUM] Members[1]: 3
Sep 27 05:52:48 carrier-3 corosync[3887]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 27 05:52:48 carrier-3 corosync[3887]:  [TOTEM ] Failed to receive the leave message. failed: 4 2 1
Sep 27 05:52:48 carrier-3 corosync[3887]:  [CPG   ] downlist left_list: 3 received
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [dcdb] notice: members: 3/3697
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [status] notice: members: 3/3697
Sep 27 05:52:48 carrier-3 corosync[3887]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 27 05:52:48 carrier-3 corosync[3887]:  [QUORUM] Members[1]: 3
Sep 27 05:52:48 carrier-3 corosync[3887]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [status] notice: node lost quorum
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [dcdb] notice: cpg_send_message retried 1 times
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [dcdb] crit: received write while not quorate - trigger resync
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [dcdb] crit: leaving CPG group
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [dcdb] notice: start cluster connection
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [dcdb] notice: members: 3/3697
Sep 27 05:52:48 carrier-3 pmxcfs[3697]: [dcdb] notice: all data is up to date
Sep 27 05:52:56 carrier-3 corosync[3887]: notice  [TOTEM ] A new membership (192.168.1.4:485148) was formed. Members joined: 4 2 1
Sep 27 05:52:56 carrier-3 corosync[3887]:  [TOTEM ] A new membership (192.168.1.4:485148) was formed. Members joined: 4 2 1
Sep 27 05:52:56 carrier-3 corosync[3887]: warning [CPG   ] downlist left_list: 0 received
Sep 27 05:52:56 carrier-3 corosync[3887]:  [CPG   ] downlist left_list: 0 received
Sep 27 05:52:56 carrier-3 corosync[3887]:  [CPG   ] downlist left_list: 0 received
Sep 27 05:52:56 carrier-3 corosync[3887]: warning [CPG   ] downlist left_list: 0 received
Sep 27 05:52:56 carrier-3 corosync[3887]:  [CPG   ] downlist left_list: 0 received
Sep 27 05:52:56 carrier-3 corosync[3887]: warning [CPG   ] downlist left_list: 0 received
Sep 27 05:52:56 carrier-3 corosync[3887]:  [CPG   ] downlist left_list: 0 received
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: members: 2/5026, 3/3697
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: starting data syncronisation
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: members: 2/5026, 3/3697, 4/47153
Sep 27 05:52:56 carrier-3 corosync[3887]: notice  [QUORUM] This node is within the primary component and will provide service.
Sep 27 05:52:56 carrier-3 corosync[3887]: notice  [QUORUM] Members[4]: 4 2 3 1
Sep 27 05:52:56 carrier-3 corosync[3887]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: members: 2/5026, 3/3697
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: starting data syncronisation
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: members: 1/11601, 2/5026, 3/3697, 4/47153
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: members: 2/5026, 3/3697, 4/47153
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: members: 1/11601, 2/5026, 3/3697, 4/47153
Sep 27 05:52:56 carrier-3 corosync[3887]:  [QUORUM] This node is within the primary component and will provide service.
Sep 27 05:52:56 carrier-3 corosync[3887]:  [QUORUM] Members[4]: 4 2 3 1
Sep 27 05:52:56 carrier-3 corosync[3887]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: node has quorum
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: received sync request (epoch 1/11601/000034B9)
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: received sync request (epoch 1/11601/00003133)
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: received all states
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: leader is 1/11601
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: synced members: 1/11601, 2/5026, 4/47153
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: waiting for updates from leader
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: dfsm_deliver_queue: queue length 1
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: update complete - trying to commit (got 5 inode updates)
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: all data is up to date
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 1
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: received all states
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: all data is up to date
Sep 27 05:52:56 carrier-3 pmxcfs[3697]: [status] notice: dfsm_deliver_queue: queue length 13
Sep 27 05:53:39 carrier-3 pmxcfs[3697]: [status] notice: received log
Sep 27 05:53:39 carrier-3 pmxcfs[3697]: [status] notice: received log
Sep 27 05:53:53 carrier-3 pmxcfs[3697]: [status] notice: received log
Sep 27 05:54:39 carrier-3 pmxcfs[3697]: [status] notice: received log
Sep 27 05:54:39 carrier-3 pmxcfs[3697]: [status] notice: received log
Sep 27 05:54:52 carrier-3 corosync[3887]: error   [TOTEM ] FAILED TO RECEIVE
Sep 27 05:54:52 carrier-3 corosync[3887]:  [TOTEM ] FAILED TO RECEIVE
Sep 27 05:54:54 carrier-3 corosync[3887]: notice  [TOTEM ] A new membership (192.168.1.18:485152) was formed. Members left: 4 2 1
Sep 27 05:54:54 carrier-3 corosync[3887]: notice  [TOTEM ] Failed to receive the leave message. failed: 4 2 1
Sep 27 05:54:54 carrier-3 corosync[3887]: warning [CPG   ] downlist left_list: 3 received
Sep 27 05:54:54 carrier-3 corosync[3887]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 27 05:54:54 carrier-3 corosync[3887]: notice  [QUORUM] Members[1]: 3

Here is my cluster configuration:

JSON:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: carrier
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.21
  }
  node {
    name: carrier-1
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.1.4
  }
  node {
    name: carrier-2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.11
  }
  node {
    name: carrier-3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.18
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: shk-proxmox
  config_version: 7
  interface {
    bindnetaddr: 192.168.1.21
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

The node with the failures is the one with nodeid 3.
 
sounds like multicast ist not stable between that node and the others. check with omping (see the Admin Guide that comes with your installation, accessible via the "Documentation" link in the GUI)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!