Node(s) exit cluster after a while on a 2 nodes cluster

scotow · Oct 22, 2019

Hi,

I have set up a two nodes cluster following the wiki guidance, but I keep facing the same problem.

I'm able to see both nodes within the web GUI and spawn VMs while everything after my initial configuration.

But after a while (a few hours), one or both of the nodes seems to leave the cluster. Browsing the GUI from node1, node2 is marked as 'offline' (red) but node2's stats are correctly displayed, expect the graphs which are empty. I can even connect to node2's VMs console from node1's GUI.

Connecting on node2's GUI expose the exact same problem but reversed.

My corosync config is the following:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: rd-srv-front-pmx01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.26.52.31
  }
  node {
    name: rd-srv-front-pmx02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.26.52.32
  }
}

quorum {
  provider: corosync_votequorum
  two_node: 1
  wait_for_all: 0
}

totem {
  cluster_name: rd-front
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

Please note that I added the two_node and wait_for_all flags.

Here are the versions of pve-related packages installed on both nodes:

Code:

proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

pvecm status returns the following output on node1:

Code:

Quorum information
------------------
Date:             Tue Oct 22 11:23:55 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/654004
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           1
Flags:            2Node Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.26.52.31 (local)

and a different Membership information on node2:

Code:

Quorum information
------------------
Date:             Tue Oct 22 11:25:42 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2/828872
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           1
Flags:            2Node Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.26.52.32 (local)

As we can see, both nodes seem to not belong to the same cluster anymore.

While setting up the cluster (I re-installed it a few times) everything was working fine and pvecm status was showing the same status on both nodes (until the problem occurs).

(More information below, but waiting for moderation approval)

scotow · Oct 22, 2019

Logs for corosync on node1 are the following (that was durring the night):

Code:

Oct 21 17:14:16 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:653992) was formed. Members
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[2]: 1 2
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:653996) was formed. Members
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[2]: 1 2
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:654000) was formed. Members
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[2]: 1 2
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 17:23:15 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Retransmit List: 8a
Oct 21 18:16:05 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Retransmit List: 1a20
Oct 21 18:19:03 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 18:19:03 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:654004) was formed. Members left: 2
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Failed to receive the leave message. failed: 2
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 1 received
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[1]: 1
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 23:44:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 21 23:44:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 21 23:44:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 21 23:44:16 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 21 23:44:16 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:15:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 01:15:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:15:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 01:15:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 01:15:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:49:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 01:49:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:49:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 01:49:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 01:49:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 02:11:06 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 02:11:06 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 02:11:06 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 02:11:07 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 02:11:07 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 04:13:29 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 04:13:29 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 04:13:29 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 04:13:30 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 04:13:30 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

scotow · Oct 22, 2019

While on node2, corosync seems to be more active:

Code:

Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830544) was formed. Members
Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830548) was formed. Members
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830552) was formed. Members
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830556) was formed. Members
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830560) was formed. Members
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830564) was formed. Members
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830568) was formed. Members
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830572) was formed. Members
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830576) was formed. Members
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830580) was formed. Members
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830584) was formed. Members
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830588) was formed. Members
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830592) was formed. Members
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.

As we can see the node2 corosync has way more logs, I can't even see the logs saved before the accident occurred.
(I don't even know if corosync is the problem here)

/etc/hosts is correctly configured on both nodes:

Code:

127.0.0.1 localhost.localdomain localhost
10.26.52.31 rd-srv-front-pmx01.priv.x.fr rd-srv-front-pmx01
10.26.52.32 rd-srv-front-pmx02.priv.x.fr rd-srv-front-pmx02

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

I tried the following (possible) solutions that I found on the forum but without success:

https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/
https://forum.proxmox.com/threads/cluster-nodes-offline-but-working.42907/
https://forum.proxmox.com/threads/a-lot-of-cluster-fails-after-upgrade-5-4-to-6-0-4.56425/

spirit · Oct 22, 2019

please upgrade proxmox to last packages version

mainly corosync && libknet packages

libknet1: 1.10-pve1 -> 1.13-pve1
corosync: 3.0.2-pve2 -> 3.0.2-pve4

scotow · Oct 22, 2019

Hi,

How can I update my repo to be able to see those versions?

I upgraded all packages on both nodes before my first post but the versions you specified are not available for update/upgrade.

Do I have to use something like an unstable repo?

spirit · Oct 22, 2019

if you don't have a subscription, you need to use pve-no-subscription repository

https://pve.proxmox.com/wiki/Package_Repositories#_proxmox_ve_5_x_repositories

replace the content of:
/etc/apt/sources.list.d/pve-enterprise.list

with

"deb http://download.proxmox.com/debian/pve buster pve-no-subscription"

scotow · Oct 28, 2019

Hi,

@spirit
Coming back to you to let you know that everything is working fine since the update. Thanks.

Search

Search

Node(s) exit cluster after a while on a 2 nodes cluster

scotow

New Member

scotow

New Member

scotow

New Member

spirit

Distinguished Member

scotow

New Member

spirit

Distinguished Member

scotow

New Member

We value your privacy