Node(s) exit cluster after a while on a 2 nodes cluster

scotow

New Member
Oct 22, 2019
5
0
1
28
Hi,

I have set up a two nodes cluster following the wiki guidance, but I keep facing the same problem.

I'm able to see both nodes within the web GUI and spawn VMs while everything after my initial configuration.

But after a while (a few hours), one or both of the nodes seems to leave the cluster. Browsing the GUI from node1, node2 is marked as 'offline' (red) but node2's stats are correctly displayed, expect the graphs which are empty. I can even connect to node2's VMs console from node1's GUI.

Connecting on node2's GUI expose the exact same problem but reversed.

My corosync config is the following:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: rd-srv-front-pmx01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.26.52.31
  }
  node {
    name: rd-srv-front-pmx02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.26.52.32
  }
}

quorum {
  provider: corosync_votequorum
  two_node: 1
  wait_for_all: 0
}

totem {
  cluster_name: rd-front
  config_version: 2
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

Please note that I added the two_node and wait_for_all flags.

Here are the versions of pve-related packages installed on both nodes:

Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

pvecm status returns the following output on node1:

Code:
Quorum information
------------------
Date:             Tue Oct 22 11:23:55 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/654004
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           1
Flags:            2Node Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.26.52.31 (local)

and a different Membership information on node2:

Code:
Quorum information
------------------
Date:             Tue Oct 22 11:25:42 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2/828872
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           1
Flags:            2Node Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.26.52.32 (local)

As we can see, both nodes seem to not belong to the same cluster anymore.

While setting up the cluster (I re-installed it a few times) everything was working fine and pvecm status was showing the same status on both nodes (until the problem occurs).

(More information below, but waiting for moderation approval)
 
Last edited:
Logs for corosync on node1 are the following (that was durring the night):

Code:
Oct 21 17:14:16 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:653992) was formed. Members
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[2]: 1 2
Oct 21 17:14:17 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:653996) was formed. Members
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[2]: 1 2
Oct 21 17:20:27 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:654000) was formed. Members
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 0 received
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[2]: 1 2
Oct 21 17:22:10 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 17:23:15 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Retransmit List: 8a
Oct 21 18:16:05 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Retransmit List: 1a20
Oct 21 18:19:03 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Token has not been received in 750 ms
Oct 21 18:19:03 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] A new membership (1:654004) was formed. Members left: 2
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [TOTEM ] Failed to receive the leave message. failed: 2
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [CPG   ] downlist left_list: 1 received
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [QUORUM] Members[1]: 1
Oct 21 18:19:04 rd-srv-front-pmx01 corosync[1262]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 23:44:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 21 23:44:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 21 23:44:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 21 23:44:16 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 21 23:44:16 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:15:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 01:15:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:15:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 01:15:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 01:15:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:49:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 01:49:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 01:49:14 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 01:49:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 01:49:15 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 02:11:06 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 02:11:06 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 02:11:06 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 02:11:07 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 02:11:07 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 04:13:29 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] link: host: 2 link: 0 is down
Oct 22 04:13:29 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 22 04:13:29 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 has no active links
Oct 22 04:13:30 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] rx: host: 2 link: 0 is up
Oct 22 04:13:30 rd-srv-front-pmx01 corosync[1262]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
 
While on node2, corosync seems to be more active:

Code:
Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830544) was formed. Members
Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:35 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830548) was formed. Members
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:36 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830552) was formed. Members
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:37 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830556) was formed. Members
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:39 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830560) was formed. Members
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:40 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830564) was formed. Members
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:42 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830568) was formed. Members
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:43 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830572) was formed. Members
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:45 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830576) was formed. Members
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:46 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830580) was formed. Members
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:47 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830584) was formed. Members
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:49 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830588) was formed. Members
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:50 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [TOTEM ] A new membership (2:830592) was formed. Members
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [CPG   ] downlist left_list: 0 received
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [QUORUM] Members[1]: 2
Oct 22 11:35:52 rd-srv-front-pmx02 corosync[1218]:   [MAIN  ] Completed service synchronization, ready to provide service.

As we can see the node2 corosync has way more logs, I can't even see the logs saved before the accident occurred.
(I don't even know if corosync is the problem here)

/etc/hosts is correctly configured on both nodes:

Code:
127.0.0.1 localhost.localdomain localhost
10.26.52.31 rd-srv-front-pmx01.priv.x.fr rd-srv-front-pmx01
10.26.52.32 rd-srv-front-pmx02.priv.x.fr rd-srv-front-pmx02

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

I tried the following (possible) solutions that I found on the forum but without success:

https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/
https://forum.proxmox.com/threads/cluster-nodes-offline-but-working.42907/
https://forum.proxmox.com/threads/a-lot-of-cluster-fails-after-upgrade-5-4-to-6-0-4.56425/
 
Hi,

How can I update my repo to be able to see those versions?

I upgraded all packages on both nodes before my first post but the versions you specified are not available for update/upgrade.

Do I have to use something like an unstable repo?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!