[SOLVED] cluster issues after adding node

XelNaha

Active Member
Apr 6, 2019
20
0
41
42
hi all,

i've been scratching my head around a strange issue with a cluster.

I originally have two nodes:

node1
node2


these two nodes are not in the same subnet


in a cluster and both are working fine. Last night I added a new cluster member: node3 and since I have done this i've observed really strange behaviour.
node 1 and node3 are in the same subnet.

the behaviour I see is:

when node1, node2 and node3 are on, i lose connectivity to node3 and graphs dont' update
when node1 and node2 are on, but node 3 isn't, no issues
when node1 and node3 are on, but node2 isn't, i don't lose connectivity and graphs update on all sides.


there seems to be a conflict between node2 and node3. These do not share an ip which I checked, and as far as I can see they have end-to-end ip connectivity. The known keys are seeminly correct I updated them as well with pvecm updatecerts.

even with node2 up and running, but shutting down the the cluster service I regain access to node3.

I had a look at the documentation, but it doesn't really seem to point to where to start digging more, any help would be much appreciated,
 
1. Power on all the 3 nodes and share the output from all nodes
pvecm status and pveversion -v commands
2. Also make sure you have port 5401-5404 UDP allowed between nodes. This port is used by corosync and
3. if you are not on corosync3, ie you are on older version of proxmox. Make sure multicast is allowed
 
1 , see below,

2 so yes udp is allowed between nodes and things work well when one of the two nodes isn't up.
there are currently no port / network restrictions in place between any of the nodes

3 I checked this, but no it's a newest version and doesn't seem to use multicast.




**********************************************
node 1
**********************************************
Code:
Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:37:27 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.b12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254 (local)
0x00000002          1 10.20.10.254
0x00000003          1 10.0.10.253


proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1





**********************************************
Node2
**********************************************
Code:
Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:36:39 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1.b12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254
0x00000002          1 10.20.10.254 (local)
0x00000003          1 10.0.10.253



proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-network-perl: 0.5-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1



**********************************************
node 3
**********************************************
Code:
Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:39:01 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.b12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254
0x00000002          1 10.20.10.254
0x00000003          1 10.0.10.253 (local)

proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
 
Right now all nodes are showing clustering ok, so seems like no issue with Corosync. Now just quickly check for 2 things

cat /etc/hostname
cat /etc/hosts
of all nodes
 
so they all check out, with the correct hosts

and whilst they are now ok it seems to be intermittend as well and only when the new node is on.

just now both nodes disappeared

Code:
pvecm status
Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:52:56 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.b62
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:            

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254 (local)

I can make either node re-appear by shutting down the clustering on either nodes,

going through the logs at the moment to see if Ican find anything useful
 

Attachments

  • node3.PNG
    node3.PNG
    25.8 KB · Views: 3
Can you see the corosync logs and see if you can ping other nodes
and did you notice there is change in ringID? so it seems the configuration in /etc/corosync/corosync.conf is not same as /etc/pve/corosycn.conf
 
whilst all this is happening I can ping the other hosts before and after I power up the node,

interestingly I do get a message

May 12 23:15:40 uk-wak-hv01 corosync[3471]: [KNET ] link: host: 2 link: 0 is down May 12 23:15:40 uk-wak-hv01 corosync[3471]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) May 12 23:15:40 uk-wak-hv01 corosync[3471]: [KNET ] host: host: 2 has no active links May 12 23:15:42 uk-wak-hv01 corosync[3471]: [KNET ] rx: host: 2 link: 0 is up May 12 23:15:42 uk-wak-hv01 corosync[3471]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) May 12 23:15:42 uk-wak-hv01 corosync[3471]: [TOTEM ] Token has not been received in 2737 ms May 12 23:15:48 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 May 12 23:15:48 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 34 35 36 May 12 23:15:49 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 37 May 12 23:15:50 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:15:51 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:15:52 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 cb cc May 12 23:15:55 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:15:56 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:15:57 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:15:57 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:15:58 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:15:59 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:16:00 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:16:01 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 39 May 12 23:16:04 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 39 May 12 23:16:05 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 May 12 23:16:06 uk-wak-hv01 corosync[3471]: [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 39

but even so the node is reachable via the network just fine and it starts almost exactly at the time one ofthe other nodes comes online.

see here
 

Attachments

  • overview.PNG
    overview.PNG
    144.6 KB · Views: 3
  • The warning message may indicate that the token timeout in /etc/corosync/corosync.conf might need to be increased if the message occurs frequently.
  • The attribute token_warning can be set in the totem section of the /etc/corosync/corosync.conf file.
 
well i tried setting the token time out to 10000ms but that didn't do much,
im thinking of removing and rejoining the cluster to see if that fixes the issue

now reading the documentation it says you can't really rejoin the cluster with a node that was previously there, but it doesn't say why. is it actually possible to remove a node and then rejoin it, by purging certs and other things?
 
well i tried setting the token time out to 10000ms but that didn't do much,
im thinking of removing and rejoining the cluster to see if that fixes the issue

now reading the documentation it says you can't really rejoin the cluster with a node that was previously there, but it doesn't say why. is it actually possible to remove a node and then rejoin it, by purging certs and other things?
Yes follow the procedure of removal
Stop cluster, corosync and then perform del node using pvecm command and if ha is configured stop crm and lrm . Services needs to be stopped on node to be deleted and del node needs to be done after stopping services on active nodes

After that just update certs pvecm updatecerts -f
 
so i removed the node and re-added the node and the issue seems gone now, so something must have gone wrong with joining the cluster. Not sure where the issue was unfortunately
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!