[SOLVED] cluster issues after adding node

XelNaha · May 12, 2021

hi all,

i've been scratching my head around a strange issue with a cluster.

I originally have two nodes:

node1
node2

these two nodes are not in the same subnet

in a cluster and both are working fine. Last night I added a new cluster member: node3 and since I have done this i've observed really strange behaviour.
node 1 and node3 are in the same subnet.

the behaviour I see is:

when node1, node2 and node3 are on, i lose connectivity to node3 and graphs dont' update
when node1 and node2 are on, but node 3 isn't, no issues
when node1 and node3 are on, but node2 isn't, i don't lose connectivity and graphs update on all sides.

there seems to be a conflict between node2 and node3. These do not share an ip which I checked, and as far as I can see they have end-to-end ip connectivity. The known keys are seeminly correct I updated them as well with pvecm updatecerts.

even with node2 up and running, but shutting down the the cluster service I regain access to node3.

I had a look at the documentation, but it doesn't really seem to point to where to start digging more, any help would be much appreciated,

ermanishchawla · May 12, 2021

1. Power on all the 3 nodes and share the output from all nodes
pvecm status and pveversion -v commands
2. Also make sure you have port 5401-5404 UDP allowed between nodes. This port is used by corosync and
3. if you are not on corosync3, ie you are on older version of proxmox. Make sure multicast is allowed

XelNaha · May 12, 2021

1 , see below,

2 so yes udp is allowed between nodes and things work well when one of the two nodes isn't up.
there are currently no port / network restrictions in place between any of the nodes

3 I checked this, but no it's a newest version and doesn't seem to use multicast.

**********************************************
node 1
**********************************************

Code:

Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:37:27 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.b12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254 (local)
0x00000002          1 10.20.10.254
0x00000003          1 10.0.10.253


proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

**********************************************
Node2
**********************************************

Code:

Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:36:39 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1.b12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254
0x00000002          1 10.20.10.254 (local)
0x00000003          1 10.0.10.253



proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-network-perl: 0.5-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

**********************************************
node 3
**********************************************

Code:

Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:39:01 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.b12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254
0x00000002          1 10.20.10.254
0x00000003          1 10.0.10.253 (local)

proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

ermanishchawla · May 12, 2021

Right now all nodes are showing clustering ok, so seems like no issue with Corosync. Now just quickly check for 2 things

cat /etc/hostname
cat /etc/hosts
of all nodes

XelNaha · May 12, 2021

so they all check out, with the correct hosts

and whilst they are now ok it seems to be intermittend as well and only when the new node is on.

just now both nodes disappeared

Code:

pvecm status
Cluster information
-------------------
Name:             xxx
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May 12 16:52:56 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.b62
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:            

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.254 (local)

I can make either node re-appear by shutting down the clustering on either nodes,

going through the logs at the moment to see if Ican find anything useful

ermanishchawla · May 12, 2021

Can you see the corosync logs and see if you can ping other nodes
and did you notice there is change in ringID? so it seems the configuration in /etc/corosync/corosync.conf is not same as /etc/pve/corosycn.conf

XelNaha · May 13, 2021

whilst all this is happening I can ping the other hosts before and after I power up the node,

interestingly I do get a message


May 12 23:15:40 uk-wak-hv01 corosync[3471]:   [KNET  ] link: host: 2 link: 0 is down
May 12 23:15:40 uk-wak-hv01 corosync[3471]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 12 23:15:40 uk-wak-hv01 corosync[3471]:   [KNET  ] host: host: 2 has no active links
May 12 23:15:42 uk-wak-hv01 corosync[3471]:   [KNET  ] rx: host: 2 link: 0 is up
May 12 23:15:42 uk-wak-hv01 corosync[3471]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
May 12 23:15:42 uk-wak-hv01 corosync[3471]:   [TOTEM ] Token has not been received in 2737 ms 
May 12 23:15:48 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 
May 12 23:15:48 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 34 35 36 
May 12 23:15:49 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 37 
May 12 23:15:50 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:15:51 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:15:52 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 cb cc 
May 12 23:15:55 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:15:56 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:15:57 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:15:57 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:15:58 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:15:59 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:16:00 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:16:01 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 39 
May 12 23:16:04 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 39 
May 12 23:16:05 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 2a 2b 2c 2d 2e 2f 30 31 32 34 36 37 38 
May 12 23:16:06 uk-wak-hv01 corosync[3471]:   [TOTEM ] Retransmit List: 29 2a 2b 2c 2d 2e 2f 30 31 32 34 36 39

but even so the node is reachable via the network just fine and it starts almost exactly at the time one ofthe other nodes comes online.

see here

ermanishchawla · May 13, 2021

The warning message may indicate that the token timeout in /etc/corosync/corosync.conf might need to be increased if the message occurs frequently.
The attribute token_warning can be set in the totem section of the /etc/corosync/corosync.conf file.

XelNaha · May 14, 2021

well i tried setting the token time out to 10000ms but that didn't do much,
im thinking of removing and rejoining the cluster to see if that fixes the issue

now reading the documentation it says you can't really rejoin the cluster with a node that was previously there, but it doesn't say why. is it actually possible to remove a node and then rejoin it, by purging certs and other things?

ermanishchawla · May 14, 2021

XelNaha said:
well i tried setting the token time out to 10000ms but that didn't do much,
im thinking of removing and rejoining the cluster to see if that fixes the issue

now reading the documentation it says you can't really rejoin the cluster with a node that was previously there, but it doesn't say why. is it actually possible to remove a node and then rejoin it, by purging certs and other things?

Yes follow the procedure of removal
Stop cluster, corosync and then perform del node using pvecm command and if ha is configured stop crm and lrm . Services needs to be stopped on node to be deleted and del node needs to be done after stopping services on active nodes

After that just update certs pvecm updatecerts -f

XelNaha · May 14, 2021

so i removed the node and re-added the node and the issue seems gone now, so something must have gone wrong with joining the cluster. Not sure where the issue was unfortunately

Search

Search

[SOLVED] cluster issues after adding node

XelNaha

Active Member

ermanishchawla

Well-Known Member

XelNaha

Active Member

ermanishchawla

Well-Known Member

XelNaha

Active Member

Attachments

ermanishchawla

Well-Known Member

XelNaha

Active Member

Attachments

ermanishchawla

Well-Known Member

XelNaha

Active Member

ermanishchawla

Well-Known Member

XelNaha

Active Member

We value your privacy