Rogue Cluster Node

Carlos Gomes · Aug 1, 2021

Hello Everyone!

I have a 4 server cluster scenario, that went running fine for 700+ days, but eventually we need to update it, mostly that we want to use PBS and since we are on version 5.4x we want to roll up the servers to version 6.

I made an offline lab with the same configurations from the nodes, to evaluate if I could update one by one, following the update steps from proxmox wiki (5to6 and 6to7 was tested this way) but only updating a single node at a time.

Anyhow on the docs there is the step to update corosync to version 3 on all nodes, even the ones that wont go from 5to6 so the cluster is healthy, and the update was done this way:

all nodes updated to all packages available on the version 5.4 itself, going to version 5.4-15
update corosync on node4 that I had all vms and configs backed up first
updated corosync on all other 3 nodes in paralel as suggested

After this the cluster went wild, lots of errors and machines and nodes with question marks, only available if logged on the GUI separatedly, and the other 3 nodes all have crucial machines up and running, so I could not take them down or reboot the whole system besides the updated node4 to version 6.

After some tests and manipulating the services (stop / restart on pve-cluster, corosync, pve-ha-lrm, pve-ha-crm, corosync, pveproxy, pvedaemon, pvestatd and pve-firewall) when realigning the cluster some nodes were up, but in the proccess eventually one came up with different ring.ids, or thinking that they were into a new cluster... a lot backing and forth on this matter and then we tapered that node2 in special was the issue, after some changes:

The updated node from 5 to 6 was with corosync versin 3.1.1, so we ensured all nodes had the same version on all related packages:


apt install libcpg4=3.0.4-pve1
apt install libcmap4=3.0.4-pve1
apt install libquorum5=3.0.4-pve1
apt install libvotequorum8=3.0.4-pve1
apt install libcorosync-common4=3.0.4-pve1
apt install corosync=3.0.4-pve1

after this, if we take out node2 out of the picture (stoping corosync and pve-cluster), we can manage the other 3 nodes, that are in different versions (node1= 5.4-15, node3= 5.4-15, node4=6.4-13) and have management.

One proccess to replicate the "good" nodes to the problematic node2 was:


systemctl stop corosync pve-cluster
scp node1:/var/lib/pve-cluster/* /var/lib/pve-cluster
systemctl restart corosync.service
systemctl restart pve-cluster.service

but after this proccess they almost got syncronized (checking with watch pvecm status), but at some point we got a whole bunch of [TOTEM] replication infos on syslog, and the nodes all lost communication.

Is there anyway to force the rejoin from node2 to the cluster, with the machines still running and not breaking the entire cluster?

Where could I check on whatever node what could be the issue that they cant replicate the cluster / corosync, pmxcfs properly?

Please let me know if I can provide any other information / logs, thanks in advance

mira · Aug 4, 2021

Please provide the output of pveversion -v from node 2 and node 4.
In addition please provide the output of cat /etc/network/interfaces, cat /etc/pve/corosync.conf and cat /etc/corosync/corosync.conf.
The syslogs (/var/log/syslog.*) would also help.

Carlos Gomes · Oct 9, 2021

pveversion of all nodes:

Code:

[pmx-office1]: pveversion
pve-manager/7.0-13/7aa7e488 (running kernel: 5.11.22-3-pve)

[pmx-office2]: pveversion
pve-manager/7.0-13/7aa7e488 (running kernel: 5.11.22-3-pve)

[pmx-office3]: pveversion
pve-manager/7.0-13/7aa7e488 (running kernel: 5.11.22-3-pve)

[pmx-office4]: pveversion
pve-manager/7.0-13/7aa7e488 (running kernel: 5.11.22-3-pve)

--
Sorry for the delay on new information, but I took some time to update all the nodes, and get services back online even with one of the nodes detached from management.

If it was something to version or because older proxmox kernel, will be known when getting to version 7, but the results were similar, when node2 enters the group everything goes awry

Now they are all on the latest version, 7.0-13 but yet still I cannot rejoin the node 2 without breaking corosync. I moved all machines from this node for this proccess, cleaned all storage besides local disks but even then when joining (both with pvecm add ip, or with -force flag) the node appears for a brief moment green, but sudden logs get flooded with totem recast.

I didn't follow the installation from this cluster when it happened, and checking now the switch I noticed that jumbo frames and IGMP groups are disabled. Considering the cluster worked fine with version 5, I don't know if there is any relation to newer versions of corosync and some network configurations.

The tests I made uppon searching the forum, was to manipulate the MTU from the totem group inside corosync, tried 900, 1480, 1500 and 9000 with same results. (always upgrading config version and restarting corosync services first).

For example:

Code:

totem {
  cluster_name: office
  config_version: 7
  netmtu: 1480
  interface {
    bindnetaddr: 10.10.11.13
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

I'm also posting syslogs from one node thats on the cluster that gets the totem erros, and on the "rogue" node that I want to rejoin

Sorry again for the delay, and if any other infos / logs / tests are needed to validate this please let me know and I'll update here

the files asked on /etc/corosync and /etc/pve/ I uploaded with dots. instead of slashes for reference, (but they have similar content anyhow)

testing cast communication between them:

Code:

omping 10.10.11.13 10.10.11.4 10.10.11.2 10.10.11.8

-- pmx-office1 10.10.11.13
10.10.11.4 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 0.272/0.398/0.613/0.080
10.10.11.4 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 0.323/0.442/0.614/0.077
10.10.11.2 :   unicast, xmt/rcv/%loss = 24/24/0%, min/avg/max/std-dev = 0.228/0.464/0.894/0.123
10.10.11.2 : multicast, xmt/rcv/%loss = 24/24/0%, min/avg/max/std-dev = 0.284/0.492/0.895/0.113
10.10.11.8 :   unicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.393/0.494/0.665/0.089
10.10.11.8 : multicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.393/0.520/0.665/0.080

-- pmx-office2 10.10.11.4
10.10.11.13 :   unicast, xmt/rcv/%loss = 22/22/0%, min/avg/max/std-dev = 0.208/0.410/0.600/0.089
10.10.11.13 : multicast, xmt/rcv/%loss = 22/22/0%, min/avg/max/std-dev = 0.236/0.447/0.629/0.083
10.10.11.2  :   unicast, xmt/rcv/%loss = 24/24/0%, min/avg/max/std-dev = 0.267/0.410/0.575/0.074
10.10.11.2  : multicast, xmt/rcv/%loss = 24/24/0%, min/avg/max/std-dev = 0.340/0.498/0.616/0.042
10.10.11.8  :   unicast, xmt/rcv/%loss = 21/21/0%, min/avg/max/std-dev = 0.358/0.484/0.594/0.051
10.10.11.8  : multicast, xmt/rcv/%loss = 21/21/0%, min/avg/max/std-dev = 0.312/0.452/0.622/0.063

-- pmx-office3 10.10.11.2
10.10.11.13 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 0.242/0.542/0.859/0.161
10.10.11.13 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 0.242/0.592/0.865/0.168
10.10.11.4  :   unicast, xmt/rcv/%loss = 24/24/0%, min/avg/max/std-dev = 0.240/0.474/0.743/0.137
10.10.11.4  : multicast, xmt/rcv/%loss = 24/24/0%, min/avg/max/std-dev = 0.282/0.407/0.593/0.087
10.10.11.8  :   unicast, xmt/rcv/%loss = 21/21/0%, min/avg/max/std-dev = 0.185/0.526/0.830/0.170
10.10.11.8  : multicast, xmt/rcv/%loss = 21/21/0%, min/avg/max/std-dev = 0.203/0.551/0.849/0.157

-- pmx-office4 10.10.11.8
10.10.11.13 :   unicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.303/0.533/0.708/0.118
10.10.11.13 : multicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.347/0.570/0.735/0.105
10.10.11.4  :   unicast, xmt/rcv/%loss = 22/22/0%, min/avg/max/std-dev = 0.229/0.398/0.649/0.120
10.10.11.4  : multicast, xmt/rcv/%loss = 22/22/0%, min/avg/max/std-dev = 0.229/0.419/0.649/0.108
10.10.11.2  :   unicast, xmt/rcv/%loss = 21/21/0%, min/avg/max/std-dev = 0.220/0.422/0.672/0.138
10.10.11.2  : multicast, xmt/rcv/%loss = 21/21/0%, min/avg/max/std-dev = 0.253/0.472/0.672/0.111

mira · Oct 11, 2021

Thank you for the files.

First I'd suggest separating corosync from all other traffic. You seem to have more than enough interfaces available.
1G is more than enough for a cluster this size. It should be physically separated, which means different NIC and switch.

If there are still issues then, check the cable of node 2 as well as the firmware for the NICs and update your BIOS.

Carlos Gomes · Oct 11, 2021

Thanks @mira for the update.

I'll review the network and try to separate cluster comunication.

I'll try to applicate this https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster when I have physical access to the cluster to validate all possibilities, thanks again for the attention

mira · Oct 11, 2021

If you simply add a second link over a different interface, make sure that the priority is higher (lower number) for the new one.

Search

Search

Rogue Cluster Node

Carlos Gomes

Renowned Member

mira

Proxmox Staff Member

Carlos Gomes

Renowned Member

Attachments

mira

Proxmox Staff Member

Carlos Gomes

Renowned Member

mira

Proxmox Staff Member

We value your privacy