Some odds after a cluster broke into single peaces and healed again (a bit lengthy, sorry)

rainer042

Active Member
Dec 3, 2019
37
3
28
123
Hello,

I run a 8 node pve cluster version "pve-manager/7.4-3/9002ab8a" . Last Friday this cluster suddenly broke down. At first the web interface showed only two hosts marked red, after a while all nodes were red. The reason might have been a network loop someone created around this time, but it is unsure if this really was the reason. A Connection from a browser say to host A showed a green sign, but red for all other hosts. If I connected via browser to say node C this one was marked green but all other hosts were marked red. This wa strue to all nodes. We tried to get the hosts to act as cluster again, but failed unti we tried a different totem-configuration in /etc/corosync/corosync.conf.

The original corosync.conf looked like this:

Code:
...
nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr:  <first_ip_of_node1>      # in first network
    ring1_addr:  <second_ip_of_node1> # in second network
  }
..... # more nodes
}
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: clustername
  config_version: 8
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

We changed the totem part to another protocol and copied this file to all hosts into /etc/corosync/corosync.conf:
Code:
totem {
  cluster_name: clustername
  config_version: 8
  interface {
    knet_transport: sctp
    linknumber: 0
  }
  interface {
    knet_transport: sctp
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

This immediately helped and the cluster acted as usual. It was working again. At sunday I looked again at the cluster and saw that one only node was still using sctp: running corosync-cfgtool -s | grep sctp; on all 8 nodes revealed this. The /etc/pve/corosync.conf showed the original udp configuration not the new sctp one. So for the one hosts (5) still showing sctp I opened this file in vi and wrote it again to disk without changes. Afterwards all nodes were using udp again.

What I found in the syslog at this time was this:
Code:
Jun 11 10:59:27 host2 corosync[1509822]:   [CFG   ] Config reload requested by node 5
Jun 11 10:59:27 host2 corosync[1509822]:   [TOTEM ] New config has different knet transport for link 0. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]:   [TOTEM ] New config has different knet transport for link 1. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]:   [CFG   ] Cannot configure new interface definitions: To reconfigure an interface it must be deleted a>
J

On one node I have another problem that in the output of corosync-cmapctl -m stats only one link seems to exists for this host whereas other nodes have two working networking links (stats.knet.node1.link0, but no stats.knet.node1.link1 in the output). The ips of this node in corosync.conf are ok and ping-able.

Another problem is a log message I see on all hosts:
Code:
Jun 11 10:59:27 host2 corosync[1509822]:   [TOTEM ] New config has different knet transport for link 0. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]:   [TOTEM ] New config has different knet transport for link 1. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]:   [CFG   ] Cannot configure new interface definitions: To reconfigure an interface it must be deleted and recreated. A working interface needs to be available to corosync at all times

The cluster is currently up and running, but I have some questions:
  • I do not understand why the new sctp config that helped to get the cluster ok again was somehow lost, except for one host?
  • What can I do about the error message from syslog showing the "value was NOT changed-warning"? Restart corosync on all hosts?
  • Could this also help for the first node that seems to have only one cluster-link.
  • Is it generally ok to manually restart corosync on a singe or all hosts?

Thanks for your help
Rainer
 
Last edited:
I was able to find the problem.

I started with the host with only one cluster-link I thought. I rebooted it and after the reboot was done it was an isolated node. By looking at the logs of other clustermembers that still were part of the cluster(all green) I could see that there seemed to be problem to reach the fenced out node on all cluster links, but ping was possible.
Then I remembered the the error message on many nodes that said "New config has different knet transport for link 0... Not changed" (see above) and this directed me to the solution.
All nodes except for the rebooted one were actually still using KNET:sctp for corosync allthogh the corosync-configuration in /etc/corosync/ as well as in /etc/pve/corosync.conf showed the usage of "udp". But these hosts had NOT been rebooted (or corosync not restarted) and were NOT using "udp" but "KNET:sctp", and thus were unable to communicate with the one rebooted host using "UDP" .
The easiest solution then was to change the configuration on all nodes back to KNET:sctp in the cluster corsync.conf which triggerd a change in /etc/corosync/corosync.conf to the /etc/pve/corosync version.
Afterwards the cluster was healthy again. I rebooted all nodes to check that it work, and it does.

Well now all nodes are using "KNET:sctp" instead of the default "udp". What I cannot sort out is, which advantages or disadvantages "KNET:sctp" has over the installation default "udp"?

Thanks
Rainer
 
If it works for you, it shouldn't matter much.
It doesn't seem to be used quite as much, but some users had to switch to `sctp` for stability. In their network `udp` just couldn't be used without Corosync losing the connection from time to time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!