Hello,
I run a 8 node pve cluster version "pve-manager/7.4-3/9002ab8a" . Last Friday this cluster suddenly broke down. At first the web interface showed only two hosts marked red, after a while all nodes were red. The reason might have been a network loop someone created around this time, but it is unsure if this really was the reason. A Connection from a browser say to host A showed a green sign, but red for all other hosts. If I connected via browser to say node C this one was marked green but all other hosts were marked red. This wa strue to all nodes. We tried to get the hosts to act as cluster again, but failed unti we tried a different totem-configuration in /etc/corosync/corosync.conf.
The original corosync.conf looked like this:
We changed the totem part to another protocol and copied this file to all hosts into /etc/corosync/corosync.conf:
This immediately helped and the cluster acted as usual. It was working again. At sunday I looked again at the cluster and saw that one only node was still using sctp: running
What I found in the syslog at this time was this:
On one node I have another problem that in the output of
Another problem is a log message I see on all hosts:
The cluster is currently up and running, but I have some questions:
Thanks for your help
Rainer
I run a 8 node pve cluster version "pve-manager/7.4-3/9002ab8a" . Last Friday this cluster suddenly broke down. At first the web interface showed only two hosts marked red, after a while all nodes were red. The reason might have been a network loop someone created around this time, but it is unsure if this really was the reason. A Connection from a browser say to host A showed a green sign, but red for all other hosts. If I connected via browser to say node C this one was marked green but all other hosts were marked red. This wa strue to all nodes. We tried to get the hosts to act as cluster again, but failed unti we tried a different totem-configuration in /etc/corosync/corosync.conf.
The original corosync.conf looked like this:
Code:
...
nodelist {
node {
name: node1
nodeid: 1
quorum_votes: 1
ring0_addr: <first_ip_of_node1> # in first network
ring1_addr: <second_ip_of_node1> # in second network
}
..... # more nodes
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: clustername
config_version: 8
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
We changed the totem part to another protocol and copied this file to all hosts into /etc/corosync/corosync.conf:
Code:
totem {
cluster_name: clustername
config_version: 8
interface {
knet_transport: sctp
linknumber: 0
}
interface {
knet_transport: sctp
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
This immediately helped and the cluster acted as usual. It was working again. At sunday I looked again at the cluster and saw that one only node was still using sctp: running
corosync-cfgtool -s | grep sctp;
on all 8 nodes revealed this. The /etc/pve/corosync.conf
showed the original udp configuration not the new sctp one. So for the one hosts (5) still showing sctp I opened this file in vi and wrote it again to disk without changes. Afterwards all nodes were using udp again.What I found in the syslog at this time was this:
Code:
Jun 11 10:59:27 host2 corosync[1509822]: [CFG ] Config reload requested by node 5
Jun 11 10:59:27 host2 corosync[1509822]: [TOTEM ] New config has different knet transport for link 0. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]: [TOTEM ] New config has different knet transport for link 1. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]: [CFG ] Cannot configure new interface definitions: To reconfigure an interface it must be deleted a>
J
On one node I have another problem that in the output of
corosync-cmapctl -m stats
only one link seems to exists for this host whereas other nodes have two working networking links (stats.knet.node1.link0
, but no stats.knet.node1.link1
in the output). The ips of this node in corosync.conf are ok and ping-able.Another problem is a log message I see on all hosts:
Code:
Jun 11 10:59:27 host2 corosync[1509822]: [TOTEM ] New config has different knet transport for link 0. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]: [TOTEM ] New config has different knet transport for link 1. Internal value was NOT changed.
Jun 11 10:59:27 host2 corosync[1509822]: [CFG ] Cannot configure new interface definitions: To reconfigure an interface it must be deleted and recreated. A working interface needs to be available to corosync at all times
The cluster is currently up and running, but I have some questions:
- I do not understand why the new sctp config that helped to get the cluster ok again was somehow lost, except for one host?
- What can I do about the error message from syslog showing the "value was NOT changed-warning"? Restart corosync on all hosts?
- Could this also help for the first node that seems to have only one cluster-link.
- Is it generally ok to manually restart corosync on a singe or all hosts?
Thanks for your help
Rainer
Last edited: