In progress of recovery and try understanding issue we perform this steps:
0. Check errors on switch ports and option like storm control - this correct;
1. disable encryption for traffic analyze;
crypto_cipher: none
crypto_hash: none
and found this;
Code:
Sep 21 10:20:32 vps4 corosync[9641]: [TOTEM ] Message received from 37.153.1.53 has bad magic number (probably sent by encrypted Kronosnet/Corosync 2.0/2.1/1.x/OpenAIS or unknown).. Ignoring
Sep 21 10:20:33 vps4 corosync[9641]: [TOTEM ] Message received from 37.153.1.51 has bad magic number (probably sent by encrypted Kronosnet/Corosync 2.0/2.1/1.x/OpenAIS or unknown).. Ignoring
We found related code:
http://lira.epac.to:8080/doc/corosync/api/html/totemsrp_8c_source.html
string 4912
and strange errors about magic bit;
After this we restore most important VMs on another hardware, remove 6 nodes forcebly and build new cluster from 7 nodes, switching this new one to `sctp` transport and increase timeouts:
Code:
corosync.conf :
totem {
cluster_name: TEST-20200920
config_version: 27
interface {
knet_transport: sctp <---------------
linknumber: 0
token_retransmits_before_loss_const: 10 <---------------
join: 150 <---------------
token: 5000 <---------------
After this new small cluster catch quorum and still working normally; We watch for logs nowtime and try found correlations and other potentially useful info. In new sctp cluster, we don't have retransmits, and quorum OK; But i don't know can it survivev after growing to 22 nodes;
Also we can't fully exclude network issues, even we have dedicated 10G networks and formally ideal pings and bandwiths.
Nowtime we watch for old cluster with udp and new cluster with sctp transports;
In any case, even under network packet loss corosync don't will start so massive and disruptive flood - this similar to bug;
We will hope to help solve this problem in future;
We try enable pve-cluster debug and analyze this data by your recommendations.
We can answer some times later after complete next diagnostics steps.
Thank your very much any case;