Corosync spam

masterdaweb

Active Member
Apr 17, 2017
87
4
28
31
Hello,

I realized that corosync sometimes "spam" the network, hence Cluster goes down and Nodes becomes "gray".

Here it is the output when Corosync gets crazy:

Code:
Apr 12 20:32:25 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:25 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:26 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:26 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:26 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:26 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:26 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:26 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:27 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:27 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:27 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:27 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:28 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:28 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:28 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:28 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:28 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:28 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:29 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:29 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:29 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:29 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:29 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:29 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:30 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:30 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:30 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:30 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:30 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:30 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:31 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:31 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:31 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:31 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:31 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:31 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:32 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:32 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:32 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:32 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:32 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:32 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:33 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:33 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
Apr 12 20:32:33 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
Apr 12 20:32:33 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d

When this happens, it isn't a network issue, cause all nodes are still acessible.
 
log message like that usually indicate a (temporary) multicast issue. just because the nodes are reachable otherwise does not mean multicast works correctly.
 
What is kernel version?
Hi,

Kernel version: 4.13.13-6-pve
log message like that usually indicate a (temporary) multicast issue. just because the nodes are reachable otherwise does not mean multicast works correctly.

Hello Fabian,

I'm using Unicast (UDPU).

Nodes are connected over WAN.

Cluster size: 15 nodes

Corosync.conf:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: br01
    nodeid: 12
    quorum_votes: 1
    ring0_addr: br01
  }
  node {
    name: br02
    nodeid: 15
    quorum_votes: 1
    ring0_addr: br02
  }
  node {
    name: ns3002319
    nodeid: 6
    quorum_votes: 1
    ring0_addr: ns3002319
  }
  node {
    name: ns3026955
    nodeid: 14
    quorum_votes: 1
    ring0_addr: ns3026955
  }
  node {
    name: ns3044087
    nodeid: 4
    quorum_votes: 1
    ring0_addr: ns3044087
  }
  node {
    name: ns328159
    nodeid: 3
    quorum_votes: 1
    ring0_addr: ns328159
  }
  node {
    name: ns343743
    nodeid: 7
    quorum_votes: 1
    ring0_addr: ns343743
  }
  node {
    name: ns370473
    nodeid: 13
    quorum_votes: 1
    ring0_addr: ns370473
  }
  node {
    name: ns383063
    nodeid: 5
    quorum_votes: 1
    ring0_addr: ns383063
  }
  node {
    name: ns387137
    nodeid: 11
    quorum_votes: 1
    ring0_addr: ns387137
  }
  node {
    name: ns394769
    nodeid: 10
    quorum_votes: 1
    ring0_addr: ns394769
  }
  node {
    name: ns500043
    nodeid: 9
    quorum_votes: 1
    ring0_addr: ns500043
  }
  node {
    name: ns524364
    nodeid: 1
    quorum_votes: 1
    ring0_addr: ns524364
  }
  node {
    name: ns535923
    nodeid: 8
    quorum_votes: 1
    ring0_addr: ns535923
  }
  node {
    name: ns538414
    nodeid: 2
    quorum_votes: 1
    ring0_addr: ns538414
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: mamae
  config_version: 116
  interface {
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: off
  transport: udpu
  version: 2
}
 
Hi,

Kernel version: 4.13.13-6-pve


Hello Fabian,

I'm using Unicast (UDPU).

Nodes are connected over WAN.

Cluster size: 15 nodes

Corosync.conf:
...

15 nodes over WAN with unicast sounds like a disaster waiting to happen, unless you mean "low latency link reserved for corosync" when you say WAN.. based on your corosync config, I guess this is not the case?
 
15 nodes over WAN with unicast sounds like a disaster waiting to happen, unless you mean "low latency link reserved for corosync" when you say WAN.. based on your corosync config, I guess this is not the case?
Yes it's in WAN.

Is there any tweaks in corosync.conf to figure it out ?

It's strange because it works , but sometimes corosync spam the nodes and then break the cluster.
 
Yes it's in WAN.

Is there any tweaks in corosync.conf to figure it out ?

It's strange because it works , but sometimes corosync spam the nodes and then break the cluster.

although repeating myself is kind of boring, I'll say it again:
it's not corosync that is "spamming the nodes" and "breaking the cluster", it's your network that is too unreliable to run a cluster on top. just because it works most of the time, does not mean it's stable and fast enough.

see our docs which also explicitly state this:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network said:
... needs a reliable network with latencies under 2 milliseconds (LAN performance) to work properly. While corosync can also use unicast for communication between nodes its highly recommended to have a multicast capable network. The network should not be used heavily by other members, ideally corosync runs on its own network. never share it with network where storage communicates too.

there is a reason we always recommend against running clusters in non-LAN settings unless you have your own dedicated fiber/low-latency link between locations..

while you can tweak certain parameters to increase timeouts, it is not advisable.

you should also seriously re-enable secauth (I have no idea why you would have that disabled? it's always been on by default in PVE..) if you are transmitting over insecure links outside of your immediate control.. right now your cluster communication is not authenticated and transmitted over the public Internet as clear text - any hop along the way can modify any file in /etc/pve as long as it understands the traffic it sees.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!