Corosync spam

Discussion in 'Proxmox VE: Installation and configuration' started by masterdaweb, Apr 13, 2018.

  1. masterdaweb

    masterdaweb Member

    Joined:
    Apr 17, 2017
    Messages:
    78
    Likes Received:
    3
    Hello,

    I realized that corosync sometimes "spam" the network, hence Cluster goes down and Nodes becomes "gray".

    Here it is the output when Corosync gets crazy:

    Code:
    Apr 12 20:32:25 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:25 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:26 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:26 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:26 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:26 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:26 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:26 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:27 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:27 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:27 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:27 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:28 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:28 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:28 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:28 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:28 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:28 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:29 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:29 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:29 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:29 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:29 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:29 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:30 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:30 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:30 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:30 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:30 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:30 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:31 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:31 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:31 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:31 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:31 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:31 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:32 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:32 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:32 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:32 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:32 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:32 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:33 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:33 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 94 95 96 97 100 101 102 103 104 105 109 10a 19d 89 8a 8b 8c 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108
    Apr 12 20:32:33 ns524364 corosync[2602]: notice  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    Apr 12 20:32:33 ns524364 corosync[2602]:  [TOTEM ] Retransmit List: 8d 8e 8f 90 91 92 93 98 99 ff 106 107 108 89 8a 8b 8c 94 95 96 97 100 101 102 103 104 105 109 10a 19d
    
    When this happens, it isn't a network issue, cause all nodes are still acessible.
     
  2. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    What is kernel version?
     
  3. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,204
    Likes Received:
    498
    log message like that usually indicate a (temporary) multicast issue. just because the nodes are reachable otherwise does not mean multicast works correctly.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. masterdaweb

    masterdaweb Member

    Joined:
    Apr 17, 2017
    Messages:
    78
    Likes Received:
    3
    Hi,

    Kernel version: 4.13.13-6-pve
    Hello Fabian,

    I'm using Unicast (UDPU).

    Nodes are connected over WAN.

    Cluster size: 15 nodes

    Corosync.conf:

    Code:
    logging {
      debug: off
      to_syslog: yes
    }
    
    nodelist {
      node {
        name: br01
        nodeid: 12
        quorum_votes: 1
        ring0_addr: br01
      }
      node {
        name: br02
        nodeid: 15
        quorum_votes: 1
        ring0_addr: br02
      }
      node {
        name: ns3002319
        nodeid: 6
        quorum_votes: 1
        ring0_addr: ns3002319
      }
      node {
        name: ns3026955
        nodeid: 14
        quorum_votes: 1
        ring0_addr: ns3026955
      }
      node {
        name: ns3044087
        nodeid: 4
        quorum_votes: 1
        ring0_addr: ns3044087
      }
      node {
        name: ns328159
        nodeid: 3
        quorum_votes: 1
        ring0_addr: ns328159
      }
      node {
        name: ns343743
        nodeid: 7
        quorum_votes: 1
        ring0_addr: ns343743
      }
      node {
        name: ns370473
        nodeid: 13
        quorum_votes: 1
        ring0_addr: ns370473
      }
      node {
        name: ns383063
        nodeid: 5
        quorum_votes: 1
        ring0_addr: ns383063
      }
      node {
        name: ns387137
        nodeid: 11
        quorum_votes: 1
        ring0_addr: ns387137
      }
      node {
        name: ns394769
        nodeid: 10
        quorum_votes: 1
        ring0_addr: ns394769
      }
      node {
        name: ns500043
        nodeid: 9
        quorum_votes: 1
        ring0_addr: ns500043
      }
      node {
        name: ns524364
        nodeid: 1
        quorum_votes: 1
        ring0_addr: ns524364
      }
      node {
        name: ns535923
        nodeid: 8
        quorum_votes: 1
        ring0_addr: ns535923
      }
      node {
        name: ns538414
        nodeid: 2
        quorum_votes: 1
        ring0_addr: ns538414
      }
    }
    
    quorum {
      provider: corosync_votequorum
    }
    
    totem {
      cluster_name: mamae
      config_version: 116
      interface {
        ringnumber: 0
      }
      ip_version: ipv4
      secauth: off
      transport: udpu
      version: 2
    }
    
     
  5. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,204
    Likes Received:
    498
    15 nodes over WAN with unicast sounds like a disaster waiting to happen, unless you mean "low latency link reserved for corosync" when you say WAN.. based on your corosync config, I guess this is not the case?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. masterdaweb

    masterdaweb Member

    Joined:
    Apr 17, 2017
    Messages:
    78
    Likes Received:
    3
    Yes it's in WAN.

    Is there any tweaks in corosync.conf to figure it out ?

    It's strange because it works , but sometimes corosync spam the nodes and then break the cluster.
     
  7. fabian

    fabian Proxmox Staff Member
    Staff Member

    Joined:
    Jan 7, 2016
    Messages:
    3,204
    Likes Received:
    498
    although repeating myself is kind of boring, I'll say it again:
    it's not corosync that is "spamming the nodes" and "breaking the cluster", it's your network that is too unreliable to run a cluster on top. just because it works most of the time, does not mean it's stable and fast enough.

    see our docs which also explicitly state this:
    there is a reason we always recommend against running clusters in non-LAN settings unless you have your own dedicated fiber/low-latency link between locations..

    while you can tweak certain parameters to increase timeouts, it is not advisable.

    you should also seriously re-enable secauth (I have no idea why you would have that disabled? it's always been on by default in PVE..) if you are transmitting over insecure links outside of your immediate control.. right now your cluster communication is not authenticated and transmitted over the public Internet as clear text - any hop along the way can modify any file in /etc/pve as long as it understands the traffic it sees.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice