Corosync Redundancy not working

Aug 10, 2021
6
0
1
43
Hello!
I have a problem with corosync redundancy on a 3 node test cluster running PVE 7.1-4. I set everything up and it looks fine but as soon as I pull the cable of ring0 the connection to the cluster is lost. The strange thing is that if I take down the port via
Code:
# ip link set dev eno1 down
everything is working as expected.

Code:
# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
    addr    = xxx.xxx.xxx.10
    status:
        nodeid:          1:    localhost
        nodeid:          2:    connected
        nodeid:          3:    connected
LINK ID 1 udp
    addr    = xxx.xxx.xxx.11
    status:
        nodeid:          1:    localhost
        nodeid:          2:    connected
        nodeid:          3:    connected

Where is my mistake?

Code:
# cat /etc/pve/corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: proxmox-2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: xxx.xxx.xxx.22
    ring1_addr: xxx.xxx.xxx.23
  }
  node {
    name: pve-1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: xxx.xxx.xxx.10
    ring1_addr: xxx.xxx.xxx.11
  }
  node {
    name: pve-2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: xxx.xxx.xxx.12
    ring1_addr: xxx.xxx.xxx.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Test
  config_version: 3
  interface {
    linknumber: 0
    knet_link_priority: 1
  }
  interface {
    linknumber: 1
    knet_link_priority: 10
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}


Thanks a lot!
 
what do you mean with 'connection to the cluster is lost' - please post any errors/logs that are relevant. also, the /etc/network/interfaces from each node would be good to include..
 
Hello Fabian,

thank you for your answer. If I pull the cable from node-2, ring0 than the node is isolated from 1 and 3.



Nov 17 15:58:34 proxmox-2 corosync[2951]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 17 15:58:35 proxmox-2 corosync[2951]: [KNET ] rx: host: 1 link: 0 is up
Nov 17 15:58:35 proxmox-2 corosync[2951]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 10)
Nov 17 16:00:11 proxmox-2 corosync[2951]: [KNET ] link: host: 2 link: 0 is down
Nov 17 16:00:11 proxmox-2 corosync[2951]: [KNET ] link: host: 2 link: 1 is down
Nov 17 16:00:11 proxmox-2 corosync[2951]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 10)
Nov 17 16:00:11 proxmox-2 corosync[2951]: [KNET ] host: host: 2 has no active links
Nov 17 16:00:11 proxmox-2 corosync[2951]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 10)
Nov 17 16:00:11 proxmox-2 corosync[2951]: [KNET ] host: host: 2 has no active links
Nov 17 16:00:12 proxmox-2 corosync[2951]: [TOTEM ] Token has not been received in 2737 ms
Nov 17 16:00:13 proxmox-2 corosync[2951]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Nov 17 16:00:17 proxmox-2 corosync[2951]: [QUORUM] Sync members[2]: 1 3
Nov 17 16:00:17 proxmox-2 corosync[2951]: [QUORUM] Sync left[1]: 2
Nov 17 16:00:17 proxmox-2 corosync[2951]: [TOTEM ] A new membership (1.281) was formed. Members left: 2
Nov 17 16:00:17 proxmox-2 corosync[2951]: [TOTEM ] Failed to receive the leave message. failed: 2
Nov 17 16:00:17 proxmox-2 corosync[2951]: [QUORUM] Members[2]: 1 3
Nov 17 16:00:17 proxmox-2 corosync[2951]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 17 16:01:58 proxmox-2 corosync[2951]: [KNET ] rx: host: 2 link: 0 is up
Nov 17 16:01:58 proxmox-2 corosync[2951]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 17 16:01:58 proxmox-2 corosync[2951]: [KNET ] rx: host: 2 link: 1 is up
Nov 17 16:01:58 proxmox-2 corosync[2951]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 10)
Nov 17 16:01:58 proxmox-2 corosync[2951]: [QUORUM] Sync members[3]: 1 2 3
Nov 17 16:01:58 proxmox-2 corosync[2951]: [QUORUM] Sync joined[1]: 2
Nov 17 16:01:58 proxmox-2 corosync[2951]: [TOTEM ] A new membership (1.285) was formed. Members joined: 2
Nov 17 16:01:58 proxmox-2 corosync[2951]: [QUORUM] Members[3]: 1 2 3
Nov 17 16:01:58 proxmox-2 corosync[2951]: [MAIN ] Completed service synchronization, ready to provide service.
 
it seems pulling that cable makes both links go down.. I suspect some network config mistake ;)
 
The link was still up according to:

# ip link | grep eno

But no ping was going through. On the switch side everything looks fine. I also replaced our data center switch of ring1 for a test but the result was the same.


The network conf looks straightforward to me!?
/etc/network/interfaces

....
auto eno1
iface eno1 inet static
address xxx.xxx.xxx.22/24
#ProxSync

auto eno2
iface eno2 inet static
address xxx.xxx.xxx.23/24
#ProxSync
....
 
is xxx.xxx.xxx the same for both interfaces?
 
well, yeah - you only have a single subnet, and the link where all of that subnet is routed goes down