After upgrade to 5.4 redundant corosync ring does not work as expected

Whatever · Apr 23, 2019

After and upgrade to PVE 5.4 I'm facing a problem with corosync second ring functionality

corosync.conf

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.101
    ring1_addr: 10.71.200.101
  }
  node {
    name: pve-node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.102
    ring1_addr: 10.71.200.102
  }
  node {
    name: pve-node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.103
    ring1_addr: 10.71.200.103
  }
  node {
    name: pve-node4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.10.104
    ring1_addr: 10.71.200.104
  }
  node {
    name: pve
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.10.10.100
    ring1_addr: 10.71.200.100
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pve-cluster
  config_version: 9
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.10.10.101
    ringnumber: 0
  }
  interface {
    bindnetaddr: 10.71.200.101
    ringnumber: 1
  }
}

After some investigation I've found that there are not multicast address defined for both rings

root@pve-node2:~# corosync-cmapctl totem

Code:

totem.cluster_name (str) = pve-cluster
totem.config_version (u64) = 9
totem.interface.0.bindnetaddr (str) = 10.10.10.101
totem.interface.1.bindnetaddr (str) = 10.71.200.101
totem.ip_version (str) = ipv4
totem.rrp_mode (str) = passive
totem.secauth (str) = on
totem.version (u32) = 2

root@pve-node2:~# cat /etc/hosts

Code:

127.0.0.1 localhost.localdomain localhost

10.10.10.100 pve.vlp.marcellis.local pve
10.10.10.101 pve-node1.imq.marcellis.local pve-node1
10.10.10.102 pve-node2.imq.marcellis.local pve-node2 pvelocalhost
10.10.10.103 pve-node3.imq.marcellis.local pve-node3
10.10.10.104 pve-node4.imq.marcellis.local pve-node4

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

An in syslog I see lots of messages saying that ring1 is marked as FAULted

Code:

Apr 23 12:11:18 pve-node4 corosync[432947]: error   [TOTEM ] Marking ringid 1 interface 10.71.200.104 FAULTY
Apr 23 12:11:18 pve-node4 corosync[432947]:  [TOTEM ] Marking ringid 1 interface 10.71.200.104 FAULTY
Apr 23 12:11:19 pve-node4 corosync[432947]: notice  [TOTEM ] Automatically recovered ring 1
Apr 23 12:11:19 pve-node4 corosync[432947]:  [TOTEM ] Automatically recovered ring 1
Apr 23 12:11:40 pve-node4 corosync[432947]: error   [TOTEM ] Marking ringid 1 interface 10.71.200.104 FAULTY
Apr 23 12:11:40 pve-node4 corosync[432947]:  [TOTEM ] Marking ringid 1 interface 10.71.200.104 FAULTY
Apr 23 12:11:41 pve-node4 corosync[432947]: notice  [TOTEM ] Automatically recovered ring 1
Apr 23 12:11:41 pve-node4 corosync[432947]:  [TOTEM ] Automatically recovered ring 1

What could be wrong with my setup and why could multicast addresses are not been set automatically?
Any advises are very appreciated.

Whatever · Apr 23, 2019

After another restart corosync on every node now I see similar results like before and upgrade:

Code:

root@pve-node2:~# corosync-cmapctl totem
totem.cluster_name (str) = pve-cluster
totem.config_version (u64) = 11
totem.interface.0.bindnetaddr (str) = 10.10.10.102
totem.interface.0.mcastaddr (str) = 239.192.91.16
totem.interface.0.mcastport (u16) = 5405
totem.interface.1.bindnetaddr (str) = 10.71.200.102
totem.interface.1.mcastaddr (str) = 239.192.91.17
totem.interface.1.mcastport (u16) = 5405
totem.ip_version (str) = ipv4
totem.rrp_mode (str) = passive
totem.secauth (str) = on
totem.version (u32) = 2

omping succeeded for each subnet:

Code:

root@pve-node1:~# omping -c 120 -i 1 -q 10.10.10.101 10.10.10.102
10.10.10.102 : waiting for response msg
10.10.10.102 : waiting for response msg
10.10.10.102 : joined (S,G) = (*, 232.43.211.234), pinging
10.10.10.102 : given amount of query messages was sent

10.10.10.102 :   unicast, xmt/rcv/%loss = 120/120/0%, min/avg/max/std-dev = 0.07                      4/0.139/0.254/0.031
10.10.10.102 : multicast, xmt/rcv/%loss = 120/120+120/0%, min/avg/max/std-dev =                       0.079/0.161/0.280/0.034
root@pve-node1:~# omping -c 120 -i 1 -q 10.71.200.101 10.71.200.102
10.71.200.102 : waiting for response msg
10.71.200.102 : waiting for response msg
10.71.200.102 : joined (S,G) = (*, 232.43.211.234), pinging
10.71.200.102 : given amount of query messages was sent

10.71.200.102 :   unicast, xmt/rcv/%loss = 120/120/0%, min/avg/max/std-dev = 0.1                      08/0.214/0.375/0.034
10.71.200.102 : multicast, xmt/rcv/%loss = 120/120/0%, min/avg/max/std-dev = 0.1                      44/0.263/4.958/0.433

Code:

root@pve-node2:~# omping -c 120 -i 1 -q 10.10.10.101 10.10.10.102
10.10.10.101 : waiting for response msg
10.10.10.101 : joined (S,G) = (*, 232.43.211.234), pinging
10.10.10.101 : given amount of query messages was sent

10.10.10.101 :   unicast, xmt/rcv/%loss = 120/120/0%, min/avg/max/std-dev = 0.071/0.128/0.273/0.028
10.10.10.101 : multicast, xmt/rcv/%loss = 120/120+120/0%, min/avg/max/std-dev = 0.089/0.155/0.269/0.029
root@pve-node2:~# omping -c 120 -i 1 -q 10.71.200.101 10.71.200.102
10.71.200.101 : waiting for response msg
10.71.200.101 : joined (S,G) = (*, 232.43.211.234), pinging
10.71.200.101 : given amount of query messages was sent

10.71.200.101 :   unicast, xmt/rcv/%loss = 120/120/0%, min/avg/max/std-dev = 0.136/0.224/0.391/0.040
10.71.200.101 : multicast, xmt/rcv/%loss = 120/119/0% (seq>=2 0%), min/avg/max/std-dev = 0.153/0.240/0.411/0.040

However ring1 is still marked as FAULTed

Code:

Apr 23 12:30:54 pve-node2 corosync[2332134]: error   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
Apr 23 12:30:54 pve-node2 corosync[2332134]:  [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
Apr 23 12:30:55 pve-node2 corosync[2332134]: notice  [TOTEM ] Automatically recovered ring 1
Apr 23 12:30:55 pve-node2 corosync[2332134]:  [TOTEM ] Automatically recovered ring 1
Apr 23 12:31:00 pve-node2 systemd[1]: Starting Proxmox VE replication runner...
Apr 23 12:31:00 pve-node2 systemd[1]: Started Proxmox VE replication runner.
Apr 23 12:31:14 pve-node2 corosync[2332134]: error   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
Apr 23 12:31:14 pve-node2 corosync[2332134]:  [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
Apr 23 12:31:15 pve-node2 corosync[2332134]: notice  [TOTEM ] Automatically recovered ring 1
Apr 23 12:31:15 pve-node2 corosync[2332134]:  [TOTEM ] Automatically recovered ring 1

Whatever · Apr 23, 2019

Enabling corosync debug does not give much

Code:

[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 2
[2332134] pve-node2 corosyncerror   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 1
[2332134] pve-node2 corosyncnotice  [TOTEM ] Automatically recovered ring 1
[2332134] pve-node2 corosyncerror   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 1
[2332134] pve-node2 corosyncnotice  [TOTEM ] Automatically recovered ring 1
[2332134] pve-node2 corosyncerror   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 2
[2332134] pve-node2 corosyncnotice  [TOTEM ] Automatically recovered ring 1
[2332134] pve-node2 corosyncerror   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 3
[2332134] pve-node2 corosyncnotice  [TOTEM ] Automatically recovered ring 1
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 2
[2332134] pve-node2 corosyncerror   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 3
[2332134] pve-node2 corosyncnotice  [TOTEM ] Automatically recovered ring 1
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 2
[2332134] pve-node2 corosyncerror   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 3
[2332134] pve-node2 corosyncnotice  [TOTEM ] Automatically recovered ring 1
[2332134] pve-node2 corosyncerror   [TOTEM ] Marking ringid 1 interface 10.71.200.102 FAULTY
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] received message requesting test of ring now active
[2332134] pve-node2 corosyncdebug   [TOTEM ] Received ring test activate message for ring 1 sent by node 1
[2332134] pve-node2 corosyncnotice  [TOTEM ] Automatically recovered ring 1

Code:

root@pve-node2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 10.10.10.102
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.102
        status  = ring 1 active with no faults

Stoiko Ivanov · Apr 23, 2019

Please run both tests documented in - https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network for both rings (i.e. 2 runs on both nodes each with 2 ips as arguments)

* Especially the second tests would probably show potential issues with a missing multicast-querier on your network
* do you have any firewall-rules declared on the nodes (`iptables-save` output of both nodes) - or on some device inbetween?

* You could also try running the omping test with the mcast-address of the respective ring provided (see `man omping` - the '-m' option)

However it mostly sounds like a multicast-querier or firewall issue

Stoiko Ivanov · Apr 23, 2019

* This is also dealt with in https://bugzilla.proxmox.com/show_bug.cgi?id=2186

* Do you see anything in your nodes' `dmesg` or `journalctl -b` output - it might be related to a issue with a NIC
* did anything else change in your setup and from which version did you upgrade to 5.4 (/var/log/dpkg.log)?

Whatever · Apr 23, 2019

As I already said all the tests have been successfully passed

Code:

root@pve-node1:~# omping -c 10000 -i 0.001 -F -q 10.71.200.101 10.71.200.102
10.71.200.102 : waiting for response msg
10.71.200.102 : joined (S,G) = (*, 232.43.211.234), pinging
10.71.200.102 : given amount of query messages was sent

10.71.200.102 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.034/0.117/0.642/0.030
10.71.200.102 : multicast, xmt/rcv/%loss = 10000/9982/0% (seq>=19 0%), min/avg/max/std-dev = 0.039/0.114/0.643/0.032

root@pve-node2:~# omping -c 10000 -i 0.001 -F -q 10.71.200.101 10.71.200.102
10.71.200.101 : waiting for response msg
10.71.200.101 : waiting for response msg
10.71.200.101 : joined (S,G) = (*, 232.43.211.234), pinging
10.71.200.101 : waiting for response msg
10.71.200.101 : server told us to stop

10.71.200.101 :   unicast, xmt/rcv/%loss = 9391/9391/0%, min/avg/max/std-dev = 0.036/0.116/0.262/0.030
10.71.200.101 : multicast, xmt/rcv/%loss = 9391/9391/0%, min/avg/max/std-dev = 0.042/0.122/0.290/0.031

Test#2

Code:

root@pve-node1:~# omping -c 120 -i 1 -q 10.71.200.101 10.71.200.102
10.71.200.102 : waiting for response msg
10.71.200.102 : waiting for response msg
10.71.200.102 : joined (S,G) = (*, 232.43.211.234), pinging
10.71.200.102 : given amount of query messages was sent

root@pve-node2:~# omping -c 120 -i 1 -q 10.71.200.101 10.71.200.102
10.71.200.101 : waiting for response msg
10.71.200.101 : joined (S,G) = (*, 232.43.211.234), pinging
10.71.200.101 : given amount of query messages was sent

10.71.200.101 :   unicast, xmt/rcv/%loss = 120/120/0%, min/avg/max/std-dev = 0.136/0.224/0.391/0.040
10.71.200.101 : multicast, xmt/rcv/%loss = 120/119/0% (seq>=2 0%), min/avg/max/std-dev = 0.153/0.240/0.411/0.040

Stoiko Ivanov · Apr 23, 2019

Whatever said:
Test#2

Test 2 should run for 10 minutes instead of 2 in order to catch a non-configured multicast listener - please test with:
`omping -c 600 -i 1 -q <nodeip> <node2ip>`

also try setting the multicast-addr to the one of your corosync rings

Whatever · Apr 23, 2019

Done. Results are listed below

#1

Code:

root@pve-node1:~# omping -c 600 -m 239.192.91.17 -i 1 -q 10.71.200.100 10.71.200.101 10.71.200.102 10.71.200.103 10.71.200.104
10.71.200.100 : waiting for response msg
10.71.200.102 : waiting for response msg
10.71.200.103 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.103 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.100 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.102 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.104 : waiting for response msg
10.71.200.104 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.100 : given amount of query messages was sent
10.71.200.102 : given amount of query messages was sent
10.71.200.103 : given amount of query messages was sent
10.71.200.104 : given amount of query messages was sent

10.71.200.100 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.060/0.195/0.309/0.039
10.71.200.100 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.065/0.202/0.341/0.040
10.71.200.102 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.062/0.146/0.291/0.034
10.71.200.102 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.064/0.155/0.294/0.036
10.71.200.103 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.077/0.163/0.272/0.038
10.71.200.103 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.080/0.176/0.297/0.039
10.71.200.104 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.057/0.121/0.300/0.033
10.71.200.104 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.065/0.129/0.310/0.035

#2

Code:

root@pve-node2:~# omping -c 600 -m 239.192.91.17 -i 1 -q 10.71.200.100 10.71.200.101 10.71.200.102 10.71.200.103 10.71.200.104
10.71.200.100 : waiting for response msg
10.71.200.101 : waiting for response msg
10.71.200.103 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.100 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.101 : waiting for response msg
10.71.200.103 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.103 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.101 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.104 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.101 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.100 : given amount of query messages was sent
10.71.200.103 : given amount of query messages was sent
10.71.200.101 : given amount of query messages was sent
10.71.200.104 : given amount of query messages was sent

10.71.200.100 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.055/0.205/0.380/0.041
10.71.200.100 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.072/0.212/0.413/0.043
10.71.200.101 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.155/0.358/0.039
10.71.200.101 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.067/0.160/0.367/0.039
10.71.200.103 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.054/0.179/0.354/0.041
10.71.200.103 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.054/0.185/0.344/0.041
10.71.200.104 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.121/0.325/0.035
10.71.200.104 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.048/0.130/0.338/0.038

#3

Code:

root@pve-node3:~# omping -c 600 -m 239.192.91.17 -i 1 -q 10.71.200.100 10.71.200.101 10.71.200.102 10.71.200.103 10.71.200.104
10.71.200.100 : waiting for response msg
10.71.200.101 : waiting for response msg
10.71.200.102 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.102 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.100 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.101 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.104 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.101 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.100 : given amount of query messages was sent
10.71.200.102 : given amount of query messages was sent
10.71.200.101 : given amount of query messages was sent
10.71.200.104 : given amount of query messages was sent

10.71.200.100 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.051/0.165/0.380/0.052
10.71.200.100 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.054/0.172/0.386/0.053
10.71.200.101 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.042/0.154/0.295/0.047
10.71.200.101 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.166/0.299/0.050
10.71.200.102 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.043/0.152/0.333/0.047
10.71.200.102 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.160/0.351/0.048
10.71.200.104 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.112/0.387/0.031
10.71.200.104 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.120/0.394/0.033

#4

Code:

root@pve-node4:~# omping -c 600 -m 239.192.91.17 -i 1 -q 10.71.200.100 10.71.200.101 10.71.200.102 10.71.200.103 10.71.200.104
10.71.200.100 : waiting for response msg
10.71.200.101 : waiting for response msg
10.71.200.102 : waiting for response msg
10.71.200.103 : waiting for response msg
10.71.200.103 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.100 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.101 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.102 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.100 : given amount of query messages was sent
10.71.200.101 : given amount of query messages was sent
10.71.200.102 : given amount of query messages was sent
10.71.200.103 : given amount of query messages was sent

10.71.200.100 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.063/0.152/0.454/0.052
10.71.200.100 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.053/0.160/0.375/0.045
10.71.200.101 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.060/0.163/0.297/0.043
10.71.200.101 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.171/0.699/0.053
10.71.200.102 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.051/0.160/0.421/0.052
10.71.200.102 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.169/0.345/0.048
10.71.200.103 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.050/0.143/0.299/0.042
10.71.200.103 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.059/0.150/0.435/0.049

#5

Code:

root@pve:~# omping -c 600 -m 239.192.91.17 -i 1 -q 10.71.200.100 10.71.200.10.71.200.102 10.71.200.103 10.71.200.104
10.71.200.101 : waiting for response msg
10.71.200.102 : waiting for response msg
10.71.200.103 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.101 : waiting for response msg
10.71.200.102 : waiting for response msg
10.71.200.103 : waiting for response msg
10.71.200.104 : waiting for response msg
10.71.200.101 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.103 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.104 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.102 : joined (S,G) = (*, 239.192.91.17), pinging
10.71.200.101 : given amount of query messages was sent
10.71.200.102 : given amount of query messages was sent
10.71.200.103 : given amount of query messages was sent
10.71.200.104 : given amount of query messages was sent

10.71.200.101 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.163/0.369/0.058
10.71.200.101 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.048/0.166/0.363/0.058
10.71.200.102 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.148/0.620/0.056
10.71.200.102 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.042/0.155/0.620/0.059
10.71.200.103 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.130/0.343/0.046
10.71.200.103 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.050/0.133/0.300/0.045
10.71.200.104 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.112/0.398/0.042
10.71.200.104 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.050/0.118/0.397/0.047

What is much more interesting - is that multicast address is lost from corosync totem.

After restart corosync (or node) I got:
root@pve-node2:~# corosync-cmapctl totem

totem.cluster_name (str) = pve-cluster
totem.config_version (u64) = 11
totem.interface.0.bindnetaddr (str) = 10.10.10.102
totem.interface.0.mcastaddr (str) = 239.192.91.16
totem.interface.0.mcastport (u16) = 5405
totem.interface.1.bindnetaddr (str) = 10.71.200.102
totem.interface.1.mcastaddr (str) = 239.192.91.17
totem.interface.1.mcastport (u16) = 5405
totem.ip_version (str) = ipv4
totem.rrp_mode (str) = passive
totem.secauth (str) = on
totem.version (u32) = 2

But now on the same node I see:
root@pve-node2:~# corosync-cmapctl totem

totem.cluster_name (str) = pve-cluster
totem.config_version (u64) = 13
totem.interface.0.bindnetaddr (str) = 10.10.10.101
totem.interface.1.bindnetaddr (str) = 10.71.200.101
totem.ip_version (str) = ipv4
totem.rrp_mode (str) = passive
totem.secauth (str) = on

10.10.10.x - 10Gbe net (ring#0 - OK)
10.71.200.x - 1Gbe net (ring#1 - FAULTY)

239.192.91.16 - multicast address on ring#0 (10Gbe switch)
239.192.91.17 - multicast address on ring#1 (1Gbe switch)

Whatever · Apr 23, 2019

If I'm not mistaken than multicast address is determined from hostname

Code:

root@pve-node2:~# hostname -f
pve-node2.imq.marcellis.local

root@pve-node2:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost

10.10.10.100 pve.vlp.marcellis.local pve
10.10.10.101 pve-node1.imq.marcellis.local pve-node1
10.10.10.102 pve-node2.imq.marcellis.local pve-node2 pvelocalhost
10.10.10.103 pve-node3.imq.marcellis.local pve-node3
10.10.10.104 pve-node4.imq.marcellis.local pve-node4


# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Whatever · Apr 23, 2019

Setting multicast adresses into totem brace didn't help

Code:

  interface {
    bindnetaddr: 10.10.10.101
    ringnumber: 0
    mcastaddr: 239.192.91.16
    mcastport: 5405
  }
  interface {
    bindnetaddr: 10.71.200.101
    ringnumber: 1
    mcastaddr: 239.192.91.17
    mcastport: 5405
  }

Stoiko Ivanov · Apr 23, 2019

I tried reproducing the setup here locally (albeit in with 3 virtual nodes on the same host) - and here it works.
the mcastaddr is also not printed, although both rings are healthy:

Code:

# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
        id      = 10.16.201.103
        status  = ring 0 active with no faults
RING ID 1
        id      = 172.16.201.103
        status  = ring 1 active with no faults

Code:

# corosync-cmapctl totem
totem.cluster_name (str) = pve-ceph
totem.config_version (u64) = 16
totem.interface.0.bindnetaddr (str) = 10.16.201.101
totem.interface.1.bindnetaddr (str) = 172.16.201.101
totem.ip_version (str) = ipv4
totem.rrp_mode (str) = passive
totem.secauth (str) = on
totem.version (u32) = 2

* do you see the same output (ring marked faulty) on all 5 nodes?

* since your 2 rings are active and not faulty directly after a fresh start of corosync - is there any chance that the ring 1 switch has some kind of - stormcontrol, anti-ddos, etc. feature enabled which limits the number of multicast/broadcast packets?
* do you have any iptables/nftables rules configured on any of the nodes? (`iptables-save`, `nft list ruleset`)

Whatever · Apr 23, 2019

I've checked:

#1

Code:

root@pve-node1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 10.10.10.101
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.101
        status  = ring 1 active with no faults

#2

Code:

root@pve-node2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 10.10.10.102
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.102
        status  = ring 1 active with no faults

#3

Code:

root@pve-node3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
        id      = 10.10.10.103
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.103
        status  = ring 1 active with no fault

#4

Code:

root@pve-node4:~# corosync-cfgtool -s
Printing ring status.
Local node ID 4
RING ID 0
        id      = 10.10.10.104
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.104
        status  = ring 1 active with no faults

#5

Code:

root@pve:~# corosync-cfgtool -s
Printing ring status.
Local node ID 5
RING ID 0
        id      = 10.10.10.100
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.100
        status  = ring 1 active with no faults

But then I called this command several times and got:

Code:

root@pve-node2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 10.10.10.102
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.102
        status  = ring 1 active with no faults
root@pve-node2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 10.10.10.102
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.102
        status  = Marking ringid 1 interface 10.71.200.102 FAULTY
root@pve-node2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 10.10.10.102
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.71.200.102
        status  = ring 1 active with no faults

Thus, it's marked as FAULTy and than clear the state..
Well, it's inline with syslog messages:

Code:

Apr 23 19:17:36 pve-node3 corosync[3695]:  [TOTEM ] Received ring test activate message for ring 1 sent by node 4
Apr 23 19:17:36 pve-node3 corosync[3695]:  [TOTEM ] Automatically recovered ring 1
Apr 23 19:17:57 pve-node3 corosync[3695]: debug   [TOTEM ] received message requesting test of ring now active
Apr 23 19:17:57 pve-node3 corosync[3695]:  [TOTEM ] received message requesting test of ring now active
Apr 23 19:17:57 pve-node3 corosync[3695]:  [TOTEM ] received message requesting test of ring now active
Apr 23 19:17:57 pve-node3 corosync[3695]: debug   [TOTEM ] Received ring test activate message for ring 1 sent by node 1
Apr 23 19:17:57 pve-node3 corosync[3695]: debug   [TOTEM ] Received ring test activate message for ring 1 sent by node 4
Apr 23 19:17:57 pve-node3 corosync[3695]:  [TOTEM ] Received ring test activate message for ring 1 sent by node 1
Apr 23 19:17:57 pve-node3 corosync[3695]:  [TOTEM ] Received ring test activate message for ring 1 sent by node 4

I believe that "Automatically recovered ring 1" clears ring #1 faulty state

Whatever · Apr 23, 2019

Could it be related to network bond usage?

On all my nodes my network looks like:

Code:

root@pve-node3:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual

iface enp3s0f1 inet manual

iface ens1f0 inet manual

iface ens1f1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno1 enp3s0f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000

auto bond1
iface bond1 inet static
        address  10.10.10.103
        netmask  255.255.255.0
        bond-slaves ens1f0 ens1f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000

auto vmbr0
iface vmbr0 inet static
        address  10.71.200.103
        netmask  255.255.255.0
        gateway  10.71.200.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

Code:

root@pve-node3:~# dmesg | grep bond
[    7.530977] bond0: Enslaving eno1 as a backup interface with a down link
[    7.591318] bond0: Enslaving enp3s0f1 as a backup interface with a down link
[    7.591969] vmbr0: port 1(bond0) entered blocking state
[    7.591971] vmbr0: port 1(bond0) entered disabled state
[    7.592398] device bond0 entered promiscuous mode
[    7.800716] bonding: bond1 is being created...
[    8.049613] bond1: Enslaving ens1f0 as a backup interface with a down link
[    8.293574] bond1: Enslaving ens1f1 as a backup interface with a down link
[    8.295693] device bond0 left promiscuous mode
[    8.952181] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[    9.056080] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
[    9.576089] bond1: link status definitely up for interface ens1f0, 10000 Mbps full duplex
[    9.576093] bond1: first active interface up!
[   10.828060] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[   10.829053] bond0: link status definitely up for interface enp3s0f1, 1000 Mbps full duplex
[   10.829056] bond0: first active interface up!
[   10.829068] vmbr0: port 1(bond0) entered blocking state
[   10.829069] vmbr0: port 1(bond0) entered forwarding state
[   11.361844] bond0: link status definitely up for interface eno1, 1000 Mbps full duplex
[   12.816212] bond1: link status definitely up for interface ens1f1, 10000 Mbps full duplex
[   99.898168] device bond0 entered promiscuous mode

Stoiko Ivanov · Apr 23, 2019

Whatever said:
[ 8.952181] bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond

Hm - these messages look a bit suspicious (although it could only be a timing issue)
* please verify that the bond-mode is indeed lacp/802.3ad - (`ip -details addr` gives the fitting hints on host side) - as for the switch-side it depends on your switch config (AFAIR for cisco you need to have a portchannel configured for both interfaces where you want to have a LACP bond)

* and AFAIR multicast and LACP could very well end up mixing packets up - although I'd suspect that corosync should be able to handle this to some extend

Also am still not sure if all nodes have the faulty ring or if it happens only on pve-node2 (most output with this error is from that node - but that could be just coincidence)

If possible try to take the bond out of the equation (by reconfiguring vmbr0 on top of one of the interfaces only, or with mode active-backup)

Hope this helps!

XoCluTch · Apr 23, 2019

I'm having a very similar issue, maybe even the same issue. After updating. However I only had 1 Ring setup, and it was preventing my cluster from communicating with each other. I eventually added a Secondary Ring, which has resolved the communication issue, However, the Primary Ring still is not working correctly. This happened after I upgraded to the latest version as well.. . I will post logs and more details as soon as I am able.

I have checked everything and as far as I can tell there is no actual communication issues, omping and other tests show the devices are able to communicate.

However, I have noticed the mcastaddr's on my server appear to be the same, I am going to manually set them to make sure their isnt some sort of mcast ip conflict happening.

XoCluTch · Apr 23, 2019

# omping -c 10000 -i 0.001 -F -q -m 239.192.239.81 172.16.4.32 172.16.4.33
172.16.4.33 : waiting for response msg
172.16.4.33 : joined (S,G) = (*, 239.192.239.81), pinging
172.16.4.33 : given amount of query messages was sent

172.16.4.33 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.041/0.104/0.492/0.030
172.16.4.33 : multicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 0.049/0.145/0.492/0.040
root@xoserver02:/etc/pve# omping -c 10000 -i 0.001 -F -q -m 239.192.239.81 192.168.4.33 192.168.4.32
192.168.4.33 : waiting for response msg
192.168.4.33 : joined (S,G) = (*, 239.192.239.81), pinging
192.168.4.33 : given amount of query messages was sent

192.168.4.33 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.052/0.177/0.325/0.032
192.168.4.33 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

Stoiko Ivanov · Apr 24, 2019

XoCluTch said:
I'm having a very similar issue, maybe even the same issue. After updating. However I only had 1 Ring setup, and it was preventing my cluster from communicating with each other. I eventually added a Secondary Ring, which has resolved the communication issue, However, the Primary Ring still is not working correctly. This happened after I upgraded to the latest version as well.. . I will post logs and more details as soon as I am able.

I don't think that the issue is directly related - however is your original first ring on a bond? - if so which bond mode? (check with `ip -details addr`)

XoCluTch said:
However, I have noticed the mcastaddr's on my server appear to be the same, I am going to manually set them to make sure their isnt some sort of mcast ip conflict happening.

Not sure I completely understand this - but the multicast-addr has to be the same for one ring of one cluster (all nodes participating in the corosync ring subscripte to the multicastgroup with that address) - do you have multiple clusters?

XoCluTch said:
192.168.4.33 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

This is quite a good indication that there is a multicast problem somewhere!

Whatever · Apr 24, 2019

Stoiko Ivanov said:
Hm - these messages look a bit suspicious (although it could only be a timing issue)
* please verify that the bond-mode is indeed lacp/802.3ad - (`ip -details addr` gives the fitting hints on host side) - as for the switch-side it depends on your switch config (AFAIR for cisco you need to have a portchannel configured for both interfaces where you want to have a LACP bond)

Checked. LACP is properly configured and is working (it was working before upgrade to 5.4 as well)

Stoiko Ivanov said:
Also am still not sure if all nodes have the faulty ring or if it happens only on pve-node2 (most output with this error is from that node - but that could be just coincidence)

If possible try to take the bond out of the equation (by reconfiguring vmbr0 on top of one of the interfaces only, or with mode active-backup)

Hope this helps!

Thanks for a suggestion. It's quite hard to check on production cluster. However I will give a try as well as trying to set MTU to 1500 instead of 9000

P.S. MTU 9000 is the only difference with almost the same cluster (but 5.3) which is working perfectly well with both rings

Stoiko Ivanov · Apr 24, 2019

Whatever said:
Thanks for a suggestion. It's quite hard to check on production cluster. However I will give a try as well as trying to set MTU to 1500 instead of 9000

P.S. MTU 9000 is the only difference with almost the same cluster (but 5.3) which is working perfectly well with both rings

* if possible please post your (redacted if needed) `ip -details addr` output - maybe there's some discrepancy between config ('/etc/network/interfaces') and reality.
* also are the LACP/MTU settings active and working on the switch side?

Whatever · Apr 24, 2019

Stoiko Ivanov said:
* if possible please post your (redacted if needed) `ip -details addr` output - maybe there's some discrepancy between config ('/etc/network/interfaces') and reality.

Checked.

#1

Code:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a7:30 brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:2b:a7:30 queue_id 0 ad_aggregator_id 2 ad_actor_oper_port_state 6                                                                                        1 ad_partner_oper_port_state 61 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
3: ens1f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 0c:c4:7a:1d:8c:c6 brd ff:ff:ff:ff:ff:ff promiscuity 0
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:1d:8c:c6 queue_id 0 ad_aggregator_id 2 ad_actor_oper_port_state 6                                                                                        1 ad_partner_oper_port_state 61 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
4: enp3s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a7:30 brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:2b:a7:31 queue_id 0 ad_aggregator_id 2 ad_actor_oper_port_state 6                                                                                        1 ad_partner_oper_port_state 61 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
5: ens1f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 0c:c4:7a:1d:8c:c6 brd ff:ff:ff:ff:ff:ff promiscuity 0
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:1d:8c:c7 queue_id 0 ad_aggregator_id 2 ad_actor_oper_port_state 6                                                                                        1 ad_partner_oper_port_state 61 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a7:30 brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond mode 802.3ad miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fai                                                                                        l_over_mac none xmit_hash_policy layer2+3 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate sl                                                                                        ow ad_select stable ad_aggregator 2 ad_num_ports 2 ad_actor_key 9 ad_partner_key 1003 ad_partner_mac 02:04:96:8b:a0:dd ad_actor_sys_prio 65535 ad_use                                                                                        r_port_key 0 ad_actor_system 00:00:00:00:00:00:00:00 tlb_dynamic_lb 1
    bridge_slave state forwarding priority 32 cost 4 hairpin off guard off root_block off fastleave off learning on flood on port_id 0x8001 port_no 0                                                                                        x1 designated_port 32769 designated_cost 0 designated_bridge 8000.c:c4:7a:2b:a7:30 designated_root 8000.c:c4:7a:2b:a7:30 hold_timer    0.00 message_a                                                                                        ge_timer    0.00 forward_delay_timer    0.00 topology_change_ack 0 config_pending 0 proxy_arp off proxy_arp_wifi off mcast_router 1 mcast_fast_leave                                                                                         off mcast_flood on neigh_suppress off group_fwd_mask 0x0 group_fwd_mask_str 0x0 vlan_tunnel off numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_                                                                                        max_segs 65535
7: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a7:30 brd ff:ff:ff:ff:ff:ff promiscuity 0
    bridge forward_delay 0 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 1 vlan_protocol 802.1Q bridge_id 8                                                                                        000.c:c4:7a:2b:a7:30 designated_root 8000.c:c4:7a:2b:a7:30 root_port 0 root_path_cost 0 topology_change 0 topology_change_detected 0 hello_timer    0                                                                                        .00 tcn_timer    0.00 topology_change_timer    0.00 gc_timer  137.46 vlan_default_pvid 1 vlan_stats_enabled 0 group_fwd_mask 0 group_address 01:80:c2                                                                                        :00:00:00 mcast_snooping 1 mcast_router 1 mcast_query_use_ifaddr 0 mcast_querier 0 mcast_hash_elasticity 4 mcast_hash_max 512 mcast_last_member_count                                                                                         2 mcast_startup_query_count 2 mcast_last_member_interval 100 mcast_membership_interval 26000 mcast_querier_interval 25500 mcast_query_interval 12500                                                                                         mcast_query_response_interval 1000 mcast_startup_query_interval 3124 mcast_stats_enabled 0 mcast_igmp_version 2 mcast_mld_version 1 nf_call_iptables                                                                                         0 nf_call_ip6tables 0 nf_call_arptables 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.71.200.101/24 brd 10.71.200.255 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fe2b:a730/64 scope link
       valid_lft forever preferred_lft forever

#2

Code:

root@pve-node2:~# ip -details addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_m                                                                                                                 ax_segs 65535
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:ab:5c brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:2b:ab:5c queue_id 0 ad_aggregator_id 1 a                                                                                                                 d_actor_oper_port_state 61 ad_partner_oper_port_state 61 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535                                                                                                                 
3: ens1f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 0c:c4:7a:1d:87:98 brd ff:ff:ff:ff:ff:ff promiscuity 0
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:1d:87:98 queue_id 0 ad_aggregator_id 1 a                                                                                                                 d_actor_oper_port_state 61 ad_partner_oper_port_state 61 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535                                                                                                                 
4: enp3s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:ab:5c brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:2b:ab:5d queue_id 0 ad_aggregator_id 1 a                                                                                                                 d_actor_oper_port_state 61 ad_partner_oper_port_state 61 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535                                                                                                                 
5: ens1f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 0c:c4:7a:1d:87:98 brd ff:ff:ff:ff:ff:ff promiscuity 0
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:1d:87:99 queue_id 0 ad_aggregator_id 1 a                                                                                                                 d_actor_oper_port_state 61 ad_partner_oper_port_state 61 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535                                                                                                                 
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:ab:5c brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond mode 802.3ad miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any pr                                                                                                                 imary_reselect always fail_over_mac none xmit_hash_policy layer2+3 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_link                                                                                                                 s 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable ad_aggregator 1 ad_num_ports 2 ad_actor_key 9 ad_partn                                                                                                                 er_key 1004 ad_partner_mac 02:04:96:8b:a0:dd ad_actor_sys_prio 65535 ad_user_port_key 0 ad_actor_system 00:00:00:00:00:00:00                                                                                                                 :00 tlb_dynamic_lb 1
    bridge_slave state forwarding priority 32 cost 4 hairpin off guard off root_block off fastleave off learning on flood on                                                                                                                  port_id 0x8001 port_no 0x1 designated_port 32769 designated_cost 0 designated_bridge 8000.c:c4:7a:2b:ab:5c designated_root                                                                                                                  8000.c:c4:7a:2b:ab:5c hold_timer    0.00 message_age_timer    0.00 forward_delay_timer    0.00 topology_change_ack 0 config_                                                                                                                 pending 0 proxy_arp off proxy_arp_wifi off mcast_router 1 mcast_fast_leave off mcast_flood on neigh_suppress off group_fwd_m                                                                                                                 ask 0x0 group_fwd_mask_str 0x0 vlan_tunnel off numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
7: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:ab:5c brd ff:ff:ff:ff:ff:ff promiscuity 0
    bridge forward_delay 0 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 1 vlan_pr                                                                                                                 otocol 802.1Q bridge_id 8000.c:c4:7a:2b:ab:5c designated_root 8000.c:c4:7a:2b:ab:5c root_port 0 root_path_cost 0 topology_ch                                                                                                                 ange 0 topology_change_detected 0 hello_timer    0.00 tcn_timer    0.00 topology_change_timer    0.00 gc_timer   61.22 vlan_                                                                                                                 default_pvid 1 vlan_stats_enabled 0 group_fwd_mask 0 group_address 01:80:c2:00:00:00 mcast_snooping 1 mcast_router 1 mcast_q                                                                                                                 uery_use_ifaddr 0 mcast_querier 0 mcast_hash_elasticity 4 mcast_hash_max 512 mcast_last_member_count 2 mcast_startup_query_c                                                                                                                 ount 2 mcast_last_member_interval 100 mcast_membership_interval 26000 mcast_querier_interval 25500 mcast_query_interval 1250                                                                                                                 0 mcast_query_response_interval 1000 mcast_startup_query_interval 3124 mcast_stats_enabled 0 mcast_igmp_version 2 mcast_mld_                                                                                                                 version 1 nf_call_iptables 0 nf_call_ip6tables 0 nf_call_arptables 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_                                                                                                                 segs 65535
    inet 10.71.200.102/24 brd 10.71.200.255 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fe2b:ab5c/64 scope link
       valid_lft forever preferred_lft forever

#3

Code:

root@pve-node3:~# ip -details addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 655                                                                                                      35
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a2:da brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:2b:a2:da queue_id 0 ad_aggregator_id 1 ad_actor_ope                                                                                                      r_port_state 61 ad_partner_oper_port_state 61 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
3: ens1f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 00:25:90:64:3b:8e brd ff:ff:ff:ff:ff:ff promiscuity 0
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 00:25:90:64:3b:8e queue_id 0 ad_aggregator_id 1 ad_actor_ope                                                                                                      r_port_state 61 ad_partner_oper_port_state 61 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
4: enp3s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a2:da brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 0c:c4:7a:2b:a2:db queue_id 0 ad_aggregator_id 1 ad_actor_ope                                                                                                      r_port_state 61 ad_partner_oper_port_state 61 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
5: ens1f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000
    link/ether 00:25:90:64:3b:8e brd ff:ff:ff:ff:ff:ff promiscuity 0
    bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 00:25:90:64:3b:8f queue_id 0 ad_aggregator_id 1 ad_actor_ope                                                                                                      r_port_state 61 ad_partner_oper_port_state 61 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a2:da brd ff:ff:ff:ff:ff:ff promiscuity 1
    bond mode 802.3ad miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_resel                                                                                                      ect always fail_over_mac none xmit_hash_policy layer2+3 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 pack                                                                                                      ets_per_slave 1 lacp_rate slow ad_select stable ad_aggregator 1 ad_num_ports 2 ad_actor_key 9 ad_partner_key 1005 ad_partner_mac 02:04:                                                                                                      96:8b:a0:dd ad_actor_sys_prio 65535 ad_user_port_key 0 ad_actor_system 00:00:00:00:00:00:00:00 tlb_dynamic_lb 1
    bridge_slave state forwarding priority 32 cost 4 hairpin off guard off root_block off fastleave off learning on flood on port_id 0x                                                                                                      8001 port_no 0x1 designated_port 32769 designated_cost 0 designated_bridge 8000.c:c4:7a:2b:a2:da designated_root 8000.c:c4:7a:2b:a2:da                                                                                                       hold_timer    0.00 message_age_timer    0.00 forward_delay_timer    0.00 topology_change_ack 0 config_pending 0 proxy_arp off proxy_arp                                                                                                      _wifi off mcast_router 1 mcast_fast_leave off mcast_flood on neigh_suppress off group_fwd_mask 0x0 group_fwd_mask_str 0x0 vlan_tunnel o                                                                                                      ff numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
7: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 0c:c4:7a:2b:a2:da brd ff:ff:ff:ff:ff:ff promiscuity 0
    bridge forward_delay 0 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 1 vlan_protocol 802.                                                                                                      1Q bridge_id 8000.c:c4:7a:2b:a2:da designated_root 8000.c:c4:7a:2b:a2:da root_port 0 root_path_cost 0 topology_change 0 topology_change                                                                                                      _detected 0 hello_timer    0.00 tcn_timer    0.00 topology_change_timer    0.00 gc_timer  136.56 vlan_default_pvid 1 vlan_stats_enabled                                                                                                       0 group_fwd_mask 0 group_address 01:80:c2:00:00:00 mcast_snooping 1 mcast_router 1 mcast_query_use_ifaddr 0 mcast_querier 0 mcast_hash                                                                                                      _elasticity 4 mcast_hash_max 512 mcast_last_member_count 2 mcast_startup_query_count 2 mcast_last_member_interval 100 mcast_membership_                                                                                                      interval 26000 mcast_querier_interval 25500 mcast_query_interval 12500 mcast_query_response_interval 1000 mcast_startup_query_interval                                                                                                       3124 mcast_stats_enabled 0 mcast_igmp_version 2 mcast_mld_version 1 nf_call_iptables 0 nf_call_ip6tables 0 nf_call_arptables 0 numtxque                                                                                                      ues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 10.71.200.103/24 brd 10.71.200.255 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::ec4:7aff:fe2b:a2da/64 scope link
       valid_lft forever preferred_lft forever

After upgrade to 5.4 redundant corosync ring does not work as expected

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

We value your privacy