corosync redundant rings strange detection if ring down

czechsys

Renowned Member
Nov 18, 2015
419
43
93
Hi,

i have crazy corosync behavior - it doesn't mark specific interface failed, if i set it down.
proxmox-ve: 4.4-86 (running kernel: 4.4.49-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-97
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
openvswitch-switch: 2.6.0-2

auto eth3
iface eth3 inet static
address 10.0.40.41
netmask 255.255.255.0
#pve-02 node coro0

allow-vmbr1 bond1
iface bond1 inet manual
ovs_bridge vmbr1
ovs_type OVSBond
ovs_bonds eth4 eth5
ovs_options lacp=active bond-mode=balance-tcp
pre-up ( /sbin/ip link set eth4 mtu 9000 && /sbin/ip link set eth5 mtu 9000 )
up /sbin/ip link set mtu 9000 bond1


auto vmbr1
iface vmbr1 inet manual
ovs_type OVSBridge
ovs_ports bond1 pve_01_nfs pve_01_coro1
mtu 9000

allow-vmbr1 pve_01_coro1
iface pve_01_coro1 inet static
address 10.0.50.41
netmask 255.255.255.0
ovs_type OVSIntPort
ovs_bridge vmbr1
ovs_options tag=143
mtu 9000
#pve-01 node coro1

totem {
version: 2
secauth: on
cluster_name: pve-0
config_version: 2
ip_version: ipv4
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 10.0.40.41
}
interface {
ringnumber: 1
bindnetaddr: 10.0.50.41
}
}

nodelist {
node {
ring0_addr: pve-01-coro0
ring1_addr: pve-01-coro1
name: pve-01
nodeid: 1
quorum_votes: 1
}
}

quorum {
provider: corosync_votequorum
}

logging {
to_syslog: yes
debug: off
}

10.0.40.42 : joined (S,G) = (*, 232.43.211.234), pinging
10.0.40.42 : unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 0.122/0.154/0.177/0.023
10.0.40.42 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 0.129/0.163/0.195/0.025

10.0.50.42 : joined (S,G) = (*, 232.43.211.234), pinging
10.0.50.42 : unicast, xmt/rcv/%loss = 68/68/0%, min/avg/max/std-dev = 0.187/0.217/0.278/0.025
10.0.50.42 : multicast, xmt/rcv/%loss = 68/68/0%, min/avg/max/std-dev = 0.167/0.234/0.466/0.035

Setup:
1] eth3 - physical, primary corosync ring0, connected to cisco 3560g
2] pve_01_coro1 - openvswitch, secondary corosync ring1, connected to huawei S6720

Both 1] & 2] up:
Code:
Local node ID 1
RING ID 0
    id    = 10.0.40.41
    status    = ring 0 active with no faults
RING ID 1
    id    = 10.0.50.41
    status    = ring 1 active with no faults

Case A: ip link set eth3 down:
Code:
RING ID 0
    id    = 127.0.0.1
    status    = ring 0 active with no faults
RING ID 1
    id    = 10.0.50.41
    status    = ring 1 active with no faults

Case B: ip link set pve_01_coro1 down:
Code:
RING ID 0
    id    = 10.0.40.41
    status    = ring 0 active with no faults
RING ID 1
    id    = 10.0.50.41
    status    = Marking ringid 1 interface 127.0.0.1 FAULTY

Case C: ip link set down eth3 && ip link set down pve_01_coro1:
Code:
RING ID 0
    id    = 127.0.0.1
    status    = Marking ringid 0 interface 127.0.0.1 FAULTY
RING ID 1
    id    = 10.0.50.41
    status    = ring 1 active with no faults

Case C has more different statuses, depends on which interface goes down first etc. Anyway, there is something crazy with corosync, do i have some config error/flaw in setup? Can be problem that both rings uses same address on different switches? Do i need in such setup define mcast address in corosync.conf? Nothing about this in proxmox wiki.

Thanks.
 
I moved both rings to openvswitch. Same problem. I moved both rings from openvswitch to physical interfaces, so both ended on the same cisco, every ring has different nonroutable vlan. Well, ring0 via vlanY is directly connected on cisco, ring1 via vlanX (Po3,Po4) is interconnect between cisco switches.

Code:
vlanX      239.192.12.219    igmp        v2          Po3, Po4
vlanY      239.192.12.218    igmp        v2          Gi0/1, Gi0/2

This is if both interfaces down:
Code:
RING ID 0
    id    = 127.0.0.1
    status    = ring 0 active with no faults
RING ID 1
    id    = 127.0.0.1
    status    = Marking ringid 1 interface 127.0.0.1 FAULTY

This is after raising one interface up from both down:
Code:
RING ID 0
   id   = 10.0.40.41
   status   = ring 0 active with no faults
RING ID 1
   id   = 127.0.0.1
   status   = ring 1 active with no faults

This is after raising second interface up from one up, one down:
Code:
RING ID 0
   id   = 10.0.40.41
   status   = ring 0 active with no faults
RING ID 1
   id   = 127.0.0.1
   status   = Marking ringid 1 interface 10.0.50.41 FAULTY

Apr 18 16:36:15 pve-02 corosync[1902]:  [TOTEM ] Marking ringid 0 interface 10.0.40.42 FAULTY
Apr 18 16:36:16 pve-02 corosync[1902]:  [TOTEM ] Automatically recovered ring 0
Apr 18 16:41:35 pve-02 corosync[1902]:  [TOTEM ] Marking ringid 1 interface 10.0.50.42 FAULTY
Apr 18 16:41:35 pve-02 corosync[1902]:  [TOTEM ] Retransmit List: 445 447 449
Apr 18 16:42:47 pve-02 corosync[1902]:  [TOTEM ] Automatically recovered ring 1
Apr 18 16:43:06 pve-02 corosync[1902]:  [TOTEM ] Retransmit List: 531 533 535
Apr 18 16:43:07 pve-02 corosync[1902]:  [TOTEM ] Retransmit List: 531 535
Apr 18 16:43:07 pve-02 corosync[1902]:  [TOTEM ] Retransmit List: 535
Apr 18 16:43:07 pve-02 corosync[1902]:  [TOTEM ] Marking ringid 0 interface 10.0.40.42 FAULTY
Apr 18 16:43:23 pve-02 corosync[1902]:  [TOTEM ] A processor failed, forming new configuration.
Apr 18 16:43:24 pve-02 corosync[1902]:  [TOTEM ] A new membership (10.0.40.42:2100) was formed. Members left: 1
Apr 18 16:43:24 pve-02 corosync[1902]:  [TOTEM ] Failed to receive the leave message. failed: 1
Apr 18 16:43:24 pve-02 corosync[1902]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 18 16:43:24 pve-02 corosync[1902]:  [QUORUM] Members[1]: 2
Apr 18 16:43:24 pve-02 corosync[1902]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr 18 16:43:25 pve-02 corosync[1902]:  [TOTEM ] Automatically recovered ring 0
Apr 18 16:44:30 pve-02 corosync[1902]:  [TOTEM ] A new membership (10.0.40.41:2120) was formed. Members joined: 1
Apr 18 16:44:31 pve-02 corosync[1902]:  [TOTEM ] Marking ringid 1 interface 10.0.50.42 FAULTY
Apr 18 16:44:31 pve-02 corosync[1902]:  [TOTEM ] Retransmit List: 1
Apr 18 16:44:31 pve-02 corosync[1902]:  [QUORUM] This node is within the primary component and will provide service.
Apr 18 16:44:31 pve-02 corosync[1902]:  [QUORUM] Members[2]: 1 2
Apr 18 16:44:31 pve-02 corosync[1902]:  [MAIN  ] Completed service synchronization, ready to provide service.

Hm, any idea?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!