Redundant ring for Corosync: faulty

tobby · Jan 17, 2018

I installed a test-cluster with three nodes and wanted to test the redundant ring for corosync. On the first node I executed

Code:

pvecm create testcluster -bindnet0_addr 172.26.1.1 -ring0_addr 172.26.1.1 -bindnet1_addr 172.28.1.1 -ring1_addr 172.28.1.1

On the second node I did this:

Code:

pvecm add 172.26.1.1 -ring0_addr 172.26.1.2 -ring1_addr 172.28.1.2

And on the third:

Code:

pvecm add 172.26.1.1 -ring0_addr 172.26.1.3 -ring1_addr 172.28.1.3

Looking into the logs (journalctl -b -u corosync) gives me this:

Jan 17 10:48:56 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e30 23e32
Jan 17 10:48:56 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e32
Jan 17 10:48:56 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e32
Jan 17 10:48:57 t-pve2 corosync[4304]: error [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:48:57 t-pve2 corosync[4304]: [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:48:58 t-pve2 corosync[4304]: notice [TOTEM ] Automatically recovered ring 1
Jan 17 10:48:58 t-pve2 corosync[4304]: [TOTEM ] Automatically recovered ring 1
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e44
Jan 17 10:49:02 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e44
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e47
[...]
Jan 17 10:49:22 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e87
Jan 17 10:49:25 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e89 23e8b 23e8d 23e8f
Jan 17 10:49:25 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e89 23e8b 23e8d 23e8f
Jan 17 10:49:25 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e8b 23e8f
Jan 17 10:49:25 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e8b 23e8f
Jan 17 10:49:27 t-pve2 corosync[4304]: error [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:49:27 t-pve2 corosync[4304]: [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:49:28 t-pve2 corosync[4304]: notice [TOTEM ] Automatically recovered ring 1
Jan 17 10:49:28 t-pve2 corosync[4304]: [TOTEM ] Automatically recovered ring 1

Every 30s it's marking my ring1 as "faulty" and 1s later it recovers the ring.

Did I do anything wrong?

The servers are very old ones (~10 years) but this is just equipment for testing purposes. Both network cards are Gigabit ones, but the one for ring0 is an onboard Intel card and connected to a Gigabit switch while the second network card (for ring1) is a cheap Realtek card connected to a 100 Mbit switch.

wolfgang · Jan 22, 2018

Hi,

no you did nothing wrong, but I think your network (latency) is to slow.
Retransmit means you lose packages or they do not come in time.
You can use omping to test the networks.
see https://pve.proxmox.com/wiki/Multicast_notes

tobby · Jul 26, 2018

I set "Block Unknown Multicast Address" in "IGMP Snooping Setting" on one of my three switches (the old 100Mbit Netgear one) to disable - and now it seems to work

Search

Search

Redundant ring for Corosync: faulty

tobby

Active Member

wolfgang

Proxmox Retired Staff

tobby

Active Member