Redundant ring for Corosync: faulty

tobby

Active Member
Feb 21, 2017
20
5
43
36
I installed a test-cluster with three nodes and wanted to test the redundant ring for corosync. On the first node I executed
Code:
pvecm create testcluster -bindnet0_addr 172.26.1.1 -ring0_addr 172.26.1.1 -bindnet1_addr 172.28.1.1 -ring1_addr 172.28.1.1

On the second node I did this:
Code:
pvecm add 172.26.1.1 -ring0_addr 172.26.1.2 -ring1_addr 172.28.1.2

And on the third:
Code:
pvecm add 172.26.1.1 -ring0_addr 172.26.1.3 -ring1_addr 172.28.1.3

Looking into the logs (journalctl -b -u corosync) gives me this:
Jan 17 10:48:56 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e30 23e32
Jan 17 10:48:56 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e32
Jan 17 10:48:56 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e32
Jan 17 10:48:57 t-pve2 corosync[4304]: error [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:48:57 t-pve2 corosync[4304]: [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:48:58 t-pve2 corosync[4304]: notice [TOTEM ] Automatically recovered ring 1
Jan 17 10:48:58 t-pve2 corosync[4304]: [TOTEM ] Automatically recovered ring 1
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e44
Jan 17 10:49:02 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e44
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e46
Jan 17 10:49:02 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e47
[...]
Jan 17 10:49:22 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e87
Jan 17 10:49:25 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e89 23e8b 23e8d 23e8f
Jan 17 10:49:25 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e89 23e8b 23e8d 23e8f
Jan 17 10:49:25 t-pve2 corosync[4304]: [TOTEM ] Retransmit List: 23e8b 23e8f
Jan 17 10:49:25 t-pve2 corosync[4304]: notice [TOTEM ] Retransmit List: 23e8b 23e8f
Jan 17 10:49:27 t-pve2 corosync[4304]: error [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:49:27 t-pve2 corosync[4304]: [TOTEM ] Marking ringid 1 interface 172.28.1.2 FAULTY
Jan 17 10:49:28 t-pve2 corosync[4304]: notice [TOTEM ] Automatically recovered ring 1
Jan 17 10:49:28 t-pve2 corosync[4304]: [TOTEM ] Automatically recovered ring 1

Every 30s it's marking my ring1 as "faulty" and 1s later it recovers the ring.

Did I do anything wrong?

The servers are very old ones (~10 years) but this is just equipment for testing purposes. Both network cards are Gigabit ones, but the one for ring0 is an onboard Intel card and connected to a Gigabit switch while the second network card (for ring1) is a cheap Realtek card connected to a 100 Mbit switch.
 
Hi,

no you did nothing wrong, but I think your network (latency) is to slow.
Retransmit means you lose packages or they do not come in time.
You can use omping to test the networks.
see https://pve.proxmox.com/wiki/Multicast_notes
 
I set "Block Unknown Multicast Address" in "IGMP Snooping Setting" on one of my three switches (the old 100Mbit Netgear one) to disable - and now it seems to work :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!