[SOLVED] corosync no active links

Oct 17, 2008
99
6
73
49
Netherlands
Hi,

our new cluster is being prepaired for use and while slowly starting to use the cluster i see some messages in syslog:

Code:
Dec 25 06:27:52 prxa06 pmxcfs[4868]: [dcdb] notice: data verification successful
Dec 25 06:34:26 prxa06 corosync[5041]:   [TOTEM ] Retransmit List: b819f
Dec 25 06:35:35 prxa06 corosync[5041]:   [KNET  ] link: host: 5 link: 0 is down
Dec 25 06:35:35 prxa06 corosync[5041]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 25 06:35:35 prxa06 corosync[5041]:   [KNET  ] host: host: 5 has no active links
Dec 25 06:35:38 prxa06 corosync[5041]:   [KNET  ] rx: host: 5 link: 0 is up
Dec 25 06:35:38 prxa06 corosync[5041]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 25 06:38:36 prxa06 pmxcfs[4868]: [status] notice: received log
Dec 25 06:39:42 prxa06 corosync[5041]:   [TOTEM ] Retransmit List: b93e1
Dec 25 06:46:07 prxa06 corosync[5041]:   [TOTEM ] Retransmit List: baa94 baa95

It happens every 1 or 2 hours and i've read that it could be caused by latency or network misconfig.

My nodes have multiple 10Gbit bonds to the switches which has BAGGS configured in 'dynamic' mode.
The nodes each have the following config:

Code:
auto lo
iface lo inet loopback

auto enp129s0f0np0
iface enp129s0f0np0 inet manual
#XGE2/0/3

iface enp68s0f0 inet manual

iface enp68s0f1 inet manual

iface enp68s0f2 inet manual

iface enp68s0f3 inet manual

auto enp129s0f1np1
iface enp129s0f1np1 inet manual
#XGE1/0/3

auto enp161s0f0np0
iface enp161s0f0np0 inet manual
#XGE2/0/4

auto enp161s0f1np1
iface enp161s0f1np1 inet manual
#XGE1/0/4

auto bond0
iface bond0 inet manual
        bond-slaves enp161s0f0np0 enp161s0f1np1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
#LAN

auto bond1
iface bond1 inet manual
        bond-slaves enp129s0f0np0 enp129s0f1np1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000
#CEPH

auto vmbr0
iface vmbr0 inet static
        address 10.100.11.228/24
        gateway 10.100.11.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#LAN

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0
        mtu 9000
#CEPH

auto vlan6
iface vlan6 inet static
        address 10.100.6.228/24
        vlan-raw-device vmbr1
#CEPH-PUBLIC

auto vlan104
iface vlan104 inet static
        address 10.100.104.228/24
        mtu 9000
        vlan-raw-device vmbr1
#CEPH-CLUSTER

The installation is kinda default, wizard based, mostly according wiki pages. No specific settings have been made.

Should i worry about this?
 
Last edited:
That's not a good start for a new cluster.
Corosync should always be separated from the rest of the traffic physically. This means it should get its own NIC(s) and switch(es).
A 1G network is more than enough for most clusters, as Corosync requires low latency but not much bandwidth. [0]
With a shared network, this might not be the case once it has some load.
Corosync offers its own redundancy support via links, which should be preferred over using a bond [1].

If you can't add new links over a separate network, you could try `active-backup` bond mode to see if it is stable then. `active-backup` is the most simple one which doesn't require any support from the switch.


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_requirements
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
So actually corosync misbehaves on a 802.3ad bond which has dynamic channeling :eek:
Why is it so sensitive and what could go wrong eventually?
It does a better job on a single nic setup? Or like 2 times 1 Gbit with seperate IP's on different switches?

I'll give active/backup a try.

Thanks,
Martijn

I see, adding a new ringX to ring0 to the file could make corosync chose the other network. I guess a service-restart is required. After that the old one can be remove i guess.
 
Last edited:
You can keep the old one as backup still, in case the other one goes down.
You can define up to 8 links for Corosync.