Corosync KNET Flapping

ikogan

Active Member
Apr 8, 2017
29
1
28
38
Every so often (like right now), I'll start seeing a lot of KNET logs about a link going down and coming back up. Sometimes rebooting one node will fix it, sometimes it won't. It seems to happen randomly after node reboots or some other event. How can I determine which node is causing this or what part of the infrastructure is causing it?

I have 2 10GbE interfaces and 2 GbE interfaces LAGd together on each host except one, which also has 2 10GbE but 4 GbE all lagged together. Each logical interface trunks several VLANs.

1 of the 10GbE VLANs contains the main corosync network as well as the Ceph front side network. The other 10GbE VLAN contains the secondary corosync ring and the Ceph back side network. Each 10GbE link is connected via DAC to a different switch. Here's the logs I see on each node:

Code:
Jan 24 20:23:31 pve2 corosync[6080]:   [KNET  ] link: host: 5 link: 1 is down
Jan 24 20:23:31 pve2 corosync[6080]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jan 24 20:23:34 pve2 corosync[6080]:   [KNET  ] rx: host: 5 link: 1 is up
Jan 24 20:23:34 pve2 corosync[6080]:   [KNET  ] link: Resetting MTU for link 1 because host 5 joined
Jan 24 20:23:34 pve2 corosync[6080]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jan 24 20:23:34 pve2 corosync[6080]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Here's my corosync.conf:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.0.32
    ring1_addr: 10.13.1.4
  }
  node {
    name: pve2
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.0.1
    ring1_addr: 10.13.1.1
  }
  node {
    name: pve3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.0.2
    ring1_addr: 10.13.1.2
  }
  node {
    name: pve4
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.0.3
    ring1_addr: 10.13.1.3
  }
  node {
    name: pve5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.10.0.20
    ring1_addr: 10.13.1.5
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: PVE
  config_version: 6
  interface {
    bindnetaddr: 10.10.0.1
    ringnumber: 0
  }
  interface {
    bindnetaddr: 10.13.1.1
    ringnumber: 1
  }
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
}

Here's a graph of traffic from the switches over the last 15 minutes (ignore the vertical lines):
Screenshot from 2023-01-24 20-44-28.png

Ring 0 is the "Private" network and Ring 1 is the "Secondary" network, which shares a VLAN with the "Storage" network. It looks like it's Ring 1 that's flapping...but why?
 
Last edited:

spirit

Famous Member
Apr 2, 2010
5,916
712
143
www.odiso.com
1 of the 10GbE VLANs contains the main corosync network as well as the Ceph front side network. The other 10GbE VLAN contains the secondary corosync ring

So, you have 2 rings, on same bond of 2x10gb interfaces ? if yes, it's don't make too much sense, you already have redundancy at bond level.

(BTW, which bond mode do you use ?)
 

ikogan

Active Member
Apr 8, 2017
29
1
28
38
The 10 GbE interfaces are not bonded, only the 1 GbE interfaces are bonded and those are used for "public" VM traffic, they're not used for Proxmox clustering or Ceph. They're using LACP and the switch reports that it's fine.

Here's an example of one of the host's `/etc/network/interfaces`:

Code:
❯ cat interfaces
auto lo
iface lo inet loopback

iface eno1 inet manual

iface enp5s0 inet manual

iface enp3s0f0 inet manual

iface enp3s0f1 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves eno1 enp5s0
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
#Public Trunk

auto vmbr0
iface vmbr0 inet manual
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
#Public Trunk

auto vmbr0.20
iface vmbr0.20 inet static
    address 10.1.42.32/24
    gateway 10.1.42.254
#Public

auto vmbr1
iface vmbr1 inet manual
    bridge-ports enp3s0f0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
#Private Trunk

auto vmbr1.1
iface vmbr1.1 inet static
    address 10.10.0.32/16
#Private

auto vmbr2
iface vmbr2 inet manual
    bridge-ports enp3s0f1
    bridge-stp off
    bridge-fd 0
#Cluster Trunk

auto vmbr2.11
iface vmbr2.11 inet static
    address 10.11.1.4/16
#Cluster Storage

auto vmbr2.13
iface vmbr2.13 inet static
    address 10.13.1.4/16
#Cluster Secondary Ring
 
Last edited:

spirit

Famous Member
Apr 2, 2010
5,916
712
143
www.odiso.com
ah ok !

if you don't have any vms running on vmbr1 && vmbr2, do you have tried to tag enp3s0f0 && enp3s0f1 directly without any vmbr1/2 ?


Another idea: what is you mac-address-ageing timeout on your physical switch ? some have 5min timeout, it could be too low. 30min-2h should be more safe.
 
Last edited:

ikogan

Active Member
Apr 8, 2017
29
1
28
38
ah ok !

if you don't have any vms running on vmbr1 && vmbr2, do you have tried to tag enp3s0f0 && enp3s0f1 directly without any vmbr1/2 ?


Another idea: what is you mac-address-ageing timeout on your physical switch ? some have 5min timeout, it could be too low. 30min-2h should be more safe.

Thanks for all of your help! I do have VMs running on vmbr1, those consume ceph client traffic. Not ideal, but as the graph shows, they're not saturated in this situation and the issue is happening on vmbr2.13. vmbr2 is shared between the Ceph backside network and the secondary ring. That _could_ create problems during heavy replication but that's not happening right now.

vmbr1 and vmbr2 are on two separate Unifi Aggregation switches (https://store.ui.com/collections/unifi-network-switching/products/unifi-switch-aggregation). I can't seem to find any docs on MAC address aging or anything similar for these switches, nor do I see any way to determine the defaults.

All of the switches have DHCP Snooping, Jumbo Frames, and Spanning Tree RTSP but Flow Control is off. The hosts do not have Jump Frames on at the moment, only the switches. vmbr1 and vmbr2 are generally configured identically so I'm not sure why this is _only_ happening on vmbr2. It feels like one of the 5 NICs might be having trouble. Is there a way to determine if one of the hosts or ports is causing the problem without shutting down one host at a time?
 

spirit

Famous Member
Apr 2, 2010
5,916
712
143
www.odiso.com
do you have disabled rstp on ports where proxmox is plugged ? you really don't want rstp converge lag on your corosync network.
(if a nic is flapping on 1 node, or node reboot, it could hang all ports where rstp is enabled until convergence is done)


about bridge ageing timeout, you can change it command line, it seem to be 300s by default, could be too low.

https://dl.ubnt.com/guides/edgemax/EdgeSwitch_CLI_Command_Reference_UG.pdf (page 313)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!