Corosync link0 bouncing

Sep 12, 2020
13
0
21
27
Hello,

we built a proxmox cluster from 3 servers. We have transformed a 10G network for internal communication, on which the cluster would communicate. However, it keeps bouncing up / down.

Code:
Jul  7 16:12:25 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  7 16:12:46 node1tela corosync[4219]:   [KNET  ] link: host: 2 link: 0 is down
Jul  7 16:12:46 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jul  7 16:12:48 node1tela corosync[4219]:   [KNET  ] rx: host: 2 link: 0 is up
Jul  7 16:12:48 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  7 16:14:02 node1tela corosync[4219]:   [KNET  ] link: host: 2 link: 0 is down
Jul  7 16:14:02 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jul  7 16:14:04 node1tela corosync[4219]:   [KNET  ] rx: host: 2 link: 0 is up
Jul  7 16:14:04 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  7 16:14:32 node1tela corosync[4219]:   [KNET  ] link: host: 2 link: 0 is down
Jul  7 16:14:32 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jul  7 16:14:34 node1tela corosync[4219]:   [KNET  ] rx: host: 2 link: 0 is up
Jul  7 16:14:34 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Anyone have any idea what the problem might be?

Theoretically, there is no problem with the network.

thanks for the help!
 
Are there other services using that physical network? They might use up a lot of bandwidth which causes the latency for the corosync packets to go up, maybe to the point where the link is deemed unusable.

You can configure multiple links for Corosync, and it will switch automatically if another one is better. It is also best practice, to have at least one physical network dedicated to Corosync alone to avoid other services interfering. Better would be even 2 dedicated links. These can be 1 GBit links as Corosync does not need too much bandwidth, but definitely low latency.
 
Are there other services using that physical network? They might use up a lot of bandwidth which causes the latency for the corosync packets to go up, maybe to the point where the link is deemed unusable.

You can configure multiple links for Corosync, and it will switch automatically if another one is better. It is also best practice, to have at least one physical network dedicated to Corosync alone to avoid other services interfering. Better would be even 2 dedicated links. These can be 1 GBit links as Corosync does not need too much bandwidth, but definitely low latency.
Dear aaron!

The 10G network was just set up for this physically, no one else runs here, a privat network. Corosync and ceph would run on it, ceph is not installed yet.

For external access, a 1G network is set up per machine. Here, too, a link to corosync is created, but this should only be used in case of an emergency.

corosync config:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1tela
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.122.1
    ring1_addr: EXT IP
  }
  node {
    name: node2tela
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.122.2
    ring1_addr: EXT IP
  }
  node {
    name: node3tela
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.122.3
    ring1_addr: EXT IP
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: TDCluster
  config_version: 3
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Thanks!
 
Corosync and ceph would run on it
That is exactly what you should not do, as Ceph could saturate a 10Gbit network (see the 2018 Ceph Benchmark paper).

For external access, a 1G network is set up per machine. Here, too, a link to corosync is created, but this should only be used in case of an emergency.
Okay that should help and once you have it in production, keep an eye on the logs to see how often Corosync will switch to that second link.

ceph is not installed yet.
Well, in that case, it seems like there might be some problem with the network. There really are no other services running on that network?

Could there be an MTU mismatch between the nodes and / or the switches?

Is it always host 2 that shows up in the logs? How do the logs on host 2 look like compared to the other nodes? Do the other nodes always see host 2 as down and host 2 all the other nodes?
If so, then it means you can narrow it down to host 2. Maybe a somewhat broken cable?

You can check the number of dropped packets with ip -s link
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!