Corosync link0 bouncing

szucs10 · Jul 7, 2021

Hello,

we built a proxmox cluster from 3 servers. We have transformed a 10G network for internal communication, on which the cluster would communicate. However, it keeps bouncing up / down.

Code:

Jul  7 16:12:25 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  7 16:12:46 node1tela corosync[4219]:   [KNET  ] link: host: 2 link: 0 is down
Jul  7 16:12:46 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jul  7 16:12:48 node1tela corosync[4219]:   [KNET  ] rx: host: 2 link: 0 is up
Jul  7 16:12:48 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  7 16:14:02 node1tela corosync[4219]:   [KNET  ] link: host: 2 link: 0 is down
Jul  7 16:14:02 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jul  7 16:14:04 node1tela corosync[4219]:   [KNET  ] rx: host: 2 link: 0 is up
Jul  7 16:14:04 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  7 16:14:32 node1tela corosync[4219]:   [KNET  ] link: host: 2 link: 0 is down
Jul  7 16:14:32 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jul  7 16:14:34 node1tela corosync[4219]:   [KNET  ] rx: host: 2 link: 0 is up
Jul  7 16:14:34 node1tela corosync[4219]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Anyone have any idea what the problem might be?

Theoretically, there is no problem with the network.

thanks for the help!

aaron · Jul 7, 2021

Are there other services using that physical network? They might use up a lot of bandwidth which causes the latency for the corosync packets to go up, maybe to the point where the link is deemed unusable.

You can configure multiple links for Corosync, and it will switch automatically if another one is better. It is also best practice, to have at least one physical network dedicated to Corosync alone to avoid other services interfering. Better would be even 2 dedicated links. These can be 1 GBit links as Corosync does not need too much bandwidth, but definitely low latency.

szucs10 · Jul 7, 2021

aaron said:
Are there other services using that physical network? They might use up a lot of bandwidth which causes the latency for the corosync packets to go up, maybe to the point where the link is deemed unusable.

You can configure multiple links for Corosync, and it will switch automatically if another one is better. It is also best practice, to have at least one physical network dedicated to Corosync alone to avoid other services interfering. Better would be even 2 dedicated links. These can be 1 GBit links as Corosync does not need too much bandwidth, but definitely low latency.

Dear aaron!

The 10G network was just set up for this physically, no one else runs here, a privat network. Corosync and ceph would run on it, ceph is not installed yet.

For external access, a 1G network is set up per machine. Here, too, a link to corosync is created, but this should only be used in case of an emergency.

corosync config:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1tela
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.122.1
    ring1_addr: EXT IP
  }
  node {
    name: node2tela
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.122.2
    ring1_addr: EXT IP
  }
  node {
    name: node3tela
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.122.3
    ring1_addr: EXT IP
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: TDCluster
  config_version: 3
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Thanks!

aaron · Jul 8, 2021

szucs10 said:
Corosync and ceph would run on it

That is exactly what you should not do, as Ceph could saturate a 10Gbit network (see the 2018 Ceph Benchmark paper).

szucs10 said:
For external access, a 1G network is set up per machine. Here, too, a link to corosync is created, but this should only be used in case of an emergency.

Okay that should help and once you have it in production, keep an eye on the logs to see how often Corosync will switch to that second link.

szucs10 said:
ceph is not installed yet.

Well, in that case, it seems like there might be some problem with the network. There really are no other services running on that network?

Could there be an MTU mismatch between the nodes and / or the switches?

Is it always host 2 that shows up in the logs? How do the logs on host 2 look like compared to the other nodes? Do the other nodes always see host 2 as down and host 2 all the other nodes?
If so, then it means you can narrow it down to host 2. Maybe a somewhat broken cable?

You can check the number of dropped packets with ip -s link

Search

Search

Corosync link0 bouncing

szucs10

Member

aaron

Proxmox Staff Member

szucs10

Member

aaron

Proxmox Staff Member