Ceph RBD cluster, link: X is down

bgabika · Nov 22, 2021

Dear Members,

I have a ceph cluster with the followed details:
Cluster works on separated NIC, active-backup bonding, separated DELL 10G switch, and separated IP range on 10Gbit.
My problem:
On all of nodes there are some KNET link down entries when there is heavy load on some of node.
I don't know should I care about this or is it normal?

4 nodes, the 5th node comes on next week.
PVE1 pve-manager/7.0-13/7aa7e488
PVE2 pve-manager/7.0-13/7aa7e488
PVE3 pve-manager/7.0-13/7aa7e488
PVE6 pve-manager/7.1-5/6fe299a0

syslog entries:
PVE1:
Nov 22 12:11:52 pve1 corosync[2766]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:17:24 pve1 corosync[2688]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:17:30 pve1 corosync[2688]: [KNET ] link: host: 4 link: 0 is down
Nov 22 14:39:35 pve1 corosync[2688]: [KNET ] link: host: 2 link: 0 is down
Nov 22 16:00:04 pve1 corosync[2688]: [KNET ] link: host: 3 link: 0 is down
========
Nov 22 12:11:54 pve1 corosync[2766]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:04 pve1 corosync[2766]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms

PVE2:
Nov 22 12:09:59 pve2 corosync[2772]: [KNET ] link: host: 4 link: 0 is down
Nov 22 12:11:47 pve2 corosync[2772]: [KNET ] link: host: 3 link: 0 is down
Nov 22 12:11:47 pve2 corosync[2772]: [KNET ] link: host: 4 link: 0 is down
Nov 22 14:32:32 pve2 corosync[2739]: [KNET ] link: host: 4 link: 0 is down
Nov 22 20:10:22 pve2 corosync[2739]: [KNET ] link: host: 4 link: 0 is down
========
Nov 22 12:11:54 pve2 corosync[2772]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:05 pve2 corosync[2772]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3226 ms
Nov 22 18:09:21 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3225 ms

PVE3:
Nov 22 12:17:53 pve3 corosync[2826]: [KNET ] link: host: 2 link: 0 is down
Nov 22 14:12:39 pve3 corosync[2826]: [KNET ] link: host: 3 link: 0 is down
Nov 22 14:12:39 pve3 corosync[2826]: [KNET ] link: host: 1 link: 0 is down
Nov 22 18:09:19 pve3 corosync[2826]: [KNET ] link: host: 3 link: 0 is down
Nov 22 18:09:19 pve3 corosync[2826]: [KNET ] link: host: 2 link: 0 is down
========
Nov 22 12:11:54 pve3 corosync[2890]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:04 pve3 corosync[2890]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3226 ms

PVE6:
Nov 22 12:11:07 pve6 corosync[3352]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:11:11 pve6 corosync[3352]: [KNET ] link: host: 4 link: 0 is down
Nov 22 12:17:41 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
Nov 22 13:54:19 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
Nov 22 14:47:52 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
========
Nov 22 12:11:54 pve6 corosync[3352]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:05 pve6 corosync[3352]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms

=======================================================
root@pve1:~# ceph -s
cluster:
id: 58b7c533-09d5-4c82-8aa8-9ee4a5af696d
health: HEALTH_OK

services:
mon: 4 daemons, quorum pve1,pve2,pve6,pve3 (age 8h)
mgr: pve2(active, since 8h), standbys: pve1, pve6, pve3
osd: 26 osds: 26 up (since 8h), 26 in (since 2d)

data:
pools: 3 pools, 768 pgs
objects: 1.92M objects, 7.2 TiB
usage: 21 TiB used, 42 TiB / 63 TiB avail
pgs: 768 active+clean

io:
client: 234 KiB/s rd, 4.9 MiB/s wr, 17 op/s rd, 563 op/s wr

=======================================================
/etc/network/interfaces

auto lo
iface lo inet loopback

iface eno3 inet manual

iface eno4 inet manual

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto enp129s0f0
iface enp129s0f0 inet manual

auto enp129s0f1
iface enp129s0f1 inet manual

auto bond0
iface bond0 inet static
address 10.10.10.1/24
bond-slaves eno1 enp129s0f0
bond-miimon 100
bond-mode active-backup
bond-primary eno1
network 10.10.10.0
metric 20
#Cluster Network

auto bond1
iface bond1 inet manual
bond-slaves eno2 enp129s0f1
bond-miimon 100
bond-mode balance-alb
metric 30

auto vmbr0
iface vmbr0 inet static
address 172.26.73.21/24
gateway 172.26.73.1
bridge-ports eno3
bridge-stp off
bridge-fd 0
#Proxmox Management

auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
#VM Network

auto vmbr2
iface vmbr2 inet manual
bridge-ports eno4
bridge-stp off
bridge-fd 0
#VM Internet

=======================================================
root@pve1:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
addr = 10.10.10.1
status:
nodeid: 1: localhost
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: connected

=======================================================

root@pve1:~# pvecm status
Cluster information
-------------------
Name: corexcluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Nov 22 20:32:29 2021
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1.264
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.1 (local)
0x00000002 1 10.10.10.2
0x00000003 1 10.10.10.6
0x00000004 1 10.10.10.3
=======================================================

Thank you for your advice.
Gabor

aaron · Nov 23, 2021

Which network is used for Ceph? Check the IPs in the Ceph -> Configuration panel.

If it is also the 10.10.10.x network used for Corosync, you have your answer.
Corosync wants low latency and if you have another service like Ceph (other storages or backup traffic are also likely candidates) can sometimes use up all the bandwidth which causes the latency for the corosync packets to go up.

If you can, configure multiple links for Corosync so it can switch to another network: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

bgabika · Nov 23, 2021

aaron said:
Which network is used for Ceph? Check the IPs in the Ceph -> Configuration panel.

If it is also the 10.10.10.x network used for Corosync, you have your answer.
Corosync wants low latency and if you have another service like Ceph (other storages or backup traffic are also likely candidates) can sometimes use up all the bandwidth which causes the latency for the corosync packets to go up.

If you can, configure multiple links for Corosync so it can switch to another network: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

Dear Aaron,

thank you for your answer.

Yes, I use 10.10.10.x network for Ceph:
Are Corosync network and cluster network same definition?
I used 10.10.10.x network for create cluster.
At cluster creation time Ceph network is created on same 10.10.10.x network?

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.10.0/24
fsid = 58b7c533-09d5-4c82-8aa8-9ee4a5af696d
mon_allow_pool_delete = true
mon_host = 10.10.10.1 10.10.10.2 10.10.10.6 10.10.10.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.10.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve1]
public_addr = 10.10.10.1

[mon.pve2]
public_addr = 10.10.10.2

[mon.pve3]
public_addr = 10.10.10.3

[mon.pve6]
public_addr = 10.10.10.6

I check your advice, thank you.

Gabor

aaron · Nov 23, 2021

Think of PVE and Ceph as two different things that both create a cluster for themselves. Ceph is deployed and managed by PVE, but in the end, it is its own cluster running on the same hardware.

PVE is using Corosync for its cluster communication. So the two are not the same. Cluster and public network are terms used by Ceph. The public network is mandatory, and the cluster network is optional and can be used to divert traffic between the OSDs (replication, heartbeat) to a different network to spread the load a bit more.

The Ceph networks are defined in the /etc/pve/ceph.conf and the corosync networks (corosync can handle mutliple networks by itself) are in the /etc/pve/corosync.conf file.

I hope this helps to alleviate any confusion

bgabika · Nov 23, 2021

Yes, sure, thank you!

- Is 1Gbit separated NIC enough for corosync, right?
- I think I would use two 1Gbit NICs in bonding mode for corosync. Is Active-Backup bonding mode recommended for corosync?

thank you,
Gabor

aaron · Nov 24, 2021

How many nodes do you have in your cluster? 1Gbit should be more than enough. There is no need to setup bonding for Corosync as it can switch itself to other networks and will do so much faster than a regular bond.

I would just add more links, the more, the merrier

It can have up to 8 links configured. You can configure a priority to tell Coorsync which networks to use first, if available.

E.g.:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: cephtest1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.26.51
    ring1_addr: 192.168.80.1
  }
  node {
    name: cephtest2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.26.52
    ring1_addr: 192.168.80.2
  }
  node {
    name: cephtest3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.26.53
    ring1_addr: 192.168.80.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cephtest
  config_version: 45
  interface {
    linknumber: 0
    knet_link_priority: 10
  }
  interface {
    linknumber: 1
    knet_link_priority: 20
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

With this config, the ring/link 1 in network 192.168.80.x would be preferred as it has the higher priority configured.

And as always when making changes to the corosync config, do it to the one in /etc/pve/corosync.conf and don't forget to increase the "config_version" number with each change

bgabika · Nov 24, 2021

Dear Aaron,

5 nodes will be in this cluster.
Thank you for your answer, and thank you for all your hard work on this. (-:

aaron · Nov 24, 2021

bgabika said:
5 nodes will be in this cluster.

1Gbit will do more than fine for corosync.

Search

Search

Ceph RBD cluster, link: X is down

bgabika

Member

aaron

Proxmox Staff Member

bgabika

Member

aaron

Proxmox Staff Member

bgabika

Member

aaron

Proxmox Staff Member

bgabika

Member

aaron

Proxmox Staff Member