Dear Members,
I have a ceph cluster with the followed details:
Cluster works on separated NIC, active-backup bonding, separated DELL 10G switch, and separated IP range on 10Gbit.
My problem:
On all of nodes there are some KNET link down entries when there is heavy load on some of node.
I don't know should I care about this or is it normal?
4 nodes, the 5th node comes on next week.
PVE1 pve-manager/7.0-13/7aa7e488
PVE2 pve-manager/7.0-13/7aa7e488
PVE3 pve-manager/7.0-13/7aa7e488
PVE6 pve-manager/7.1-5/6fe299a0
syslog entries:
PVE1:
Nov 22 12:11:52 pve1 corosync[2766]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:17:24 pve1 corosync[2688]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:17:30 pve1 corosync[2688]: [KNET ] link: host: 4 link: 0 is down
Nov 22 14:39:35 pve1 corosync[2688]: [KNET ] link: host: 2 link: 0 is down
Nov 22 16:00:04 pve1 corosync[2688]: [KNET ] link: host: 3 link: 0 is down
========
Nov 22 12:11:54 pve1 corosync[2766]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:04 pve1 corosync[2766]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
PVE2:
Nov 22 12:09:59 pve2 corosync[2772]: [KNET ] link: host: 4 link: 0 is down
Nov 22 12:11:47 pve2 corosync[2772]: [KNET ] link: host: 3 link: 0 is down
Nov 22 12:11:47 pve2 corosync[2772]: [KNET ] link: host: 4 link: 0 is down
Nov 22 14:32:32 pve2 corosync[2739]: [KNET ] link: host: 4 link: 0 is down
Nov 22 20:10:22 pve2 corosync[2739]: [KNET ] link: host: 4 link: 0 is down
========
Nov 22 12:11:54 pve2 corosync[2772]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:05 pve2 corosync[2772]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3226 ms
Nov 22 18:09:21 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3225 ms
PVE3:
Nov 22 12:17:53 pve3 corosync[2826]: [KNET ] link: host: 2 link: 0 is down
Nov 22 14:12:39 pve3 corosync[2826]: [KNET ] link: host: 3 link: 0 is down
Nov 22 14:12:39 pve3 corosync[2826]: [KNET ] link: host: 1 link: 0 is down
Nov 22 18:09:19 pve3 corosync[2826]: [KNET ] link: host: 3 link: 0 is down
Nov 22 18:09:19 pve3 corosync[2826]: [KNET ] link: host: 2 link: 0 is down
========
Nov 22 12:11:54 pve3 corosync[2890]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:04 pve3 corosync[2890]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3226 ms
PVE6:
Nov 22 12:11:07 pve6 corosync[3352]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:11:11 pve6 corosync[3352]: [KNET ] link: host: 4 link: 0 is down
Nov 22 12:17:41 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
Nov 22 13:54:19 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
Nov 22 14:47:52 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
========
Nov 22 12:11:54 pve6 corosync[3352]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:05 pve6 corosync[3352]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
=======================================================
root@pve1:~# ceph -s
cluster:
id: 58b7c533-09d5-4c82-8aa8-9ee4a5af696d
health: HEALTH_OK
services:
mon: 4 daemons, quorum pve1,pve2,pve6,pve3 (age 8h)
mgr: pve2(active, since 8h), standbys: pve1, pve6, pve3
osd: 26 osds: 26 up (since 8h), 26 in (since 2d)
data:
pools: 3 pools, 768 pgs
objects: 1.92M objects, 7.2 TiB
usage: 21 TiB used, 42 TiB / 63 TiB avail
pgs: 768 active+clean
io:
client: 234 KiB/s rd, 4.9 MiB/s wr, 17 op/s rd, 563 op/s wr
=======================================================
/etc/network/interfaces
auto lo
iface lo inet loopback
iface eno3 inet manual
iface eno4 inet manual
auto eno1
iface eno1 inet manual
auto eno2
iface eno2 inet manual
auto enp129s0f0
iface enp129s0f0 inet manual
auto enp129s0f1
iface enp129s0f1 inet manual
auto bond0
iface bond0 inet static
address 10.10.10.1/24
bond-slaves eno1 enp129s0f0
bond-miimon 100
bond-mode active-backup
bond-primary eno1
network 10.10.10.0
metric 20
#Cluster Network
auto bond1
iface bond1 inet manual
bond-slaves eno2 enp129s0f1
bond-miimon 100
bond-mode balance-alb
metric 30
auto vmbr0
iface vmbr0 inet static
address 172.26.73.21/24
gateway 172.26.73.1
bridge-ports eno3
bridge-stp off
bridge-fd 0
#Proxmox Management
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
#VM Network
auto vmbr2
iface vmbr2 inet manual
bridge-ports eno4
bridge-stp off
bridge-fd 0
#VM Internet
=======================================================
root@pve1:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
addr = 10.10.10.1
status:
nodeid: 1: localhost
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: connected
=======================================================
root@pve1:~# pvecm status
Cluster information
-------------------
Name: corexcluster
Config Version: 4
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Mon Nov 22 20:32:29 2021
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1.264
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.1 (local)
0x00000002 1 10.10.10.2
0x00000003 1 10.10.10.6
0x00000004 1 10.10.10.3
=======================================================
Thank you for your advice.
Gabor
I have a ceph cluster with the followed details:
Cluster works on separated NIC, active-backup bonding, separated DELL 10G switch, and separated IP range on 10Gbit.
My problem:
On all of nodes there are some KNET link down entries when there is heavy load on some of node.
I don't know should I care about this or is it normal?
4 nodes, the 5th node comes on next week.
PVE1 pve-manager/7.0-13/7aa7e488
PVE2 pve-manager/7.0-13/7aa7e488
PVE3 pve-manager/7.0-13/7aa7e488
PVE6 pve-manager/7.1-5/6fe299a0
syslog entries:
PVE1:
Nov 22 12:11:52 pve1 corosync[2766]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:17:24 pve1 corosync[2688]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:17:30 pve1 corosync[2688]: [KNET ] link: host: 4 link: 0 is down
Nov 22 14:39:35 pve1 corosync[2688]: [KNET ] link: host: 2 link: 0 is down
Nov 22 16:00:04 pve1 corosync[2688]: [KNET ] link: host: 3 link: 0 is down
========
Nov 22 12:11:54 pve1 corosync[2766]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:04 pve1 corosync[2766]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve1 corosync[2688]: [TOTEM ] Token has not been received in 3225 ms
PVE2:
Nov 22 12:09:59 pve2 corosync[2772]: [KNET ] link: host: 4 link: 0 is down
Nov 22 12:11:47 pve2 corosync[2772]: [KNET ] link: host: 3 link: 0 is down
Nov 22 12:11:47 pve2 corosync[2772]: [KNET ] link: host: 4 link: 0 is down
Nov 22 14:32:32 pve2 corosync[2739]: [KNET ] link: host: 4 link: 0 is down
Nov 22 20:10:22 pve2 corosync[2739]: [KNET ] link: host: 4 link: 0 is down
========
Nov 22 12:11:54 pve2 corosync[2772]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:05 pve2 corosync[2772]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3226 ms
Nov 22 18:09:21 pve2 corosync[2739]: [TOTEM ] Token has not been received in 3225 ms
PVE3:
Nov 22 12:17:53 pve3 corosync[2826]: [KNET ] link: host: 2 link: 0 is down
Nov 22 14:12:39 pve3 corosync[2826]: [KNET ] link: host: 3 link: 0 is down
Nov 22 14:12:39 pve3 corosync[2826]: [KNET ] link: host: 1 link: 0 is down
Nov 22 18:09:19 pve3 corosync[2826]: [KNET ] link: host: 3 link: 0 is down
Nov 22 18:09:19 pve3 corosync[2826]: [KNET ] link: host: 2 link: 0 is down
========
Nov 22 12:11:54 pve3 corosync[2890]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:04 pve3 corosync[2890]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve3 corosync[2826]: [TOTEM ] Token has not been received in 3226 ms
PVE6:
Nov 22 12:11:07 pve6 corosync[3352]: [KNET ] link: host: 2 link: 0 is down
Nov 22 12:11:11 pve6 corosync[3352]: [KNET ] link: host: 4 link: 0 is down
Nov 22 12:17:41 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
Nov 22 13:54:19 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
Nov 22 14:47:52 pve6 corosync[3222]: [KNET ] link: host: 1 link: 0 is down
========
Nov 22 12:11:54 pve6 corosync[3352]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:12:05 pve6 corosync[3352]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 12:17:27 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 14:39:37 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
Nov 22 18:09:21 pve6 corosync[3222]: [TOTEM ] Token has not been received in 3225 ms
=======================================================
root@pve1:~# ceph -s
cluster:
id: 58b7c533-09d5-4c82-8aa8-9ee4a5af696d
health: HEALTH_OK
services:
mon: 4 daemons, quorum pve1,pve2,pve6,pve3 (age 8h)
mgr: pve2(active, since 8h), standbys: pve1, pve6, pve3
osd: 26 osds: 26 up (since 8h), 26 in (since 2d)
data:
pools: 3 pools, 768 pgs
objects: 1.92M objects, 7.2 TiB
usage: 21 TiB used, 42 TiB / 63 TiB avail
pgs: 768 active+clean
io:
client: 234 KiB/s rd, 4.9 MiB/s wr, 17 op/s rd, 563 op/s wr
=======================================================
/etc/network/interfaces
auto lo
iface lo inet loopback
iface eno3 inet manual
iface eno4 inet manual
auto eno1
iface eno1 inet manual
auto eno2
iface eno2 inet manual
auto enp129s0f0
iface enp129s0f0 inet manual
auto enp129s0f1
iface enp129s0f1 inet manual
auto bond0
iface bond0 inet static
address 10.10.10.1/24
bond-slaves eno1 enp129s0f0
bond-miimon 100
bond-mode active-backup
bond-primary eno1
network 10.10.10.0
metric 20
#Cluster Network
auto bond1
iface bond1 inet manual
bond-slaves eno2 enp129s0f1
bond-miimon 100
bond-mode balance-alb
metric 30
auto vmbr0
iface vmbr0 inet static
address 172.26.73.21/24
gateway 172.26.73.1
bridge-ports eno3
bridge-stp off
bridge-fd 0
#Proxmox Management
auto vmbr1
iface vmbr1 inet manual
bridge-ports bond1
bridge-stp off
bridge-fd 0
#VM Network
auto vmbr2
iface vmbr2 inet manual
bridge-ports eno4
bridge-stp off
bridge-fd 0
#VM Internet
=======================================================
root@pve1:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
addr = 10.10.10.1
status:
nodeid: 1: localhost
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: connected
=======================================================
root@pve1:~# pvecm status
Cluster information
-------------------
Name: corexcluster
Config Version: 4
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Mon Nov 22 20:32:29 2021
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000001
Ring ID: 1.264
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.1 (local)
0x00000002 1 10.10.10.2
0x00000003 1 10.10.10.6
0x00000004 1 10.10.10.3
=======================================================
Thank you for your advice.
Gabor