okay, so that was not enough to get corosync unstuck. I would proceed with stopping corosync everywhere, and then starting it node by node, waiting for the nodes to settle.
it would still be interesting to know more about your network setup
ok, so I did corosync + pve-cluster stop at all, then started pve-cluster, started corosync and waited a minute or so. I watch pvecm status on other terminal. I saw each increased number of votes after few seconds and when quorom was reached, I logged in into web frontend and saw it working. I tried logging in in several nodes with success. On the other nodes, each started node is green. The not-yet-started ones are red. With every start, one red becomes green. Lovely!
I also started on 113. On the other nodes, same colors (113 grey, others green). I cannot login Web GUI on this node. I get a timeout. Even some minutes later I cannot. I looked again to other web front ends, and now only the local node is green, but all others are grey. pvevm status looks good I think:
Code:
Cluster information
-------------------
Name: tisc-pve
Config Version: 13
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu May 11 16:09:31 2023
Quorum provider: corosync_votequorum
Nodes: 13
Node ID: 0x00000001
Ring ID: 1.780a
Quorate: Yes
Votequorum information
----------------------
Expected votes: 13
Highest expected: 13
Total votes: 13
Quorum: 7
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.241.197.101 (local)
0x00000066 1 10.241.197.102
0x00000067 1 10.241.197.103
0x00000068 1 10.241.197.104
0x00000069 1 10.241.197.105
0x0000006a 1 10.241.197.106
0x0000006b 1 10.241.197.107
0x0000006c 1 10.241.197.108
0x0000006d 1 10.241.197.109
0x0000006e 1 10.241.197.110
0x0000006f 1 10.241.197.111
0x00000070 1 10.241.197.112
0x00000071 1 10.241.197.113
I tried logout / login other web GUIs, but now I cannot login anymore.
I tried again: stop all, and restart one-by-one. Same happens: everything looks good, more and more green, until I add 113. Then this again turns from red to grey and some seconds later (or a minute?) from each node all others are grey too, only own node is green. I logged into 101-107 (all show the same) and 113 (never works).
I did a third time. Now I stop adding at 112. I wait, logout, login on several nodes, all fine. Apparently there is now some bad state in 113 which brings down the whole cluster.
What should I do next? Can I provide more information (logfiles or such)?
Should I skip 113 and try to add 114?
Should I remove (or reinstall?) 113 and try adding again?
About the network, what is of interest?
lspci reports "Intel Corporation Ethernet Connection (11) I219-LM", a simple GbE onboard NIC.
Each has maybe 2..3m cable to a
EDIT: Cisco Catalyst 9300 48 port switch (
C9300-48T-E V04 mit 1 x C9300-NM-4M
). All these ports on same VLAN and the VLAN is invisible to clients (i.e. "normal" access mode).
iperf reports 934..936 Mbit/s and jitter 0.079..0.120ms, e.g.
Code:
# TCP (iperf -c labhen197-113)
[ 3] 0.0000-10.0036 sec 1.09 GBytes 936 Mbits/sec
# UDP (iperf -u -c labhen197-113)
[ 3] 0.0000-10.0155 sec 1.25 MBytes 1.05 Mbits/sec 0.082 ms 0/ 895 (0%)
# PING
root@labhen197-113:~# ping -f -c 10000 -s 1500 labhen197-101
PING labhen197-101.bt.bombardier.net (10.241.197.101) 1500(1528) bytes of data.
--- labhen197-101.bt.bombardier.net ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 1751ms
rtt min/avg/max/mdev = 0.140/0.168/0.945/0.018 ms, ipg/ewma 0.175/0.159 ms
root@labhen197-101:~# ping -f -c 10000 -s 1500 labhen197-113
PING labhen197-113.bt.bombardier.net (10.241.197.113) 1500(1528) bytes of data.
--- labhen197-113.bt.bombardier.net ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 1743ms
rtt min/avg/max/mdev = 0.135/0.167/0.970/0.016 ms, ipg/ewma 0.174/0.168 ms
root@labhen197-101:~# ip link show eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether 74:78:27:3f:31:aa brd ff:ff:ff:ff:ff:ff
altname enp0s31f6
root@labhen197-113:~# ip l sh eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether 74:78:27:71:f0:1e brd ff:ff:ff:ff:ff:ff
altname enp0s31f6
root@labhen197-101:~# ip -s -h l show dev eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether 74:78:27:3f:31:aa brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
20.9G 68.7M 0 0 0 50.1M
TX: bytes packets errors dropped carrier collsns
8.19G 24.2M 0 0 0 0
altname enp0s31f6
root@labhen197-113:~# ip -s -h l show dev eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether 74:78:27:71:f0:1e brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
36.4G 79.7M 0 0 0 50.1M
TX: bytes packets errors dropped carrier collsns
6.05G 8.88M 0 0 0 0
altname enp0s31f6
root@labhen197-113:~#
root@labhen197-101:~# ethtool eno2
Settings for eno2:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
root@labhen197-101:~#
root@labhen197-113:~# ethtool eno2
Settings for eno2:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
root@labhen197-113:~#
anything else I could provide?