Adding node #10 makes puts whole cluster out of order

sdettmer · May 11, 2023

fabian said:
it probably is enough to stop corosync and pve-cluster on node 13, check the status on the other nodes, and if that looks good, start the services again on node 13.

there is one peculiarity in the logs that i am not sure about, I'll have to take a closer look - when node13 joined corosync logs that its ready after pmxcfs logged that it started synchronization, maybe that caused some inconsistent state or caused something to be lost, but it might also not matter at all or just be a log ordering issue with no consequences.

OK, so I did.

Afterwards, I could login in web GUI 101, bot not on 102, 103, 104, 105, 106 (all hang), but 107 I can. On 108, 109, 110, 111, 112 I cannot login (all hang), but on 113. On the three were I can login, I see 101, 107 and 113 in green icon, all other in red. SSH works on each even with key now.

The pvecm status is accordingly; on 101, 107 and 113, only 3 members are listed, but on the others, all (101-113) are listed. Apparently the command is fast on the others, but often takes multiple seconds on the 3. I added details in attached script-4.txt.

Thank you so much that you help!

fabian · May 11, 2023

okay, so that was not enough to get corosync unstuck. I would proceed with stopping corosync everywhere, and then starting it node by node, waiting for the nodes to settle.

it would still be interesting to know more about your network setup

sdettmer · May 11, 2023

fabian said:
okay, so that was not enough to get corosync unstuck. I would proceed with stopping corosync everywhere, and then starting it node by node, waiting for the nodes to settle.

it would still be interesting to know more about your network setup

ok, so I did corosync + pve-cluster stop at all, then started pve-cluster, started corosync and waited a minute or so. I watch pvecm status on other terminal. I saw each increased number of votes after few seconds and when quorom was reached, I logged in into web frontend and saw it working. I tried logging in in several nodes with success. On the other nodes, each started node is green. The not-yet-started ones are red. With every start, one red becomes green. Lovely!

I also started on 113. On the other nodes, same colors (113 grey, others green). I cannot login Web GUI on this node. I get a timeout. Even some minutes later I cannot. I looked again to other web front ends, and now only the local node is green, but all others are grey. pvevm status looks good I think:

Code:

Cluster information
-------------------
Name:             tisc-pve
Config Version:   13
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu May 11 16:09:31 2023
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x00000001
Ring ID:          1.780a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      13
Quorum:           7
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.241.197.101 (local)
0x00000066          1 10.241.197.102
0x00000067          1 10.241.197.103
0x00000068          1 10.241.197.104
0x00000069          1 10.241.197.105
0x0000006a          1 10.241.197.106
0x0000006b          1 10.241.197.107
0x0000006c          1 10.241.197.108
0x0000006d          1 10.241.197.109
0x0000006e          1 10.241.197.110
0x0000006f          1 10.241.197.111
0x00000070          1 10.241.197.112
0x00000071          1 10.241.197.113

I tried logout / login other web GUIs, but now I cannot login anymore.

I tried again: stop all, and restart one-by-one. Same happens: everything looks good, more and more green, until I add 113. Then this again turns from red to grey and some seconds later (or a minute?) from each node all others are grey too, only own node is green. I logged into 101-107 (all show the same) and 113 (never works).

I did a third time. Now I stop adding at 112. I wait, logout, login on several nodes, all fine. Apparently there is now some bad state in 113 which brings down the whole cluster.

What should I do next? Can I provide more information (logfiles or such)?
Should I skip 113 and try to add 114?
Should I remove (or reinstall?) 113 and try adding again?

About the network, what is of interest?
lspci reports "Intel Corporation Ethernet Connection (11) I219-LM", a simple GbE onboard NIC.
Each has maybe 2..3m cable to a EDIT: Cisco Catalyst 9300 48 port switch (C9300-48T-E V04 mit 1 x C9300-NM-4M). All these ports on same VLAN and the VLAN is invisible to clients (i.e. "normal" access mode).

iperf reports 934..936 Mbit/s and jitter 0.079..0.120ms, e.g.

Code:

# TCP (iperf -c labhen197-113)
[  3] 0.0000-10.0036 sec  1.09 GBytes   936 Mbits/sec
# UDP (iperf -u -c labhen197-113)
[  3] 0.0000-10.0155 sec  1.25 MBytes  1.05 Mbits/sec   0.082 ms    0/  895 (0%)

# PING
root@labhen197-113:~# ping -f -c 10000 -s 1500 labhen197-101
PING labhen197-101.bt.bombardier.net (10.241.197.101) 1500(1528) bytes of data.
 
--- labhen197-101.bt.bombardier.net ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 1751ms
rtt min/avg/max/mdev = 0.140/0.168/0.945/0.018 ms, ipg/ewma 0.175/0.159 ms

root@labhen197-101:~# ping -f -c 10000 -s 1500 labhen197-113
PING labhen197-113.bt.bombardier.net (10.241.197.113) 1500(1528) bytes of data.
 
--- labhen197-113.bt.bombardier.net ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 1743ms
rtt min/avg/max/mdev = 0.135/0.167/0.970/0.016 ms, ipg/ewma 0.174/0.168 ms

root@labhen197-101:~# ip link show eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 74:78:27:3f:31:aa brd ff:ff:ff:ff:ff:ff
    altname enp0s31f6

root@labhen197-113:~# ip l sh eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 74:78:27:71:f0:1e brd ff:ff:ff:ff:ff:ff
    altname enp0s31f6

root@labhen197-101:~# ip -s -h l show dev eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 74:78:27:3f:31:aa brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped missed  mcast 
    20.9G      68.7M    0       0       0       50.1M 
    TX: bytes  packets  errors  dropped carrier collsns
    8.19G      24.2M    0       0       0       0     
    altname enp0s31f6

root@labhen197-113:~# ip -s -h l show dev eno2
2: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 74:78:27:71:f0:1e brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped missed  mcast 
    36.4G      79.7M    0       0       0       50.1M 
    TX: bytes  packets  errors  dropped carrier collsns
    6.05G      8.88M    0       0       0       0     
    altname enp0s31f6
root@labhen197-113:~#


root@labhen197-101:~# ethtool eno2
Settings for eno2:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Auto-negotiation: on
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        MDI-X: on (auto)
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes
root@labhen197-101:~#

root@labhen197-113:~# ethtool eno2
Settings for eno2:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Auto-negotiation: on
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        MDI-X: on (auto)
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes
root@labhen197-113:~#

anything else I could provide?

fabian · May 12, 2023

that does sound really strange. if you are willing to do some more debugging, enabling debug logs in /etc/pve/corosync.conf - please follow the steps at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration - and copying the same config to node 113 to /etc/corosync/corosync.conf as well, and then repeating the experiment and collecting logs might shed some additional light - but I have to warn you, the logs are really verbose!

otherwise, verifying one more time that there is nothing strange w.r.t. switch port settings (storm control or something like that, for example) and removing the node (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node) followed by a reinstall and rejoin would be interesting. if the problem persists, than it must be somehow hardware or network setup related..

sdettmer · May 16, 2023

fabian said:
that does sound really strange. if you are willing to do some more debugging, enabling debug logs in /etc/pve/corosync.conf - please follow the steps at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration - and copying the same config to node 113 to /etc/corosync/corosync.conf as well, and then repeating the experiment and collecting logs might shed some additional light - but I have to warn you, the logs are really verbose!

otherwise, verifying one more time that there is nothing strange w.r.t. switch port settings (storm control or something like that, for example) and removing the node (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node) followed by a reinstall and rejoin would be interesting. if the problem persists, than it must be somehow hardware or network setup related..

I'm sorry for my late reply, I was on sick leave.

I noticed that the pvecm status has changed, just in case it helps:
(This just as side information. I'll answer your questions in a next post to ease skipping this if it is of no interest.)

Code:

root@labhen197-101:~# pvecm  status
Cluster information
-------------------
Name:             tisc-pve
Config Version:   13
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue May 16 18:24:37 2023
Quorum provider:  corosync_votequorum
Nodes:            8
Node ID:          0x00000001
Ring ID:          1.15d6c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      8
Quorum:           7 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.241.197.101 (local)
0x00000066          1 10.241.197.102
0x00000067          1 10.241.197.103
0x00000068          1 10.241.197.104
0x00000069          1 10.241.197.105
0x0000006f          1 10.241.197.111
0x00000070          1 10.241.197.112
0x00000071          1 10.241.197.113
root@labhen197-101:~#
logout
Connection to labhen197-101 closed.
sdettmer@RefVm5:~/work/tisc-src (master_tisc $ u-6) $ ssh root@labhen197-106
Linux labhen197-106 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue May 16 18:23:43 2023 from 10.169.9.34
root@labhen197-106:~# pvecm  status
Cluster information
-------------------
Name:             tisc-pve
Config Version:   13
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service
root@labhen197-106:~#
root@labhen197-113:~# uptime
 18:32:01 up 13 days,  3:58,  2 users,  load average: 1.06, 1.04, 1.00
root@labhen197-113:~# pvecm  status
Cluster information
-------------------
Name:             tisc-pve
Config Version:   13
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue May 16 18:32:09 2023
Quorum provider:  corosync_votequorum
Nodes:            8
Node ID:          0x00000071
Ring ID:          1.15d6c
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   13
Highest expected: 13
Total votes:      8
Quorum:           7 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.241.197.101
0x00000066          1 10.241.197.102
0x00000067          1 10.241.197.103
0x00000068          1 10.241.197.104
0x00000069          1 10.241.197.105
0x0000006f          1 10.241.197.111
0x00000070          1 10.241.197.112
0x00000071          1 10.241.197.113 (local)
root@labhen197-113:~#
root@labhen197-113:~# last
root     pts/1        10.169.133.132   Tue May 16 18:31   still logged in
root     pts/1        10.241.197.96    Thu May 11 16:59 - 17:42  (00:43)
root     pts/1        10.241.197.96    Thu May 11 16:50 - 16:59  (00:09)
root     pts/1        10.241.197.96    Thu May 11 16:40 - 16:40  (00:00)
root@labhen197-113:~# systemctl status corosync.service 
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-05-12 04:58:10 CEST; 4 days ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1636848 (corosync)
      Tasks: 9 (limit: 37994)
     Memory: 28.1G
        CPU: 56min 45.856s
     CGroup: /system.slice/corosync.service
             └─1636848 /usr/sbin/corosync -f

May 13 17:36:04 labhen197-113 corosync[1636848]:   [TOTEM ] Retransmit List: 6 c d e f 10 11 18 19 1a 1c 1d 1f 20 21 22 29 2a 2d>
May 13 17:36:09 labhen197-113 corosync[1636848]:   [TOTEM ] Retransmit List: 1f 10 30 32 3d 3e 45 46 47 48 49 4f 50
May 13 17:36:16 labhen197-113 corosync[1636848]:   [TOTEM ] Retransmit List: 60 61 6b 6c 6d 6e 6f 70 76 77 78 79

root@labhen197-113:~# less /var/log/syslog.1
[...]
May 12 04:58:09 labhen197-113 systemd[1]: Starting The Proxmox VE cluster filesystem...
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [quorum] crit: quorum_initialize failed: 2
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [quorum] crit: can't initialize service
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [confdb] crit: cmap_initialize failed: 2
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [confdb] crit: can't initialize service
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [dcdb] crit: cpg_initialize failed: 2
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [dcdb] crit: can't initialize service
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [status] crit: cpg_initialize failed: 2
May 12 04:58:09 labhen197-113 pmxcfs[1636843]: [status] crit: can't initialize service
May 12 04:58:10 labhen197-113 systemd[1]: Started The Proxmox VE cluster filesystem.
May 12 04:58:10 labhen197-113 systemd[1]: Starting Corosync Cluster Engine...
May 12 04:58:10 labhen197-113 systemd[1]: Starting Daily PVE download activities...
May 12 04:58:10 labhen197-113 corosync[1636848]:   [MAIN  ] Corosync Cluster Engine 3.1.7 starting up
May 12 04:58:10 labhen197-113 corosync[1636848]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf 
vqsim nozzle snmp pie relro bindnow
May 12 04:58:10 labhen197-113 corosync[1636848]:   [TOTEM ] Initializing transport (Kronosnet).
May 12 04:58:10 labhen197-113 corosync[1636848]:   [TOTEM ] totemknet initialized
May 12 04:58:10 labhen197-113 corosync[1636848]:   [KNET  ] pmtud: MTU manually set to: 0

so there was no login, but Friday night 04:58:10 corosync was started, presumably with "The Proxmox VE cluster filesystem", and the cluster went down.

This just as side information. I'll answer your questions in a next post to ease skipping this.

sdettmer · May 16, 2023

fabian said:
that does sound really strange. if you are willing to do some more debugging, enabling debug logs in /etc/pve/corosync.conf - please follow the steps at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration - and copying the same config to node 113 to /etc/corosync/corosync.conf as well, and then repeating the experiment and collecting logs might shed some additional light - but I have to warn you, the logs are really verbose!

otherwise, verifying one more time that there is nothing strange w.r.t. switch port settings (storm control or something like that, for example) and removing the node (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node) followed by a reinstall and rejoin would be interesting. if the problem persists, than it must be somehow hardware or network setup related..

Yes, I'm willing to do more debugging, because I don't feel well with the current situation. Either there is some bug in setup, like bad or misconfigured network, or a bug in some software, but in any case I don't have a reliable environment. I would love to find the reason.

I read the documentation and I see no problem editing the file as described (I understand it is needed on any node and will be replicated) and then copy to a different path on node 113. However, I don't know how to share the expected huge files, the forum has a small limit (even before I had to reduce the journalctl output significantly to fit the limit, after zipping of course). My company has some proprierary file sharing platform ("OpenTrust MFT"), but I would need an email address to share anything to (and maybe it must be of some big ISP like gmail.com or such?). Access to file share services, one-click-hosters and alike are forbidden, unfortunately. So how can I share the logs when I created them?

I hope I can re-install the node (but not within this next week) to see what happens. Can I do tests on the network? So far all I tested was stable and fast, but maybe I can utilize something special?
I have another system with some more cores on one host, should I try "nested PVE" (i.e. install PVE, then create 16 VMs with PVE each, and define a private bridge device for a private networking, so in that software networking, no switch or network defect could harm)? Or do I just replicate a standard testcase that is known to work?

fabian · May 17, 2023

sdettmer said:
Yes, I'm willing to do more debugging, because I don't feel well with the current situation. Either there is some bug in setup, like bad or misconfigured network, or a bug in some software, but in any case I don't have a reliable environment. I would love to find the reason.

I read the documentation and I see no problem editing the file as described (I understand it is needed on any node and will be replicated) and then copy to a different path on node 113. However, I don't know how to share the expected huge files, the forum has a small limit (even before I had to reduce the journalctl output significantly to fit the limit, after zipping of course). My company has some proprierary file sharing platform ("OpenTrust MFT"), but I would need an email address to share anything to (and maybe it must be of some big ISP like gmail.com or such?). Access to file share services, one-click-hosters and alike are forbidden, unfortunately. So how can I share the logs when I created them?

you can send me the link/access details via email: f.gruenbichler@proxmox.com (if possible, include a reference to this thread and ping the thread so I know to also check in case it got flagged as spam somewhere

)

sdettmer said:
I hope I can re-install the node (but not within this next week) to see what happens. Can I do tests on the network? So far all I tested was stable and fast, but maybe I can utilize something special?

you could retry stopping both corosync and pve-cluster and then just starting corosync on each node one by one (checking the logs and corosync-quorumtool -s before proceeding to the next one). if nothing else, this should keep the load to a minimum since you don't have the sync up traffic from pmxcfs.

sdettmer said:
I have another system with some more cores on one host, should I try "nested PVE" (i.e. install PVE, then create 16 VMs with PVE each, and define a private bridge device for a private networking, so in that software networking, no switch or network defect could harm)? Or do I just replicate a standard testcase that is known to work?

you can do that, but you need a pretty beefy host (you probably want at least 2 vcpus per VM, and corosync gets really unhappy if the VMs are not scheduled frequently enough). if you have a spare capacity of ~40 logical cores/hyper threads (or more

) I'd give it a shot.

n.borisenkov · May 17, 2023

The logs say "Token has not been received in". This traffic may not actually reach the server. Or be discarded by the operating system and not reach the process.

Maybe with a larger number of nodes, the receive buffers overflow? You can check the counters with the following command:

Code:

nstat --zero --ignore | grep -E "(UdpInErrors|UdpRcvbufErrors)"

If the buffers never overflowed after loading, then this should be 0

And you also need to write a traffic dump on UDP port 5405. It is better to compare two dumps when you add a node and everything is OK and when you add the last one.

You can also compare the output of the "ethtool -S eno2" command before adding the node and after

fabian · May 17, 2023

n.borisenkov said:
The logs say "Token has not been received in". This traffic may not actually reach the server. Or be discarded by the operating system and not reach the process.

the token is part of the consensus protocol, it not being received in time does not necessarily mean that any traffic was dropped altogether, but it usually indicates some kind of network issue or a corosync bug. it's also logged sometimes (although only once or twice in a row!) when the cluster membership changes, but those are benign.

n.borisenkov said:
Maybe with a larger number of nodes, the receive buffers overflow? You can check the counters with the following command:

Code:

nstat --zero --ignore | grep -E "(UdpInErrors|UdpRcvbufErrors)"

If the buffers never overflowed after loading, then this should be 0

And you also need to write a traffic dump on UDP port 5405. It is better to compare two dumps when you add a node and everything is OK and when you add the last one.

You can also compare the output of the "ethtool -S eno2" command before adding the node and after

checking for errors/retransmits/.. might give a clue, but dumping the on-wire traffic is pretty meaningless (it's encrypted). corosync debug logs contain much more relevant information. note that there also is a sort of chicken egg problem - if there is an issue, both corosync and pmxcfs will retransmit, potentially causing lots of traffic, potentially causing packets to be dropped (so even if you see dropped packets/retransmits on the network layer, those might be a symptom, not the cause).

n.borisenkov · May 17, 2023

Since it is not clear in which direction to dig, it is necessary to exclude theories one by one.

If the logs say that the token did not arrive within 7 seconds, then you need to check two things: that the token was sent during this period of time and that it reached its destination. Both of these things can be verified by recording a traffic dump. Accordingly, if the traffic dump shows that these packets are coming (packet capture occurs at a very early stage), and corosync says that it does not see them, then you need to look further. Buffers may overflow here (one of the options).

It is important for us to know that the packet actually arrived at the interface. The content doesn't matter (for now). In a traffic dump, I expect to see incoming packets from all other nodes. These packets should arrive at a constant rate (assuming the cluster is well established) and fairly frequently. In my cluster of 7 nodes, I see 2-3 packets every second from each host.

I capture traffic with the command "tcpdump -w corosync-pve07.pcap -ni enp5s0f1 port 5405 and dst host 172.20.70.231" where 172.20.70.231 is the target host I capture traffic on (but it's better to log all traffic on port 5405 on host that is added to the cluster).

Please attach pcap file if possible.

sdettmer · May 22, 2023

fabian said:
you can send me the link/access details via email: f.gruenbichler@proxmox.com (if possible, include a reference to this thread and ping the thread so I know to also check in case it got flagged as spam somewhere )

you could retry stopping both corosync and pve-cluster and then just starting corosync on each node one by one (checking the logs and corosync-quorumtool -s before proceeding to the next one). if nothing else, this should keep the load to a minimum since you don't have the sync up traffic from pmxcfs.

I documented every step using "script"
I was able to start corosync with debug enabled on each and I think this looked good. I took much time (30 mins in total) to be slow.

Then I started pve-cluster one by one. After adding 10 or so, load started to increase, but no CPU usage. According to the internet that can be processes waiting for blocking I/O and I found such processes.

The README contains more details.

In the web GUI, I saw node 101-111 in grey and 112-113 in red, but it seemed not to update. I try to log in again, but I cannot login anymore, I get ERR_TIMED_OUT.
nstat --zero --ignore | grep -E "(UdpInErrors|UdpRcvbufErrors)" before and after was and remains 0.0 on each and ping was 100% stable:

Code:

--- labhen197-113.bt.bombardier.net ping statistics ---
24933 packets transmitted, 24933 received, 0% packet loss, time 5008533ms  
rtt min/avg/max/mdev = 0.116/0.977/20.025/0.220 ms

The ethtool says for each host something like:

Code:

NIC statistics:
     rx_packets: 213159380
     tx_packets: 72175965
     rx_bytes: 78818799989
     tx_bytes: 20151689919
     rx_broadcast: 11463047
     tx_broadcast: 10
     rx_multicast: 97137424
     tx_multicast: 7
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 97137424
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 644
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 23145
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_csum_offload_good: 189565207
     rx_csum_offload_errors: 6
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0
     rx_hwtstamp_cleared: 0
     uncorr_ecc_errors: 0
     corr_ecc_errors: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0

the values slightly differ of course, but during the process no field with "error" increased on any host (each host is in the file):

Code:

sdettmer@tux6:~/work/ansible/take-5 (master * u+1) $ grep error ethtool-S-eno2-before.txt|sort > 1
sdettmer@tux6:~/work/ansible/take-5 (master * u+1) $ grep error ethtool-S-eno2-after.txt|sort > 2
sdettmer@tux6:~/work/ansible/take-5 (master * u+1) $ diff 1 2
sdettmer@tux6:~/work/ansible/take-5 (master * u+1) $

Could it be not a corosync but a pmxcfs issue maybe?

sdettmer · May 22, 2023

n.borisenkov said:
Since it is not clear in which direction to dig, it is necessary to exclude theories one by one.

If the logs say that the token did not arrive within 7 seconds, then you need to check two things: that the token was sent during this period of time and that it reached its destination. Both of these things can be verified by recording a traffic dump. Accordingly, if the traffic dump shows that these packets are coming (packet capture occurs at a very early stage), and corosync says that it does not see them, then you need to look further. Buffers may overflow here (one of the options).

It is important for us to know that the packet actually arrived at the interface. The content doesn't matter (for now). In a traffic dump, I expect to see incoming packets from all other nodes. These packets should arrive at a constant rate (assuming the cluster is well established) and fairly frequently. In my cluster of 7 nodes, I see 2-3 packets every second from each host.

I capture traffic with the command "tcpdump -w corosync-pve07.pcap -ni enp5s0f1 port 5405 and dst host 172.20.70.231" where 172.20.70.231 is the target host I capture traffic on (but it's better to log all traffic on port 5405 on host that is added to the cluster).

Please attach pcap file if possible.

Hi,

If you can send me an email (for example via steffen.dettmer@alstomgroup.com), I can share you the pcap files (611 MB ZIPs for two hosts).

fabian · May 23, 2023

just in case you weren't aware - those pcap files might contain very sensitive data (such as the root ssh key, the corosync auth key, the private key of the TLS certificates, TFA secrets, passwords, ..). the data should all be encrypted using the corosync auth key, but you still might want to reconsider sharing them in full.

sdettmer · May 23, 2023

fabian said:
just in case you weren't aware - those pcap files might contain very sensitive data (such as the root ssh key, the corosync auth key, the private key of the TLS certificates, TFA secrets, passwords, ..). the data should all be encrypted using the corosync auth key, but you still might want to reconsider sharing them in full.

Thank you for pointing out. This is a test installation without real passwords, unreachable from other networks and not in production. Since I will have to re-install them anyway (after hopefully finding the cause for my problem), and I think all secrets etc. will be freshly generated during installation.
However, encryption must protect well nevertheless, otherwise an attacker would have got access to another device on the same switch, could possibly find ways to see the traffic - and must not able to benefit.
In other words, if sharing a pcap would be a problem, the network security would already be lost, I think

fabian · May 24, 2023

just as a heads up - I will probably only have time to look at your logs in detail end of next week (long weekend coming up

)

sdettmer · May 24, 2023

fabian said:
just as a heads up - I will probably only have time to look at your logs in detail end of next week (long weekend coming up )

Thank you very much!
Could you possibly take a brief look if the needed / interesting data is included at all? Just to avoid I made some silly mistakes or so and the data is not useful at all (in this case I could get more information or repeat the test)

sdettmer · Jun 2, 2023

fabian said:
just as a heads up - I will probably only have time to look at your logs in detail end of next week (long weekend coming up )

I hope you did not take a look and instead had a great long weekend!

Could you please advice what I should try next? Should I kill the coro key from node 13 and see if the other 12 work at all? Should I try add node 14?

ps: Of course I don't believe that numbers have real-life properties (except 42), but a part of me finds it somehow nice that it is actually node 13 :D

fabian · Jun 2, 2023

yes indeed

thanks for the log! I will have to take some time to digest them in detail, but they look sensible to me.

AFAICT:
- corosync only starting and establishing quorum worked using all nodes
- starting pmxcfs afterwards triggered the issue once node 109 was reached
- there were retransmits earlier already that might indicate issues with the networking or scheduling keeping up

you could try with only 12 nodes and see if those are stable on their own, and if they are, add node 14 only to see if it's really somehow node13 being "special" in some fashion (not number wise, but hardware/..

)

brosky · Jun 14, 2023

Hi,

I'm having the same problem: https://forum.proxmox.com/threads/one-node-in-cluster-brings-everything-down.128862/

I suspect the source of the problem the inability to write on the /etc/pve folder after the node boots and joins the cluster.
Is there a way to write there files when the pve-cluster service is stopped ?

dignus · Jun 14, 2023

Question on the sidelines of this: Is there a maximum defined number of hosts in a cluster?

Adding node #10 makes puts whole cluster out of order

Active Member

Attachments

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Active Member

Attachments

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Well-Known Member

Renowned Member

We value your privacy