Broken cluster after connecting new node

kluvi · Wednesday at 21:41

Hi.
We have cluster of 25 nodes - 5 in one datacenter, 20 in another. Both directly connected with 4x100G fiber (its same building, different floors).
We performed some mantenance (I asked some questions about it in another thread) - need to reconnect from some temporary switches (without redundancy) to HA-pair of new switches.

First we prepare the HA-pair and connect some testing server, install Proxmox and connect it to our cluster... everything works as expected.

Then we take one server, disconnect the one cable from old switches and connect 2 cables from new switches. The server looks running, but about 4 minutes later, everything was down... some servers restarts themselves, some not. When we disconnect the server, everything stabilizes.

Then we check again and again configuration of new switches, fixed some details and prepare better for downtime (see link to another thread above).

Today we finaly migrated 19/20 servers to new switches. Both cables connected, everything is working as expected, EXCEPT the one server we tried earlier.
Old switches completely disconnected. Then we tried to connect this problematic server. Again - first few minutes everything looks good. Then in PVE web UI one server grayed-out, then another server... and then all servers. I think, that if we didnt stop LRM+CRM, everything crashes.

I have absolutely no idea, what to look for. Logs on switches only shows, that ports are down, logs on PVE nodes shows disconnected nodes and then reboot. Which is expected when the nodes goes randomly down.

How can happen, when connecting only one server kill whole cluster?

All nodes have "same" configuration - we use ansible for configuration. Here is part of our `/etc/network/interfaces` in case it helps better understand our network architecture (ports on switches are configured as trunk ports with allowed required vlans):

Code:

auto eno1np0

iface eno1np0 inet manual
auto eno2np1
iface eno2np1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno1np0 eno2np1
        bond-miimon 100
        bond-mode 802.3ad

auto bond0.701
iface bond0.701 inet manual

auto vmbr701
iface vmbr701 inet manual
        bridge-ports bond0.701
        bridge-stp off
        bridge-fd 0

auto bond0.703
iface bond0.703 inet manual

auto vmbr703
iface vmbr703 inet manual
        address 10.2.102.251/26
        gateway 10.2.102.193
        bridge-ports bond0.703
        bridge-stp off
        bridge-fd 0

Search

Search

Broken cluster after connecting new node

kluvi

Member

We value your privacy