Hi.
We have cluster of 25 nodes - 5 in one datacenter, 20 in another. Both directly connected with 4x100G fiber (its same building, different floors).
We performed some mantenance (I asked some questions about it in another thread) - need to reconnect from some temporary switches (without redundancy) to HA-pair of new switches.
First we prepare the HA-pair and connect some testing server, install Proxmox and connect it to our cluster... everything works as expected.
Then we take one server, disconnect the one cable from old switches and connect 2 cables from new switches. The server looks running, but about 4 minutes later, everything was down... some servers restarts themselves, some not. When we disconnect the server, everything stabilizes.
Then we check again and again configuration of new switches, fixed some details and prepare better for downtime (see link to another thread above).
Today we finaly migrated 19/20 servers to new switches. Both cables connected, everything is working as expected, EXCEPT the one server we tried earlier.
Old switches completely disconnected. Then we tried to connect this problematic server. Again - first few minutes everything looks good. Then in PVE web UI one server grayed-out, then another server... and then all servers. I think, that if we didnt stop LRM+CRM, everything crashes.
I have absolutely no idea, what to look for. Logs on switches only shows, that ports are down, logs on PVE nodes shows disconnected nodes and then reboot. Which is expected when the nodes goes randomly down.
How can happen, when connecting only one server kill whole cluster?
All nodes have "same" configuration - we use ansible for configuration. Here is part of our `/etc/network/interfaces` in case it helps better understand our network architecture (ports on switches are configured as trunk ports with allowed required vlans):
We have cluster of 25 nodes - 5 in one datacenter, 20 in another. Both directly connected with 4x100G fiber (its same building, different floors).
We performed some mantenance (I asked some questions about it in another thread) - need to reconnect from some temporary switches (without redundancy) to HA-pair of new switches.
First we prepare the HA-pair and connect some testing server, install Proxmox and connect it to our cluster... everything works as expected.
Then we take one server, disconnect the one cable from old switches and connect 2 cables from new switches. The server looks running, but about 4 minutes later, everything was down... some servers restarts themselves, some not. When we disconnect the server, everything stabilizes.
Then we check again and again configuration of new switches, fixed some details and prepare better for downtime (see link to another thread above).
Today we finaly migrated 19/20 servers to new switches. Both cables connected, everything is working as expected, EXCEPT the one server we tried earlier.
Old switches completely disconnected. Then we tried to connect this problematic server. Again - first few minutes everything looks good. Then in PVE web UI one server grayed-out, then another server... and then all servers. I think, that if we didnt stop LRM+CRM, everything crashes.
I have absolutely no idea, what to look for. Logs on switches only shows, that ports are down, logs on PVE nodes shows disconnected nodes and then reboot. Which is expected when the nodes goes randomly down.
How can happen, when connecting only one server kill whole cluster?
All nodes have "same" configuration - we use ansible for configuration. Here is part of our `/etc/network/interfaces` in case it helps better understand our network architecture (ports on switches are configured as trunk ports with allowed required vlans):
Code:
auto eno1np0
iface eno1np0 inet manual
auto eno2np1
iface eno2np1 inet manual
auto bond0
iface bond0 inet manual
bond-slaves eno1np0 eno2np1
bond-miimon 100
bond-mode 802.3ad
auto bond0.701
iface bond0.701 inet manual
auto vmbr701
iface vmbr701 inet manual
bridge-ports bond0.701
bridge-stp off
bridge-fd 0
auto bond0.703
iface bond0.703 inet manual
auto vmbr703
iface vmbr703 inet manual
address 10.2.102.251/26
gateway 10.2.102.193
bridge-ports bond0.703
bridge-stp off
bridge-fd 0