Maintenance on a large part of cluster

kluvi

Member
Nov 3, 2022
2
1
6
Hi.
We are currently in a little specific situation... Our cluster have 25 nodes - 5 nodes is in one datacenter, 20 nodes in another. Both datacenters are directly connected with fiber.
We want to replace switch in second datacenter (20 nodes), because its a little bit faulty. It is not possible to reconnect server-by-server to new switch, because the old one does not delete entries from its mac-address table (and really bad things happens when we tried to reconnect one server to new switch :D ).
So we need to reconnect all servers at once. Downtime on VMs is not the problem, but I am worried about PVE cluster itself.

What is the recommended way to temporary disconnect 20 of 25 PVE nodes? (just few minutes)

btw we dont use Ceph on nodes, but most of VMs use shared NVMe/TCP storage.
 
  • Like
Reactions: Johannes S
as long as you have enough resources in the "small" section, just move the workload over, then shut down all nodes in the "large" section. perform your required maintenance and boot them back up.

edit there is another possibility; keep both the old switch and new switches connected simultaneously, but keep the "new" network on a seperate vlan and a seperate corosync interface. make sure that both sets of switchports provide access to both vlans, and make sure that both old and new switch have connectivity to the remote nodeset. Once you establish connectivity on the new switch you can turn off and disconnect the old without any downtime.
 
Last edited: