Maintenance on a large part of cluster

kluvi

Member
Nov 3, 2022
6
1
8
Hi.
We are currently in a little specific situation... Our cluster have 25 nodes - 5 nodes is in one datacenter, 20 nodes in another. Both datacenters are directly connected with fiber.
We want to replace switch in second datacenter (20 nodes), because its a little bit faulty. It is not possible to reconnect server-by-server to new switch, because the old one does not delete entries from its mac-address table (and really bad things happens when we tried to reconnect one server to new switch :D ).
So we need to reconnect all servers at once. Downtime on VMs is not the problem, but I am worried about PVE cluster itself.

What is the recommended way to temporary disconnect 20 of 25 PVE nodes? (just few minutes)

btw we dont use Ceph on nodes, but most of VMs use shared NVMe/TCP storage.
 
  • Like
Reactions: Johannes S
as long as you have enough resources in the "small" section, just move the workload over, then shut down all nodes in the "large" section. perform your required maintenance and boot them back up.

edit there is another possibility; keep both the old switch and new switches connected simultaneously, but keep the "new" network on a seperate vlan and a seperate corosync interface. make sure that both sets of switchports provide access to both vlans, and make sure that both old and new switch have connectivity to the remote nodeset. Once you establish connectivity on the new switch you can turn off and disconnect the old without any downtime.
 
Last edited:
Thanks both of you... Unfortunately we dont have enough resources to migrate workload to smaller section. The second way looks promising, i will think about it.

Is there some simple way how to temporary disable HA migrations,... ?
I know I can use API / bash script and save list of VMs with HA configured (not all VMs are HA), then bulk disable on all VMs and then re-enable from stored "backup". But it looks too complicated.

I also know about Datacenter > Options > HA settings: freeze, but it doesnt work when something goes wrong during our planned procedure and everything goes off at once.
 
Is there some simple way how to temporary disable HA migrations,... ?
mv /etc/pve/ha/resources.cfg /tmp/

;)

then move back the file again


Note than you need to close watchdog too to avoid fencing, the currently only is it to

1) stop pve-ha-lrm on all nodes, node by node
2) stop pve-ha-crm on all nodes, node by node


then, do the reverse when on have finished your upgrade



(BTW, HA with only 2 datacenters is really not recommended, if you have a split brain|fiber cut between 2C, or if your main datacenter is down,, your 5 nodes on second site will be fenced and reboot, and HA is not able to auto failover vms on second site.
 
  • Like
Reactions: kluvi
Thanks again... today we performed the "operation" and it worked well - in one step, the cluster starts behaved weird (i will post new thread about this), but disabling HA on VMs and stopping LRM+CRM saved our cluster from disaster.
 
Yes, we had only one switch... it was jut temporary solution before we move to final HA-pair of 100G switches.

btw for reference, here is the thread with another problem we found during migration