Maintenance on a large part of cluster

kluvi · May 23, 2025

Hi.
We are currently in a little specific situation... Our cluster have 25 nodes - 5 nodes is in one datacenter, 20 nodes in another. Both datacenters are directly connected with fiber.
We want to replace switch in second datacenter (20 nodes), because its a little bit faulty. It is not possible to reconnect server-by-server to new switch, because the old one does not delete entries from its mac-address table (and really bad things happens when we tried to reconnect one server to new switch

).
So we need to reconnect all servers at once. Downtime on VMs is not the problem, but I am worried about PVE cluster itself.

What is the recommended way to temporary disconnect 20 of 25 PVE nodes? (just few minutes)

btw we dont use Ceph on nodes, but most of VMs use shared NVMe/TCP storage.

alexskysilk · May 23, 2025

as long as you have enough resources in the "small" section, just move the workload over, then shut down all nodes in the "large" section. perform your required maintenance and boot them back up.

edit there is another possibility; keep both the old switch and new switches connected simultaneously, but keep the "new" network on a seperate vlan and a seperate corosync interface. make sure that both sets of switchports provide access to both vlans, and make sure that both old and new switch have connectivity to the remote nodeset. Once you establish connectivity on the new switch you can turn off and disconnect the old without any downtime.

spirit · May 24, 2025

shouldn't be a problem, but if you use HA, you really to disable it before.

kluvi · May 26, 2025

Thanks both of you... Unfortunately we dont have enough resources to migrate workload to smaller section. The second way looks promising, i will think about it.

Is there some simple way how to temporary disable HA migrations,... ?
I know I can use API / bash script and save list of VMs with HA configured (not all VMs are HA), then bulk disable on all VMs and then re-enable from stored "backup". But it looks too complicated.

I also know about Datacenter > Options > HA settings: freeze, but it doesnt work when something goes wrong during our planned procedure and everything goes off at once.

spirit · May 27, 2025

kluvi said:
Is there some simple way how to temporary disable HA migrations,... ?

mv /etc/pve/ha/resources.cfg /tmp/

then move back the file again

Note than you need to close watchdog too to avoid fencing, the currently only is it to

1) stop pve-ha-lrm on all nodes, node by node
2) stop pve-ha-crm on all nodes, node by node

then, do the reverse when on have finished your upgrade

(BTW, HA with only 2 datacenters is really not recommended, if you have a split brain|fiber cut between 2C, or if your main datacenter is down,, your 5 nodes on second site will be fenced and reboot, and HA is not able to auto failover vms on second site.

kluvi · Jun 4, 2025

Thanks again... today we performed the "operation" and it worked well - in one step, the cluster starts behaved weird (i will post new thread about this), but disabling HA on VMs and stopping LRM+CRM saved our cluster from disaster.

LnxBil · Jun 4, 2025

Great that it worked.

Do I understand correctly, that you just had one switch instead of the recommended two?

kluvi · Jun 6, 2025

Yes, we had only one switch... it was jut temporary solution before we move to final HA-pair of 100G switches.

btw for reference, here is the thread with another problem we found during migration

Search

Search

Maintenance on a large part of cluster

kluvi

Member

alexskysilk

Distinguished Member

spirit

Distinguished Member

kluvi

Member

spirit

Distinguished Member

kluvi

Member

LnxBil

Distinguished Member

kluvi

Member

We value your privacy