So... this has escalated quickly from a "
I have a cluster in production that I need to move to a new switch" to
"I have a cluster with Ceph, with separate networks for quorum, Ceph Public and Ceph Cluster, from which will only move to another switch the Ceph Cluster network of all servers in the PVE Cluster, but the Ceph OSD are only on a subset of all the hosts of the PVE Cluster" (if I understand correctly your posts).
Sorry, but it's very very hard and time consuming to provide accurate answers in this forum if I don't have all the information, specially on a critical change like this.
Anyway, my first recomendation still applies:
Network wise, this is a simple as interconnecting both switchs stacks so devices on the old MLAG switches see those moved to the new MLAG swich using the same VLAN's.
So when you move a server to the new switch it will see the others still in the old switch.
Which one is used to keep ceph cluster running ?
Ceph Public it's used for monitor quorum and client access to OSD, CephFS, etc. Ceph Cluster is used for OSD replicas/recovery/backfill only. Depending on your pool configuration you will need to be able to write to at least to
n OSD for the I/O to finish. So you need both networks working for Ceph to work.
how much time do I have when moving he node (one at the time) to the new switch before that nod realizes that it has been separated from other PVE nodes ?
Basically zero. Once a node with OSD's loses connection with the cluster network it will not be able to create replicas for the primary PGs it holds, so I/O for those PGs will block, so make sure to mark those OSD
down before moving that host. If you connect old and new switch between them, the operation will be as fast as moving the cable and marking the OSD
up again. Use the
noout flag so Ceph won't rebalance OSD's if they are down more than the default 10 minutes.
Also I should disable the HA for VMs running on the PVE node being moved to minimize chances of them being started on another mode - is there like a maintenance mode that would disable that ?
You must leave that node empty, just in case. By your posts I don't expect HA to fence the node, given that the PVE network (corosync quorum) will not be moved, but the safest thing to do is not to have any VM in the server you are working on.