How to replace the network hardware in a running cluster?

proxwolfe

Well-Known Member
Jun 20, 2020
499
51
48
49
Hi,

I have a small home lab cluster with 3 PVE nodes each of which also serves as a Ceph node. For each, cluster sync and ceph, there is a dedicated 10gbe network (and then there is another dedicated 10gbe network to a PBS which sits outside the cluster).

Now I want to upgrade the network to Infiniband. For this, I need to replace the networking cards in the nodes (and the switch and cables, of course). There are not enough slots to have both cards in at the same time.

My question is: How can I replace the network without disrupting the cluster? Even if everything works at first try (which I don't expect), there will be a period of time during which the nodes will not be able to see each other and could "panic"...

Is there a way to suspend cluster operations for the time of the hardware replacement?

Or what is the best practice in my case?

Thanks!
 
So, there is no way to do this?

Do I need to tear down everything and create a new cluster?
 
My question is: How can I replace the network without disrupting the cluster?
I have never tried this and I have no Infiniband. So I have zero experience with what you describe - that's why I hesitate to write something.

Having a three node cluster I would try to
  1. tell Ceph to not re-balance/no-out --> Ceph is "degraded" during this procedure
  2. shutdown Node A
  3. modify the hardware
  4. start that Node and make it work with the cluster using the "old" established copper network. This might require adjustments regarding the network configuration and/or VLANs. In any case this is a temporary setup.
  5. reset Ceph constraints, it has to re-balance now
  6. only if this works: repeat for Node B+C
In my vision you now have a "down-configured" - but working - cluster with Copper and an unused Infiniband setup. Without interruption until now.

Now you can activate Infiniband, test it and switch connectivity "in one go" while experiencing an as short as possible downtime. I have no idea on how tho achieve this, sorry.

Good luck!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!