Proper patching procedure for PVE??

Stargazer8402

New Member
Dec 6, 2023
1
0
1
Hello all, I've been running PVE for a few years now, and I feel like I have yet to grasp the PROPER way to patch at reboot nodes.

I have a 3 (or 5) node PVE9 cluster of HP DL380 servers running CEPH. 99% are VMs. I have redundant corosync links over differing switches. I consider the setup to be pretty proper.

Way back when I first started with proxmox, I would just migrate my VMs over to another node, patch the node I just cleared and repeat the process. It took forever though. Even though it's on 10G networking, it took time for everything to migrate a few different times. But, I rarely had any of these cluster-wide spontaneous reboots. I would patch, reboot, let CEPH go back to OK, and then do the next node. It was a 3 hour affair.

Recently I've just been shutting down the VMs, the migrating them off the host being patched, and then did my maintenance.

Most recently (aka last night) I just shut down the VMs, left them where they were, and then patched. Boom, I get a cluster wide reboot on the first node being patched. There was nothing else going on... no switch maintenance that would affect corosync, quite literally nothing else. So, ok. I forget that PVE has a maintenance mode (that I know I should be using more, but there is no gui for it!) so I let the first host come back up, fired up a few critical vms, and moved to the second node. I put it in maintenance, patched it, rebooted and BAM another cluster wide reboot. This makes no sense, and while I feel like I have a good grasp of PVE, I just feel like patching has always been problematic.

So, what's the RIGHT way to maintain the PVE cluster?
 
Hello,

maintenance mode is for HA managed resources but has nothing to do with cluster health in total or the behaviour of the reboots you're describing.
Did you have a look at the corosync.service logs on the non-patched servers for the time before reboot occured?
 
What's your main time consuming part ? We have around 80 vm's on 10Gb nfs to 5 nodes, using maintenance mode and vm/lxc ha definitions come to auto live migratition between and all is done in 30min.
 
Last edited: