Hello all, I've been running PVE for a few years now, and I feel like I have yet to grasp the PROPER way to patch at reboot nodes.
I have a 3 (or 5) node PVE9 cluster of HP DL380 servers running CEPH. 99% are VMs. I have redundant corosync links over differing switches. I consider the setup to be pretty proper.
Way back when I first started with proxmox, I would just migrate my VMs over to another node, patch the node I just cleared and repeat the process. It took forever though. Even though it's on 10G networking, it took time for everything to migrate a few different times. But, I rarely had any of these cluster-wide spontaneous reboots. I would patch, reboot, let CEPH go back to OK, and then do the next node. It was a 3 hour affair.
Recently I've just been shutting down the VMs, the migrating them off the host being patched, and then did my maintenance.
Most recently (aka last night) I just shut down the VMs, left them where they were, and then patched. Boom, I get a cluster wide reboot on the first node being patched. There was nothing else going on... no switch maintenance that would affect corosync, quite literally nothing else. So, ok. I forget that PVE has a maintenance mode (that I know I should be using more, but there is no gui for it!) so I let the first host come back up, fired up a few critical vms, and moved to the second node. I put it in maintenance, patched it, rebooted and BAM another cluster wide reboot. This makes no sense, and while I feel like I have a good grasp of PVE, I just feel like patching has always been problematic.
So, what's the RIGHT way to maintain the PVE cluster?
I have a 3 (or 5) node PVE9 cluster of HP DL380 servers running CEPH. 99% are VMs. I have redundant corosync links over differing switches. I consider the setup to be pretty proper.
Way back when I first started with proxmox, I would just migrate my VMs over to another node, patch the node I just cleared and repeat the process. It took forever though. Even though it's on 10G networking, it took time for everything to migrate a few different times. But, I rarely had any of these cluster-wide spontaneous reboots. I would patch, reboot, let CEPH go back to OK, and then do the next node. It was a 3 hour affair.
Recently I've just been shutting down the VMs, the migrating them off the host being patched, and then did my maintenance.
Most recently (aka last night) I just shut down the VMs, left them where they were, and then patched. Boom, I get a cluster wide reboot on the first node being patched. There was nothing else going on... no switch maintenance that would affect corosync, quite literally nothing else. So, ok. I forget that PVE has a maintenance mode (that I know I should be using more, but there is no gui for it!) so I let the first host come back up, fired up a few critical vms, and moved to the second node. I put it in maintenance, patched it, rebooted and BAM another cluster wide reboot. This makes no sense, and while I feel like I have a good grasp of PVE, I just feel like patching has always been problematic.
So, what's the RIGHT way to maintain the PVE cluster?