Optimize maintenance process ?

nsc

Renowned Member
Jul 21, 2010
54
2
73
Hi all,

I have a Proxmox 8.4 cluster with 9 servers and I'm wondering if any of you have optimised the update/restart process.

I come from a VMware environment and was a big fan of Maintenance Mode + DRS + Update Manager.

Now I've had to rewrite bash scripts to do the same thing, but it's still a bit too manual, and normal mode was far too aggressive on my cluster.

How do you do it?

Thanks

nsc
 
We have a Proxmox 8.4 cluster with 5 nodes with about 80 vm/lxc and 1 nfs-fileserver.
Start updates on all nodes in parallel and on server "dnf upgrade -y" and if there's some and new kernel do "sync;exportfs -uav;sync;reboot" ... wait until back (3min). When it's back set 1. node in maintenance mode, when all is auto-migrated "sync;reboot" ... wait until back turn maintenance mode off and on to 2. node and so on until all 5 nodes are done. Maintenance done in 30min.
 
I use Ansible. The process runs automatically at 3am every morning. You could also use the Debian "unattended upgrades" package
 
So it's safe to "dist upgrade" then reboot ? Currently, I completely free up the node with migrations or VM shutdowns, and then I launch the dist-upgrade.
 
First of all, automatically upgrading your Proxmox hosts is probably not the best idea for several reasons. However, if you decide to do it anyway, there's a package called unattended-upgrades in Debian that can do it for you. There are a few important things you should be aware of, though:

1. By default, unattended-upgrades only installs security updates (similiar to apt upgrade). not new features (which you’d get with apt full-upgrade or apt dist-upgrade). On Proxmox, however, you should always use full-upgrade / dist-upgrade. So, if you still want to use unattended-upgrades, make sure to configure it accordingly: https://wiki.debian.org/PeriodicUpdates#Configure_unattended-upgrades

2. While services on Debian/Ubuntu systems generally should restart automatically when they themselves are upgraded, they don’t automatically restart when only one of their dependencies is upgraded. This means a service may continue running with outdated (and potentially vulnerable) code until you manually restart it or reboot the host. This can be avoided by installing needrestart: (https://manpages.ubuntu.com/manpages/focal/man1/needrestart.1.html), which detects which services require a restart after updates and can be configured to restart them automatically.

So it's safe to "dist upgrade" then reboot ? Currently, I completely free up the node with migrations or VM shutdowns, and then I launch the dist-upgrade.

If the qemu-guest-agent is installed on all your VMs, then yes, rebooting while the VMs are running is generally safe. In that case, all VMs will receive a graceful shutdown command from the host before the reboot. If some VMs don’t have the guest agent installed, it’s better to shut them down manually before rebooting the host.

Also, when using needrestart (see link above), a reboot is only required if a kernel update has been installed.

My personal approach:

I prefer to update my Proxmox hosts manually and interactively. Here’s how I usually do it:

apt update && apt dist-upgrade

After the upgrade completes, needrestart tells me which services need to be restarted, and I let it restart them. It also tells me if a reboot is required. If so, I shut down my TrueNAS Core VM (yeah, I still use that ) since it doesn’t have the guest agent installed.

Then I issue the reboot command, and Proxmox gracefully shuts down all other running VMs and performs the reboot.
 
Last edited:
As i wrote i make two scripts to do this :

- maintenance start and live migrate VM to other nodes, local VM are shutdown.
- then i connect and i "apt-get update -y && apt-get upgrade -y &&apt-get dist-upgrade -y && apt-get autoremove -y"
- then i reboot and wait with a "ping".
- when node is back i check than it available in cluster
- finaly a run another script to bring back VM and start local VM.

I'm going to automate things a bit more so that everything is automatic.
 
As i wrote i make two scripts to do this :

- maintenance start and live migrate VM to other nodes, local VM are shutdown.
To be honest, in that case, I don’t really understand your question. Assuming that all the VMs have been migrated away and none are running on the host, why wouldn’t it be safe to reboot the host after a dist-upgrade?

However, I can see other potential issues with your approach, namely if something goes wrong in your chain of scripted events. In that case, you’d first need to figure out where it got stuck and what state your cluster is in the next morning. If you perform the process interactively, on the other hand, you can immediately see when something goes wrong and take corrective action right away.

So yes, it can be done. Whether you should do it, mainly depends on how sophisticated your scripts are. Meaning how well they handle potential errors and whether they properly inform you where and how something went wrong, if anything does.
 
Last edited:
I'm having trouble explaining myself. The script isn't automatic in my case, but rather launched manually by a human ;-)

It's just a script that we launch, which does the job and stops at the first problem, allowing the person in charge of the update to intervene.

However, it saves time because the process of migrating in one direction and then back again is quite lengthy. Not to mention the reboot.
 
Manually started updates are good to be able to stop if somethink goes off roads.
Migration is just lengthy if not using shared storage for otherwise it's just couple of few minutes for even lots of machines.
 
As i wrote i make two scripts to do this :

- maintenance start and live migrate VM to other nodes, local VM are shutdown.
- then i connect and i "apt-get update -y && apt-get upgrade -y &&apt-get dist-upgrade -y && apt-get autoremove -y"
- then i reboot and wait with a "ping".
- when node is back i check than it available in cluster
- finaly a run another script to bring back VM and start local VM.

I'm going to automate things a bit more so that everything is automatic.
If you configured high-avilability for your VMs you can save some steps. First: If you enable maintenance-mode the VMs in HA will migrate to other nodes, if you configured HA accordingly. So with that you don't need to to a manual migration before activating maintenance.
But for a planned reboot not even that is needed: You can configure a shutdown policy:
Shutdown Policy
Below you will find a description of the different HA policies for a node shutdown. Currently Conditional is the default due to backward compatibility. Some users may find that Migrate behaves more as expected.

The shutdown policy can be configured in the Web UI (Datacenter → Options → HA Settings), or directly in datacenter.cfg:

ha: shutdown_policy=<value>
Migrate
Once the Local Resource manager (LRM) gets a shutdown request and this policy is enabled, it will mark itself as unavailable for the current HA manager. This triggers a migration of all HA Services currently located on this node. The LRM will try to delay the shutdown process, until all running services get moved away. But, this expects that the running services can be migrated to another node. In other words, the service must not be locally bound, for example by using hardware passthrough. For example, strict node affinity rules tell the HA Manager that the service cannot run outside of the chosen set of nodes. If all of these nodes are unavailable, the shutdown will hang until you manually intervene. Once the shut down node comes back online again, the previously displaced services will be moved back, if they were not already manually migrated in-between.

https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_node_maintenance

There are some caveats regarding fencing in case of quorum loss and the affinity rules though, you find the details in the remainder of the HA chapter.

IMHO the shutdown-policy together with fitting configured affinity rules and maintenance mode should reduce the need for such a script mostly.
 
Last edited:
> If you enable maintenance-mode the VMs in HA to other nodes if you configured HA

For clarity, the VMs automatically move to other nodes. :)

Code:
ha-manager crm-command node-maintenance enable nodename
(wait a bit, update)
ha-manager crm-command node-maintenance disable nodename
 
  • Like
Reactions: Johannes S
> If you enable maintenance-mode the VMs in HA to other nodes if you configured HA

For clarity, the VMs automatically move to other nodes. :)

Thanks, I reworded my ramblings. I wanted to point out, that even the maintenance-mode is not needed if you configure the shutdown policy correctly. Then in case of an reboot the ha-manager will take care to do everything for you ;)
 
By default we tried the "automatic move on reboot" but it was too hard for the cluster, we had some packet lost and some hang on VM.

We did not take the time to investigate whether it was possible to configure settings to make this ‘live migration’ less aggressive.

Looks like in "Datacenter \ Options" we could try to tweak Bandwith Limits ?