Optimize maintenance process ?

nsc · Nov 11, 2025

Hi all,

I have a Proxmox 8.4 cluster with 9 servers and I'm wondering if any of you have optimised the update/restart process.

I come from a VMware environment and was a big fan of Maintenance Mode + DRS + Update Manager.

Now I've had to rewrite bash scripts to do the same thing, but it's still a bit too manual, and normal mode was far too aggressive on my cluster.

How do you do it?

Thanks

nsc

waltar · Nov 11, 2025

We have a Proxmox 8.4 cluster with 5 nodes with about 80 vm/lxc and 1 nfs-fileserver.
Start updates on all nodes in parallel and on server "dnf upgrade -y" and if there's some and new kernel do "sync;exportfs -uav;sync;reboot" ... wait until back (3min). When it's back set 1. node in maintenance mode, when all is auto-migrated "sync;reboot" ... wait until back turn maintenance mode off and on to 2. node and so on until all 5 nodes are done. Maintenance done in 30min.

louie1961 · Nov 11, 2025

I use Ansible. The process runs automatically at 3am every morning. You could also use the Debian "unattended upgrades" package

bbgeek17 · Nov 11, 2025

louie1961 said:
The process runs automatically at 3am every morning. You could also use the Debian "unattended upgrades" package

This is not something I'd recommend implementing in production/business environment.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

nsc · Nov 12, 2025

So it's safe to "dist upgrade" then reboot ? Currently, I completely free up the node with migrations or VM shutdowns, and then I launch the dist-upgrade.

proxuser77 · Nov 12, 2025

First of all, automatically upgrading your Proxmox hosts is probably not the best idea for several reasons. However, if you decide to do it anyway, there's a package called unattended-upgrades in Debian that can do it for you. There are a few important things you should be aware of, though:

1. By default, unattended-upgrades only installs security updates (similiar to apt upgrade). not new features (which you’d get with apt full-upgrade or apt dist-upgrade). On Proxmox, however, you should always use full-upgrade / dist-upgrade. So, if you still want to use unattended-upgrades, make sure to configure it accordingly: https://wiki.debian.org/PeriodicUpdates#Configure_unattended-upgrades

2. While services on Debian/Ubuntu systems generally should restart automatically when they themselves are upgraded, they don’t automatically restart when only one of their dependencies is upgraded. This means a service may continue running with outdated (and potentially vulnerable) code until you manually restart it or reboot the host. This can be avoided by installing needrestart: (https://manpages.ubuntu.com/manpages/focal/man1/needrestart.1.html), which detects which services require a restart after updates and can be configured to restart them automatically.

nsc said:
So it's safe to "dist upgrade" then reboot ? Currently, I completely free up the node with migrations or VM shutdowns, and then I launch the dist-upgrade.

If the qemu-guest-agent is installed on all your VMs, then yes, rebooting while the VMs are running is generally safe. In that case, all VMs will receive a graceful shutdown command from the host before the reboot. If some VMs don’t have the guest agent installed, it’s better to shut them down manually before rebooting the host.

Also, when using needrestart (see link above), a reboot is only required if a kernel update has been installed.

My personal approach:

I prefer to update my Proxmox hosts manually and interactively. Here’s how I usually do it:

apt update && apt dist-upgrade

After the upgrade completes, needrestart tells me which services need to be restarted, and I let it restart them. It also tells me if a reboot is required. If so, I shut down my TrueNAS Core VM (yeah, I still use that ) since it doesn’t have the guest agent installed.

Then I issue the reboot command, and Proxmox gracefully shuts down all other running VMs and performs the reboot.

nsc · Nov 12, 2025

As i wrote i make two scripts to do this :

- maintenance start and live migrate VM to other nodes, local VM are shutdown.
- then i connect and i "apt-get update -y && apt-get upgrade -y &&apt-get dist-upgrade -y && apt-get autoremove -y"
- then i reboot and wait with a "ping".
- when node is back i check than it available in cluster
- finaly a run another script to bring back VM and start local VM.

I'm going to automate things a bit more so that everything is automatic.

proxuser77 · Nov 12, 2025

nsc said:
As i wrote i make two scripts to do this :

- maintenance start and live migrate VM to other nodes, local VM are shutdown.

To be honest, in that case, I don’t really understand your question. Assuming that all the VMs have been migrated away and none are running on the host, why wouldn’t it be safe to reboot the host after a dist-upgrade?

However, I can see other potential issues with your approach, namely if something goes wrong in your chain of scripted events. In that case, you’d first need to figure out where it got stuck and what state your cluster is in the next morning. If you perform the process interactively, on the other hand, you can immediately see when something goes wrong and take corrective action right away.

So yes, it can be done. Whether you should do it, mainly depends on how sophisticated your scripts are. Meaning how well they handle potential errors and whether they properly inform you where and how something went wrong, if anything does.

nsc · Nov 12, 2025

I'm having trouble explaining myself. The script isn't automatic in my case, but rather launched manually by a human ;-)

It's just a script that we launch, which does the job and stops at the first problem, allowing the person in charge of the update to intervene.

However, it saves time because the process of migrating in one direction and then back again is quite lengthy. Not to mention the reboot.

waltar · Nov 12, 2025

Manually started updates are good to be able to stop if somethink goes off roads.
Migration is just lengthy if not using shared storage for otherwise it's just couple of few minutes for even lots of machines.

Johannes S · Nov 12, 2025

nsc said:
As i wrote i make two scripts to do this :

- maintenance start and live migrate VM to other nodes, local VM are shutdown.
- then i connect and i "apt-get update -y && apt-get upgrade -y &&apt-get dist-upgrade -y && apt-get autoremove -y"
- then i reboot and wait with a "ping".
- when node is back i check than it available in cluster
- finaly a run another script to bring back VM and start local VM.

I'm going to automate things a bit more so that everything is automatic.

If you configured high-avilability for your VMs you can save some steps. First: If you enable maintenance-mode the VMs in HA will migrate to other nodes, if you configured HA accordingly. So with that you don't need to to a manual migration before activating maintenance.
But for a planned reboot not even that is needed: You can configure a shutdown policy:

Shutdown Policy
Below you will find a description of the different HA policies for a node shutdown. Currently Conditional is the default due to backward compatibility. Some users may find that Migrate behaves more as expected.

The shutdown policy can be configured in the Web UI (Datacenter → Options → HA Settings), or directly in datacenter.cfg:

ha: shutdown_policy=<value>
Migrate
Once the Local Resource manager (LRM) gets a shutdown request and this policy is enabled, it will mark itself as unavailable for the current HA manager. This triggers a migration of all HA Services currently located on this node. The LRM will try to delay the shutdown process, until all running services get moved away. But, this expects that the running services can be migrated to another node. In other words, the service must not be locally bound, for example by using hardware passthrough. For example, strict node affinity rules tell the HA Manager that the service cannot run outside of the chosen set of nodes. If all of these nodes are unavailable, the shutdown will hang until you manually intervene. Once the shut down node comes back online again, the previously displaced services will be moved back, if they were not already manually migrated in-between.

https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_node_maintenance

There are some caveats regarding fencing in case of quorum loss and the affinity rules though, you find the details in the remainder of the HA chapter.

IMHO the shutdown-policy together with fitting configured affinity rules and maintenance mode should reduce the need for such a script mostly.

SteveITS · Nov 12, 2025

> If you enable maintenance-mode the VMs in HA to other nodes if you configured HA

For clarity, the VMs automatically move to other nodes.

Code:

ha-manager crm-command node-maintenance enable nodename
(wait a bit, update)
ha-manager crm-command node-maintenance disable nodename

Johannes S · Nov 12, 2025

SteveITS said:
> If you enable maintenance-mode the VMs in HA to other nodes if you configured HA

For clarity, the VMs automatically move to other nodes.

Thanks, I reworded my ramblings. I wanted to point out, that even the maintenance-mode is not needed if you configure the shutdown policy correctly. Then in case of an reboot the ha-manager will take care to do everything for you

SteveITS · Nov 12, 2025

Eh, not sure that's working actually: https://forum.proxmox.com/threads/i...-fast-before-ha-migration-is-finished.170268/. But yes, in theory that should work.

I'd prefer to move VMs before installing updates but maybe that's just me being conservative.

nsc · Nov 12, 2025

By default we tried the "automatic move on reboot" but it was too hard for the cluster, we had some packet lost and some hang on VM.

We did not take the time to investigate whether it was possible to configure settings to make this ‘live migration’ less aggressive.

Looks like in "Datacenter \ Options" we could try to tweak Bandwith Limits ?

waltar · Nov 12, 2025

Set in ~/.bashrc:
alias mmon="ha-manager crm-command node-maintenance enable"
alias mmoff="ha-manager crm-command node-maintenance disable"
and then do like
mmon <pvenode>

Westikane · Nov 20, 2025

I’ve set up notifications tied to my project management tasks so I can plan reboots and updates ahead of time without surprises. Also added a short bash script that runs checks and dumps key info into our project management board—it’s saved a lot of back-and-forth. Keeping things simple helps, especially when others on the team jump in mid-cycle.

Search

Search

Optimize maintenance process ?

nsc

Renowned Member

waltar

Famous Member

louie1961

Well-Known Member

bbgeek17

Distinguished Member

nsc

Renowned Member

proxuser77

Member

nsc

Renowned Member

proxuser77

Member

nsc

Renowned Member

waltar

Famous Member

Johannes S

Distinguished Member

SteveITS

Active Member

Johannes S

Distinguished Member

SteveITS

Active Member

nsc

Renowned Member

waltar

Famous Member

Westikane

New Member

We value your privacy