Node maintenance mode does not persist on every reboot

Oct 8, 2021
22
1
8
I just updated one of our clusters to 7.4 and was playing aroung with node maintenance mode. It is stated in the documentation that the maintenance mode persists on reboot and is only deactivated through the ha-manager command. However during my testing it occured several times that after rebooting the machine, the LRM was started again and HA manager started migrating machines. I can confirm that I always waited for the transition into maintenance mode to be completed before rebooting by looking at the logs for the corresponding message pve-ha-lrm[PID]: watchdog closed (disabled)

Am I doing something wrong here or is this a bug?
 
Can you provide the `pve-ha-lrm` and `pve-ha-crm` logs from before the reboot and after when it started again?
journalctl -u pve-ha-lrm -u pve-ha-crm
 
Thank you for the logs!

In my quick tests I couldn't reproduce it here.
Would you mind sharing the exact steps which lead to this issue?
 
Well, there's nothing special about it. I set one of the cluster nodes to maintenance mode by executing ha-manager crm-command node-maintenance enable <node>, wait for the VM migration and transition to finish and then reboot. I wasn't able to find a way to force this behaviour yet, it just happens now and then without any prior indication.

Edit: Since I noticed this issue first after updating one of our production clusters, I checked our DEV environment as well and was able to reproduce it at first try. If I can provide any further information or try out certain things, please let me know.
 
Last edited:
Same issue on our site. Maybe this is concerned to a combination of cluster options?

"HA Settings" is set to Default (conditional) our datacenter.cfg looks like this:
crs: ha-rebalance-on-start=1
keyboard: de
migration: secure,network=192.168.XXX.XXX/24

The commands ha-manager crm-command node-maintenance enable <node> (before we reboot we wait until all guests are migrate to the other hosts) and reboot always lead to the following migration errors:

Code:
task started by HA resource agent
System is going down. Unprivileged users are not permitted to log in anymore. For technical details, see pam_nologin(8).

2023-04-12 08:08:41 use dedicated network address for sending migration traffic (192.168.XXX.XXX)
2023-04-12 08:08:41 starting migration of VM XXX to node 'node' (192.168.XXX.XXX)
2023-04-12 08:08:41 starting VM XXX on remote node 'node'
2023-04-12 08:08:42 [node] System is going down. Unprivileged users are not permitted to log in anymore. For technical details, see pam_nologin(8).
2023-04-12 08:08:42 [node]
2023-04-12 08:08:42 [node] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2023-04-12 08:08:42 ERROR: online migrate failure - remote command failed with exit code 255
2023-04-12 08:08:42 aborting phase 2 - cleanup resources
2023-04-12 08:08:42 migrate_cancel
2023-04-12 08:08:43 ERROR: migration finished with problems (duration 00:00:02)
TASK ERROR: migration problems

TASK ERROR: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

When the host comes up after reboot all guests that should run on that host by priority are starting to migrate. Hope these information will help to solve this.

EDIT:
I tried that with all three hosts in the cluster and on the last host I see that HA manager is migrating one guest (that is restricted to the host by ha group/priority) to that host and then the maintenance mode comes in and the guest is migrated back to another host and I have to disbale maintenance on that host, whereas the other hosts disable maintenance after reboot automatically.
 
Last edited:
Thanks for the additional information!
I'll try to reproduce it here with CRS and a dedicated migration network.

@lodex how does your datacenter.cfg look?
 
Like this:
Code:
console: html5
crs: ha=static
ha: shutdown_policy=migrate
keyboard: de
max_workers: 10
next-id: lower=3000,upper=3500
notify: package-updates=never
u2f:
 
We were able to reproduce this issue. In your cases, were the nodes in maintenance mode the `Master`?
In this case after a reboot it will reset.

I did see the `DBus` Error as well now, I'll see if and how I can reproduce this reliably.
 
A new version of `pve-ha-manager` (3.6.1) is available in the pve-no-subscription repository.
It would be great if you could test it and report back if the issues are fixed.
 
Hi mira,

we have no cluster without subscriptions at hand. Is there an easy way to install this package on a cluster with subsciptions out of the pve-no-subscription repository without risking to install any other packages?
 
After installing pve-ha-manager (3.6.1) and doing some rudimentary testing, I can confirm that every node remained in maintenance mode after reboot. The issue seems to be solved.
Is there anything specific to look for in the logs to be certain?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!