Node maintenance mode does not persist on every reboot

lodex · Apr 6, 2023

I just updated one of our clusters to 7.4 and was playing aroung with node maintenance mode. It is stated in the documentation that the maintenance mode persists on reboot and is only deactivated through the ha-manager command. However during my testing it occured several times that after rebooting the machine, the LRM was started again and HA manager started migrating machines. I can confirm that I always waited for the transition into maintenance mode to be completed before rebooting by looking at the logs for the corresponding message pve-ha-lrm[PID]: watchdog closed (disabled)

Am I doing something wrong here or is this a bug?

mira · Apr 6, 2023

Can you provide the `pve-ha-lrm` and `pve-ha-crm` logs from before the reboot and after when it started again?
journalctl -u pve-ha-lrm -u pve-ha-crm

lodex · Apr 6, 2023

Sure thing. Here are the logs.

mira · Apr 6, 2023

Thank you for the logs!

In my quick tests I couldn't reproduce it here.
Would you mind sharing the exact steps which lead to this issue?

lodex · Apr 8, 2023

Well, there's nothing special about it. I set one of the cluster nodes to maintenance mode by executing ha-manager crm-command node-maintenance enable <node>, wait for the VM migration and transition to finish and then reboot. I wasn't able to find a way to force this behaviour yet, it just happens now and then without any prior indication.

Edit: Since I noticed this issue first after updating one of our production clusters, I checked our DEV environment as well and was able to reproduce it at first try. If I can provide any further information or try out certain things, please let me know.

opn-david · Apr 12, 2023

Same issue on our site. Maybe this is concerned to a combination of cluster options?

"HA Settings" is set to Default (conditional) our datacenter.cfg looks like this:
crs: ha-rebalance-on-start=1
keyboard: de
migration: secure,network=192.168.XXX.XXX/24

The commands ha-manager crm-command node-maintenance enable <node> (before we reboot we wait until all guests are migrate to the other hosts) and reboot always lead to the following migration errors:

Code:

task started by HA resource agent
System is going down. Unprivileged users are not permitted to log in anymore. For technical details, see pam_nologin(8).

2023-04-12 08:08:41 use dedicated network address for sending migration traffic (192.168.XXX.XXX)
2023-04-12 08:08:41 starting migration of VM XXX to node 'node' (192.168.XXX.XXX)
2023-04-12 08:08:41 starting VM XXX on remote node 'node'
2023-04-12 08:08:42 [node] System is going down. Unprivileged users are not permitted to log in anymore. For technical details, see pam_nologin(8).
2023-04-12 08:08:42 [node]
2023-04-12 08:08:42 [node] org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2023-04-12 08:08:42 ERROR: online migrate failure - remote command failed with exit code 255
2023-04-12 08:08:42 aborting phase 2 - cleanup resources
2023-04-12 08:08:42 migrate_cancel
2023-04-12 08:08:43 ERROR: migration finished with problems (duration 00:00:02)
TASK ERROR: migration problems

TASK ERROR: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

When the host comes up after reboot all guests that should run on that host by priority are starting to migrate. Hope these information will help to solve this.

EDIT:
I tried that with all three hosts in the cluster and on the last host I see that HA manager is migrating one guest (that is restricted to the host by ha group/priority) to that host and then the maintenance mode comes in and the guest is migrated back to another host and I have to disbale maintenance on that host, whereas the other hosts disable maintenance after reboot automatically.

mira · Apr 12, 2023

Thanks for the additional information!
I'll try to reproduce it here with CRS and a dedicated migration network.

@lodex how does your datacenter.cfg look?

lodex · Apr 12, 2023

Like this:

Code:

console: html5
crs: ha=static
ha: shutdown_policy=migrate
keyboard: de
max_workers: 10
next-id: lower=3000,upper=3500
notify: package-updates=never
u2f:

mira · Apr 13, 2023

We were able to reproduce this issue. In your cases, were the nodes in maintenance mode the `Master`?
In this case after a reboot it will reset.

I did see the `DBus` Error as well now, I'll see if and how I can reproduce this reliably.

lodex · Apr 13, 2023

I had the same suspicion, but cannot confirm it. 'Slave' nodes have this issue as well.

mira · Apr 28, 2023

A new version of `pve-ha-manager` (3.6.1) is available in the pve-no-subscription repository.
It would be great if you could test it and report back if the issues are fixed.

lodex · May 3, 2023

Hi mira,

we have no cluster without subscriptions at hand. Is there an easy way to install this package on a cluster with subsciptions out of the pve-no-subscription repository without risking to install any other packages?

mira · May 4, 2023

You can add the `pve-no-subscription` repository.
Then run `apt update` followed by `apt install pve-ha-manager`. Afterwards you can disable the repository again.

One other way would be to manually download it from `http://download.proxmox.com/debian/pve/dists/bullseye/pve-no-subscription/binary-amd64/` and installing it via apt or dpkg.

lodex · May 4, 2023

After installing pve-ha-manager (3.6.1) and doing some rudimentary testing, I can confirm that every node remained in maintenance mode after reboot. The issue seems to be solved.
Is there anything specific to look for in the logs to be certain?

Search

Search

Node maintenance mode does not persist on every reboot

lodex

Member

mira

Proxmox Staff Member

lodex

Member

Attachments

mira

Proxmox Staff Member

lodex

Member

opn-david

New Member

mira

Proxmox Staff Member

lodex

Member

mira

Proxmox Staff Member

lodex

Member

mira

Proxmox Staff Member

lodex

Member

mira

Proxmox Staff Member

lodex

Member