cluster issues after upgrading to pve9

Liviu Sas

Well-Known Member
Jun 11, 2018
30
5
48
46
Hello,

I just updated my cluster to proxmox 9, and most nodes went well, except 2 of them that ended up in a very weird state.
Those 2 nodes hung at "Setting up pve-cluster". And I have noticed that /etc/pve was locked (causing any process that tried to access it to lock in a "D" state)
The only way to finish the upgrade was to reboot in recovery mode.

After the upgrade was finished, all looked good until I rebooted any one of those nodes. After the reboot, they would come up and /etc/pve would be stuck again.
This would cause /etc/pve to become stuck on other nodes in the cluster, causing them to go into a reboot loop.

The only way to recover these node is to boot in recovery mode, do a "apt install --reinstall pve-cluster" and press CTRL+D to continue boot and they come up and wotrk as expected.
But if any of these 2 nodes reboot again, the situation repeats (/etc/pve becomes stuck in all nodes and they enter the reboot loop).

It looks like something did not get upgraded correctly on those 2 nodes, but I can't figure out what.
I must mention that during this time, pvecm status shows a healthy cluster at all times.

Cheers,
Liviu
 
Hi!

Could you post some syslog on these nodes? Is HA setup on these nodes?
 
Hello,

I have rebooted one of the nodes yesterday to reproduce the issues.
See the syslog attached:
12:25:52 - first uninterrupted boot, cluster fail
12:28:31- second uninterrupted boot, cluster fail
12:32:55 - boot in recovery mode, performed "apt install --reinstall pve-cluster pve-firewall pve-manager pve-ha-manager" and continued boot, cluster healthy

Currently I have some HA services but they are all ignored:

Code:
root@pvegiant:~# ha-manager status
quorum OK
master pvenew (active, Mon Oct  6 10:36:23 2025)
lrm proxmoxW540 (idle, Mon Oct  6 10:36:27 2025)
lrm pvebig (idle, Mon Oct  6 10:36:30 2025)
lrm pveduo (idle, Mon Oct  6 10:36:28 2025)
lrm pvegiant (idle, Mon Oct  6 10:36:31 2025)
lrm pvenew (idle, Mon Oct  6 10:36:31 2025)
lrm pveslow (idle, Mon Oct  6 10:36:29 2025)
service ct:100 (pvenew, ignored)
service ct:102 (---, ignored)
service ct:106 (---, ignored)
service ct:107 (pveduo, ignored)
service ct:112 (pveslow, ignored)
service ct:114 (pvebig, ignored)
service ct:116 (---, ignored)
service ct:117 (---, ignored)
service ct:200 (pvenew, ignored)
service vm:120 (---, ignored)

For both failed boots I can see the following concerning logs:
Code:
2025-10-05T12:28:44.807917+13:00 proxmoxW540 pmxcfs[1298]: [quorum] crit: quorum_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.810089+13:00 proxmoxW540 pmxcfs[1298]: [quorum] crit: can't initialize service
2025-10-05T12:28:44.810376+13:00 proxmoxW540 pmxcfs[1298]: [confdb] crit: cmap_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.810582+13:00 proxmoxW540 pmxcfs[1298]: [confdb] crit: can't initialize service
2025-10-05T12:28:44.810838+13:00 proxmoxW540 pmxcfs[1298]: [dcdb] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.810947+13:00 proxmoxW540 pmxcfs[1298]: [dcdb] crit: can't initialize service
2025-10-05T12:28:44.811034+13:00 proxmoxW540 pmxcfs[1298]: [status] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.811101+13:00 proxmoxW540 pmxcfs[1298]: [status] crit: can't initialize service


And it looks like pve-ha-crm gets stuck in status change startup => wait_for_quorum.

But I do not understand why, as I can see corosync starting up and acquiring quorum and looking happy.
 

Attachments

Last edited:
Hello,

I did a bit more testing, and I found that the easiest way to start one of those two nodes is to follow these steps:
1. boot in recovery mode
2. systemctl start pve-cluster
3. CTRL+D to continue the boot process


So it looks like a race condition on node boot where the cluster service or corosync can take a little bit longer to start and it locks the processes that are supposed to start immediately after.

Also to note, that the nodes that have this issue are both a bit on the slower side (one running in a VM inside VirtualBox and another one a NUC running on a Intel(R) Celeron(R) CPU N3050.
 
lrm proxmoxW540 (idle, Mon Oct 6 10:36:27 2025)
lrm pvebig (idle, Mon Oct 6 10:36:30 2025)
lrm pveduo (idle, Mon Oct 6 10:36:28 2025)
lrm pvegiant (idle, Mon Oct 6 10:36:31 2025)
lrm pvenew (idle, Mon Oct 6 10:36:31 2025)
lrm pveslow (idle, Mon Oct 6 10:36:29 2025)
Are these all nodes in the cluster? What does pvecm status output? Keep in mind that it is not recommended to have a even amount of nodes in a cluster, see e.g. [0]. This seems like a clustering issue first of all.

service ct:100 (pvenew, ignored)
service ct:102 (---, ignored)
service ct:106 (---, ignored)
service ct:107 (pveduo, ignored)
service ct:112 (pveslow, ignored)
service ct:114 (pvebig, ignored)
service ct:116 (---, ignored)
service ct:117 (---, ignored)
service ct:200 (pvenew, ignored)
service vm:120 (---, ignored)
Are the HA resources with node "---" running somewhere unnoticed? Or were these removed already?

[0] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_supported_setups
 
From the provided log in log-pve540-1.zip it also seems like there's quite a few connection issues, e.g. to the InfluxDB right at the start. I assume that proxmoxW540 is one of the nodes that has the problems starting up? Is the network stable for that/these nodes?
 
Hello,

The network is stable. No issues with the cluster or the nodes during the normal usage.
I only see issues at startup and only for these 2 nodes.

As for the HA resources, I removed them all just for the sting, and I can still reproduce the issue.

root@proxmoxW540:~# ha-manager status

quorum OK
master pveslow (idle, Fri Nov 14 11:05:15 2025)
lrm proxmoxW540 (idle, Fri Nov 14 22:25:42 2025)
lrm pvebig (idle, Fri Nov 14 22:25:46 2025)
lrm pveduo (idle, Fri Nov 14 22:25:44 2025)
lrm pvegiant (idle, Fri Nov 14 22:25:46 2025)
lrm pvenew (idle, Fri Nov 14 22:25:43 2025)
lrm pveslow (idle, Fri Nov 14 22:25:42 2025)