cluster issues after upgrading to pve9

Liviu Sas

Active Member
Jun 11, 2018
28
5
43
46
Hello,

I just updated my cluster to proxmox 9, and most nodes went well, except 2 of them that ended up in a very weird state.
Those 2 nodes hung at "Setting up pve-cluster". And I have noticed that /etc/pve was locked (causing any process that tried to access it to lock in a "D" state)
The only way to finish the upgrade was to reboot in recovery mode.

After the upgrade was finished, all looked good until I rebooted any one of those nodes. After the reboot, they would come up and /etc/pve would be stuck again.
This would cause /etc/pve to become stuck on other nodes in the cluster, causing them to go into a reboot loop.

The only way to recover these node is to boot in recovery mode, do a "apt install --reinstall pve-cluster" and press CTRL+D to continue boot and they come up and wotrk as expected.
But if any of these 2 nodes reboot again, the situation repeats (/etc/pve becomes stuck in all nodes and they enter the reboot loop).

It looks like something did not get upgraded correctly on those 2 nodes, but I can't figure out what.
I must mention that during this time, pvecm status shows a healthy cluster at all times.

Cheers,
Liviu
 
Hi!

Could you post some syslog on these nodes? Is HA setup on these nodes?
 
Hello,

I have rebooted one of the nodes yesterday to reproduce the issues.
See the syslog attached:
12:25:52 - first uninterrupted boot, cluster fail
12:28:31- second uninterrupted boot, cluster fail
12:32:55 - boot in recovery mode, performed "apt install --reinstall pve-cluster pve-firewall pve-manager pve-ha-manager" and continued boot, cluster healthy

Currently I have some HA services but they are all ignored:

Code:
root@pvegiant:~# ha-manager status
quorum OK
master pvenew (active, Mon Oct  6 10:36:23 2025)
lrm proxmoxW540 (idle, Mon Oct  6 10:36:27 2025)
lrm pvebig (idle, Mon Oct  6 10:36:30 2025)
lrm pveduo (idle, Mon Oct  6 10:36:28 2025)
lrm pvegiant (idle, Mon Oct  6 10:36:31 2025)
lrm pvenew (idle, Mon Oct  6 10:36:31 2025)
lrm pveslow (idle, Mon Oct  6 10:36:29 2025)
service ct:100 (pvenew, ignored)
service ct:102 (---, ignored)
service ct:106 (---, ignored)
service ct:107 (pveduo, ignored)
service ct:112 (pveslow, ignored)
service ct:114 (pvebig, ignored)
service ct:116 (---, ignored)
service ct:117 (---, ignored)
service ct:200 (pvenew, ignored)
service vm:120 (---, ignored)

For both failed boots I can see the following concerning logs:
Code:
2025-10-05T12:28:44.807917+13:00 proxmoxW540 pmxcfs[1298]: [quorum] crit: quorum_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.810089+13:00 proxmoxW540 pmxcfs[1298]: [quorum] crit: can't initialize service
2025-10-05T12:28:44.810376+13:00 proxmoxW540 pmxcfs[1298]: [confdb] crit: cmap_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.810582+13:00 proxmoxW540 pmxcfs[1298]: [confdb] crit: can't initialize service
2025-10-05T12:28:44.810838+13:00 proxmoxW540 pmxcfs[1298]: [dcdb] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.810947+13:00 proxmoxW540 pmxcfs[1298]: [dcdb] crit: can't initialize service
2025-10-05T12:28:44.811034+13:00 proxmoxW540 pmxcfs[1298]: [status] crit: cpg_initialize failed: CS_ERR_LIBRARY (failed to connect to corosync)
2025-10-05T12:28:44.811101+13:00 proxmoxW540 pmxcfs[1298]: [status] crit: can't initialize service


And it looks like pve-ha-crm gets stuck in status change startup => wait_for_quorum.

But I do not understand why, as I can see corosync starting up and acquiring quorum and looking happy.
 

Attachments

Last edited: