/etc/pve frozen and gui inoperable

OMFG_HELP_ME_PLEASE

New Member
Mar 31, 2025
2
0
1
Have a 5 node cluster. Today I wanted to add nodes 6 and 7.

I installed node 6. Added it to the cluster. Performed an update of all package. Once complete, performed a reboot.

On reboot, my cluster gui became inoperable.

/etc/pve appears frozen on all my nodes. I cannot access any subdirectories on it, on any of my nodes. Any attempt to do, say, an ls /etc/pve/nodes just hangs and times out.

The gui is inoperable. I can log in, I can get a list of nodes, but all my VMs/CTs appear as little grey boxes.

pvecm status shows my cluster running with a quorum.

all my VMs and CTs appear to still be running.

How do I recover /etc/pve so I do not lose my cluster or VM/CT configurations?
 
What's the scope of the problem?

Did the GUI freeze on all nodes, or just node 6?

It's normal for the node that is joining the cluster to go haywire for a minute or two as it adopts cluster settings, TLS certs are regenerated, and a number of other things get reset.

I don't believe it's normal for /etc/pve to freeze though.

Mounting /etc/pve local

You can run killall pcmxcfs; pmxcfs --local to mount /etc/pve as local-only from the latest sync / current copy of /var/lib/pve-cluster/config.db.

Corosync Config

I would check and make sure that all corosync config files on all nodes match:
/etc/corosync/corosync.conf (local, used by corosync)
/etc/pve/corosync.conf (shared, only used for sync)

I had a similar problem once when I mistakenly edited /etc/corosync/corosync.conf (the local version) It cause the cluster to update to a new version (via network communication) of which the other nodes did not have a local copy (because I didn't update the shared copy), and a few minutes later everything fenced itself and stopped.

Other thoughts

Had you updated the other nodes prior to adding the new node to the cluster?
If not, how far apart were the versions?

I believe the recommended strategy is to update one node in the existing cluster to the latest, migrate everything off and reboot it, wait several minutes for any sync processes to complete, and repeat. Likewise, a new node would be fully updated and rebooted before adding it to the cluster.

Also, are any of the drives over 80% full on any of the nodes?

See also

- https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)#_recovery
 
Last edited:
Have a 5 node cluster. Today I wanted to add nodes 6 and 7.

I installed node 6. Added it to the cluster. Performed an update of all package. Once complete, performed a reboot.

On reboot, my cluster gui became inoperable.

/etc/pve appears frozen on all my nodes. I cannot access any subdirectories on it, on any of my nodes. Any attempt to do, say, an ls /etc/pve/nodes just hangs and times out.

The gui is inoperable. I can log in, I can get a list of nodes, but all my VMs/CTs appear as little grey boxes.

pvecm status shows my cluster running with a quorum.

all my VMs and CTs appear to still be running.

How do I recover /etc/pve so I do not lose my cluster or VM/CT configurations?
Hello,
First of all, a copy of your configuration file /etc/pve/corosync.conf lies in /etc/corosync/
Secondly, /etc/pve is, in fact, a filesystem view of what, in fact, is a sqlite database. You'll find it in /var/lib/pve-cluster/config.db. You can backup that file and query it with sqlite3, it contents your corosync.conf file as well as every VM/CT configuration file in your cluster.

Thirdly, if /etc/pve hangs, it may be on read-only mode to prevent a split-brain situation, and you may be out of your cluster despite pvecm status telling you otherwise.
try 'corosync-cfgtool -s' on each node to see in real time if you are connected to your cluster or not.

Best regards,

GD
 
Thirdly, if /etc/pve hangs, it may be on read-only mode to prevent a split-brain situation, and you may be out of your cluster despite pvecm status telling you otherwise.
try 'corosync-cfgtool -s' on each node to see in real time if you are connected to your cluster or not.
All the nodes have a variation of the same output (localhost being different on each obviously)

root@node 2:~# corosync-cfgtool -s
Local node ID 2, transport knet
LINK ID 0 udp
addr = 192.168.228.21
status:
nodeid: 1: connected
nodeid: 2: localhost
nodeid: 3: connected
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: connected

LINK ID 1 udp
addr = 172.16.228.21
status:
nodeid: 1: connected
nodeid: 2: localhost
nodeid: 3: connected
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: connected
 
Last edited:
> localhost being different on each obviously

Yes, of course. Corosync detects from which node you are querying, and flags it as localhost.
Mainly relevant here is that all hosts are "connected", so pvecm telling you there is quorum is coherent here.
Most cluster problems either come from corosync configuration, or network configuration.
As it seems your corosync config is okay, I suggest you proof network configuration on your nodes, especially on the two new ones you added recently.

On a 7-node cluster you reach quorum with 4, so I would suggest shutting one down, then the other, then both, and see if your problem persists or not. This may help you identify the faulty node, while you still will have quorum with the first five ones.

Best regards,

GD