Something is resetting my cluster

mosespray · May 24, 2025

Hello,
I recently installed a 6 node Proxmox VE cluster (Version 8.4.1).
The cluster is made up of 2 SuperMicro and 4 HP EliteDesk 800 G2 servers.
After installation, I executed the
Proxmox VE Post Install script.
I was able to install LXC and VMs on all nodes with no issues.
When I came back later and logged in, I got the "nag" message.
When I try to go to the shell in any of the nodes, I get:

Undefined (Code: 1006)

I also got the following error in syslog and the task window:

F4C2:vncshell::root@pam: command '/usr/bin/termproxy 5900 --path /nodes/proxmox00 --perm Sys.Console -- /bin/login -f root' failed: exit code 1

It seems like some process is going through and changing or updating something. I can't find anything in the logs to prove that.
Oddly, re-running the post install script from the first node fixes it, at least temporarily.

Also, I noticed these errors:

2025-05-23T15:11:20.997483-04:00 proxmox00 pmxcfs[8154]: [quorum] crit: quorum_initialize failed: 2
2025-05-23T15:11:20.997560-04:00 proxmox00 pmxcfs[8154]: [quorum] crit: can't initialize service
2025-05-23T15:11:20.997609-04:00 proxmox00 pmxcfs[8154]: [confdb] crit: cmap_initialize failed: 2
2025-05-23T15:11:20.997659-04:00 proxmox00 pmxcfs[8154]: [confdb] crit: can't initialize service
2025-05-23T15:11:20.997709-04:00 proxmox00 pmxcfs[8154]: [dcdb] crit: cpg_initialize failed: 2
2025-05-23T15:11:20.997748-04:00 proxmox00 pmxcfs[8154]: [dcdb] crit: can't initialize service
2025-05-23T15:11:20.997799-04:00 proxmox00 pmxcfs[8154]: [status] crit: cpg_initialize failed: 2
2025-05-23T15:11:20.997848-04:00 proxmox00 pmxcfs[8154]: [status] crit: can't initialize service

Not sure if they are related.

# pvecm status
Cluster information
-------------------
Name: pve-cluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Sat May 24 13:08:36 2025
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000001
Ring ID: 1.147
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.99.40 (local)
0x00000002 1 10.10.99.41
0x00000003 1 10.10.99.42
0x00000004 1 10.10.99.43
0x00000005 1 10.10.99.44
0x00000006 1 10.10.99.45

Any help would be greatly appreciated.

beisser · May 24, 2025

whats a Proxmox VE Post Install script?
i am not aware of any such script delivered by proxmox.
sounds like you executed 3rd party scripts which screwed with your host.
you should ideally ask the creators of those scripts for help.

mosespray · May 24, 2025

I thought that it was a commonly used source for installing apps on Proxmox.
This is the main page: https://community-scripts.github.io/ProxmoxVE/
And this it the specific script: https://community-scripts.github.io/ProxmoxVE/scripts?id=post-pve-install
I have installed and reinstalled Proxmox about 50 times trying to work out issues. The error that I listed above was happening long before I discovered
the scripts website.

beisser · May 25, 2025

so the issue is also happening if you do a fresh install of pve on ALL nodes and not run the 3rd-party script at all?

that would be extremely weird, because then a multitude of people would be seeing your problem, which they are not.
could this be an issue with your hardware?
do you see this on all nodes of the cluster at the same time or is this only happening on one or a few nodes?
pmxcfs is the cluster filesystem responsible for storing and syncing the clusters importan configuration files (see https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs) ).
not quite sure how to troubleshoot this though as i have never experienced any issues relating to pmxcfs.

also having 6 nodes is bad in itself because its an even number and you can get into a split brain situation if a 3/3 network partition happens. this will take the entire cluster offline in an instant because it lacks quorum. it is recommended to have an uneven number of hosts in the cluster to avoid this.

mosespray · May 27, 2025

Hello,
Thank you very much for your response.
Actually, I took the following steps and got the same result:
1. I saw a comment about needing an odd number of nodes, so, I dropped it to 5.
2. Nodes 1 and 2 are older SuperMicro servers. The installer will not allow me to install 8.4.1 directly. I get an error about the video display is not supported. So, I have to install 7.4.15 first and then upgrade them to 8.4.1.
3. I created the cluster and did not use any scripts or outside procedures.
4. This morning I checked and logging into the first node I get the undefined (Code: 1006) and the
failed waiting for client: timed out
TASK ERROR: command '/usr/bin/termproxy 5900 --path /nodes/proxmox02 --perm Sys.Console -- /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox02' -o 'UserKnownHostsFile=/etc/pve/nodes/proxmox02/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' -t root@10.10.99.42 -- /bin/login -f root' failed: exit code 1 error.
5. This is only happening on the first node. If I log into the others, accessing the shell works fine.

I will look into pmxcfs.

Thank you.

Actually, I think I will just switch it around and have one of the HP EliteDesk servers be the first node...

Search

Search

Something is resetting my cluster

mosespray

New Member

beisser

Well-Known Member

mosespray

New Member

beisser

Well-Known Member

mosespray

New Member

We value your privacy