Very strange PVE crash/unavailability

andreisrr

New Member
Feb 2, 2024
22
4
3
I am running a small cluster of 3 nodes, currently running PVE8.3.3.
The cluster is in a bad state with only one functional node right now. It is so strange I can' even begin to fathom where/how to begin investigating the problem.

Node1: web interface working up to login page. I try to login, after a while it shows "login failed".
Node2: web interface won't load with PR_END_OF_FILE_ERROR showed by browser (apparently this is when Firefox exhausted all cipher combinations trying to establish a SSL connection)
Node3: this one is working fine.

From Node3 it says the following, regarding the cluster: node1 and node3 are up, node2 is down.
Node1 is shown with a grey question mark (and status:unknown when hovering) and I can't perform any operation. It then shows "Error: connection error 401: permission denied, invalid ticket" and throws me a login dialog.
Node2 is shown with red X and status offline.

I am contacting the tech support where the physical machines are, all I know for now is that there were recently some network outages due to a faulty switch. I have yet to physically restart the machines or gain access to physical consoles.

I also see something like this in systemlog on node3:
Code:
[TOTEM ] Token has not been received in 2737 ms
[TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
[QUORUM] Sync members[1]: 3
[QUORUM] Sync left[1]: 1
[TOTEM ] A new membership (3.2663) was formed. Members left: 1
[TOTEM ] Failed to receive the leave message. failed: 1
[QUORUM] This node is within the non-primary component and will NOT provide any services.
[QUORUM] Members[1]: 3
[MAIN  ] Completed service synchronization, ready to provide service.
[...]
pve-ha-lrm[1159]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve03/lrm_status.tmp.1159' - Permission denied

pve03 = node3
There is no such file /etc/pve/nodes/pve03/lrm_status.tmp.1159
But there is a lrm_status

What can I do, before/besides a physical reboot?
 
What can I do, before/besides a physical reboot?
Investigate via SSH. It's almost stock Debian below the GUI, so every Debian admin should be able to go through log files, analyse, do pings etc. in order to find out what's wrong. Also check disk usage on all nodes, if the pve cluster filesystem cannot be written, you will have a lot of stange errors.