I am running a small cluster of 3 nodes, currently running PVE8.3.3.
The cluster is in a bad state with only one functional node right now. It is so strange I can' even begin to fathom where/how to begin investigating the problem.
Node1: web interface working up to login page. I try to login, after a while it shows "login failed".
Node2: web interface won't load with PR_END_OF_FILE_ERROR showed by browser (apparently this is when Firefox exhausted all cipher combinations trying to establish a SSL connection)
Node3: this one is working fine.
From Node3 it says the following, regarding the cluster: node1 and node3 are up, node2 is down.
Node1 is shown with a grey question mark (and status:unknown when hovering) and I can't perform any operation. It then shows "Error: connection error 401: permission denied, invalid ticket" and throws me a login dialog.
Node2 is shown with red X and status offline.
I am contacting the tech support where the physical machines are, all I know for now is that there were recently some network outages due to a faulty switch. I have yet to physically restart the machines or gain access to physical consoles.
I also see something like this in systemlog on node3:
pve03 = node3
There is no such file /etc/pve/nodes/pve03/lrm_status.tmp.1159
But there is a lrm_status
What can I do, before/besides a physical reboot?
The cluster is in a bad state with only one functional node right now. It is so strange I can' even begin to fathom where/how to begin investigating the problem.
Node1: web interface working up to login page. I try to login, after a while it shows "login failed".
Node2: web interface won't load with PR_END_OF_FILE_ERROR showed by browser (apparently this is when Firefox exhausted all cipher combinations trying to establish a SSL connection)
Node3: this one is working fine.
From Node3 it says the following, regarding the cluster: node1 and node3 are up, node2 is down.
Node1 is shown with a grey question mark (and status:unknown when hovering) and I can't perform any operation. It then shows "Error: connection error 401: permission denied, invalid ticket" and throws me a login dialog.
Node2 is shown with red X and status offline.
I am contacting the tech support where the physical machines are, all I know for now is that there were recently some network outages due to a faulty switch. I have yet to physically restart the machines or gain access to physical consoles.
I also see something like this in systemlog on node3:
Code:
[TOTEM ] Token has not been received in 2737 ms
[TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
[QUORUM] Sync members[1]: 3
[QUORUM] Sync left[1]: 1
[TOTEM ] A new membership (3.2663) was formed. Members left: 1
[TOTEM ] Failed to receive the leave message. failed: 1
[QUORUM] This node is within the non-primary component and will NOT provide any services.
[QUORUM] Members[1]: 3
[MAIN ] Completed service synchronization, ready to provide service.
[...]
pve-ha-lrm[1159]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve03/lrm_status.tmp.1159' - Permission denied
pve03 = node3
There is no such file /etc/pve/nodes/pve03/lrm_status.tmp.1159
But there is a lrm_status
What can I do, before/besides a physical reboot?