Hello Comunity
I am having trouble getting my 3-node cluster up and running again after update.
I was away for 5 weeks and shut down most the IT equipment in my home lab, including the Proxmox cluster. After I returned, I started up everything and noticed that the Proxmox cluster is having issues, two nodes seemed operational and could be selected in the GUI while the third node just showed as disconnected. I shut the cluster down again and started only the node that had a problem. Everything seemed normal, I could ssh into the node and it showed up normal in the GUI. As soon as I started up the other two nodes, the first one went offline again.
After some checking I realised that the system clock was off by about 4 minutes on the problem node. That was solved by the system automatically as it synced the system time to match the other nodes (via NTP). However, the problem with that one node still existed. As my next step I decided to update all VMs, LXCs, nodes and everything else. This is automated with Ansible and worked flawlessly for over a year.
After the update the problem still existed, but seemed to be slightly different now. I noticed some errors in the syslog about ssh keys. After googling the problem, I followed some recommendations, which basically was to run
I have read and followed so many online articles, but nothing seem to help. I hope to get some guidance from the community that would help me to reinstate the cluster. I don't really want to reinstall the problem node, but are prepared to do so if there is no other way to fix this issue. Below is an excert from the syslog from the problem node. It shows that the node connects to the cluster and after a moment disconnects again.
The status of the problem node, queried when it is online, doesn't show any issues to my understanding:
This is the status of node 3 when node 1 is offline:
I appreciate your help, thank you.
I am having trouble getting my 3-node cluster up and running again after update.
I was away for 5 weeks and shut down most the IT equipment in my home lab, including the Proxmox cluster. After I returned, I started up everything and noticed that the Proxmox cluster is having issues, two nodes seemed operational and could be selected in the GUI while the third node just showed as disconnected. I shut the cluster down again and started only the node that had a problem. Everything seemed normal, I could ssh into the node and it showed up normal in the GUI. As soon as I started up the other two nodes, the first one went offline again.
After some checking I realised that the system clock was off by about 4 minutes on the problem node. That was solved by the system automatically as it synced the system time to match the other nodes (via NTP). However, the problem with that one node still existed. As my next step I decided to update all VMs, LXCs, nodes and everything else. This is automated with Ansible and worked flawlessly for over a year.
After the update the problem still existed, but seemed to be slightly different now. I noticed some errors in the syslog about ssh keys. After googling the problem, I followed some recommendations, which basically was to run
pvecm updatecerts -F
and restart the cluster. But that didn't help either. I cannot ssh into the problem node when the other two nodes are up and running. When only the problem node is online, I can ssh into the node. Similar with the GUI, when the whole cluster is running, the GUI shows the problem node intermittendly as disconnected.I have read and followed so many online articles, but nothing seem to help. I hope to get some guidance from the community that would help me to reinstate the cluster. I don't really want to reinstall the problem node, but are prepared to do so if there is no other way to fix this issue. Below is an excert from the syslog from the problem node. It shows that the node connects to the cluster and after a moment disconnects again.
Code:
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 12:59:49 pve-1 corosync[1186]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [status] notice: node lost quorum
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [dcdb] crit: received write while not quorate - trigger resync
Mar 19 12:59:49 pve-1 pmxcfs[1067]: [dcdb] crit: leaving CPG group
Mar 19 12:59:49 pve-1 pve-ha-lrm[1562]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve-1/lrm_status.tmp.1562' - Permission denied
Mar 19 12:59:50 pve-1 pmxcfs[1067]: [dcdb] notice: start cluster connection
Mar 19 12:59:50 pve-1 pmxcfs[1067]: [dcdb] crit: cpg_join failed: 14
Mar 19 12:59:50 pve-1 pmxcfs[1067]: [dcdb] crit: can't initialize service
Mar 19 12:59:54 pve-1 pvestatd[1255]: storage 'NAS-TS-453A' is not online
Mar 19 12:59:56 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 12:59:56 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 12:59:59 pve-1 pvestatd[1255]: got timeout
Mar 19 13:00:04 pve-1 pvestatd[1255]: got timeout
Mar 19 13:00:04 pve-1 pvestatd[1255]: status update time (18.124 seconds)
Mar 19 13:00:08 pve-1 corosync[1186]: [KNET ] rx: host: 2 link: 0 is up
Mar 19 13:00:08 pve-1 corosync[1186]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Mar 19 13:00:08 pve-1 corosync[1186]: [KNET ] rx: host: 3 link: 0 is up
Mar 19 13:00:08 pve-1 corosync[1186]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Mar 19 13:00:08 pve-1 corosync[1186]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 13:00:08 pve-1 corosync[1186]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 19 13:00:08 pve-1 corosync[1186]: [KNET ] pmtud: Global data MTU changed to: 1397
Mar 19 13:00:08 pve-1 corosync[1186]: [QUORUM] Sync members[3]: 1 2 3
Mar 19 13:00:08 pve-1 corosync[1186]: [QUORUM] Sync joined[2]: 2 3
Mar 19 13:00:08 pve-1 corosync[1186]: [TOTEM ] A new membership (1.62b0) was formed. Members joined: 2 3
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: starting data syncronisation
Mar 19 13:00:08 pve-1 corosync[1186]: [QUORUM] This node is within the primary component and will provide service.
Mar 19 13:00:08 pve-1 corosync[1186]: [QUORUM] Members[3]: 1 2 3
Mar 19 13:00:08 pve-1 corosync[1186]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: cpg_send_message retried 1 times
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: node has quorum
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: starting data syncronisation
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: received sync request (epoch 1/1067/00001242)
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: received sync request (epoch 1/1067/00000928)
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: received all states
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: leader is 2/897
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: synced members: 2/897, 3/893
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: waiting for updates from leader
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: received all states
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: all data is up to date
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: update complete - trying to commit (got 3 inode updates)
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 13:00:08 pve-1 pmxcfs[1067]: [status] notice: received log
Mar 19 13:00:09 pve-1 kernel: libceph: mon1 (1)192.168.1.21:6789 session lost, hunting for new mon
Mar 19 13:00:09 pve-1 kernel: libceph: mon0 (1)192.168.1.20:6789 session established
Mar 19 13:00:12 pve-1 kernel: libceph: osd2 (1)192.168.1.22:6801 socket closed (con state OPEN)
Mar 19 13:00:12 pve-1 kernel: rbd: rbd1: encountered watch error: -107
Mar 19 13:00:44 pve-1 corosync[1186]: [KNET ] link: host: 3 link: 0 is down
Mar 19 13:00:44 pve-1 corosync[1186]: [KNET ] link: host: 2 link: 0 is down
Mar 19 13:00:44 pve-1 corosync[1186]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 19 13:00:44 pve-1 corosync[1186]: [KNET ] host: host: 3 has no active links
Mar 19 13:00:44 pve-1 corosync[1186]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 13:00:44 pve-1 corosync[1186]: [KNET ] host: host: 2 has no active links
Mar 19 13:00:45 pve-1 corosync[1186]: [TOTEM ] Token has not been received in 2737 ms
Mar 19 13:00:46 pve-1 corosync[1186]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Mar 19 13:00:49 pve-1 pvestatd[1255]: storage 'NAS-TS-453A' is not online
Mar 19 13:00:50 pve-1 corosync[1186]: [QUORUM] Sync members[1]: 1
Mar 19 13:00:50 pve-1 corosync[1186]: [QUORUM] Sync left[2]: 2 3
Mar 19 13:00:50 pve-1 corosync[1186]: [TOTEM ] A new membership (1.62b4) was formed. Members left: 2 3
Mar 19 13:00:50 pve-1 corosync[1186]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Mar 19 13:00:50 pve-1 corosync[1186]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 19 13:00:50 pve-1 corosync[1186]: [QUORUM] Members[1]: 1
Mar 19 13:00:50 pve-1 corosync[1186]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [status] notice: node lost quorum
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [dcdb] crit: received write while not quorate - trigger resync
Mar 19 13:00:50 pve-1 pmxcfs[1067]: [dcdb] crit: leaving CPG group
Mar 19 13:00:50 pve-1 pve-ha-lrm[1562]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve-1/lrm_status.tmp.1562' - Permission denied
Mar 19 13:00:51 pve-1 pmxcfs[1067]: [dcdb] notice: start cluster connection
Mar 19 13:00:51 pve-1 pmxcfs[1067]: [dcdb] crit: cpg_join failed: 14
Mar 19 13:00:51 pve-1 pmxcfs[1067]: [dcdb] crit: can't initialize service
Mar 19 13:00:52 pve-1 pvestatd[1255]: status update time (8.144 seconds)
Mar 19 13:00:57 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067
Mar 19 13:00:57 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 13:00:59 pve-1 pvestatd[1255]: got timeout
Mar 19 13:01:04 pve-1 pvestatd[1255]: storage 'NAS-TS-453A' is not online
Mar 19 13:01:04 pve-1 kernel: libceph: mon0 (1)192.168.1.20:6789 socket closed (con state OPEN)
Mar 19 13:01:04 pve-1 kernel: libceph: mon0 (1)192.168.1.20:6789 session lost, hunting for new mon
Mar 19 13:01:04 pve-1 kernel: libceph: mon1 (1)192.168.1.21:6789 session established
Mar 19 13:01:05 pve-1 pvestatd[1255]: status update time (11.151 seconds)
Mar 19 13:01:07 pve-1 corosync[1186]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Mar 19 13:01:07 pve-1 corosync[1186]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 13:01:07 pve-1 corosync[1186]: [KNET ] pmtud: Global data MTU changed to: 1397
Mar 19 13:01:11 pve-1 pvescheduler[486298]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 19 13:01:11 pve-1 pvescheduler[486297]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 19 13:01:11 pve-1 corosync[1186]: [QUORUM] Sync members[1]: 1
Mar 19 13:01:11 pve-1 corosync[1186]: [TOTEM ] A new membership (1.62b8) was formed. Members
Mar 19 13:01:11 pve-1 corosync[1186]: [QUORUM] Members[1]: 1
Mar 19 13:01:11 pve-1 corosync[1186]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 19 13:01:13 pve-1 pvestatd[1255]: got timeout
Mar 19 13:01:13 pve-1 pvestatd[1255]: status update time (8.460 seconds)
Mar 19 13:01:15 pve-1 corosync[1186]: [QUORUM] Sync members[1]: 1
Mar 19 13:01:15 pve-1 corosync[1186]: [TOTEM ] A new membership (1.62bc) was formed. Members
Mar 19 13:01:15 pve-1 corosync[1186]: [QUORUM] Members[1]: 1
Mar 19 13:01:15 pve-1 corosync[1186]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 19 13:01:20 pve-1 corosync[1186]: [QUORUM] Sync members[1]: 1
Mar 19 13:01:20 pve-1 corosync[1186]: [TOTEM ] A new membership (1.62c0) was formed. Members
Mar 19 13:01:20 pve-1 corosync[1186]: [QUORUM] Members[1]: 1
Mar 19 13:01:20 pve-1 corosync[1186]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 19 13:01:22 pve-1 corosync[1186]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Mar 19 13:01:22 pve-1 corosync[1186]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Mar 19 13:01:22 pve-1 corosync[1186]: [QUORUM] Sync members[3]: 1 2 3
Mar 19 13:01:22 pve-1 corosync[1186]: [QUORUM] Sync joined[2]: 2 3
Mar 19 13:01:22 pve-1 corosync[1186]: [TOTEM ] A new membership (1.62c4) was formed. Members joined: 2 3
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: starting data syncronisation
Mar 19 13:01:22 pve-1 corosync[1186]: [KNET ] pmtud: Global data MTU changed to: 1397
Mar 19 13:01:22 pve-1 corosync[1186]: [QUORUM] This node is within the primary component and will provide service.
Mar 19 13:01:22 pve-1 corosync[1186]: [QUORUM] Members[3]: 1 2 3
Mar 19 13:01:22 pve-1 corosync[1186]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: cpg_send_message retried 1 times
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: node has quorum
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: members: 1/1067, 2/897, 3/893
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: starting data syncronisation
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: received sync request (epoch 1/1067/00001246)
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: received sync request (epoch 1/1067/0000092A)
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: received all states
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: leader is 2/897
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: synced members: 2/897, 3/893
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: waiting for updates from leader
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: dfsm_deliver_queue: queue length 2
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: received all states
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: all data is up to date
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [status] notice: dfsm_deliver_queue: queue length 29
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: update complete - trying to commit (got 3 inode updates)
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: all data is up to date
Mar 19 13:01:22 pve-1 pmxcfs[1067]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 2
The status of the problem node, queried when it is online, doesn't show any issues to my understanding:
Code:
root@pve-1:~# pvecm status
Cluster information
-------------------
Name: Cluster
Config Version: 5
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Mar 19 13:25:22 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.63b0
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.20 (local)
0x00000002 1 192.168.1.21
0x00000003 1 192.168.1.22
This is the status of node 3 when node 1 is offline:
Code:
root@pve-3:~# pvecm status
Cluster information
-------------------
Name: Cluster
Config Version: 5
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Mar 19 13:29:52 2024
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000003
Ring ID: 2.63d8
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.21
0x00000003 1 192.168.1.22 (local)
I appreciate your help, thank you.