I've been having this really weird issue where one of my Proxmox nodes will fail a week after the cluster is fixed, and I've really been struggling to troubleshoot the issue. Especially since it takes a week to see if any solution actually worked.
My cluster has been in a loop where every week one of the nodes' pvesr service fails with the following details in its logs:
At first I didn't know what it could be. But looking for patterns in how the servers failed. I observed the following details:
The only thing I imagine could cause this is that I upgraded to Proxmox VE 6 in August. But that still leaves a 1 month gap until these issues started happening. Corosync seems to establish a link for a few seconds, but then the server promptly gets "kicked out":
As is the original source of my confusion. If I shut down the stray node. Remove it from the cluster using the remaining node, reinstall it, rejoin it to the cluster, remove the quorum device and then readd the quorum device, the cluster will then work normally for exactly one week. I didn't note down all the timestamps. But it seems to be almost on the minute. As in 168 hours, 0 minutes and 0 seconds after the cluster was last fixed.
My cluster has been in a loop where every week one of the nodes' pvesr service fails with the following details in its logs:
Code:
Oct 13 20:57:00 tethealla systemd[1]: Starting Proxmox VE replication runner...
Oct 13 20:57:00 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:01 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:02 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:03 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:04 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:05 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:06 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:07 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:08 tethealla pvesr[2019]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 13 20:57:09 tethealla pvesr[2019]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 13 20:57:09 tethealla systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
At first I didn't know what it could be. But looking for patterns in how the servers failed. I observed the following details:
- Node A first failed on Saturday September 28th.
- I couldn't figure out how to get it working again, so I shut down the node, removed the it from the cluster and reinstalled/rejoined it.
- Exactly one week later, on Saturday October 5th, Node B failed instead.
- This time I wasn't on-site on Saturday, so I reinstalled it on Sunday October 6th instead.
- Now another exact week after I fixed the server, on Sunday October 13th, Node B failed again.
The only thing I imagine could cause this is that I upgraded to Proxmox VE 6 in August. But that still leaves a 1 month gap until these issues started happening. Corosync seems to establish a link for a few seconds, but then the server promptly gets "kicked out":
Code:
Oct 15 01:59:03 tethealla systemd[1]: Started Corosync Cluster Engine.
Oct 15 01:59:28 tethealla corosync[1045]: [KNET ] rx: host: 1 link: 0 is up
Oct 15 01:59:28 tethealla corosync[1045]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 01:59:28 tethealla corosync[1045]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Oct 15 01:59:28 tethealla corosync[1045]: [KNET ] pmtud: Global data MTU changed to: 1397
Oct 15 12:36:41 tethealla corosync[1045]: [KNET ] link: host: 1 link: 0 is down
Oct 15 12:36:41 tethealla corosync[1045]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:36:41 tethealla corosync[1045]: [KNET ] host: host: 1 has no active links
Oct 15 12:36:51 tethealla corosync[1045]: [QUORUM] This node is within the primary component and will provide service.
Oct 15 12:36:51 tethealla corosync[1045]: [QUORUM] Members[1]: 2
Oct 15 12:36:52 tethealla corosync[1045]: [KNET ] rx: host: 1 link: 0 is up
Oct 15 12:36:52 tethealla corosync[1045]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:36:53 tethealla corosync[1045]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 15 12:36:53 tethealla corosync[1045]: [QUORUM] Members[1]: 2
Oct 15 12:36:58 tethealla corosync[1045]: [KNET ] link: host: 1 link: 0 is down
Oct 15 12:36:58 tethealla corosync[1045]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 12:37:00 tethealla corosync[1045]: [KNET ] rx: host: 1 link: 0 is up
Oct 15 12:37:00 tethealla corosync[1045]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
As is the original source of my confusion. If I shut down the stray node. Remove it from the cluster using the remaining node, reinstall it, rejoin it to the cluster, remove the quorum device and then readd the quorum device, the cluster will then work normally for exactly one week. I didn't note down all the timestamps. But it seems to be almost on the minute. As in 168 hours, 0 minutes and 0 seconds after the cluster was last fixed.