Quorum gets lost between servers and all of them reboot constantly

richinbg

Member
Oct 2, 2017
28
3
8
32
Hello,

I have an issue with my prox cluster. I have set up a cluster with three servers - they make a quorum and my config looks like:


Code:
nodelist {
  node {
    name: prox01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: prox01
  }

  node {
    name: prox03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: prox03
  }

  node {
    name: prox02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: prox02
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster1
  config_version: 3
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.7.4.11
    ringnumber: 0
  }

}

Normally all works quite well and even I tried it once, if one machine dies, the VMs were migrated HOWEVER this seems to be not stable...

For example today, I had a small power outage causing the switch to go off (the switch has not been migrated to UPS just now - but is going to be added very soon). That caused the servers to lose the quorum and all of them just rebooted...

Well, I understand why they do reboot because they "think" it will fix the problem, though the time until they "force" reboot is quiet low...

Anyways, after all was connected to power again the machines did boot and did build the quorum. It only took them 5-10 mins I would say, including boot time.
So that is OKish HOWEVER they then tend to do the reboot like at least once more before they are "stable" again.

If I look in the log file, I can find for example
Code:
Dec 27 11:21:00 prox01 systemd[1]: Starting Proxmox VE replication runner...
Dec 27 11:21:00 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:01 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:02 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:03 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:03 prox01 ntpd[4328]: error resolving pool 1.debian.pool.ntp.org: Temporary failure in name resolution (-3)
Dec 27 11:21:04 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...

Dec 27 11:21:05 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:06 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...

Dec 27 11:21:07 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:08 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:09 prox01 pvesr[6906]: error with cfs lock 'file-replication_cfg': no quorum!
Dec 27 11:21:09 prox01 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Dec 27 11:21:09 prox01 systemd[1]: Failed to start Proxmox VE replication runner.
Dec 27 11:21:09 prox01 systemd[1]: pvesr.service: Unit entered failed state.

Well Ok - it cannot yet find his mates.

Looking further in the logs it says:

Code:
corosync[4578]: notice  [TOTEM ] A new membership (10.7.4.11:46100) was formed. Members joined: 2 3
Dec 27 11:24:40 prox01 corosync[4578]:  [TOTEM ] A new membership (10.7.4.11:46100) was formed. Members joined: 2 3
Dec 27 11:24:40 prox01 corosync[4578]: warning [CPG   ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]: warning [CPG   ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]:  [CPG   ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]:  [CPG   ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]:  [CPG   ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]: warning [CPG   ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409, 2/4436
Dec 27 11:24:40 prox01 pmxcfs[4409]: [dcdb] notice: starting data syncronisation
Dec 27 11:24:40 prox01 corosync[4578]:  [QUORUM] This node is within the primary component and will provide service.
Dec 27 11:24:40 prox01 corosync[4578]: notice  [QUORUM] This node is within the primary component and will provide service.
Dec 27 11:24:40 prox01 corosync[4578]: notice  [QUORUM] Members[3]: 1 2 3
Dec 27 11:24:40 prox01 corosync[4578]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 27 11:24:40 prox01 corosync[4578]:  [QUORUM] Members[3]: 1 2 3
Dec 27 11:24:40 prox01 corosync[4578]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 27 11:24:41 prox01 pmxcfs[4409]: [dcdb] notice: cpg_send_message retried 1 times
Dec 27 11:24:41 prox01 pmxcfs[4409]: [status] notice: node has quorum
Dec 27 11:24:41 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409, 2/4436, 3/12033
Dec 27 11:24:41 prox01 pmxcfs[4409]: [status] notice: members: 1/4409, 2/4436
Dec 27 11:24:41 prox01 pmxcfs[4409]: [status] notice: starting data syncronisation
 corosync[4578]: notice  [TOTEM ] A new membership (10.7.4.11:46108) was formed. Members left: 3
Dec 27 11:24:54 prox01 corosync[4578]: notice  [TOTEM ] Failed to receive the leave message. failed: 3
Dec 27 11:24:54 prox01 corosync[4578]: warning [CPG   ] downlist left_list: 1 received
Dec 27 11:24:54 prox01 corosync[4578]:  [TOTEM ] A new membership (10.7.4.11:46108) was formed. Members left: 3
Dec 27 11:24:54 prox01 corosync[4578]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 27 11:24:54 prox01 corosync[4578]: notice  [QUORUM] Members[1]: 1
Dec 27 11:24:54 prox01 corosync[4578]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 27 11:24:54 prox01 corosync[4578]:  [TOTEM ] Failed to receive the leave message. failed: 3
Dec 27 11:24:54 prox01 corosync[4578]:  [CPG   ] downlist left_list: 1 received
Dec 27 11:24:54 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409
Dec 27 11:24:54 prox01 pmxcfs[4409]: [status] notice: members: 1/4409
Dec 27 11:24:54 prox01 corosync[4578]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 27 11:24:54 prox01 corosync[4578]:  [QUORUM] Members[1]: 1
Dec 27 11:24:54 prox01 corosync[4578]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 27 11:24:54 prox01 pmxcfs[4409]: [status] notice: node lost quorum
Dec 27 11:24:54 prox01 pmxcfs[4409]: [dcdb] crit: received write while not quorate - trigger resync
Dec 27 11:24:54 prox01 pmxcfs[4409]: [dcdb] crit: leaving CPG group
Dec 27 11:24:55 prox01 pmxcfs[4409]: [dcdb] notice: start cluster connection
Dec 27 11:24:55 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409
Dec 27 11:24:55 prox01 pmxcfs[4409]: [dcdb] notice: all data is up to date
Dec 27 11:24:55 prox01 pve-ha-crm[5024]: status change slave => wait_for_quorum
Dec 27 11:24:55 prox01 pve-ha-lrm[6275]: unable to write lrm status file - unable to open file '/etc/pve/nodes/prox01/lrm_status.tmp.6275' - Permission denied
Dec 27 11:24:58 prox01 pve-guests[6359]: cluster not ready - no quorum?
Dec 27 11:24:58 prox01 pvesh[6336]: cluster not ready - no quorum?

So why would it lose the quorum seconds after it was building it? Well yes because it could not write the file but that doesn't make sense to me. Also that time the server then stayed stable - until now.

My questions now are:[/b}

* Is there something wrong with my config? I thought three nodes would be enough for a start - I plan on adding more however if the servers then finally do stay online they run stable for days and weeks so not my primary focus because I have issues with disk write/read speed.
* What other logs should I be checking to eventually find out the reasons for the issues happening?
 
Do you have enable HA ? (server should only reboot if they lost quorum and HA is enabled).

for HA, you need redundant network. (dual nic, dual switch).

Another possibility is that multicast is not stable on your network.

But server should'nt reboot is HA is not enabled.
 
Thanks for your answer.

Indeed I have a HA group configured. I hoped that if one host goes down, that the machines then do migrate to the "still alive" machines.
Should that be turned off then?

In regards to multicast - well always if I just do some iperf testing while everything works fine, well it works fine :D Any ideas how I could proof it is not working, would be welcome, thank you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!