Hello,
I have an issue with my prox cluster. I have set up a cluster with three servers - they make a quorum and my config looks like:
Normally all works quite well and even I tried it once, if one machine dies, the VMs were migrated HOWEVER this seems to be not stable...
For example today, I had a small power outage causing the switch to go off (the switch has not been migrated to UPS just now - but is going to be added very soon). That caused the servers to lose the quorum and all of them just rebooted...
Well, I understand why they do reboot because they "think" it will fix the problem, though the time until they "force" reboot is quiet low...
Anyways, after all was connected to power again the machines did boot and did build the quorum. It only took them 5-10 mins I would say, including boot time.
So that is OKish HOWEVER they then tend to do the reboot like at least once more before they are "stable" again.
If I look in the log file, I can find for example
Well Ok - it cannot yet find his mates.
Looking further in the logs it says:
So why would it lose the quorum seconds after it was building it? Well yes because it could not write the file but that doesn't make sense to me. Also that time the server then stayed stable - until now.
My questions now are:[/b}
* Is there something wrong with my config? I thought three nodes would be enough for a start - I plan on adding more however if the servers then finally do stay online they run stable for days and weeks so not my primary focus because I have issues with disk write/read speed.
* What other logs should I be checking to eventually find out the reasons for the issues happening?
I have an issue with my prox cluster. I have set up a cluster with three servers - they make a quorum and my config looks like:
Code:
nodelist {
node {
name: prox01
nodeid: 1
quorum_votes: 1
ring0_addr: prox01
}
node {
name: prox03
nodeid: 3
quorum_votes: 1
ring0_addr: prox03
}
node {
name: prox02
nodeid: 2
quorum_votes: 1
ring0_addr: prox02
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: cluster1
config_version: 3
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.7.4.11
ringnumber: 0
}
}
Normally all works quite well and even I tried it once, if one machine dies, the VMs were migrated HOWEVER this seems to be not stable...
For example today, I had a small power outage causing the switch to go off (the switch has not been migrated to UPS just now - but is going to be added very soon). That caused the servers to lose the quorum and all of them just rebooted...
Well, I understand why they do reboot because they "think" it will fix the problem, though the time until they "force" reboot is quiet low...
Anyways, after all was connected to power again the machines did boot and did build the quorum. It only took them 5-10 mins I would say, including boot time.
So that is OKish HOWEVER they then tend to do the reboot like at least once more before they are "stable" again.
If I look in the log file, I can find for example
Code:
Dec 27 11:21:00 prox01 systemd[1]: Starting Proxmox VE replication runner...
Dec 27 11:21:00 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:01 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:02 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:03 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:03 prox01 ntpd[4328]: error resolving pool 1.debian.pool.ntp.org: Temporary failure in name resolution (-3)
Dec 27 11:21:04 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:05 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:06 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:07 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:08 prox01 pvesr[6906]: trying to acquire cfs lock 'file-replication_cfg' ...
Dec 27 11:21:09 prox01 pvesr[6906]: error with cfs lock 'file-replication_cfg': no quorum!
Dec 27 11:21:09 prox01 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Dec 27 11:21:09 prox01 systemd[1]: Failed to start Proxmox VE replication runner.
Dec 27 11:21:09 prox01 systemd[1]: pvesr.service: Unit entered failed state.
Well Ok - it cannot yet find his mates.
Looking further in the logs it says:
Code:
corosync[4578]: notice [TOTEM ] A new membership (10.7.4.11:46100) was formed. Members joined: 2 3
Dec 27 11:24:40 prox01 corosync[4578]: [TOTEM ] A new membership (10.7.4.11:46100) was formed. Members joined: 2 3
Dec 27 11:24:40 prox01 corosync[4578]: warning [CPG ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]: warning [CPG ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]: [CPG ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]: [CPG ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]: [CPG ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 corosync[4578]: warning [CPG ] downlist left_list: 0 received
Dec 27 11:24:40 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409, 2/4436
Dec 27 11:24:40 prox01 pmxcfs[4409]: [dcdb] notice: starting data syncronisation
Dec 27 11:24:40 prox01 corosync[4578]: [QUORUM] This node is within the primary component and will provide service.
Dec 27 11:24:40 prox01 corosync[4578]: notice [QUORUM] This node is within the primary component and will provide service.
Dec 27 11:24:40 prox01 corosync[4578]: notice [QUORUM] Members[3]: 1 2 3
Dec 27 11:24:40 prox01 corosync[4578]: notice [MAIN ] Completed service synchronization, ready to provide service.
Dec 27 11:24:40 prox01 corosync[4578]: [QUORUM] Members[3]: 1 2 3
Dec 27 11:24:40 prox01 corosync[4578]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 27 11:24:41 prox01 pmxcfs[4409]: [dcdb] notice: cpg_send_message retried 1 times
Dec 27 11:24:41 prox01 pmxcfs[4409]: [status] notice: node has quorum
Dec 27 11:24:41 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409, 2/4436, 3/12033
Dec 27 11:24:41 prox01 pmxcfs[4409]: [status] notice: members: 1/4409, 2/4436
Dec 27 11:24:41 prox01 pmxcfs[4409]: [status] notice: starting data syncronisation
corosync[4578]: notice [TOTEM ] A new membership (10.7.4.11:46108) was formed. Members left: 3
Dec 27 11:24:54 prox01 corosync[4578]: notice [TOTEM ] Failed to receive the leave message. failed: 3
Dec 27 11:24:54 prox01 corosync[4578]: warning [CPG ] downlist left_list: 1 received
Dec 27 11:24:54 prox01 corosync[4578]: [TOTEM ] A new membership (10.7.4.11:46108) was formed. Members left: 3
Dec 27 11:24:54 prox01 corosync[4578]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 27 11:24:54 prox01 corosync[4578]: notice [QUORUM] Members[1]: 1
Dec 27 11:24:54 prox01 corosync[4578]: notice [MAIN ] Completed service synchronization, ready to provide service.
Dec 27 11:24:54 prox01 corosync[4578]: [TOTEM ] Failed to receive the leave message. failed: 3
Dec 27 11:24:54 prox01 corosync[4578]: [CPG ] downlist left_list: 1 received
Dec 27 11:24:54 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409
Dec 27 11:24:54 prox01 pmxcfs[4409]: [status] notice: members: 1/4409
Dec 27 11:24:54 prox01 corosync[4578]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 27 11:24:54 prox01 corosync[4578]: [QUORUM] Members[1]: 1
Dec 27 11:24:54 prox01 corosync[4578]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 27 11:24:54 prox01 pmxcfs[4409]: [status] notice: node lost quorum
Dec 27 11:24:54 prox01 pmxcfs[4409]: [dcdb] crit: received write while not quorate - trigger resync
Dec 27 11:24:54 prox01 pmxcfs[4409]: [dcdb] crit: leaving CPG group
Dec 27 11:24:55 prox01 pmxcfs[4409]: [dcdb] notice: start cluster connection
Dec 27 11:24:55 prox01 pmxcfs[4409]: [dcdb] notice: members: 1/4409
Dec 27 11:24:55 prox01 pmxcfs[4409]: [dcdb] notice: all data is up to date
Dec 27 11:24:55 prox01 pve-ha-crm[5024]: status change slave => wait_for_quorum
Dec 27 11:24:55 prox01 pve-ha-lrm[6275]: unable to write lrm status file - unable to open file '/etc/pve/nodes/prox01/lrm_status.tmp.6275' - Permission denied
Dec 27 11:24:58 prox01 pve-guests[6359]: cluster not ready - no quorum?
Dec 27 11:24:58 prox01 pvesh[6336]: cluster not ready - no quorum?
So why would it lose the quorum seconds after it was building it? Well yes because it could not write the file but that doesn't make sense to me. Also that time the server then stayed stable - until now.
My questions now are:[/b}
* Is there something wrong with my config? I thought three nodes would be enough for a start - I plan on adding more however if the servers then finally do stay online they run stable for days and weeks so not my primary focus because I have issues with disk write/read speed.
* What other logs should I be checking to eventually find out the reasons for the issues happening?