C
Chris Rivera
Guest
This is something that's more of a reoccurring issue than anything.
While proxmox for the most part is a great solution, i can say that we teach our staff to regularly to restart cman and pve-cluster services because proxmox looses sync ( turns red ) in the web interface atleast 2 times a day.
I noticed that cman service is very sensitive to network traffic on the switch that all the nodes are connected to. We currently have proxmox plugged in directly to our colocation switch instead of a private switch specifically for the cluster which may be a way for us to fix the issue hardware wise.
While proxmox web interface shows the nodes as being offline or red... these servers are always online, accessible via SSH with vms running and no connectivity problems.
In this thread i am asking 2 things:
#1. How to solve my issue. January 1. Happy New Year, i get woken up because the cloud is completely out of sync and we cannot provision any new orders. Normally when this happens we tell our staff to run:
These 2 services getting restarted 99% of the time fixed the issue, but that is not the case today.
If i use the command clustat. I get an output that shows me all nodes have quorum and are online
#############################
Cluster Status for FL-Cluster @ Tue Jan 1 10:07:30 2013
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online
proxmox3 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online
proxmox9 9 Online
Proxmox10 10 Online
proxmox1a 11 Online, Local
#############################
Yet proxmox web interface disagrees. It shows all nodes offline except for the local node you are connected to. Taking a look at syslogs:
I have 14,000 lines of : proxmox1a pmxcfs[212951]: [status] crit: cpg_send_message failed: 9
#############################
Proxmox syslogs: grep corosync /var/log/syslog
Jan 1 09:55:54 Proxmox10 corosync[97803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais cluster membership service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[8]: 2 3 4 5 6 7 9 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[8]: 2 3 4 5 6 7 9 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais event service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais distributed locking service B.03.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais message service B.03.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais timer service A.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:1856.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Corosync Cluster Engine ('1.4.4'): started and ready to provide service.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Corosync built-in features: nss
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully parsed cman config
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully configured openais services to load
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] The network interface [*.*.*.175] is now up.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [QUORUM] Using quorum provider quorum_cman
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 1 09:55:55 Proxmox10 corosync[577803]: [CMAN ] CMAN 1352871249 (built Nov 14 2012 06:34:12) started
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais event service B.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais message service B.03.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais timer service A.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync configuration service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync profile loading service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Using quorum provider quorum_cman
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[1]: 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[1]: 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CPG ] chosen downlist: sender r(0) ip(*.*.*.175) ; members(old:0 left:0)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.142)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.153)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.154)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.155)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.156)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.157)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.158)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.159)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.160)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.161)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.142)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.153)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.154)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.155)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.156)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.157)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.158)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.159)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.160)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.161)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[2]: 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[2]: 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[3]: 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[3]: 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[4]: 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[4]: 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[5]: 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[5]: 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CMAN ] quorum regained, resuming activity
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] This node is within the primary component and will provide service.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[6]: 5 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[6]: 5 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[7]: 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[7]: 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[8]: 2 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[8]: 2 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[9]: 2 3 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[9]: 2 3 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CPG ] chosen downlist: sender r(0) ip(*.*.*.142) ; members(old:10 left:0)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Completed service synchronization, ready to provide service.
#############################
All nodes are using the same kernel .16 update.
pveversion -v output:
pve-manager: 2.2-31 (pve-manager/2.2/e94e95e9)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-82
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-82
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-33
qemu-server: 2.0-69
pve-firmware: 1.0-21
libpve-common-perl: 1.0-39
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-36
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1
#2 What type of better solutions can we configure to get this to just work?
These are just my thoughts or ideas... speaking out loud to see if this is workable or some type of solution can come of this
While proxmox for the most part is a great solution, i can say that we teach our staff to regularly to restart cman and pve-cluster services because proxmox looses sync ( turns red ) in the web interface atleast 2 times a day.
I noticed that cman service is very sensitive to network traffic on the switch that all the nodes are connected to. We currently have proxmox plugged in directly to our colocation switch instead of a private switch specifically for the cluster which may be a way for us to fix the issue hardware wise.
While proxmox web interface shows the nodes as being offline or red... these servers are always online, accessible via SSH with vms running and no connectivity problems.
In this thread i am asking 2 things:
- How to solve my issue and get my cloud back online (ASAP)
- What type of better solutions can we configure to get this to just work? (THOUGHTS/IDEAS)
#1. How to solve my issue. January 1. Happy New Year, i get woken up because the cloud is completely out of sync and we cannot provision any new orders. Normally when this happens we tell our staff to run:
- service cman restart
- service pve-cluster restart
These 2 services getting restarted 99% of the time fixed the issue, but that is not the case today.
If i use the command clustat. I get an output that shows me all nodes have quorum and are online
#############################
Cluster Status for FL-Cluster @ Tue Jan 1 10:07:30 2013
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online
proxmox3 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online
proxmox9 9 Online
Proxmox10 10 Online
proxmox1a 11 Online, Local
#############################
Yet proxmox web interface disagrees. It shows all nodes offline except for the local node you are connected to. Taking a look at syslogs:
I have 14,000 lines of : proxmox1a pmxcfs[212951]: [status] crit: cpg_send_message failed: 9
#############################
Proxmox syslogs: grep corosync /var/log/syslog
Jan 1 09:55:54 Proxmox10 corosync[97803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais cluster membership service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[8]: 2 3 4 5 6 7 9 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[8]: 2 3 4 5 6 7 9 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais event service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais distributed locking service B.03.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais message service B.03.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais timer service A.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:1856.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Corosync Cluster Engine ('1.4.4'): started and ready to provide service.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Corosync built-in features: nss
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully parsed cman config
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully configured openais services to load
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] The network interface [*.*.*.175] is now up.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [QUORUM] Using quorum provider quorum_cman
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 1 09:55:55 Proxmox10 corosync[577803]: [CMAN ] CMAN 1352871249 (built Nov 14 2012 06:34:12) started
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais event service B.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais message service B.03.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais timer service A.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync configuration service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync profile loading service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Using quorum provider quorum_cman
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[1]: 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[1]: 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CPG ] chosen downlist: sender r(0) ip(*.*.*.175) ; members(old:0 left:0)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.142)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.153)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.154)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.155)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.156)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.157)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.158)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.159)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.160)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.161)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.142)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.153)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.154)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.155)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.156)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.157)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.158)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.159)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.160)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.161)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[2]: 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[2]: 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[3]: 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[3]: 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[4]: 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[4]: 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[5]: 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[5]: 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CMAN ] quorum regained, resuming activity
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] This node is within the primary component and will provide service.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[6]: 5 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[6]: 5 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[7]: 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[7]: 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[8]: 2 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[8]: 2 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[9]: 2 3 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[9]: 2 3 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CPG ] chosen downlist: sender r(0) ip(*.*.*.142) ; members(old:10 left:0)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Completed service synchronization, ready to provide service.
#############################
All nodes are using the same kernel .16 update.
pveversion -v output:
pve-manager: 2.2-31 (pve-manager/2.2/e94e95e9)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-82
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-82
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-33
qemu-server: 2.0-69
pve-firmware: 1.0-21
libpve-common-perl: 1.0-39
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-36
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1
#2 What type of better solutions can we configure to get this to just work?
These are just my thoughts or ideas... speaking out loud to see if this is workable or some type of solution can come of this
- If nodes are online and accessible via SSH, and clustat shows them online but Proxmox considers them offline shouldn't we have a cloud self healing script to automatically run cman / pve-cluster restart. If it fails too many times send an email to the cluster administrator?
- This wouldn't work in my scenario since restarting cman and pve-cluster is not working.... but 99% of the time restarting both of those services on offline no quorum nodes will get them to rejoin the cluster no problem ( this is done at least 2 times a day by staff )
- cman is the specific service your using for the cluster to keep everything in sync and all together... are there other ways of going about this or adding another way, incase of cman failure, to remain in sync?
- Nodes are online and accessible via SSH so we can access them for whatever reason ( read / update / modify / self repair )