Cloud completely offline / cman not working?

C

Chris Rivera

Guest
This is something that's more of a reoccurring issue than anything.

While proxmox for the most part is a great solution, i can say that we teach our staff to regularly to restart cman and pve-cluster services because proxmox looses sync ( turns red ) in the web interface atleast 2 times a day.

I noticed that cman service is very sensitive to network traffic on the switch that all the nodes are connected to. We currently have proxmox plugged in directly to our colocation switch instead of a private switch specifically for the cluster which may be a way for us to fix the issue hardware wise.

While proxmox web interface shows the nodes as being offline or red... these servers are always online, accessible via SSH with vms running and no connectivity problems.

In this thread i am asking 2 things:
  1. How to solve my issue and get my cloud back online (ASAP)
  2. What type of better solutions can we configure to get this to just work? (THOUGHTS/IDEAS)



#1. How to solve my issue. January 1. Happy New Year, i get woken up because the cloud is completely out of sync and we cannot provision any new orders. Normally when this happens we tell our staff to run:
  1. service cman restart
  2. service pve-cluster restart

These 2 services getting restarted 99% of the time fixed the issue, but that is not the case today.

If i use the command clustat. I get an output that shows me all nodes have quorum and are online


#############################

Cluster Status for FL-Cluster @ Tue Jan 1 10:07:30 2013
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online
proxmox3 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online
proxmox9 9 Online
Proxmox10 10 Online
proxmox1a 11 Online, Local



#############################

Yet proxmox web interface disagrees. It shows all nodes offline except for the local node you are connected to. Taking a look at syslogs:

I have 14,000 lines of : proxmox1a pmxcfs[212951]: [status] crit: cpg_send_message failed: 9

#############################

Proxmox syslogs: grep corosync /var/log/syslog
Jan 1 09:55:54 Proxmox10 corosync[97803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais cluster membership service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[8]: 2 3 4 5 6 7 9 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[8]: 2 3 4 5 6 7 9 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais event service B.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais distributed locking service B.03.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais message service B.03.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [QUORUM] Members[9]: 2 3 4 5 6 7 9 10 11
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Jan 1 09:55:54 Proxmox10 corosync[97803]: [SERV ] Service engine unloaded: openais timer service A.01.01
Jan 1 09:55:54 Proxmox10 corosync[97803]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:1856.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Corosync Cluster Engine ('1.4.4'): started and ready to provide service.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Corosync built-in features: nss
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully parsed cman config
Jan 1 09:55:55 Proxmox10 corosync[577803]: [MAIN ] Successfully configured openais services to load
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jan 1 09:55:55 Proxmox10 corosync[577803]: [TOTEM ] The network interface [*.*.*.175] is now up.
Jan 1 09:55:55 Proxmox10 corosync[577803]: [QUORUM] Using quorum provider quorum_cman
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 1 09:55:55 Proxmox10 corosync[577803]: [CMAN ] CMAN 1352871249 (built Nov 14 2012 06:34:12) started
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Jan 1 09:55:55 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais event service B.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais message service B.03.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: openais timer service A.01.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync configuration service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync profile loading service
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Using quorum provider quorum_cman
Jan 1 09:55:56 Proxmox10 corosync[577803]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[1]: 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[1]: 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CPG ] chosen downlist: sender r(0) ip(*.*.*.175) ; members(old:0 left:0)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] CLM CONFIGURATION CHANGE
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] New Configuration:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.142)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.153)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.154)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.155)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.156)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.157)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.158)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.159)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.160)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.161)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.175)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Left:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] Members Joined:
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.142)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.153)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.154)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.155)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.156)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.157)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.158)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.159)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.160)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CLM ] #011r(0) ip(*.*.*.161)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[2]: 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[2]: 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[3]: 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[3]: 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[4]: 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[4]: 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[5]: 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[5]: 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CMAN ] quorum regained, resuming activity
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] This node is within the primary component and will provide service.
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[6]: 5 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[6]: 5 6 7 8 9 10
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[7]: 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[7]: 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[8]: 2 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[8]: 2 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[9]: 2 3 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[9]: 2 3 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[10]: 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Jan 1 09:55:56 Proxmox10 corosync[577803]: [CPG ] chosen downlist: sender r(0) ip(*.*.*.142) ; members(old:10 left:0)
Jan 1 09:55:56 Proxmox10 corosync[577803]: [MAIN ] Completed service synchronization, ready to provide service.


#############################

All nodes are using the same kernel .16 update.

pveversion -v output:

pve-manager: 2.2-31 (pve-manager/2.2/e94e95e9)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-82
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-82
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-33
qemu-server: 2.0-69
pve-firmware: 1.0-21
libpve-common-perl: 1.0-39
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-36
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1



#2 What type of better solutions can we configure to get this to just work?

These are just my thoughts or ideas... speaking out loud to see if this is workable or some type of solution can come of this



  • If nodes are online and accessible via SSH, and clustat shows them online but Proxmox considers them offline shouldn't we have a cloud self healing script to automatically run cman / pve-cluster restart. If it fails too many times send an email to the cluster administrator?
    • This wouldn't work in my scenario since restarting cman and pve-cluster is not working.... but 99% of the time restarting both of those services on offline no quorum nodes will get them to rejoin the cluster no problem ( this is done at least 2 times a day by staff )
  • cman is the specific service your using for the cluster to keep everything in sync and all together... are there other ways of going about this or adding another way, incase of cman failure, to remain in sync?
    • Nodes are online and accessible via SSH so we can access them for whatever reason ( read / update / modify / self repair )
 
More reports from syslog:


Jan 1 11:48:09 Proxmox10 kernel: __ratelimit: 1102 callbacks suppressed
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.
Jan 1 11:48:09 Proxmox10 kernel: nf_conntrack: table full, dropping packet.


I was reading somethings about nf_conntrack but only finding a bug for the centos system. I am using the original iso operating system debian - updated to pve.16 kernel

I read on increasing nf_conntrack limit and came accross:

http://antmeetspenguin.blogspot.com/2011/01/high-performance-linux-router.html

which suggests to run:

/sbin/sysctl -w net.netfilter.nf_conntrack_max=196608 # double original value
echo net.ipv4.netfilter.ip_conntrack_max=196608 > /etc/sysctl.conf # save permanently
echo 24576 > /sys/module/nf_conntrack/parameters/hashsize # increase hashsize
 
While proxmox web interface shows the nodes as being offline or red... these servers are always online, accessible via SSH with vms running and no connectivity problems.

That looks like a bug we already solved in qemu-server_2.0-64 - are you sure this still happens?

If so, does it help if you restart pvestatd

# service pvestatd restart
 
That looks like a bug we already solved in qemu-server_2.0-64 - are you sure this still happens?

If so, does it help if you restart pvestatd

# service pvestatd restart


I am sure, its never been this bad before (using kernel .14). I never had all nodes offline when we did not have a network disturbance or DDoS attack.

Restarting the pvestatd service alone did not bring up the nodes but after running cman and pve-cluster after running pvestatd restart all nodes became available via the web interface.

Thanks
 
Everything is offline again.

Node 1 when logged in normally shows its self green, but now shows its self as red as well as the rest of the nodes.

Restarting all services found in node > services tab does not help

service pvestatd restart doesnt help

restarting all services after pvestatd restart doesnt help

####

service cman restart fails

root@proxmox1a:~# service cman restart
Stopping cluster:
Stopping dlm_controld...
[FAILED]


####

at this point i can even log into proxmox interface becasue the drop down for realm is not getting prefilled from any of my nodes.

I think this should be changed to work a little differently and faster... i dont see why we need to load a full javascript interface if the user has not logged in... this just slows down the process. also why is realm not loaded when the form is loaded and not after using async call... this also slows down the logging in process.

current process to login:
1. recieve html
2. process css
3. process javascript
4. process dom
5. load proxmox interface empty
5. load proxmox login box
6. make sync call for realm drop down

this is kinda rediculous just to login. i thing the login should be static html... fastest to load with no javascript dependancies or cpu intensive loading.... then once authenticated load the proxmox javascript intensive interface....

this might not be an issue with smaller clusters or smaller of vms but the proxmox web interfaces is at the size we have super slow with delays per every click to scroll and all.

11 nodes 600 - 700 vms with some nodes containing over 100 vms.... looks beautiful but definitly needs large performance tuning to work with large clusters / datacenters.

###
 
Jan 4 15:56:30 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23190
Jan 4 15:56:31 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:56:31 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:56:31 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:56:31 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23200
Jan 4 15:56:32 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:56:32 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23210
Jan 4 15:56:33 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:56:33 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23220
Jan 4 15:56:34 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:56:34 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23230
Jan 4 15:56:35 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:56:35 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23240
Jan 4 15:56:36 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:56:36 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23250
Jan 4 15:56:37 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:56:37 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23260
Jan 4 15:56:38 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:56:38 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23270
Jan 4 15:56:39 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:56:39 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23280
Jan 4 15:56:40 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:56:40 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:56:40 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23290
Jan 4 15:56:41 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:56:41 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:56:41 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:56:41 proxmox1a pvedaemon[55193]: <root@pam> successful auth for user 'root@pam'
Jan 4 15:56:41 proxmox1a pvestatd[54593]: status update time (320.617 seconds)
Jan 4 15:56:41 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23300
Jan 4 15:56:42 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:56:42 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23310
Jan 4 15:56:43 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:56:43 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23320
Jan 4 15:56:44 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:56:44 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23330
Jan 4 15:56:45 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:56:45 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23340
Jan 4 15:56:46 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:56:46 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23350
Jan 4 15:56:47 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:56:47 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23360
Jan 4 15:56:48 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:56:48 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23370
Jan 4 15:56:49 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:56:49 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23380
Jan 4 15:56:50 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:56:50 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:56:50 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23390
Jan 4 15:56:51 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:56:51 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:56:51 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:56:51 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23400
Jan 4 15:56:52 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:56:52 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23410
Jan 4 15:56:53 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:56:53 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23420
Jan 4 15:56:54 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:56:54 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23430
Jan 4 15:56:55 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:56:55 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23440
Jan 4 15:56:56 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:56:56 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23450
Jan 4 15:56:57 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:56:57 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23460
Jan 4 15:56:58 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:56:58 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23470
Jan 4 15:56:59 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:56:59 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23480
Jan 4 15:57:00 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:57:00 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:57:00 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23490
Jan 4 15:57:01 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:57:01 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:57:01 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:57:01 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23500
Jan 4 15:57:02 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:57:02 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23510
Jan 4 15:57:03 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:57:03 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23520
Jan 4 15:57:04 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:57:04 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23530
Jan 4 15:57:05 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:57:05 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23540
Jan 4 15:57:06 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:57:06 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23550
Jan 4 15:57:07 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:57:07 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23560
Jan 4 15:57:08 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:57:08 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23570
Jan 4 15:57:09 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:57:09 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23580
Jan 4 15:57:10 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:57:10 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:57:10 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23590
Jan 4 15:57:11 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:57:11 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:57:11 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:57:11 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23600
Jan 4 15:57:12 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:57:12 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23610
Jan 4 15:57:13 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:57:13 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23620
Jan 4 15:57:14 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:57:14 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23630
Jan 4 15:57:15 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:57:15 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23640
Jan 4 15:57:16 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:57:16 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23650
Jan 4 15:57:17 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:57:17 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23660
Jan 4 15:57:18 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:57:18 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23670
Jan 4 15:57:19 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:57:19 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23680
Jan 4 15:57:20 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:57:20 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:57:20 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23690
Jan 4 15:57:21 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:57:21 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:57:21 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:57:21 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23700
Jan 4 15:57:22 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:57:22 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23710
Jan 4 15:57:23 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:57:23 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23720
Jan 4 15:57:24 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:57:24 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23730
Jan 4 15:57:25 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:57:25 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23740
Jan 4 15:57:26 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:57:26 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23750
Jan 4 15:57:27 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:57:27 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23760
Jan 4 15:57:28 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:57:28 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23770
Jan 4 15:57:29 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:57:29 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23780
Jan 4 15:57:30 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:57:30 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:57:30 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23790
Jan 4 15:57:31 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:57:31 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:57:31 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:57:31 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23800
Jan 4 15:57:32 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:57:32 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23810
Jan 4 15:57:33 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:57:33 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23820
Jan 4 15:57:34 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:57:34 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23830
Jan 4 15:57:35 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:57:35 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23840
Jan 4 15:57:36 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:57:36 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23850
Jan 4 15:57:37 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:57:37 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23860
Jan 4 15:57:38 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:57:38 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23870
Jan 4 15:57:39 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:57:39 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23880
Jan 4 15:57:40 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:57:40 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:57:40 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23890
Jan 4 15:57:41 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:57:41 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:57:41 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:57:41 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23900
Jan 4 15:57:42 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:57:42 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23910
Jan 4 15:57:43 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:57:43 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23920
Jan 4 15:57:44 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:57:44 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23930
Jan 4 15:57:45 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:57:45 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23940
Jan 4 15:57:46 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:57:46 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23950
Jan 4 15:57:47 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:57:47 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23960
Jan 4 15:57:48 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:57:48 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23970
Jan 4 15:57:49 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:57:49 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23980
Jan 4 15:57:50 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:57:50 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:57:50 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 23990
Jan 4 15:57:51 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:57:51 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:57:51 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:57:51 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24000
Jan 4 15:57:52 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:57:52 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24010
Jan 4 15:57:53 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:57:53 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24020
Jan 4 15:57:54 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:57:54 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24030
Jan 4 15:57:55 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:57:55 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24040
Jan 4 15:57:56 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:57:56 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24050
Jan 4 15:57:57 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:57:57 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24060
Jan 4 15:57:58 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:57:58 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24070
Jan 4 15:57:59 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:57:59 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24080
Jan 4 15:58:00 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:58:00 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:58:00 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24090
Jan 4 15:58:01 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:58:01 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:58:01 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:58:01 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24100
Jan 4 15:58:02 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:58:02 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24110
Jan 4 15:58:03 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:58:03 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24120
Jan 4 15:58:04 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:58:04 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24130
Jan 4 15:58:05 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 40
Jan 4 15:58:05 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24140
Jan 4 15:58:06 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 50
Jan 4 15:58:06 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24150
Jan 4 15:58:07 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 60
Jan 4 15:58:07 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24160
Jan 4 15:58:08 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 70
Jan 4 15:58:08 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24170
Jan 4 15:58:09 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 80
Jan 4 15:58:09 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24180
Jan 4 15:58:10 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 15:58:10 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 90
Jan 4 15:58:10 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24190
Jan 4 15:58:11 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 100
Jan 4 15:58:11 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retried 100 times
Jan 4 15:58:11 proxmox1a pmxcfs[55713]: [status] crit: cpg_send_message failed: 6
Jan 4 15:58:11 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24200
Jan 4 15:58:12 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 10
Jan 4 15:58:12 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24210
Jan 4 15:58:13 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 20
Jan 4 15:58:13 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24220
Jan 4 15:58:14 proxmox1a pmxcfs[55713]: [status] notice: cpg_send_message retry 30
Jan 4 15:58:14 proxmox1a pmxcfs[55713]: [status] notice: cpg_join retry 24230
 
...

11 nodes 600 - 700 vms with some nodes containing over 100 vms.... looks beautiful but definitly needs large performance tuning to work with large clusters / datacenters.

###

For such big deployments I highly recommend to go for a commercial support contract. Our support team can login remotely and can work with you on the issue.
 
If this was up to me i would of already had an active subscription.

I kinda put myself in the of managing the cloud when i provided the company i work for with a copy of your product. While it works excellent i can tell you we have daily issues with cman and pve-cluster that i would like to have solved, or a understanding of why these services need to be restarted frequently.

All i can do for now is find solutions on this forum which has been amazingly great so far.


####

As an update to my original post i am able to get into the web interface, again,since realm drop down value is present now. None of the nodes show online, even the node you are connected to shows as red

Restarting of the services do not fail now, but do not improve anything

####


Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:15 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48290
Jan 4 17:35:16 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48300
Jan 4 17:35:17 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48310
Jan 4 17:35:18 proxmox1a pvedaemon[131005]: re-starting service corosync: UPID:proxmox1a:0001FFBD:0011C7CB:50E75926:srvrestart:corosync:root@pam:
Jan 4 17:35:18 proxmox1a pvedaemon[55191]: <root@pam> starting task UPID:proxmox1a:0001FFBD:0011C7CB:50E75926:srvrestart:corosync:root@pam:
Jan 4 17:35:18 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:18 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:18 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:18 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:18 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 17:35:18 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48320
Jan 4 17:35:19 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48330
Jan 4 17:35:20 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48340
Jan 4 17:35:21 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48350
Jan 4 17:35:22 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48360
Jan 4 17:35:23 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48370
Jan 4 17:35:24 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48380
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:25 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48390
Jan 4 17:35:26 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48400
Jan 4 17:35:27 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48410
Jan 4 17:35:28 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:28 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:28 proxmox1a pvedaemon[55191]: <root@pam> end task UPID:proxmox1a:0001FFBD:0011C7CB:50E75926:srvrestart:corosync:root@pam: OK
Jan 4 17:35:28 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:28 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:28 proxmox1a dlm_controld[38904]: daemon cpg_leave error retrying
Jan 4 17:35:28 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48420
Jan 4 17:35:29 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48430
Jan 4 17:35:30 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48440
Jan 4 17:35:31 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48450
Jan 4 17:35:32 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48460
Jan 4 17:35:33 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48470
Jan 4 17:35:34 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48480
Jan 4 17:35:35 proxmox1a pmxcfs[66904]: [dcdb] notice: cpg_join retry 48490
Jan 4 17:35:35 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:35 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:35 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:35 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9
Jan 4 17:35:36 proxmox1a pmxcfs[66904]: [status] crit: cpg_send_message failed: 9

####
 
here is my /var/log/messages

Jan 4 14:51:07 proxmox2 task UPIDroxmox2:000006EC:000021C7:50E72B06:startall::root@pam:: <root@pam> starting task UPIDroxmox2:00000DC3:00031E0E:50E732AB:qmstart:168:root@pam:
Jan 4 14:51:07 proxmox2 task UPIDroxmox2:000006EC:000021C7:50E72B06:startall::root@pam:: start VM 168: UPIDroxmox2:00000DC3:00031E0E:50E732AB:qmstart:168:root@pam:
Jan 4 14:51:08 proxmox2 kernel: device tap168i0 entered promiscuous mode
Jan 4 14:51:08 proxmox2 kernel: HTB: quantum of class 10001 is big. Consider r2q change.
Jan 4 14:51:08 proxmox2 kernel: vmbr0: port 12(tap168i0) entering forwarding state
Jan 4 14:55:08 proxmox2 task UPIDroxmox2:000006EC:000021C7:50E72B06:startall::root@pam:: <root@pam> starting task UPIDroxmox2:00000E9B:00037C39:50E7339C:qmstart:284:root@pam:
Jan 4 14:55:08 proxmox2 task UPIDroxmox2:000006EC:000021C7:50E72B06:startall::root@pam:: start VM 284: UPIDroxmox2:00000E9B:00037C39:50E7339C:qmstart:284:root@pam:
Jan 4 14:55:09 proxmox2 kernel: device tap284i0 entered promiscuous mode
Jan 4 14:55:09 proxmox2 kernel: HTB: quantum of class 10001 is big. Consider r2q change.
Jan 4 14:55:09 proxmox2 kernel: vmbr0: port 13(tap284i0) entering forwarding state
Jan 4 14:55:13 proxmox2 task UPIDroxmox2:000006EC:000021C7:50E72B06:startall::root@pam:: <root@pam> starting task UPIDroxmox2:00000EBD:00037E2E:50E733A1:qmstart:559:root@pam:
Jan 4 14:55:13 proxmox2 task UPIDroxmox2:000006EC:000021C7:50E72B06:startall::root@pam:: start VM 559: UPIDroxmox2:00000EBD:00037E2E:50E733A1:qmstart:559:root@pam:
Jan 4 14:56:10 proxmox2 kernel: device tap559i0 entered promiscuous mode
Jan 4 14:56:10 proxmox2 kernel: HTB: quantum of class 10001 is big. Consider r2q change.
Jan 4 14:56:10 proxmox2 kernel: vmbr0: port 14(tap559i0) entering forwarding state
Jan 4 14:56:11 proxmox2 kernel: vmbr0: port 14(tap559i0) entering disabled state
Jan 4 14:56:11 proxmox2 kernel: vmbr0: port 14(tap559i0) entering disabled state
Jan 4 14:56:40 proxmox2 pvesh: <root@pam> end task UPIDroxmox2:000006EC:000021C7:50E72B06:startall::root@pam: OK
Jan 4 15:15:56 proxmox2 kernel: hrtimer: interrupt took 6276 ns
Jan 4 17:15:35 proxmox2 corosync[1400]: [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

####

root@proxmox2:~# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination


Chain FORWARD (policy ACCEPT)
target prot opt source destination


Chain OUTPUT (policy ACCEPT)
target prot opt source destination
root@proxmox2:~#

####

root@proxmox2:~# clustat
Cluster Status for FL-Cluster @ Fri Jan 4 17:51:16 2013
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online, Local
proxmox3 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online
proxmox9 9 Online
Proxmox10 10 Online
proxmox1a 11 Online


####

root@proxmox2:~# pveversion -v
pve-manager: 2.2-31 (pve-manager/2.2/e94e95e9)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-82
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-33
qemu-server: 2.0-69
pve-firmware: 1.0-21
libpve-common-perl: 1.0-39
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-36
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1
 
Last edited by a moderator:
Please update to latest available packages.

Do you use a separate network for the cluster communication?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!