Node 2 cman crashes ..... will not come back without reboot

C

Chris Rivera

Guest
This has started the last week.

root@proxmox2:~# service cman stop
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... Timed-out waiting for cluster
[FAILED]

only way to get cman to work is to reboot the server...


#########

root@proxmox2:~# clustat
Cluster Status for FL-Cluster @ Mon Feb 18 14:57:28 2013
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online, Local
proxmox3a 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Offline
proxmox7 7 Online
proxmox8 8 Online
proxmox9 9 Online
Proxmox10 10 Online
proxmox1a 11 Online


clustat shows as being quorate and part of the cluster but the web interface does not show the same.


#########


restarting



  • pvestatd
  • pvedaemon
  • cman (fails will not stop)
  • pve-cluster

does not solve the issue.


#########


root@proxmox2:~# pveversion -v
pve-manager: 2.2-31 (pve-manager/2.2/e94e95e9)
running kernel: 2.6.32-16-pve
proxmox-ve-2.6.32: 2.2-82
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-14-pve: 2.6.32-74
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-33
qemu-server: 2.0-69
pve-firmware: 1.0-21
libpve-common-perl: 1.0-39
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-36
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.2-7
ksm-control-daemon: 1.1-1


#########

I do not believe this service is dying on its own. We get hit with enough DDoS for me to say that this may be due to the node being hit by a DDoS. I can say that when this happens there is no way of stopping the cman service and restarting it

I have tried using top command to find the corosync process and trying to kill it with a signal 15 but this does not stop / terminate the corosync service

#########

anyone with any helpful knowledge let me know.

thanks
 
Do you want to stop or start cman? If it is crashed you can use

# service cman start

to restart it.
 
I figured the service was not started or else i would get an error saying you cannot restart a service that was not running already.

I can re run this command over and over with no problems

root@proxmox2:~# service cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]

root@proxmox2:~# service cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]

If it started i should not be able to start it again... i would have to issue a service cman restart


I have ran that command and still cannot get this node to join the cluster.

Meanwhile clustat still shows this as part of the cluster and in quorate:

root@proxmox2:~# service pve-cluster restart
Restarting pve cluster filesystem: pve-cluster.
root@proxmox2:~# clustat
Cluster Status for FL-Cluster @ Tue Feb 19 08:39:15 2013
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online, Local
proxmox3a 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Offline
proxmox7 7 Online
proxmox8 8 Online
proxmox9 9 Online
Proxmox10 10 Online
proxmox1a 11 Online
 
So cman is running and only the web interface displays something wrong? You wrote 'cman' is crashing?
 
Because its not working.

If it was working as it should i should be able to issue

service cman stop.... but that doesn't work leaving me to believe that even tho it may be working for clustat... something is not working correctly.

On all the other nodes not only can i start cman, i can restart cman service and also stop it....

but on node 2 i can only service cman start.


If i try to stop or restart the service it fails

root@proxmox2:~# service cman stop
Stopping cluster:
Stopping dlm_controld...
[FAILED]

root@proxmox2:~# service cman restart
Stopping cluster:
Stopping dlm_controld...
[FAILED]


If this is not cman crashed / or not working incorrectly let me know so i can address the problem with the appropriate name.

Thanks
 
any help with this?

im having issues with getting cman to work.... its running and cannot be stopped or restarted...

rebooting the nodes is a fix but is not a solution all nodes but node 2 & node 8.

about 85-90% of the nodes will NOT accept the password to log into the web interface leaving management to be done via cli.


At this point the management of proxmox has gone to shit...
 
Node 2 & node 8 will not join the cluster due to the cluster config file. I tracked this down in the syslog when trying to restart cman on both nodes

######

Feb 25 11:41:43 proxmox2 corosync[745665]: [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Feb 25 11:41:43 proxmox2 corosync[745665]: [CMAN ] Can't get updated config version 49: New configuration version has to be newer than current running configuration#012.
Feb 25 11:41:43 proxmox2 corosync[745665]: [CMAN ] Activity suspended on this node
Feb 25 11:41:43 proxmox2 corosync[745665]: [CMAN ] Error reloading the configuration, will retry every second
Feb 25 11:41:44 proxmox2 corosync[745665]: [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Feb 25 11:41:44 proxmox2 corosync[745665]: [CMAN ] Can't get updated config version 49: New configuration version has to be newer than current running configuration#012.
Feb 25 11:41:44 proxmox2 corosync[745665]: [CMAN ] Activity suspended on this node
Feb 25 11:41:44 proxmox2 corosync[745665]: [CMAN ] Error reloading the configuration, will retry every second
Feb 25 11:41:45 proxmox2 corosync[745665]: [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Feb 25 11:41:45 proxmox2 corosync[745665]: [CMAN ] Can't get updated config version 49: New configuration version has to be newer than current running configuration#012.
Feb 25 11:41:45 proxmox2 corosync[745665]: [CMAN ] Activity suspended on this node
Feb 25 11:41:45 proxmox2 corosync[745665]: [CMAN ] Error reloading the configuration, will retry every second
Feb 25 11:41:46 proxmox2 corosync[745665]: [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Feb 25 11:41:46 proxmox2 corosync[745665]: [CMAN ] Can't get updated config version 49: New configuration version has to be newer than current running configuration#012.
Feb 25 11:41:46 proxmox2 corosync[745665]: [CMAN ] Activity suspended on this node
Feb 25 11:41:46 proxmox2 corosync[745665]: [CMAN ] Error reloading the configuration, will retry every second
Feb 25 11:41:47 proxmox2 corosync[745665]: [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Feb 25 11:41:47 proxmox2 corosync[745665]: [CMAN ] Can't get updated config version 49: New configuration version has to be newer than current running configuration#012.
Feb 25 11:41:47 proxmox2 corosync[745665]: [CMAN ] Activity suspended on this node
Feb 25 11:41:47 proxmox2 corosync[745665]: [CMAN ] Error reloading the configuration, will retry every second

######

i already tried pvecm expected 1 but still cannot update the cluster.conf file.

What do i need to do to be able to forcefully edit this file and bring the nodes back online?
 
Spent lots of time reading syslogs and was able to work thru node 2 & 8 to bring them online.

Node 2 on reboot /etc/pve/ had files before pve-cluster could mount causing an error... i deleted those files.... rebooted the node, and this fixed the issue with node 2

Node 8 i service pve-cluster stop.... service cman stop.... then started service pve-cluster to update the cluster.conf file then started service cman start... this had to be done 3-4 times.... before it actually stayed online and accessible.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!