Cannot Initiate CMAP Service (all nodes)

navigator · Dec 1, 2018

Hello guys how are you?

My name is Marcos and I provide support for a small family bussines company that is runing proxmox. We have 4 Nodes and Ceph to provide High Availability.

Other day we had a energy supply failure and all the nodes went down after more then 10 hours with the No Breakes holding the load.

I wasn`t in the company and no one notified me so probably the Nodes Shoutdown sudenly

I have the following error

Code:

pvecm status
Cannot initialize CMAP service

After that i ran: systemctl status corosync.service

Code:

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Fri 2018-11-30 20:35:41 -02; 2 days ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 11428 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=TERM)
 Main PID: 11428 (code=killed, signal=TERM)

Nov 30 20:34:11 kimenz1 systemd[1]: Starting Corosync Cluster Engine...
Nov 30 20:34:11 kimenz1 corosync[11428]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Nov 30 20:34:11 kimenz1 corosync[11428]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Nov 30 20:34:11 kimenz1 corosync[11428]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Nov 30 20:34:11 kimenz1 corosync[11428]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Start operation timed out. Terminating.
Nov 30 20:35:41 kimenz1 systemd[1]: Failed to start Corosync Cluster Engine.
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Unit entered failed state.
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Failed with result 'timeout'.

Ok so we have a CoroSync failure right. I tried manually to start corosync without sucess with this
systemctl start corosync.service
[CODE]
Job for corosync.service failed because a timeout was exceeded.
See "systemctl status corosync.service" and "journalctl -xe" for details.

Ok that wasn`t enought as we say here in brazil we are brazilians and we never quit so i tried:
journalctl -xe

Code:

Dec 03 09:01:18 kimenz1 corosync[892]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Dec 03 09:01:19 kimenz1 pvestatd[2221]: status update time (22.161 seconds)
Dec 03 09:02:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:02:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
Dec 03 09:02:04 kimenz1 pvestatd[2221]: status update time (15.132 seconds)
Dec 03 09:02:32 kimenz1 pvestatd[2221]: status update time (28.169 seconds)
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Start operation timed out. Terminating.
Dec 03 09:02:48 kimenz1 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support:
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Unit entered failed state.
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Failed with result 'timeout'.
Dec 03 09:03:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:03:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
Dec 03 09:03:14 kimenz1 pvestatd[2221]: status update time (12.180 seconds)
Dec 03 09:03:29 kimenz1 pvestatd[2221]: status update time (15.220 seconds)
Dec 03 09:03:57 kimenz1 pvestatd[2221]: status update time (18.118 seconds)
Dec 03 09:04:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:04:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
lines 3553-3612/3612 (END)

Now i DONT KNOW WERE ELSE TO GO THEN HERE

So please if any one can help me this is the /etc/pve/corosyn.conf file

Code:

logging {                                
  debug: off                              
  to_syslog: yes                          
}                                        
                                          
nodelist {                                
  node {                                  
    name: kimenz4                        
    nodeid: 4                            
    quorum_votes: 1                      
    ring0_addr: kimenz4                  
  }                                      
                                          
  node {                                  
    name: kimenz1                        
    nodeid: 1                            
    quorum_votes: 1                      
    ring0_addr: kimenz1                  
  }                                      
                                          
  node {                                  
    name: kimenz3                        
    nodeid: 2                            
    quorum_votes: 1                      
    ring0_addr: kimenz3                  
  }                                      
                                          
  node {                                  
    name: kimenz5                        
    nodeid: 3                            
    quorum_votes: 1                      
    ring0_addr: kimenz5                  
  }                                      
                                          
}                                        
                                          
quorum {                                  
  provider: corosync_votequorum          
}                                        
                                          
totem {                                  
  cluster_name: kimenz                    
  config_version: 6                      
  ip_version: ipv4                        
  secauth: on                            
  version: 2                              
  interface {                            
    bindnetaddr: 192.168.1.161            
    ringnumber: 0                        
  }                                      
                                          
}

navigator · Dec 1, 2018

Sorry because i dont know how to post the command line if someone can teache me i would apreciate

dcsapak · Dec 3, 2018

Marcos Mendez said:
Sorry because i dont know how to post the command line if someone can teache me i would apreciate

use '[ code ]' to begin the block and '[/ code ]' to end it (without the spaces)

navigator · Dec 3, 2018

Ok i corrected the [CODE anyone can help me? Have any clue?

navigator · Dec 3, 2018

dcsapak said:
use '[ code ]' to begin the block and '[/ code ]' to end it (without the spaces)

Thanks man it worked for me!

dcsapak · Dec 3, 2018

your /etc/hosts is correct ?

navigator · Dec 3, 2018

dcsapak said:
your /etc/hosts is correct ?

I Don't know how they should be, or what is the correct. Just remembering that i have for nodes and in the /etc/hosts only shows one:

Code:

127.0.0.1 localhost.loclldomain localhost
192.168.1.161 kimenz1.com kimenz1 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Should it be like this?

Code:

127.0.0.1 localhost.loclldomain localhost
192.168.1.161 kimenz1.com kimenz1 pvelocalhost
192.168.1.163 kimenz3.com kimenz3 
192.168.1.164 kimenz4.com kimenz4
192.168.1.165 kimenz5.com kimenz5

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

dcsapak · Dec 3, 2018

more important is that the mapping of name <-> ip address is correct

navigator · Dec 3, 2018

It is showing only one IP and yes it is correct

navigator · Dec 3, 2018

Also how could be that the /etc/hosts files were modified and i lost the configuration of the entire cluster? Is it possible?

dcsapak · Dec 3, 2018

can you verify that multicast still works on your network with 'omping'?

Marcos Mendez said:
Also how could be that the /etc/hosts files were modified and i lost the configuration of the entire cluster? Is it possible?

this was only a guess...

navigator · Dec 3, 2018

Never used omping but im going to try.

The command should be:
$omping -m <local node IP> <remote node IP> ?
If this is correct i didn't got any response

Code:

root@kimenz1:/# omping 192.168.1.161 192.168.1.165
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
^C
192.168.1.165 : response message never received

navigator · Dec 3, 2018

Sho

dcsapak said:
more important is that the mapping of name <-> ip address is correct

Should i add other nodes ip's to /etc/hosts?

For example:

@node1 /etc/hosts
ip node 1 (already present as loval pve)
ip node 2
ip node 3
ip node 4

@node2 /etc/hosts
ip node 2 (already present as loval pve)
ip node 1
ip node 3
ip node 4

SHOULD I ADD this ips?

dcsapak · Dec 3, 2018

Marcos Mendez said:
Never used omping but im going to try.

you have to execute the command on all hosts at the same time

Marcos Mendez said:
SHOULD I ADD this ips?

yes

navigator · Dec 3, 2018

Ok great, is gonna look like this

navigator · Dec 3, 2018

Great news now after pvecm status i get quorum BUT the cluster doesn't start

navigator · Dec 3, 2018

RESTARTED ALL THE NODES AN NOW IT is BACK!!!! You are awesome THANKS!

Search

Search

Cannot Initiate CMAP Service (all nodes)

navigator

Well-Known Member

navigator

Well-Known Member

dcsapak

Proxmox Staff Member

navigator

Well-Known Member

navigator

Well-Known Member

dcsapak

Proxmox Staff Member

navigator

Well-Known Member

dcsapak

Proxmox Staff Member

navigator

Well-Known Member

navigator

Well-Known Member

dcsapak

Proxmox Staff Member

navigator

Well-Known Member

navigator

Well-Known Member

dcsapak

Proxmox Staff Member

navigator

Well-Known Member

navigator

Well-Known Member

navigator

Well-Known Member

We value your privacy