Cannot Initiate CMAP Service (all nodes)

navigator

Well-Known Member
May 19, 2017
49
0
46
34
São Paulo / Brasil
popsolutions.co
Hello guys how are you?

My name is Marcos and I provide support for a small family bussines company that is runing proxmox. We have 4 Nodes and Ceph to provide High Availability.

Other day we had a energy supply failure and all the nodes went down after more then 10 hours with the No Breakes holding the load.

I wasn`t in the company and no one notified me so probably the Nodes Shoutdown sudenly

I have the following error
Code:
pvecm status
Cannot initialize CMAP service

After that i ran: systemctl status corosync.service
Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Fri 2018-11-30 20:35:41 -02; 2 days ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 11428 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=TERM)
 Main PID: 11428 (code=killed, signal=TERM)

Nov 30 20:34:11 kimenz1 systemd[1]: Starting Corosync Cluster Engine...
Nov 30 20:34:11 kimenz1 corosync[11428]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Nov 30 20:34:11 kimenz1 corosync[11428]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Nov 30 20:34:11 kimenz1 corosync[11428]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Nov 30 20:34:11 kimenz1 corosync[11428]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Start operation timed out. Terminating.
Nov 30 20:35:41 kimenz1 systemd[1]: Failed to start Corosync Cluster Engine.
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Unit entered failed state.
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Failed with result 'timeout'.

Ok so we have a CoroSync failure right. I tried manually to start corosync without sucess with this
systemctl start corosync.service
[CODE]
Job for corosync.service failed because a timeout was exceeded.
See "systemctl status corosync.service" and "journalctl -xe" for details.

Ok that wasn`t enought as we say here in brazil we are brazilians and we never quit so i tried:
journalctl -xe
Code:
Dec 03 09:01:18 kimenz1 corosync[892]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Dec 03 09:01:19 kimenz1 pvestatd[2221]: status update time (22.161 seconds)
Dec 03 09:02:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:02:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
Dec 03 09:02:04 kimenz1 pvestatd[2221]: status update time (15.132 seconds)
Dec 03 09:02:32 kimenz1 pvestatd[2221]: status update time (28.169 seconds)
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Start operation timed out. Terminating.
Dec 03 09:02:48 kimenz1 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support:
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Unit entered failed state.
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Failed with result 'timeout'.
Dec 03 09:03:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:03:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
Dec 03 09:03:14 kimenz1 pvestatd[2221]: status update time (12.180 seconds)
Dec 03 09:03:29 kimenz1 pvestatd[2221]: status update time (15.220 seconds)
Dec 03 09:03:57 kimenz1 pvestatd[2221]: status update time (18.118 seconds)
Dec 03 09:04:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:04:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
lines 3553-3612/3612 (END)

Now i DONT KNOW WERE ELSE TO GO THEN HERE

So please if any one can help me this is the /etc/pve/corosyn.conf file
Code:
logging {                                
  debug: off                              
  to_syslog: yes                          
}                                        
                                          
nodelist {                                
  node {                                  
    name: kimenz4                        
    nodeid: 4                            
    quorum_votes: 1                      
    ring0_addr: kimenz4                  
  }                                      
                                          
  node {                                  
    name: kimenz1                        
    nodeid: 1                            
    quorum_votes: 1                      
    ring0_addr: kimenz1                  
  }                                      
                                          
  node {                                  
    name: kimenz3                        
    nodeid: 2                            
    quorum_votes: 1                      
    ring0_addr: kimenz3                  
  }                                      
                                          
  node {                                  
    name: kimenz5                        
    nodeid: 3                            
    quorum_votes: 1                      
    ring0_addr: kimenz5                  
  }                                      
                                          
}                                        
                                          
quorum {                                  
  provider: corosync_votequorum          
}                                        
                                          
totem {                                  
  cluster_name: kimenz                    
  config_version: 6                      
  ip_version: ipv4                        
  secauth: on                            
  version: 2                              
  interface {                            
    bindnetaddr: 192.168.1.161            
    ringnumber: 0                        
  }                                      
                                          
}
 
Last edited:
Sorry because i dont know how to post the command line if someone can teache me i would apreciate
use '[ code ]' to begin the block and '[/ code ]' to end it (without the spaces)
 
your /etc/hosts is correct ?
 
your /etc/hosts is correct ?
I Don't know how they should be, or what is the correct. Just remembering that i have for nodes and in the /etc/hosts only shows one:

Code:
127.0.0.1 localhost.loclldomain localhost
192.168.1.161 kimenz1.com kimenz1 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Should it be like this?

Code:
127.0.0.1 localhost.loclldomain localhost
192.168.1.161 kimenz1.com kimenz1 pvelocalhost
192.168.1.163 kimenz3.com kimenz3 
192.168.1.164 kimenz4.com kimenz4
192.168.1.165 kimenz5.com kimenz5

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
 
more important is that the mapping of name <-> ip address is correct
 
can you verify that multicast still works on your network with 'omping'?

Also how could be that the /etc/hosts files were modified and i lost the configuration of the entire cluster? Is it possible?
this was only a guess...
 
Never used omping but im going to try.

The command should be:
$omping -m <local node IP> <remote node IP> ?
If this is correct i didn't got any response

Code:
root@kimenz1:/# omping 192.168.1.161 192.168.1.165
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
^C
192.168.1.165 : response message never received