Cannot Initiate CMAP Service (all nodes)

navigator

Well-Known Member
May 19, 2017
49
0
46
33
São Paulo / Brasil
popsolutions.co
Hello guys how are you?

My name is Marcos and I provide support for a small family bussines company that is runing proxmox. We have 4 Nodes and Ceph to provide High Availability.

Other day we had a energy supply failure and all the nodes went down after more then 10 hours with the No Breakes holding the load.

I wasn`t in the company and no one notified me so probably the Nodes Shoutdown sudenly

I have the following error
Code:
pvecm status
Cannot initialize CMAP service

After that i ran: systemctl status corosync.service
Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Fri 2018-11-30 20:35:41 -02; 2 days ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 11428 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=TERM)
 Main PID: 11428 (code=killed, signal=TERM)

Nov 30 20:34:11 kimenz1 systemd[1]: Starting Corosync Cluster Engine...
Nov 30 20:34:11 kimenz1 corosync[11428]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Nov 30 20:34:11 kimenz1 corosync[11428]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Nov 30 20:34:11 kimenz1 corosync[11428]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Nov 30 20:34:11 kimenz1 corosync[11428]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Start operation timed out. Terminating.
Nov 30 20:35:41 kimenz1 systemd[1]: Failed to start Corosync Cluster Engine.
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Unit entered failed state.
Nov 30 20:35:41 kimenz1 systemd[1]: corosync.service: Failed with result 'timeout'.

Ok so we have a CoroSync failure right. I tried manually to start corosync without sucess with this
systemctl start corosync.service
[CODE]
Job for corosync.service failed because a timeout was exceeded.
See "systemctl status corosync.service" and "journalctl -xe" for details.

Ok that wasn`t enought as we say here in brazil we are brazilians and we never quit so i tried:
journalctl -xe
Code:
Dec 03 09:01:18 kimenz1 corosync[892]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Dec 03 09:01:19 kimenz1 pvestatd[2221]: status update time (22.161 seconds)
Dec 03 09:02:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:02:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
Dec 03 09:02:04 kimenz1 pvestatd[2221]: status update time (15.132 seconds)
Dec 03 09:02:32 kimenz1 pvestatd[2221]: status update time (28.169 seconds)
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Start operation timed out. Terminating.
Dec 03 09:02:48 kimenz1 systemd[1]: Failed to start Corosync Cluster Engine.
-- Subject: Unit corosync.service has failed
-- Defined-By: systemd
-- Support:
--
-- Unit corosync.service has failed.
--
-- The result is failed.
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Unit entered failed state.
Dec 03 09:02:48 kimenz1 systemd[1]: corosync.service: Failed with result 'timeout'.
Dec 03 09:03:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:03:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
Dec 03 09:03:14 kimenz1 pvestatd[2221]: status update time (12.180 seconds)
Dec 03 09:03:29 kimenz1 pvestatd[2221]: status update time (15.220 seconds)
Dec 03 09:03:57 kimenz1 pvestatd[2221]: status update time (18.118 seconds)
Dec 03 09:04:00 kimenz1 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: Unit pvesr.service has begun start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has begun starting up.
Dec 03 09:04:01 kimenz1 systemd[1]: Started Proxmox VE replication runner.
-- Subject: Unit pvesr.service has finished start-up
-- Defined-By: systemd
-- Support:
--
-- Unit pvesr.service has finished starting up.
--
-- The start-up result is done.
lines 3553-3612/3612 (END)

Now i DONT KNOW WERE ELSE TO GO THEN HERE

So please if any one can help me this is the /etc/pve/corosyn.conf file
Code:
logging {                                
  debug: off                              
  to_syslog: yes                          
}                                        
                                          
nodelist {                                
  node {                                  
    name: kimenz4                        
    nodeid: 4                            
    quorum_votes: 1                      
    ring0_addr: kimenz4                  
  }                                      
                                          
  node {                                  
    name: kimenz1                        
    nodeid: 1                            
    quorum_votes: 1                      
    ring0_addr: kimenz1                  
  }                                      
                                          
  node {                                  
    name: kimenz3                        
    nodeid: 2                            
    quorum_votes: 1                      
    ring0_addr: kimenz3                  
  }                                      
                                          
  node {                                  
    name: kimenz5                        
    nodeid: 3                            
    quorum_votes: 1                      
    ring0_addr: kimenz5                  
  }                                      
                                          
}                                        
                                          
quorum {                                  
  provider: corosync_votequorum          
}                                        
                                          
totem {                                  
  cluster_name: kimenz                    
  config_version: 6                      
  ip_version: ipv4                        
  secauth: on                            
  version: 2                              
  interface {                            
    bindnetaddr: 192.168.1.161            
    ringnumber: 0                        
  }                                      
                                          
}
 
Last edited:
Sorry because i dont know how to post the command line if someone can teache me i would apreciate
use '[ code ]' to begin the block and '[/ code ]' to end it (without the spaces)
 
your /etc/hosts is correct ?
 
your /etc/hosts is correct ?
I Don't know how they should be, or what is the correct. Just remembering that i have for nodes and in the /etc/hosts only shows one:

Code:
127.0.0.1 localhost.loclldomain localhost
192.168.1.161 kimenz1.com kimenz1 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Should it be like this?

Code:
127.0.0.1 localhost.loclldomain localhost
192.168.1.161 kimenz1.com kimenz1 pvelocalhost
192.168.1.163 kimenz3.com kimenz3 
192.168.1.164 kimenz4.com kimenz4
192.168.1.165 kimenz5.com kimenz5

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
 
more important is that the mapping of name <-> ip address is correct
 
can you verify that multicast still works on your network with 'omping'?

Also how could be that the /etc/hosts files were modified and i lost the configuration of the entire cluster? Is it possible?
this was only a guess...
 
Never used omping but im going to try.

The command should be:
$omping -m <local node IP> <remote node IP> ?
If this is correct i didn't got any response

Code:
root@kimenz1:/# omping 192.168.1.161 192.168.1.165
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
192.168.1.165 : waiting for response msg
^C
192.168.1.165 : response message never received
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!