Corosync not working properky

Matus

Active Member
Mar 31, 2017
28
0
41
66
Hello,

after restart of one node (4) in my cluster which has 4 nodes, synchronisation stop working on this node.
corosync.conf is same on all nodes:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: veno
nodeid: 4
quorum_votes: 1
ring0_addr: veno
}

node {
name: alto
nodeid: 1
quorum_votes: 1
ring0_addr: alto
}

node {
name: spare
nodeid: 2
quorum_votes: 1
ring0_addr: spare
}

node {
name: sumo
nodeid: 3
quorum_votes: 1
ring0_addr: sumo
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: forma
config_version: 8
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: X.X.X.X
ringnumber: 0
}
}

output from corosync nodes:
notice [MAIN ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
warning [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
warning [MAIN ] Please migrate config file to nodelist.

Node 1 was restarted and output from
systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-06-17 20:11:52 CEST; 14min ago
Process: 17423 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 17407 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 17409 (pmxcfs)
Tasks: 5 (limit: 4915)
Memory: 31.5M
CPU: 924ms
CGroup: /system.slice/pve-cluster.service
└─17409 /usr/bin/pmxcfs

Jun 17 20:11:51 alto pmxcfs[17409]: [dcdb] crit: cpg_initialize failed: 2
Jun 17 20:11:51 alto pmxcfs[17409]: [dcdb] crit: can't initialize service
Jun 17 20:11:51 alto pmxcfs[17409]: [status] crit: cpg_initialize failed: 2
Jun 17 20:11:51 alto pmxcfs[17409]: [status] crit: can't initialize service
Jun 17 20:11:52 alto systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 17 20:11:57 alto pmxcfs[17409]: [status] notice: update cluster info (cluster name forma, version = 8)
Jun 17 20:11:58 alto pmxcfs[17409]: [dcdb] notice: members: 1/17409
Jun 17 20:11:58 alto pmxcfs[17409]: [dcdb] notice: all data is up to date
Jun 17 20:11:58 alto pmxcfs[17409]: [status] notice: members: 1/17409
Jun 17 20:11:58 alto pmxcfs[17409]: [status] notice: all data is up to date
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-06-17 20:24:30 CEST; 2min 7s ago
Process: 25546 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 25444 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 25452 (pmxcfs)
Tasks: 5 (limit: 4915)
Memory: 34.8M
CPU: 572ms
CGroup: /system.slice/pve-cluster.service
└─25452 /usr/bin/pmxcfs

Jun 17 20:24:29 sumo pmxcfs[25452]: [status] notice: received sync request (epoch 3/25452/00000001)
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: received all states
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: leader is 3/25452
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: synced members: 3/25452, 4/2413
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: start sending inode updates
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: sent all (0) updates
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: all data is up to date
Jun 17 20:24:29 sumo pmxcfs[25452]: [status] notice: received all states
Jun 17 20:24:29 sumo pmxcfs[25452]: [status] notice: all data is up to date
Jun 17 20:24:30 sumo systemd[1]: Started The Proxmox VE cluster filesystem.

Node 2 is stopped.

Could you help me painlessly rebuild cluster?

Thanks
 
Last edited:
PVE 4.15.18-40

here is output from journalctl -u corosync
.....
Jun 17 20:39:15 alto corosync[17429]: [TOTEM ] A new membership (x.x.x.x:330524) was formed. Members
Jun 17 20:39:15 alto corosync[17429]: [CPG ] downlist left_list: 0 received
Jun 17 20:39:15 alto corosync[17429]: [QUORUM] Members[1]: 1
Jun 17 20:39:15 alto corosync[17429]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 17 20:39:16 alto corosync[17429]: notice [TOTEM ] A new membership (x.x.x.x:330528) was formed. Members
Jun 17 20:39:16 alto corosync[17429]: warning [CPG ] downlist left_list: 0 received
Jun 17 20:39:16 alto corosync[17429]: notice [QUORUM] Members[1]: 1
Jun 17 20:39:16 alto corosync[17429]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 11 12:48:53 veno corosync[2443]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
May 11 12:48:53 veno corosync[2443]: warning [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
May 11 12:48:53 veno corosync[2443]: warning [MAIN ] Please migrate config file to nodelist.
May 11 12:48:53 veno corosync[2443]: [MAIN ] Please migrate config file to nodelist.
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] Initializing transport (UDP/IP Multicast).
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 11 12:48:53 veno corosync[2443]: [TOTEM ] Initializing transport (UDP/IP Multicast).
May 11 12:48:53 veno corosync[2443]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] The network interface [x.x.x.x] is now up.
May 11 12:48:53 veno corosync[2443]: [TOTEM ] The network interface [x.x.x.x] is now up.
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync configuration map access [0]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: cmap
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync configuration service [1]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: cfg
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: cpg
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync profile loading service [4]
May 11 12:48:53 veno corosync[2443]: [SERV ] Service engine loaded: corosync configuration map access [0]
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync resource monitoring service [6]
May 11 12:48:53 veno corosync[2443]: warning [WD ] Watchdog not enabled by configuration
May 11 12:48:53 veno corosync[2443]: warning [WD ] resource load_15min missing a recovery key.
May 11 12:48:53 veno corosync[2443]: warning [WD ] resource memory_used missing a recovery key.
May 11 12:48:53 veno corosync[2443]: info [WD ] no resources configured.
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync watchdog service [7]
May 11 12:48:53 veno corosync[2443]: notice [QUORUM] Using quorum provider corosync_votequorum
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: votequorum
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: quorum
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] A new membership (x.x.x.x:298984) was formed. Members joined: 4
May 11 12:48:53 veno corosync[2443]: [QB ] server name: cmap

systemctl status corosync:
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-06-17 21:19:43 CEST; 18h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 24800 (corosync)
Tasks: 2 (limit: 4915)
Memory: 38.9M
CPU: 23min 35.799s
CGroup: /system.slice/corosync.service
└─24800 /usr/sbin/corosync -f

Jun 18 15:51:34 alto corosync[24800]: [QUORUM] Members[1]: 1
Jun 18 15:51:34 alto corosync[24800]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 18 15:51:36 alto corosync[24800]: notice [TOTEM ] A new membership (x.x.x.x:518856) was formed. Members
Jun 18 15:51:36 alto corosync[24800]: warning [CPG ] downlist left_list: 0 received
Jun 18 15:51:36 alto corosync[24800]: notice [QUORUM] Members[1]: 1
Jun 18 15:51:36 alto corosync[24800]: notice [MAIN ] Completed service synchronization, ready to provide service.
Jun 18 15:51:36 alto corosync[24800]: [TOTEM ] A new membership (x.x.x.x:518856) was formed. Members
Jun 18 15:51:36 alto corosync[24800]: [CPG ] downlist left_list: 0 received
Jun 18 15:51:36 alto corosync[24800]: [QUORUM] Members[1]: 1
Jun 18 15:51:36 alto corosync[24800]: [MAIN ] Completed service synchronization, ready to provide service.

pvecm status:
Quorum information
------------------
Date: Tue Jun 18 15:53:33 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/519172
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 x.x.x.x (local)

Quorum information
------------------
Date: Tue Jun 18 15:53:59 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000004
Ring ID: 3/299396
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 2
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000003 1 x.x.x.x
0x00000004 1 x.x.x.x (local)

here are views from Node1
Node1.jpg


and Node3-4

Node3-4.jpg

Node 2 (spate) is turn off.
 
Seems like you are running corosync over a public IP?
You should know that as corosync requires low latency it is suggested to run it on it's own separate network https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
Have you tested the network with omping?
Also the journal of node 3 seems incomplete as the timestamp states 'May 11'
 
Last edited:
I know that running corosync on public IP is not good idea. But it worked for a log time.
There is no problem with omping.
Now I separate Node1 from cluster. When I try to join it to cluster I received a message:
* this host already contains virtual guests
Check if node may join a cluster failed!
Is there any way to add node with VMs and CTs?
Or is possible to join two clusters.
 
Last edited:
Is there any way to add node with VMs and CTs?
As we can not guarantee consistency, adding a node which already contains VMs/CTs is not possible. The easiest way to join that node is to backup VMs/CTs, restore them on the cluster and add the empty node to the cluster.