Corosync not working properky

Matus · Jun 17, 2019

Hello,

after restart of one node (4) in my cluster which has 4 nodes, synchronisation stop working on this node.
corosync.conf is same on all nodes:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: veno
nodeid: 4
quorum_votes: 1
ring0_addr: veno
}

node {
name: alto
nodeid: 1
quorum_votes: 1
ring0_addr: alto
}

node {
name: spare
nodeid: 2
quorum_votes: 1
ring0_addr: spare
}

node {
name: sumo
nodeid: 3
quorum_votes: 1
ring0_addr: sumo
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: forma
config_version: 8
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: X.X.X.X
ringnumber: 0
}
}

output from corosync nodes:

notice [MAIN ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
warning [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
warning [MAIN ] Please migrate config file to nodelist.

Node 1 was restarted and output from
systemctl status pve-cluster.service

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-06-17 20:11:52 CEST; 14min ago
Process: 17423 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 17407 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 17409 (pmxcfs)
Tasks: 5 (limit: 4915)
Memory: 31.5M
CPU: 924ms
CGroup: /system.slice/pve-cluster.service
└─17409 /usr/bin/pmxcfs

Jun 17 20:11:51 alto pmxcfs[17409]: [dcdb] crit: cpg_initialize failed: 2
Jun 17 20:11:51 alto pmxcfs[17409]: [dcdb] crit: can't initialize service
Jun 17 20:11:51 alto pmxcfs[17409]: [status] crit: cpg_initialize failed: 2
Jun 17 20:11:51 alto pmxcfs[17409]: [status] crit: can't initialize service
Jun 17 20:11:52 alto systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 17 20:11:57 alto pmxcfs[17409]: [status] notice: update cluster info (cluster name forma, version = 8)
Jun 17 20:11:58 alto pmxcfs[17409]: [dcdb] notice: members: 1/17409
Jun 17 20:11:58 alto pmxcfs[17409]: [dcdb] notice: all data is up to date
Jun 17 20:11:58 alto pmxcfs[17409]: [status] notice: members: 1/17409
Jun 17 20:11:58 alto pmxcfs[17409]: [status] notice: all data is up to date

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-06-17 20:24:30 CEST; 2min 7s ago
Process: 25546 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Process: 25444 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 25452 (pmxcfs)
Tasks: 5 (limit: 4915)
Memory: 34.8M
CPU: 572ms
CGroup: /system.slice/pve-cluster.service
└─25452 /usr/bin/pmxcfs

Jun 17 20:24:29 sumo pmxcfs[25452]: [status] notice: received sync request (epoch 3/25452/00000001)
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: received all states
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: leader is 3/25452
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: synced members: 3/25452, 4/2413
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: start sending inode updates
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: sent all (0) updates
Jun 17 20:24:29 sumo pmxcfs[25452]: [dcdb] notice: all data is up to date
Jun 17 20:24:29 sumo pmxcfs[25452]: [status] notice: received all states
Jun 17 20:24:29 sumo pmxcfs[25452]: [status] notice: all data is up to date
Jun 17 20:24:30 sumo systemd[1]: Started The Proxmox VE cluster filesystem.

Node 2 is stopped.

Could you help me painlessly rebuild cluster?

Thanks

Chris · Jun 18, 2019

Hi,
can you also provide the output of `journalctl -u corosync`, `systemctl status corosync` and `pvecm status` from the node causing troubles.
What version of PVE are you running? What exactly is not working?
Also check that your network is ok using omping https://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast

Matus · Jun 18, 2019

PVE 4.15.18-40

here is output from journalctl -u corosync

.....
Jun 17 20:39:15 alto corosync[17429]: [TOTEM ] A new membership (x.x.x.x:330524) was formed. Members
Jun 17 20:39:15 alto corosync[17429]: [CPG ] downlist left_list: 0 received
Jun 17 20:39:15 alto corosync[17429]: [QUORUM] Members[1]: 1
Jun 17 20:39:15 alto corosync[17429]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 17 20:39:16 alto corosync[17429]: notice [TOTEM ] A new membership (x.x.x.x:330528) was formed. Members
Jun 17 20:39:16 alto corosync[17429]: warning [CPG ] downlist left_list: 0 received
Jun 17 20:39:16 alto corosync[17429]: notice [QUORUM] Members[1]: 1
Jun 17 20:39:16 alto corosync[17429]: notice [MAIN ] Completed service synchronization, ready to provide service.

May 11 12:48:53 veno corosync[2443]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
May 11 12:48:53 veno corosync[2443]: warning [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
May 11 12:48:53 veno corosync[2443]: warning [MAIN ] Please migrate config file to nodelist.
May 11 12:48:53 veno corosync[2443]: [MAIN ] Please migrate config file to nodelist.
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] Initializing transport (UDP/IP Multicast).
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 11 12:48:53 veno corosync[2443]: [TOTEM ] Initializing transport (UDP/IP Multicast).
May 11 12:48:53 veno corosync[2443]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] The network interface [x.x.x.x] is now up.
May 11 12:48:53 veno corosync[2443]: [TOTEM ] The network interface [x.x.x.x] is now up.
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync configuration map access [0]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: cmap
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync configuration service [1]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: cfg
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: cpg
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync profile loading service [4]
May 11 12:48:53 veno corosync[2443]: [SERV ] Service engine loaded: corosync configuration map access [0]
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync resource monitoring service [6]
May 11 12:48:53 veno corosync[2443]: warning [WD ] Watchdog not enabled by configuration
May 11 12:48:53 veno corosync[2443]: warning [WD ] resource load_15min missing a recovery key.
May 11 12:48:53 veno corosync[2443]: warning [WD ] resource memory_used missing a recovery key.
May 11 12:48:53 veno corosync[2443]: info [WD ] no resources configured.
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync watchdog service [7]
May 11 12:48:53 veno corosync[2443]: notice [QUORUM] Using quorum provider corosync_votequorum
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: votequorum
May 11 12:48:53 veno corosync[2443]: notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 11 12:48:53 veno corosync[2443]: info [QB ] server name: quorum
May 11 12:48:53 veno corosync[2443]: notice [TOTEM ] A new membership (x.x.x.x:298984) was formed. Members joined: 4
May 11 12:48:53 veno corosync[2443]: [QB ] server name: cmap

systemctl status corosync:

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-06-17 21:19:43 CEST; 18h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 24800 (corosync)
Tasks: 2 (limit: 4915)
Memory: 38.9M
CPU: 23min 35.799s
CGroup: /system.slice/corosync.service
└─24800 /usr/sbin/corosync -f

Jun 18 15:51:34 alto corosync[24800]: [QUORUM] Members[1]: 1
Jun 18 15:51:34 alto corosync[24800]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 18 15:51:36 alto corosync[24800]: notice [TOTEM ] A new membership (x.x.x.x:518856) was formed. Members
Jun 18 15:51:36 alto corosync[24800]: warning [CPG ] downlist left_list: 0 received
Jun 18 15:51:36 alto corosync[24800]: notice [QUORUM] Members[1]: 1
Jun 18 15:51:36 alto corosync[24800]: notice [MAIN ] Completed service synchronization, ready to provide service.
Jun 18 15:51:36 alto corosync[24800]: [TOTEM ] A new membership (x.x.x.x:518856) was formed. Members
Jun 18 15:51:36 alto corosync[24800]: [CPG ] downlist left_list: 0 received
Jun 18 15:51:36 alto corosync[24800]: [QUORUM] Members[1]: 1
Jun 18 15:51:36 alto corosync[24800]: [MAIN ] Completed service synchronization, ready to provide service.

pvecm status:

Quorum information
------------------
Date: Tue Jun 18 15:53:33 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/519172
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 x.x.x.x (local)

Quorum information
------------------
Date: Tue Jun 18 15:53:59 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000004
Ring ID: 3/299396
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 2
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000003 1 x.x.x.x
0x00000004 1 x.x.x.x (local)

here are views from Node1

and Node3-4

Node 2 (spate) is turn off.

Chris · Jun 18, 2019

Seems like you are running corosync over a public IP?
You should know that as corosync requires low latency it is suggested to run it on it's own separate network https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
Have you tested the network with omping?
Also the journal of node 3 seems incomplete as the timestamp states 'May 11'

Matus · Jun 20, 2019

I know that running corosync on public IP is not good idea. But it worked for a log time.
There is no problem with omping.
Now I separate Node1 from cluster. When I try to join it to cluster I received a message:
* this host already contains virtual guests
Check if node may join a cluster failed!
Is there any way to add node with VMs and CTs?
Or is possible to join two clusters.

Chris · Jun 21, 2019

Matus said:
Is there any way to add node with VMs and CTs?

As we can not guarantee consistency, adding a node which already contains VMs/CTs is not possible. The easiest way to join that node is to backup VMs/CTs, restore them on the cluster and add the empty node to the cluster.

Search

Search

Corosync not working properky

Matus

Active Member

Chris

Proxmox Staff Member

Matus

Active Member

Chris

Proxmox Staff Member

Matus

Active Member

Chris

Proxmox Staff Member

We value your privacy