Cluster is down

debi@n

Active Member
Nov 12, 2015
121
1
38
Málaga,Spain
Hello guys!, we did reboot on the switch and the cluster crashed, (we have 4 nodes, and 3 nodes show "Cannot initialize CMAP service") .
-systemctl status corosync.service

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: failed (Result: timeout) since Mon 2015-12-21 08:52:19 CET; 40min ago
Process: 1640 ExecStart=/usr/share/corosync/corosync start (code=killed, signal=TERM)

Dec 21 08:50:49 test corosync[1645]: [MAIN ] Corosync Cluster Engine ('2...e.
Dec 21 08:50:49 test corosync[1645]: [MAIN ] Corosync built-in features:...ow
Dec 21 08:52:19 test systemd[1]: corosync.service start operation timed o...g.
Dec 21 08:52:19 test corosync[1640]: Starting Corosync Cluster Engine (co...):
Dec 21 08:52:19 test systemd[1]: Failed to start Corosync Cluster Engine.
Dec 21 08:52:19 test systemd[1]: Unit corosync.service entered failed state.
Hint: Some lines were ellipsized, use -l to show in full.

how can i repair the cluster? =S Thanks!
 
Hi,
as you see yourself the problem is that corosync, the base of all cluster communication, cannot start. The provided logs are not enough to see what happened.

Did you change the corosync config before rebooting? Or did you anything else which could be related to it?

exec
Code:
systemctl restart corosync
systemctl restart pve-cluster
and please post full logs (attach them here or post in [code][/code] tags), the interesting parts should be network, corosync and pmxcfs outputs.

Also test with omping if the nodes can see each other over multicast.
 
Hi,
as you see yourself the problem is that corosync, the base of all cluster communication, cannot start. The provided logs are not enough to see what happened.

Did you change the corosync config before rebooting? Or did you anything else which could be related to it?

exec
Code:
systemctl restart corosync
systemctl restart pve-cluster
and please post full logs (attach them here or post in [code][/code] tags), the interesting parts should be network, corosync and pmxcfs outputs.

Also test with omping if the nodes can see each other over multicast.
hello, Thanks for reply, the last change on corosync.conf was 9December.
when i try
Code:
- systemctl restart corosync
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
i added my syslog from /var/log/ too
Thanks.
 
Last edited:
Try:
Code:
systemctl stop pve-cluster
systecmtl restart corosync

# now wait until corosync successfully formed a totem ring with all other active members something like:
#  [TOTEM ] A new membership (10.200.200.101:18876) was formed. Members joined: 4
#  [QUORUM] Members[2]: 4 3
#  [MAIN  ] Completed service synchronization, ready to provide service.

# then
systemctl start pve-cluster

Also whats your system setup? Which storage, ZFS and if yes to you have SWAP on it?
 
Try:
Code:
systemctl stop pve-cluster
systecmtl restart corosync

# now wait until corosync successfully formed a totem ring with all other active members something like:
#  [TOTEM ] A new membership (10.200.200.101:18876) was formed. Members joined: 4
#  [QUORUM] Members[2]: 4 3
#  [MAIN  ] Completed service synchronization, ready to provide service.

# then
systemctl start pve-cluster

Also whats your system setup? Which storage, ZFS and if yes to you have SWAP on it?
Hi! i tried
Code:
systemctl stop pve-cluster
systecmtl restart corosync
and "systecmtl restart corosync" and timeout again, without luck =S and my setup is Raid1 by hardware without ZFS and with NFS for shared storage.
 
Still the same log messages regarding corosync? So what did you do before the reboot, can all nodes still omping them?

Please post your corosync config found at /etc/pve/corosync.conf
 
No, please wait with this, this won't help and it should be solvable in another way.

Does corosync run on it's own network interface?

Can I guess that the node "inteli5" is the one who is working?

Edit: okay forget this post.
 
Oh, yes hmm great, but a bit strange ^^

Out of interest can you still answer the questions if it runs on a own interface and if the node I said above was the one working?
 
Oh, yes hmm great, but a bit strange ^^

Out of interest can you still answer the questions if it runs on a own interface and if the node I said above was the one working?
i5 was working good before the problem(switch reboot >1min off), and i don´t understand the question "own interface"?
Thanks for your interested :P
 
With an own interface i meant if it runs on a separate Network, also an own network interfaces, sorry for the confusion. ^^

Ok, we changed something on the corosync configuration a few months ago, but the default behaviour should be the same so this error is a bit strange, but you fixed it nonetheless.

Only make sure you have all nodes running on the same packages version, at best naturally the newest :)
 
With an own interface i meant if it runs on a separate Network, also an own network interfaces, sorry for the confusion. ^^

Ok, we changed something on the corosync configuration a few months ago, but the default behaviour should be the same so this error is a bit strange, but you fixed it nonetheless.

Only make sure you have all nodes running on the same packages version, at best naturally the newest :)
yes, the problem is solved, own network all nodes, and all same version all nodes is running, Thanks for your help! :P
Code:
proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-1 (running version: 4.1-1/2f9650d4)
pve-kernel-4.2.6-1-pve: 4.2.6-26
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-41
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie