Frequent node leaving cluster

tirpitz · Sep 25, 2015

hello

we have cluster of 4 nodes, since last 3-4 weeks, nodes are leaving the cluster frequently, we have to restart cman and pve-cluster services to bring them back in the cluster. We thought high network traffic could be the cause so we assigned dedicated network interface for internal communication and have got separate switch, but we still face the same issue. When we run the backup services the frequency of node leaving cluster increase. is there a way to debug or fix the issue or what is the best way to rebuild the cluster without removing running VMs.

dietmar · Sep 25, 2015

Are there some hint in /var/log/syslog?

tirpitz · Sep 25, 2015

hi

i see following messages

pmxcfs[865097]: [status] crit: cpg_send_message failed: 9

during this message i see cman is not running

service cman status
Found stale pid file

pve version is

3.4-156 (running kernel: 2.6.32-39-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-156
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-17
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

switch level multicast is enabled and have been working without issues.

dj_bobo · Sep 28, 2015

Quick idea:
Make sure that your backup traffic and inter VM traffic is not going thru this dedicated network
Also check (I had this) one cable was faulty and flapping, best use BOND as per Recommended_system_requirements (wiki)

Nodefield · Oct 13, 2015

I've seen this happen when Corosync is having issues with L2 switch IGMP snooping. This caused node dropping from the cluster every five minutes with exact same errors you are seeing. Multicast packets weren't simply getting trough.

Unfortunately the lower-end HP ProCurve did not offer configuration options for multicast/IGMP and was forced to use the defaults HP had chosen.

Quick solution was configuring PVE cluster to use Corosync in unicast mode instead of multicast. Not optimal, but works reliably for low number of nodes.

https://pve.proxmox.com/wiki/Multic..._instead_of_multicast_.28if_all_else_fails.29

tirpitz · Oct 13, 2015

hi

thanks, but we got the IGMP snoopping to work well with our Switches , we found that one of the node was still flapping so we have isolated that, after that we did not face any issues. But this whole cluster issue started after upgrade to Proxmox 3.4, so I think there has been some core changes happened at the OS level which forced us to enable IGMP snooping.

Search

Search

Frequent node leaving cluster

tirpitz

Member

dietmar

Proxmox Staff Member

tirpitz

Member

dj_bobo

Renowned Member

Nodefield

Renowned Member

tirpitz

Member

We value your privacy