Frequent node leaving cluster

tirpitz

Member
Jun 5, 2014
21
0
21
hello

we have cluster of 4 nodes, since last 3-4 weeks, nodes are leaving the cluster frequently, we have to restart cman and pve-cluster services to bring them back in the cluster. We thought high network traffic could be the cause so we assigned dedicated network interface for internal communication and have got separate switch, but we still face the same issue. When we run the backup services the frequency of node leaving cluster increase. is there a way to debug or fix the issue or what is the best way to rebuild the cluster without removing running VMs.
 
hi

i see following messages

pmxcfs[865097]: [status] crit: cpg_send_message failed: 9

during this message i see cman is not running

service cman status
Found stale pid file

pve version is

3.4-156 (running kernel: 2.6.32-39-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-156
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-17
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

switch level multicast is enabled and have been working without issues.
 
Quick idea:
Make sure that your backup traffic and inter VM traffic is not going thru this dedicated network
Also check (I had this) one cable was faulty and flapping, best use BOND as per Recommended_system_requirements (wiki)
 
I've seen this happen when Corosync is having issues with L2 switch IGMP snooping. This caused node dropping from the cluster every five minutes with exact same errors you are seeing. Multicast packets weren't simply getting trough.

Unfortunately the lower-end HP ProCurve did not offer configuration options for multicast/IGMP and was forced to use the defaults HP had chosen.

Quick solution was configuring PVE cluster to use Corosync in unicast mode instead of multicast. Not optimal, but works reliably for low number of nodes.

https://pve.proxmox.com/wiki/Multic..._instead_of_multicast_.28if_all_else_fails.29
 
hi

thanks, but we got the IGMP snoopping to work well with our Switches , we found that one of the node was still flapping so we have isolated that, after that we did not face any issues. But this whole cluster issue started after upgrade to Proxmox 3.4, so I think there has been some core changes happened at the OS level which forced us to enable IGMP snooping.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!