My cluster is no longer a cluster :(

a2j

New Member
Aug 3, 2015
4
0
1
Started yesterday. All 3 nodes appear to lost each other some how. I can get the cluster back together with service corosync restart but that only works for 2-3 minutes and it goes back to not being a cluster. I can log in to individual nodes. All VMs are running, but not able to do anything because of no quorum.

error with cfs lock 'file-replication_cfg': no quorum!

proxmox-ve: 5.2-2 (running kernel: 4.15.18-4-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1



Quorum information
------------------
Date: Wed Oct 24 22:29:04 2018
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/3360
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.0.11 (local)





Absolutely no changes were made prior to this issue for at least 2 weeks.


Anything I can check for to figure out why this is happening?
 
service corosync restart but that only works for 2-3 minutes and it goes back to not being a cluster.
it seems multicast does not work properly in your environment, has anything changes on the switch maybe?
try to test with omping (for longer than a few minutes) to see if that is the problem.
 
No changes on the switches were made. Last night cluster started working again.
Here is the omping test. It was running for a few hours.

node1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.089/0.238/0.571/0.071
node1 : multicast, xmt/rcv/%loss = 10000/4367/56%, min/avg/max/std-dev = 0.122/0.270/0.553/0.071
node2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.084/0.198/0.676/0.055
node2 : multicast, xmt/rcv/%loss = 10000/4367/56%, min/avg/max/std-dev = 0.118/0.240/1.973/0.065
 
I had the same problem and i thik is a bug on proxmox 5,
For me was working with commands

#killall -9 corosync
# systemctl restart pve-cluster

and from an friend recomandation

apt-get update && apt-get dist-upgrade
 
node1 : multicast, xmt/rcv/%loss = 10000/4367/56%, min/avg/max/std-dev = 0.122/0.270/0.553/0.071
you have 56% multicast loss, i am not surprised your cluster does not work reliably
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!