My cluster is no longer a cluster :(

a2j · Oct 25, 2018

Started yesterday. All 3 nodes appear to lost each other some how. I can get the cluster back together with service corosync restart but that only works for 2-3 minutes and it goes back to not being a cluster. I can log in to individual nodes. All VMs are running, but not able to do anything because of no quorum.

error with cfs lock 'file-replication_cfg': no quorum!

proxmox-ve: 5.2-2 (running kernel: 4.15.18-4-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

Quorum information
------------------
Date: Wed Oct 24 22:29:04 2018
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/3360
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.0.11 (local)

Absolutely no changes were made prior to this issue for at least 2 weeks.

Anything I can check for to figure out why this is happening?

dcsapak · Oct 25, 2018

a2j said:
service corosync restart but that only works for 2-3 minutes and it goes back to not being a cluster.

it seems multicast does not work properly in your environment, has anything changes on the switch maybe?
try to test with omping (for longer than a few minutes) to see if that is the problem.

a2j · Oct 26, 2018

No changes on the switches were made. Last night cluster started working again.
Here is the omping test. It was running for a few hours.

node1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.089/0.238/0.571/0.071
node1 : multicast, xmt/rcv/%loss = 10000/4367/56%, min/avg/max/std-dev = 0.122/0.270/0.553/0.071
node2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.084/0.198/0.676/0.055
node2 : multicast, xmt/rcv/%loss = 10000/4367/56%, min/avg/max/std-dev = 0.118/0.240/1.973/0.065

Bidi · Oct 26, 2018

I had the same problem and i thik is a bug on proxmox 5,
For me was working with commands

#killall -9 corosync
# systemctl restart pve-cluster

and from an friend recomandation

apt-get update && apt-get dist-upgrade

dcsapak · Oct 29, 2018

a2j said:
node1 : multicast, xmt/rcv/%loss = 10000/4367/56%, min/avg/max/std-dev = 0.122/0.270/0.553/0.071

you have 56% multicast loss, i am not surprised your cluster does not work reliably

Search

Search

My cluster is no longer a cluster :(

a2j

New Member

dcsapak

Proxmox Staff Member

a2j

New Member

Bidi

Renowned Member

dcsapak

Proxmox Staff Member