Broken cluster of 8 nodes

xtavras · Sep 20, 2016

Hello,

we have problems with our proxmox cluster, "/etc/pve" is read-only although cluster looks ok in "pvecm status" output:

Code:

Version: 6.2.0
Config Version: 26
Cluster Name: BLN
Cluster Id: 494
Cluster Member: Yes
Cluster Generation: 40632
Membership state: Cluster-Member
Nodes: 8
Expected votes: 8
Total votes: 8
Node votes: 1
Quorum: 5
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox03
Node ID: 6
Multicast addresses: 255.255.255.255
Node addresses: 10.1.1.203

We already stopped/restarted cman/pve-cluster services with different success (sometimes one node didn't saw the others and was in own cluster), but even now with every node in cluster proxmox says "no quorum".

The systems were in different software state at moment of fail, but were upgraded to the same last version of 3.4 branch (without reboot, so with different kernel version)

Code:

proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-15 (running version: 3.4-15/e1daa307)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-34-pve: 2.6.32-140
pve-kernel-2.6.32-46-pve: 2.6.32-177
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-20
qemu-server: 3.4-9
pve-firmware: 1.1-5
libpve-common-perl: 3.0-27
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-35
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-25
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

This is the only error messages that we see on all nodes

Code:

Sep 20 06:26:43 proxmox03 pmxcfs[653226]: [dcdb] notice: cpg_join retry 390700
Sep 20 06:26:44 proxmox03 pmxcfs[653226]: [dcdb] notice: cpg_join retry 390710
Sep 20 06:26:44 proxmox03 pmxcfs[653226]: [status] crit: cpg_send_message failed: 9
Sep 20 06:26:44 proxmox03 pmxcfs[653226]: [status] crit: cpg_send_message failed: 9

Does someone have clue how is possible to fix (even better without reboot)?

P.S. we use unicast (no multicast) and added hostnames to /etc/hosts to exclude DNS issues.

xtavras · Sep 21, 2016

Can maybe someone from developers comment how pmxfs checks, that cluster is in healthy state? Because cluster looks fine, but pmxfs is still read-only

Code:

proxmox03:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M  41224   2016-09-21 17:05:47  proxmox05
   2   M  41216   2016-09-21 17:05:23  proxmox06
   5   M  41228   2016-09-21 17:05:59  proxmox04
   6   M  41160   2016-09-21 17:03:21  proxmox03
   7   M  41216   2016-09-21 17:05:23  proxmox08
   8   M  41220   2016-09-21 17:05:38  proxmox07
   9   M  41216   2016-09-21 17:05:23  proxmox09
  10   M  41216   2016-09-21 17:05:23  proxmox10

spirit · Sep 21, 2016

you can try:

on each node :

killall -9 corosync

then on each node:

systemctl restart pve-cluster
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

It should restart the whole stack

xtavras · Sep 21, 2016

hi spirit, this is how we restarted the whole cluster, with only difference that we started cman on each node after pve-cluster was restarted, otherwise corosync will not be running, I think. But somehow, although "pvecm nodes" show all nodes and "pvecm status" shows quorum, /etc/pve stays read-only, which make me wonder why. btw, clustat shows everything is fine too.

spirit · Sep 22, 2016

xtavras said:
hi spirit, this is how we restarted the whole cluster, with only difference that we started cman on each node after pve-cluster was restarted, otherwise corosync will not be running, I think. But somehow, although "pvecm nodes" show all nodes and "pvecm status" shows quorum, /etc/pve stays read-only, which make me wonder why. btw, clustat shows everything is fine too.

oh, sorry, didn't see that you are on proxmox 3.

that's very strange, if you have quorum, /etc/pve should be writable

ronsrussell · Jan 16, 2017

We just experienced this problem after upgrading a four node cluster from ver 3.1 to 3.4.
After hours of trying everything I finally went back to the basics and discovered that the hosts file had been overwritten. I can only assume this was done by a script in the upgrade. The new hosts file had only one IPv4 host. I modified the hosts file on all four nodes to include all four nodes, stopped & started the services as suggested above by spirit along with cman as suggested by xtavras. The cluster immediately synced and all is well.

Broken cluster of 8 nodes

xtavras

Renowned Member

xtavras

Renowned Member

spirit

Distinguished Member

xtavras

Renowned Member

spirit

Distinguished Member

ronsrussell

Renowned Member

We value your privacy