Broken cluster of 8 nodes

xtavras

Renowned Member
Jun 29, 2015
31
2
73
Berlin
Hello,

we have problems with our proxmox cluster, "/etc/pve" is read-only although cluster looks ok in "pvecm status" output:
Code:
Version: 6.2.0
Config Version: 26
Cluster Name: BLN
Cluster Id: 494
Cluster Member: Yes
Cluster Generation: 40632
Membership state: Cluster-Member
Nodes: 8
Expected votes: 8
Total votes: 8
Node votes: 1
Quorum: 5
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox03
Node ID: 6
Multicast addresses: 255.255.255.255
Node addresses: 10.1.1.203

We already stopped/restarted cman/pve-cluster services with different success (sometimes one node didn't saw the others and was in own cluster), but even now with every node in cluster proxmox says "no quorum".

The systems were in different software state at moment of fail, but were upgraded to the same last version of 3.4 branch (without reboot, so with different kernel version)

Code:
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-15 (running version: 3.4-15/e1daa307)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-34-pve: 2.6.32-140
pve-kernel-2.6.32-46-pve: 2.6.32-177
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-20
qemu-server: 3.4-9
pve-firmware: 1.1-5
libpve-common-perl: 3.0-27
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-35
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-25
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1


This is the only error messages that we see on all nodes

Code:
Sep 20 06:26:43 proxmox03 pmxcfs[653226]: [dcdb] notice: cpg_join retry 390700
Sep 20 06:26:44 proxmox03 pmxcfs[653226]: [dcdb] notice: cpg_join retry 390710
Sep 20 06:26:44 proxmox03 pmxcfs[653226]: [status] crit: cpg_send_message failed: 9
Sep 20 06:26:44 proxmox03 pmxcfs[653226]: [status] crit: cpg_send_message failed: 9

Does someone have clue how is possible to fix (even better without reboot)?


P.S. we use unicast (no multicast) and added hostnames to /etc/hosts to exclude DNS issues.
 
Can maybe someone from developers comment how pmxfs checks, that cluster is in healthy state? Because cluster looks fine, but pmxfs is still read-only

Code:
proxmox03:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M  41224   2016-09-21 17:05:47  proxmox05
   2   M  41216   2016-09-21 17:05:23  proxmox06
   5   M  41228   2016-09-21 17:05:59  proxmox04
   6   M  41160   2016-09-21 17:03:21  proxmox03
   7   M  41216   2016-09-21 17:05:23  proxmox08
   8   M  41220   2016-09-21 17:05:38  proxmox07
   9   M  41216   2016-09-21 17:05:23  proxmox09
  10   M  41216   2016-09-21 17:05:23  proxmox10
 
you can try:

on each node :

killall -9 corosync


then on each node:

systemctl restart pve-cluster
systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd


It should restart the whole stack
 
hi spirit, this is how we restarted the whole cluster, with only difference that we started cman on each node after pve-cluster was restarted, otherwise corosync will not be running, I think. But somehow, although "pvecm nodes" show all nodes and "pvecm status" shows quorum, /etc/pve stays read-only, which make me wonder why. btw, clustat shows everything is fine too.
 
hi spirit, this is how we restarted the whole cluster, with only difference that we started cman on each node after pve-cluster was restarted, otherwise corosync will not be running, I think. But somehow, although "pvecm nodes" show all nodes and "pvecm status" shows quorum, /etc/pve stays read-only, which make me wonder why. btw, clustat shows everything is fine too.
oh, sorry, didn't see that you are on proxmox 3.

that's very strange, if you have quorum, /etc/pve should be writable
 
We just experienced this problem after upgrading a four node cluster from ver 3.1 to 3.4.
After hours of trying everything I finally went back to the basics and discovered that the hosts file had been overwritten. I can only assume this was done by a script in the upgrade. The new hosts file had only one IPv4 host. I modified the hosts file on all four nodes to include all four nodes, stopped & started the services as suggested above by spirit along with cman as suggested by xtavras. The cluster immediately synced and all is well.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!