After upgrading to 3.4 no quorum and pve cluster error

Whatever

Renowned Member
Nov 19, 2012
390
63
93
Hello, Everybody


I've faced a problem while upgrading to 3.4.

In 6-nodes cluster I've tried to upgrade 2 nodes to the latest 3.4. After upgrading and reboot neither of those 2 nodes joint the cluster again.

While pve-cluster starting up I'm receiving the following error:

Code:
Wed Mar 11 11:45:07 2015: Starting pve cluster filesystem : pve-cluster[dcdb] crit: local cluster.conf is newer

And cman starting failed with:
Code:
Wed Mar 11 11:45:56 2015: Starting Cluster Service Manager: [  OK  ]
Wed Mar 11 11:45:57 2015: Starting Proxmox VE firewall: pve-firewall.
Wed Mar 11 11:45:57 2015: Starting PVE Daemon: pvedaemon.
Wed Mar 11 11:45:57 2015: Starting PVE Status Daemon: pvestatd.
Wed Mar 11 11:45:57 2015: Starting PVE API Proxy Server: pveproxy.
Wed Mar 11 11:45:58 2015: Starting PVE SPICE Proxy Server: spiceproxy.
Wed Mar 11 11:45:58 2015: Starting VMs and Containers
Wed Mar 11 11:46:08 2015: cluster not ready - no quorum?

Cluster was build on top of Infiniband network and till these upgrade worked like a charm.


On the broken node:
Code:
root@pve02A:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

On the alive node (not yet upgraded):
Code:
root@pve01r:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-34-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-24-pve: 2.6.32-111
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-34-pve: 2.6.32-140
pve-kernel-2.6.32-31-pve: 2.6.32-132
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1



Any help is very appreciated!!!
 
Seems the problem exists only with kernel 2.6.32-37-pve, when I chose 2.6.32-34-pve both nodes join the cluster successfully.

What has been changed in kernel 2.6.32-37-pve in cluster communication / infiniband multicast ?
 
I suspect it is the reboot that resolved the issue...not the kernel...but I am not 100% sure.

I am having similar issues as well on some of our nodes...and so far a reboot on each node that would not join the cluster has helped.

The problem is it is a bit of a pain to reboot some of the nodes currently in production.

Please let me know if you try the other kernel again and it works.

Thanks,

Shain
 
I have a similar problem with proxmox 3.4 and proxmox 3.2
When I'm trying to add a new node with 2.6.32-37-pve (Feb.11) to 2.6.32-28-pve cluster, the new node is waiting endessly the quorum (unicast synchro).

I will setup two nodes with proxmox 3.4 and check if they can at least create a new quorum themself :)
Anyway, it's very strange the 2.6.32-37-pve (Feb.11) could not corosync correctly the quorum of old proxmox.
 
Whats going on here, i'am really worry to upgrade us cluster! We have Proxmox subscription (and alos for all as customers). But when this major bugs going in to Enterprise Repo, it makes not really sense to buy one!
 
I suspect it is the reboot that resolved the issue...not the kernel...but I am not 100% sure.

I am having similar issues as well on some of our nodes...and so far a reboot on each node that would not join the cluster has helped.

Unfortunately, only the workaround I've found so far, is to downgrade the kernel.
Seems the problem somehow related to the updated Mellanox Infiniband driver (i'm using infiniband network "ip-over-infiniband" for inter-cluster communication)
 
I have a similar problem with proxmox 3.4 and proxmox 3.2
When I'm trying to add a new node with 2.6.32-37-pve (Feb.11) to 2.6.32-28-pve cluster, the new node is waiting endessly the quorum (unicast synchro).

I will setup two nodes with proxmox 3.4 and check if they can at least create a new quorum themself :)
Anyway, it's very strange the 2.6.32-37-pve (Feb.11) could not corosync correctly the quorum of old proxmox.

Ok, I found my problem, and it was not the kernel fault.
I just discover that it's currently impossible (or very difficult) to add a new node in UNICAST corosync configuration.
I had to switch back to MULTICAST for my quorum sync before adding a new node with success.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!