After upgrading to 3.4 no quorum and pve cluster error

Whatever · Mar 11, 2015

Hello, Everybody

I've faced a problem while upgrading to 3.4.

In 6-nodes cluster I've tried to upgrade 2 nodes to the latest 3.4. After upgrading and reboot neither of those 2 nodes joint the cluster again.

While pve-cluster starting up I'm receiving the following error:

Code:

Wed Mar 11 11:45:07 2015: Starting pve cluster filesystem : pve-cluster[dcdb] crit: local cluster.conf is newer

And cman starting failed with:

Code:

Wed Mar 11 11:45:56 2015: Starting Cluster Service Manager: [  OK  ]
Wed Mar 11 11:45:57 2015: Starting Proxmox VE firewall: pve-firewall.
Wed Mar 11 11:45:57 2015: Starting PVE Daemon: pvedaemon.
Wed Mar 11 11:45:57 2015: Starting PVE Status Daemon: pvestatd.
Wed Mar 11 11:45:57 2015: Starting PVE API Proxy Server: pveproxy.
Wed Mar 11 11:45:58 2015: Starting PVE SPICE Proxy Server: spiceproxy.
Wed Mar 11 11:45:58 2015: Starting VMs and Containers
Wed Mar 11 11:46:08 2015: cluster not ready - no quorum?

Cluster was build on top of Infiniband network and till these upgrade worked like a charm.

On the broken node:

Code:

root@pve02A:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

On the alive node (not yet upgraded):

Code:

root@pve01r:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-34-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-24-pve: 2.6.32-111
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-37-pve: 2.6.32-147
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-34-pve: 2.6.32-140
pve-kernel-2.6.32-31-pve: 2.6.32-132
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Any help is very appreciated!!!

Whatever · Mar 11, 2015

Seems the problem exists only with kernel 2.6.32-37-pve, when I chose 2.6.32-34-pve both nodes join the cluster successfully.

What has been changed in kernel 2.6.32-37-pve in cluster communication / infiniband multicast ?

smiley · Mar 11, 2015

I suspect it is the reboot that resolved the issue...not the kernel...but I am not 100% sure.

I am having similar issues as well on some of our nodes...and so far a reboot on each node that would not join the cluster has helped.

The problem is it is a bit of a pain to reboot some of the nodes currently in production.

Please let me know if you try the other kernel again and it works.

Thanks,

Shain

RCK · Mar 11, 2015

I have a similar problem with proxmox 3.4 and proxmox 3.2
When I'm trying to add a new node with 2.6.32-37-pve (Feb.11) to 2.6.32-28-pve cluster, the new node is waiting endessly the quorum (unicast synchro).

I will setup two nodes with proxmox 3.4 and check if they can at least create a new quorum themself

Anyway, it's very strange the 2.6.32-37-pve (Feb.11) could not corosync correctly the quorum of old proxmox.

fireon · Mar 11, 2015

Whats going on here, i'am really worry to upgrade us cluster! We have Proxmox subscription (and alos for all as customers). But when this major bugs going in to Enterprise Repo, it makes not really sense to buy one!

Whatever · Mar 12, 2015

smiley said:
I suspect it is the reboot that resolved the issue...not the kernel...but I am not 100% sure.

I am having similar issues as well on some of our nodes...and so far a reboot on each node that would not join the cluster has helped.

Unfortunately, only the workaround I've found so far, is to downgrade the kernel.
Seems the problem somehow related to the updated Mellanox Infiniband driver (i'm using infiniband network "ip-over-infiniband" for inter-cluster communication)

RCK · Mar 17, 2015

RCK said:
I have a similar problem with proxmox 3.4 and proxmox 3.2
When I'm trying to add a new node with 2.6.32-37-pve (Feb.11) to 2.6.32-28-pve cluster, the new node is waiting endessly the quorum (unicast synchro).

I will setup two nodes with proxmox 3.4 and check if they can at least create a new quorum themself
Anyway, it's very strange the 2.6.32-37-pve (Feb.11) could not corosync correctly the quorum of old proxmox.

Ok, I found my problem, and it was not the kernel fault.
I just discover that it's currently impossible (or very difficult) to add a new node in UNICAST corosync configuration.
I had to switch back to MULTICAST for my quorum sync before adding a new node with success.

Search

Search

After upgrading to 3.4 no quorum and pve cluster error

Whatever

Renowned Member

Whatever

Renowned Member

smiley

New Member

RCK

Renowned Member

fireon

Distinguished Member

Whatever

Renowned Member

RCK

Renowned Member