Update 5 Nodes Cluster to 3.2 and lost one of the nodes

rahman

Renowned Member
Nov 1, 2010
63
0
71
Hi,

I just upgraded our 5 node cluster to 3.2 and now one of the nodes can't join to the cluster. It says waiting for quorum and timeouts every time. Nodes are kvm44,kvm45,kvm46,kvm47,kvm48. I shutdown all the VMs and updated and rebooted the nodes, one node at a time. I started with kvm48, then kvm47 with success. But after I updated and rebooted kvm46 it could not join the cluster. I continued to update others with success too.

How can I re join kvm46 to cluster?

Here is pveversion -v outputs:

kvm44
Code:
root@kvm44:~# pveversion -vproxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-25-pve: 2.6.32-113
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

kvm45
Code:
root@kvm45:~# pveversion -vproxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-25-pve: 2.6.32-113
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

kvm46:
Code:
root@kvm46:~# pveversion -vproxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-25-pve: 2.6.32-113
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

kvm47:
Code:
root@kvm47:~# pveversion -vproxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-25-pve: 2.6.32-113
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

kvm48:
Code:
root@kvm48:~# pveversion -vproxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-24-pve: 2.6.32-111
pve-kernel-2.6.32-25-pve: 2.6.32-113
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1
 
Fixed it. The issue was /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping set to 1. Setting it to 0 and restarting cman fixed the problem. But in /etc/network/interfaces of kvm46 there is post-up command that sets it to 0, wierd.

auto vmbr0
iface vmbr0 inet static
address 10.255.254.46
netmask 255.255.255.0
bridge_ports eth1
bridge_stp off
bridge_fd 0
post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping
 
I don't know why but with multicast snooping enabled on pve cluster bridge, the cluster communication doesn't work. Google says it is a known linux bridge bug/limitation.
 
I don't know why but with multicast snooping enabled on pve cluster bridge, the cluster communication doesn't work. Google says it is a known linux bridge bug/limitation.

my cluster communication is clearly working, so there could be another reason... or not?

Marco
 
I really don't know and even don't understand. I had serius issues as others with recent pve kernels while backing up to nexenta cifs. Backups with large VMs was not completing. When they fail to backup, VMs was locking and windows server VMs with virtio drivers giving BSOD reboot loops when backups failed. Not someting amusing.

Then I started to use debian backported 3.10 kernel which solved the cifs issues. But then multicast clustering broke which I solved by adding "post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping" to cluster vmbr.

With 3.2 I wanted to try pve kernel again. But without the multicast workaround it did not worked either.

Now I reverted to 3.10 debian kernel as cifs is broken with this kernel too. And latest pve kernel broke my vlan/ipv6 setup which is working as expected with debian kernel now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!