Update cluster to 3.2 and kernel 3.10 : quorum lost

alain

Renowned Member
May 17, 2009
223
2
83
France/Paris
Hi all,

I got recently a new server, Dell PE R620, and decided to install it using PVE 3.2. As I only use KVM, I decided to also install the new kernel 3.10 from Enterprise repository. I had a little problem with this kernel, as my server did not reboot properly first time. In fact it appeared that it was trying to mount LVM volumes before loading Raid controller. It was resolved by adding the option 'scsi_mod.scan=sync' to the line 'GRUB_CMDLINE_LINUX_DEFAULT' in /etc/default/grub, as stated in another thread.

After that, I joined the new server to my PVE cluster, and all seems to be fine.

So, I also updated the other 3 nodes in the cluster, and also installed the 3.10 kernel. All went fine until I rebooted the last node. Then, one node appeared in red in the web management interface. I verified the cluster status and found I lost the quorum and had a line stating that 'Quorum: 3 Activity blocked'. Shortly after, all nodes did not see the other nodes, and the cluster had failed.
'pvecm nodes' showed only one node as alive (the current node were I was logged).

I tried to reboot some nodes. After that, I recovered for a short time the quorum, but it failed again in a short time.

I then tried to reinstall the 3.6.32 kernel on the last node I installed (in fact removed the 3.10 kernel, and update-grub). After I rebooted, I recovered instantly the entire cluster, and the quorum.

So, I will stick for now to the 3.6.32 kernel, and reinstall it on each node (I already did it on a node). Are there known problems with 3.10 kernel ?

My pveversion is this (on the new server, with 3.6.32 kernel now) :
# pveversion -v
proxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-27-pve: 2.6.32-121
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1
 
Last edited:
Just a quick notice. It is said in the release notes that the 3.10 kernel was for "testing only". I thought, even if it had no openvz patches, it was good enough to be in the enterprise repository. I can say from my experience it is not the case.

It seems that at least one node with a 3.6.32 kernel is required in order the cluster to work. I had a lot ot ARP table overflow previously, that I did not have with the 3.10 kernel.
So, this problem could be related to multicast ?

So, I have tested, and the 3.10 kernel is not reliable for now (perhaps integration of a RHEL kernel and a Debian distribution, RHEL using systemd now for RHEL 7, as stated in a previous thread by mir ?)
 
Does it help if you enable multicast querier?

# echo 1 >/sys/class/net/vmbr0/bridge/multicast_querier
 
Hi Dietmar, spirit,

I tried on the last node I installed, which has still no VM. The first :
# echo 1 >/sys/class/net/vmbr0/bridge/multicast_querier

did not seem to change the situation.
The second :
echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

seems to have reduce the number of messages, but not completly, I still have in syslog :
Code:
Mar 17 09:26:21 srv-virt4 kernel: __ratelimit: 278 callbacks suppressed
Mar 17 09:26:21 srv-virt4 kernel: Neighbour table overflow.
Mar 17 09:26:21 srv-virt4 kernel: Neighbour table overflow.

Do I have to do it on every node ? Are you sure it has no impact on the cluster communications ? I don't want to loose again the quorum.
It is with kernel 2.6.32-27-pve on every node.
 
We have around 400 hosts on the local network, but we are on a private 10.x.y.z network, divided in several vlans.
I already increased the arp table size in /etc/sysctl.conf :
Code:
# Force gc to clean-up quickly
net.ipv4.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

# We increase the thresholds for ARP tables
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

# idem for IPv6
# Force gc to clean-up quickly
net.ipv6.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
net.ipv6.neigh.default.gc_stale_time = 3600

net.ipv6.neigh.default.gc_thresh1 = 1024
net.ipv6.neigh.default.gc_thresh2 = 2048
net.ipv6.neigh.default.gc_thresh3 = 4096
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!