Update cluster to 3.2 and kernel 3.10 : quorum lost

alain · Mar 16, 2014

Hi all,

I got recently a new server, Dell PE R620, and decided to install it using PVE 3.2. As I only use KVM, I decided to also install the new kernel 3.10 from Enterprise repository. I had a little problem with this kernel, as my server did not reboot properly first time. In fact it appeared that it was trying to mount LVM volumes before loading Raid controller. It was resolved by adding the option 'scsi_mod.scan=sync' to the line 'GRUB_CMDLINE_LINUX_DEFAULT' in /etc/default/grub, as stated in another thread.

After that, I joined the new server to my PVE cluster, and all seems to be fine.

So, I also updated the other 3 nodes in the cluster, and also installed the 3.10 kernel. All went fine until I rebooted the last node. Then, one node appeared in red in the web management interface. I verified the cluster status and found I lost the quorum and had a line stating that 'Quorum: 3 Activity blocked'. Shortly after, all nodes did not see the other nodes, and the cluster had failed.
'pvecm nodes' showed only one node as alive (the current node were I was logged).

I tried to reboot some nodes. After that, I recovered for a short time the quorum, but it failed again in a short time.

I then tried to reinstall the 3.6.32 kernel on the last node I installed (in fact removed the 3.10 kernel, and update-grub). After I rebooted, I recovered instantly the entire cluster, and the quorum.

So, I will stick for now to the 3.6.32 kernel, and reinstall it on each node (I already did it on a node). Are there known problems with 3.10 kernel ?

My pveversion is this (on the new server, with 3.6.32 kernel now) :
# pveversion -v
proxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-27-pve: 2.6.32-121
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

alain · Mar 16, 2014

Just a quick notice. It is said in the release notes that the 3.10 kernel was for "testing only". I thought, even if it had no openvz patches, it was good enough to be in the enterprise repository. I can say from my experience it is not the case.

It seems that at least one node with a 3.6.32 kernel is required in order the cluster to work. I had a lot ot ARP table overflow previously, that I did not have with the 3.10 kernel.
So, this problem could be related to multicast ?

So, I have tested, and the 3.10 kernel is not reliable for now (perhaps integration of a RHEL kernel and a Debian distribution, RHEL using systemd now for RHEL 7, as stated in a previous thread by mir ?)

dietmar · Mar 17, 2014

Does it help if you enable multicast querier?

# echo 1 >/sys/class/net/vmbr0/bridge/multicast_querier

spirit · Mar 17, 2014

you can also try to disable multicast snooping, if multicast_querier doesn't solve the problem

echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

alain · Mar 17, 2014

Hi Dietmar, spirit,

I tried on the last node I installed, which has still no VM. The first :
# echo 1 >/sys/class/net/vmbr0/bridge/multicast_querier

did not seem to change the situation.
The second :
echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

seems to have reduce the number of messages, but not completly, I still have in syslog :

Code:

Mar 17 09:26:21 srv-virt4 kernel: __ratelimit: 278 callbacks suppressed
Mar 17 09:26:21 srv-virt4 kernel: Neighbour table overflow.
Mar 17 09:26:21 srv-virt4 kernel: Neighbour table overflow.

Do I have to do it on every node ? Are you sure it has no impact on the cluster communications ? I don't want to loose again the quorum.
It is with kernel 2.6.32-27-pve on every node.

dietmar · Mar 17, 2014

How many host are in the local network? Maybe you need to increase the arp table size.

alain · Mar 17, 2014

We have around 400 hosts on the local network, but we are on a private 10.x.y.z network, divided in several vlans.
I already increased the arp table size in /etc/sysctl.conf :

Code:

# Force gc to clean-up quickly
net.ipv4.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

# We increase the thresholds for ARP tables
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

# idem for IPv6
# Force gc to clean-up quickly
net.ipv6.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
net.ipv6.neigh.default.gc_stale_time = 3600

net.ipv6.neigh.default.gc_thresh1 = 1024
net.ipv6.neigh.default.gc_thresh2 = 2048
net.ipv6.neigh.default.gc_thresh3 = 4096

Search

Search

Update cluster to 3.2 and kernel 3.10 : quorum lost

alain

Renowned Member

alain

Renowned Member

dietmar

Proxmox Staff Member

spirit

Distinguished Member

alain

Renowned Member

dietmar

Proxmox Staff Member

alain

Renowned Member