Cluster quorum lost after starting VM

dib

New Member
Jul 26, 2017
2
0
1
Hello all,

we had a 3-node production cluster (v 4.4) running with basically no issues for more than half year.
Without getting into too much detail (I'll attach pvereport output, anyway), here's some relevant information about the setup:

- proxmox 4.4
- configured with HA, software watchdog
- ceph-based storage; there's a dedicated network for ceph
- couple a dozen VMs running, relatively low on resource consumption
- each node has 4 NICs. They are all configured the same. Here's the interfaces file from one of the nodes:

Code:
auto lo
iface lo inet loopback
iface eth0 inet manual
iface rename3 inet manual
iface eth2 inet manual
iface eth3 inet manual

auto bond0
iface bond0 inet manual
        slaves eth0 rename3
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3

auto vmbr0
iface vmbr0 inet manual
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0
        bridge_vlan_aware yes

auto vmbr0.33:256
iface vmbr0.33:256 inet static
        address  192.168.2.131
        netmask  255.255.252.0
        gateway  192.168.1.1

auto bond1
iface bond1 inet static
        slaves eth2 eth3
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3
        address  10.0.10.1
        netmask  255.255.255.0

- all nodes connected to a Cisco SG500-28 switch

Recently we decided to add the 4th node to the cluster. Which we did: installed proxmox 4.4 (albeit slightly newer packages), added node to cluster. Everything looked alright (cluster quorum, 4th node visible in pvecm nodes output). We continued with configuring ceph on the 4th node (monitor and osds) and waited until the ceph rebalancing was done. At this point we concluded that the setup is stable.
Then we found out how wrong this conclusion was, when we started a VM on the newly added node (no VM was running on it up until that moment). After the VM started, corosync quorum was lost, nodes started (presumably) fencing each other, resulting in all 3 initial nodes rebooting. I'm not sure why the 4th node did not reboot at this point. Everything stopped the moment we managed to login into the 4th node and stop the VM. At this point quorum re-established and everything started working again (connectivity, VMs started, ceph storage available and recovering, etc).

And here's what we found in the logs of the 1st node (arthrex), basically just the watchdog expiring followed by reboot:
Code:
Jul 16 17:14:34 arthrex pvedaemon[9804]: command '/bin/nc6 -l -p 5900 -w 10 -e '/usr/bin/ssh -T -o BatchMode=yes 192.168.2.134 /usr/sbin/qm vncproxy 113 2>/dev/null'' failed: exit code 1
Jul 16 17:14:34 arthrex pvedaemon[13655]: <root@pam> end task UPID:arthrex:0000264C:02B39AAA:596B74A4:vncproxy:113:root@pam: command '/bin/nc6 -l -p 5900 -w 10 -e '/usr/bin/ssh -T -o BatchMode=yes 192.168.2.134 /usr/sbin/qm vncproxy 113 2>/dev/null'' failed: exit code 1
Jul 16 17:14:42 arthrex snmpd[4366]: error on subcontainer 'ia_addr' insert (-1)
Jul 16 17:14:58 arthrex watchdog-mux[3815]: client watchdog expired - disable watchdog updates
Jul 16 17:15:01 arthrex CRON[10446]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 16 17:18:37 arthrex systemd-modules-load[2529]: Module 'fuse' is builtin
Jul 16 17:18:37 arthrex systemd-modules-load[2529]: Inserted module 'vhost_net'
Jul 16 17:18:37 arthrex systemd[1]: Starting Create Static Device Nodes in /dev...
Jul 16 17:18:37 arthrex systemd[1]: Mounted FUSE Control File System.
Jul 16 17:18:37 arthrex systemd[1]: Started Apply Kernel Variables.
Jul 16 17:18:37 arthrex systemd[1]: Started Create Static Device Nodes in /dev.
Jul 16 17:18:37 arthrex kernel: [    0.000000] Initializing cgroup subsys cpuset
Jul 16 17:18:37 arthrex kernel: [    0.000000] Initializing cgroup subsys cpu

The logs from this point on just show services starting and quorum being re-established.

I'm also attaching (filename donalgin.syslog.txt) logs from the 4th node and pvereport output from the 1st (arthrex) and the 4th (donalgin) nodes.

What we did so far was to double check the network configuration on all 4 nodes and also on the switch. My bet is that the switch config is correct since it works just fine for quite a while and except for the changes required by the additional cluster node (adding 4 ports to the appropriate LAGs) nothing was touched.
So I would point the finger at the 4th node, but I'm out of ideas as to where to look exactly and what to do.

Does anyone else has any suggestions?
BTW, if there's some relevant piece of information missing, I'll be more than happy to provide.

Many thanks.
 

Attachments

- configured with HA, software watchdog
- ceph-based storage; there's a dedicated network for ceph

Also Clustering should be at a separately dedicated physical network

- couple a dozen VMs running, relatively low on resource consumption
- each node has 4 NICs. They are all configured the same. Here's the interfaces file from one of the nodes:


What we did so far was to double check the network configuration on all 4 nodes and also on the switch. My bet is that the switch config is correct since it works just fine for quite a while and except for the changes required by the additional cluster node (adding 4 ports to the appropriate LAGs) nothing was touched.
So I would point the finger at the 4th node, but I'm out of ideas as to where to look exactly and what to do.

In order to narrow down the reason I suggest to (temporarily)

- remove the firewall

- change the bonded connections to simple ones
 
Thanks for the reply, Richard.

I'm gonna try disabling the firewall for starters as it is the simplest and the symptoms are very much alike the firewall blocking traffic.
I'll come back with results after I manage to schedule a maintenance window, as this is a production env.

BR,