Cluster stopped working

sandman · Sep 25, 2019

We have a cluster with 3 servers on Proxmox 6: 2 Dell servers and and a small whitebox PC for quorum and backups. On saturday the cluster stopped working for apparently no reason until we rebooted both dell servers and set "pvecm expected 1" on on of them. I tried to dig through the logs but I really couldn't find the reason. Quorum sometimes seemed to be made between 2 or even 3 of the servers but would constantly break again. We have had no problems before or since then, but would like to prevent this from happening again.

Also, there is HA enabled for a single VM and replication set for 5 or 6 VM between the dell servers.

Here is my corosync conf:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve00
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.200.220
    ring1_addr: 10.1.200.220
  }
  node {
    name: pve01
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.200.221
    ring1_addr: 10.1.200.221
  }
  node {
    name: pve02
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.200.222
    ring1_addr: 10.1.200.222
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: proxmox-cluster
  config_version: 3
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

And I have attached the logs from the 3 servers. The problem started at 1AM and was fixed around 11AM when someone rebooted both dell servers and set the expected 1 command.

EDIT: Had to remove a few thousands lines from pve00 logs since it was too large, so I cut a chunk from the middle.

aaron · Sep 26, 2019

What is your network configuration? On the servers as well as with regards to the switches?
Are the two corosync links connected to different switches?
Do you have other services using the same networks you use for corosync?

A quick look in the logs showed that the servers fenced themselves. That happens if they lost connection to the cluster with HA enabled. After two minutes they trigger a reset. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

sandman · Sep 27, 2019

Two bonded interfaces on each server, all on the same switch. So all networks are over that one bond, just using a different VLAN for the second corosync link.

There are some other services running over the ring0 network, but nothing really generates much traffic. Only storage replication generates significant traffic.

Network configuration, it's the same on all 3 servers:

Code:

# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

iface enp4s0f0 inet manual

iface enp4s0f1 inet manual

iface enp4s0f2 inet manual

iface enp4s0f3 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves eno1 enp4s0f3
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
    bond miimon 100
#BOND LACP

auto vmbr200
iface vmbr200 inet static
    address  192.168.200.220
    netmask  255.255.255.0
    gateway  192.168.200.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
#MAIN

auto vmbr1200
iface vmbr1200 inet static
        address  10.1.200.220
        netmask  255.255.255.248
        bridge_ports bond0.1200
        bridge_stp off
        bridge_fd 0
        bridge_vlan_aware yes
#COROSYNC

sandman · Sep 27, 2019

We will use a different NIC for the second corosync link, and will use that network for migrations too. Just trying to make sure that will mitigate this kind of problesm.

Search

Search

Cluster stopped working

sandman

Member

Attachments

aaron

Proxmox Staff Member

sandman

Member

sandman

Member

We value your privacy