Cluster stopped working

sandman

Member
Oct 10, 2017
9
0
6
33
We have a cluster with 3 servers on Proxmox 6: 2 Dell servers and and a small whitebox PC for quorum and backups. On saturday the cluster stopped working for apparently no reason until we rebooted both dell servers and set "pvecm expected 1" on on of them. I tried to dig through the logs but I really couldn't find the reason. Quorum sometimes seemed to be made between 2 or even 3 of the servers but would constantly break again. We have had no problems before or since then, but would like to prevent this from happening again.

Also, there is HA enabled for a single VM and replication set for 5 or 6 VM between the dell servers.

Here is my corosync conf:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve00
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.200.220
    ring1_addr: 10.1.200.220
  }
  node {
    name: pve01
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.200.221
    ring1_addr: 10.1.200.221
  }
  node {
    name: pve02
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.200.222
    ring1_addr: 10.1.200.222
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: proxmox-cluster
  config_version: 3
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

And I have attached the logs from the 3 servers. The problem started at 1AM and was fixed around 11AM when someone rebooted both dell servers and set the expected 1 command.

EDIT: Had to remove a few thousands lines from pve00 logs since it was too large, so I cut a chunk from the middle.
 

Attachments

  • syslog.pve02.tar.gz
    565.8 KB · Views: 1
  • syslog.pve01.tar.gz
    340.5 KB · Views: 1
  • syslog.pve00.tar.gz
    843.8 KB · Views: 1
Last edited:
What is your network configuration? On the servers as well as with regards to the switches?
Are the two corosync links connected to different switches?
Do you have other services using the same networks you use for corosync?

A quick look in the logs showed that the servers fenced themselves. That happens if they lost connection to the cluster with HA enabled. After two minutes they trigger a reset. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
 
Two bonded interfaces on each server, all on the same switch. So all networks are over that one bond, just using a different VLAN for the second corosync link.

There are some other services running over the ring0 network, but nothing really generates much traffic. Only storage replication generates significant traffic.

Network configuration, it's the same on all 3 servers:

Code:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

iface enp4s0f0 inet manual

iface enp4s0f1 inet manual

iface enp4s0f2 inet manual

iface enp4s0f3 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves eno1 enp4s0f3
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer2+3
    bond miimon 100
#BOND LACP

auto vmbr200
iface vmbr200 inet static
    address  192.168.200.220
    netmask  255.255.255.0
    gateway  192.168.200.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
#MAIN

auto vmbr1200
iface vmbr1200 inet static
        address  10.1.200.220
        netmask  255.255.255.248
        bridge_ports bond0.1200
        bridge_stp off
        bridge_fd 0
        bridge_vlan_aware yes
#COROSYNC
 
We will use a different NIC for the second corosync link, and will use that network for migrations too. Just trying to make sure that will mitigate this kind of problesm.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!