corosync split, some nodes not rejoining

danstephans

Member
Apr 21, 2020
10
0
21
54
First time posting, I'm axle wrapped around this one.

I have a 15 node PVE cluster with CEPH. It has been running peachy since November. Today I went to add another node and it hung on waiting for quorum (I added at the command line). Eventually I had to kill the join. At this point all 15 original nodes were still members of the cluster and visible in the ring (pvecm status looked good). Note that nodes all have the same corosync.conf in /etc and /etc/pve/corosync.conf and are the same version.

I decided to restart corosync on the primary node (pve01) to see if that would change the deadlock. Instead node pve01 became isolated. Eventually every node was a singleton. After a significant amount of surgery, I was able to get 10 nodes back in the cluster. I still have 5 that do not join. Two of them will join a cluster together but not the main cluster. Here is my corosync.conf:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: cassatpve01
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 10.4.11.35
  }
  node {
    name: mathcspve01
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.4.11.142
  }
  node {
    name: pve01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.4.11.130
  }
  node {
    name: pve02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.4.11.131
  }
  node {
    name: pve03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.4.11.132
  }
  node {
    name: pve04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.4.11.133
  }
  node {
    name: pve05
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.4.11.134
  }
  node {
    name: pve06
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.4.11.135
  }
  node {
    name: pve07
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.4.11.136
  }
  node {
    name: pve08
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.4.11.137
  }
  node {
    name: pve09
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.4.11.138
  }
  node {
    name: pve10
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.4.11.139
  }
  node {
    name: pve11
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.4.11.129
  }
  node {
    name: pve12
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.4.11.6
  }
  node {
    name: pve13
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.4.11.7
  }
  node {
    name: pve14
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 10.4.11.5
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Carleton
  config_version: 23
  interface {
    linknumber: 0
    knet_transport: sctp
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

The cluster nodes keep reconfiguring and fencing for no reason. As I was typing this message pve01 dropped out of the cluster (but still thinks it is in it)

Each node is connected by 10G fiber to a dedicated switch. Some nodes have a bond interface with 2 10G nics bonded to the same switch. This all worked perfectly until the failed node add and now things won't stay in sync. Any help is appreciated.
 
I've now got 10 nodes online and stable. Of the remaining 6 if I bring corosync up it will start splitting off other nodes. Five of these nodes have myricom 10G network cards in them -- they are the only nodes with them (but they were working fine in the cluster since november). pve06 has an intel 10G card, it is also the only node with an intel ixgbe in it.

Nodes not able to reliably rejoin the cluster at this point: pve04, pve05, pve06, pve07, pve08 and pve14.

If I start corosync on pve06 it knocks mathcspve01 out of the cluster. Every single time. When I stop corosync on pve06, mathcspve01 rejoins the main cluster. Similar splitting happens when bringing any of the other nodes corosync up.
 
What is the network setup? Does corosync have its own physical NIC or is it shared with other services (which one)?
 
Each host has two one gig network ports connected to physical switches that are "old school" segmented. One is SERVER, one is BASTION. The cluster IP lives in SERVER (for management via web and joining). There is a top of rack 10G cisco Nexus that is trunked to our core for the rest of our VLANs. The 10G interface on each host is bonded and there are multiple VLANs riding them. Here is a sample config:

Code:
iface lo inet loopback

iface eno1 inet manual
iface eno2 inet manual
iface eno3 inet manual
iface eno4 inet manual

auto bond0
iface bond0 inet manual
        bond-mode 6
        bond-primary eno1
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-downdelay 400
        bond-updelay 800

auto vmbr0
iface vmbr0 inet static
        address 137.22.194.130
        netmask 255.255.254.0
        gateway 137.22.195.254
        bridge_ports eno3
        bridge_stp off
        bridge_fd 0
#Server Network

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#10G Trunk

auto vmbr1.2141
iface vmbr1.2141 inet static
        address  10.10.10.130
        netmask  16
#Storage VLAN

auto vmbr1.411
iface vmbr1.411 inet static
        address 10.4.11.130
        netmask 16
#Cluster VLAN

auto vmbr2
iface vmbr2 inet manual
        bridge-ports eno4
        bridge-stp off
        bridge-fd 0
#Bastion Network

There is currently not much traffic on any of the 10G as we're in vetting stage for this technology right now. Everything was working fine and without any issue until yesterday as well (and had been for months). I'm still playing whack-a-mole today.
 
My CEPH is also working perfectly fine through this. I suspect this is a corosync3 issue but I cannot find the way to keep my nodes synced any more.
 
I've downed corosync on the nodes that are problematic as mentioned above. It is running on all other nodes that aren't misbehaving. Those other nodes are still not rejoining the cluster even though they are connected

Code:
root@pve01:~# corosync-cfgtool -s
Printing link status.
Local node ID 1
LINK ID 0
        addr    = 10.4.11.130
        status:
                nodeid  1:      link enabled:1  link connected:1
                nodeid  2:      link enabled:1  link connected:1
                nodeid  3:      link enabled:1  link connected:1
                nodeid  4:      link enabled:1  link connected:0
                nodeid  5:      link enabled:1  link connected:0
                nodeid  6:      link enabled:1  link connected:0
                nodeid  7:      link enabled:1  link connected:0
                nodeid  8:      link enabled:1  link connected:0
                nodeid  9:      link enabled:1  link connected:1
                nodeid 10:      link enabled:1  link connected:1
                nodeid 11:      link enabled:1  link connected:0
                nodeid 12:      link enabled:1  link connected:1
                nodeid 13:      link enabled:1  link connected:1
                nodeid 14:      link enabled:1  link connected:1
                nodeid 15:      link enabled:1  link connected:0
                nodeid 16:      link enabled:1  link connected:1

root@pve01:~# pvecm status
Cluster information
-------------------
Name:             Carleton
Config Version:   23
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Apr 21 12:42:19 2020
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.4419
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.4.11.130 (local)
0x00000002          1 10.4.11.131
0x00000003          1 10.4.11.132
0x00000009          1 10.4.11.138
0x00000010          1 10.4.11.35

Yes, I have done a pvecm expected 1 to get some VMs to come online which is why quorum looks odd.
 
Also, versions.

Code:
root@pve01:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
I've removed all the nodes with the myricom interfaces from the cluster, although they are still part of CEPH. I also had a node in a data closet, connected over 10G long haul that I removed from the cluster and now everything seems good.

What I don't like -- a single node shouldn't be able to break quorum. I could bring up any of the myricom nic'ed corosyncs and the cluster would lose its mind, splitbraining like crazy. Bring corosync down on one of those nodes and everything recedes to normal. I'm going to continue to play with these to see if I can get to the bottom of it but it's been a long 24 hours. corosync3 and I need a break from each other.
 
thanks for the configs. From what I can gather, you run the corosync network in a VLAN on the 10G NIC on which you have Ceph as well.

This is most likely the cause of your problem.
Corosync really wants low latency [0] and sharing the same physical NIC with Ceph will sooner or later lead to problems. If Ceph is saturating the network and thus congesting it, the latency for the corosync packets will go up and can lead to the behavior you experience. Saturating a 10G link with 15 Ceph nodes is not hard!

Ideal solution: move Corosync to its own physical link
Okayish solution: add a second link to Corosync which is on another physical link so it can switch over if the first link gets too high latency [1]
Meh solution: configure a reliable QOS in your network so the corosync packets will be preferred over the Ceph ones

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_requirements
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!