issues getting corosync to work for the entire cluster

mocanub · Sep 4, 2023

Hello everyone,

I currently manage a 19-node Proxmox cluster and I believe that during the weekend one of the nodes failed during the weekly VM backup job. Due to that event, the corosync service lost sync at cluster level and the nodes can't speak to each other (no quorum) and they have been working individually since then. I believe that the cluster now generates a lot traffic at the network level (see below print screen) and due to that I can see some performance issues on my entire infrastructure:

From the cluster owner (pve-node-08) I can reach all the other member nodes but I can see that ICMP response is varying (not always low latency):

Code:

root@pve-node-08:/home/bogdan# ping 10.100.100.16
PING 10.100.100.16 (10.100.100.16) 56(84) bytes of data.
64 bytes from 10.100.100.16: icmp_seq=1 ttl=64 time=0.075 ms
64 bytes from 10.100.100.16: icmp_seq=2 ttl=64 time=0.048 ms
64 bytes from 10.100.100.16: icmp_seq=3 ttl=64 time=20.2 ms
64 bytes from 10.100.100.16: icmp_seq=4 ttl=64 time=0.146 ms
64 bytes from 10.100.100.16: icmp_seq=5 ttl=64 time=59.2 ms
64 bytes from 10.100.100.16: icmp_seq=6 ttl=64 time=42.2 ms
64 bytes from 10.100.100.16: icmp_seq=7 ttl=64 time=0.085 ms
64 bytes from 10.100.100.16: icmp_seq=8 ttl=64 time=8.15 ms
64 bytes from 10.100.100.16: icmp_seq=9 ttl=64 time=42.6 ms
64 bytes from 10.100.100.16: icmp_seq=10 ttl=64 time=0.051 ms
64 bytes from 10.100.100.16: icmp_seq=11 ttl=64 time=74.2 ms
^C
--- 10.100.100.16 ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 10056ms
rtt min/avg/max/mdev = 0.048/22.453/74.166/26.171 ms

Since not long ago, I didn't have the storage traffic separated from the corosync network traffic and thought that these issues would happen because of that. Now we've switched from 1gbe NICs to 10Gbe NIC and now corosync runs on its dedicated NIC but I still seem to encounter those corosync issues when it looses quorum.

At this point I'm thinking that I'm not doing something right .... From what I've described above does my configuration approach look okay?

Also until now, in order to restore cluster state I had to bring down the entire cluster, then start the cluster owner and shortly after I would start one-by-one the other cluster nodes. If possible I would like to avoid that in the future. Do you guys have any better approach on restoring the corosync quorum at the cluster level without restarting the entire infrastructure?

What I've tried was to ssh into each node and attempted to restart pve-cluster and corosync services but that didn't do it.

Any advice to sort this one out would be highly appreciated.

Here some info about by environment:

pveversion -v output:

Code:

proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

pvecm status (from cluster owner):

Code:

Cluster information
-------------------
Name:             pmx-cluster-is
Config Version:   36
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Sep  4 06:06:31 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.b5f
Quorate:          No

Votequorum information
----------------------
Expected votes:   18
Highest expected: 18
Total votes:      1
Quorum:           10 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.100.200.18 (local)

output for systemctl status corosync:

Code:

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Mon 2023-09-04 05:38:51 EEST; 32min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 3659 (corosync)
      Tasks: 9 (limit: 153744)
     Memory: 202.6M
        CPU: 1h 20min 4.028s
     CGroup: /system.slice/corosync.service
             └─3659 /usr/sbin/corosync -f

Sep 04 06:10:49 pve-node-08 corosync[3659]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:50 pve-node-08 corosync[3659]:   [KNET  ] link: host: 11 link: 0 is down
Sep 04 06:10:50 pve-node-08 corosync[3659]:   [KNET  ] host: host: 11 (passive) best link: 0 (pri: 1)
Sep 04 06:10:50 pve-node-08 corosync[3659]:   [KNET  ] host: host: 11 has no active links

contents of /etc/pve/corosync.conf (on cluster owner):

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-node-02
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.100.200.12
  }
  node {
    name: pve-node-03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.100.200.13
  }
  node {
    name: pve-node-04
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.100.200.14
  }
  node {
    name: pve-node-05
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.100.200.15
  }
  node {
    name: pve-node-06
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 10.100.200.16
  }
  node {
    name: pve-node-07
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.100.200.17
  }
  node {
    name: pve-node-08
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.100.200.18
  }
  node {
    name: pve-node-10
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.100.200.20
  }
  node {
    name: pve-node-11
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.100.200.21
  }
  node {
    name: pve-node-12
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.100.200.22
  }
  node {
    name: pve-node-13
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 10.100.200.23
  }
  node {
    name: pve-node-14
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.100.200.24
  }
  node {
    name: pve-node-15
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.100.200.25
  }
  node {
    name: pve-node-16
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.100.200.26
  }
  node {
    name: pve-node-17
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 10.100.200.27
  }
  node {
    name: pve-node-18
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.100.200.28
  }
  node {
    name: pve-node-19
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.100.200.29
  }
  node {
    name: pve-node-20
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 10.100.200.30
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pmx-cluster-is
  config_version: 36
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Thanks in advance,
Bogdan M.

dietmar · Sep 4, 2023

Seems your network is down:

> Sep 04 06:10:50 pve-node-08 corosync[3659]: [KNET ] link: host: 11 link: 0 is down

Maybe there are more hints in syslog?

I can also imagine a problem with your network switch. Does it help if you restart the switch?

t.lamprecht · Sep 5, 2023

mocanub said:
From the cluster owner (pve-node-08) I can reach all the other member nodes but I can see that ICMP response is varying (not always low latency):

Yeah, those spikes seem rather bad, especially how huge the stdev is, you go from a perfect network condition (<1ms) to something way off the rough maximum possible for corosync to handle (> ~ 10 ms).

mocanub said:
Now we've switched from 1gbe NICs to 10Gbe NIC and now corosync runs on its dedicated NIC but I still seem to encounter those corosync issues when it looses quorum.

Are they still running through the same switch? Maybe that one is saturated?
I'd also check cables and anything else in the network loop, those latency spikes are definitively not normal, and when latency spikes, it's way too high for corosync cluster traffic.

Search

Search

issues getting corosync to work for the entire cluster

mocanub

Active Member

dietmar

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member