Hello everyone,
I currently manage a 19-node Proxmox cluster and I believe that during the weekend one of the nodes failed during the weekly VM backup job. Due to that event, the corosync service lost sync at cluster level and the nodes can't speak to each other (no quorum) and they have been working individually since then. I believe that the cluster now generates a lot traffic at the network level (see below print screen) and due to that I can see some performance issues on my entire infrastructure:
From the cluster owner (pve-node-08) I can reach all the other member nodes but I can see that ICMP response is varying (not always low latency):
Since not long ago, I didn't have the storage traffic separated from the corosync network traffic and thought that these issues would happen because of that. Now we've switched from 1gbe NICs to 10Gbe NIC and now corosync runs on its dedicated NIC but I still seem to encounter those corosync issues when it looses quorum.
At this point I'm thinking that I'm not doing something right .... From what I've described above does my configuration approach look okay?
Also until now, in order to restore cluster state I had to bring down the entire cluster, then start the cluster owner and shortly after I would start one-by-one the other cluster nodes. If possible I would like to avoid that in the future. Do you guys have any better approach on restoring the corosync quorum at the cluster level without restarting the entire infrastructure?
What I've tried was to ssh into each node and attempted to restart pve-cluster and corosync services but that didn't do it.
Any advice to sort this one out would be highly appreciated.
Here some info about by environment:
pveversion -v output:
pvecm status (from cluster owner):
output for systemctl status corosync:
contents of /etc/pve/corosync.conf (on cluster owner):
Thanks in advance,
Bogdan M.
I currently manage a 19-node Proxmox cluster and I believe that during the weekend one of the nodes failed during the weekly VM backup job. Due to that event, the corosync service lost sync at cluster level and the nodes can't speak to each other (no quorum) and they have been working individually since then. I believe that the cluster now generates a lot traffic at the network level (see below print screen) and due to that I can see some performance issues on my entire infrastructure:
From the cluster owner (pve-node-08) I can reach all the other member nodes but I can see that ICMP response is varying (not always low latency):
Code:
root@pve-node-08:/home/bogdan# ping 10.100.100.16
PING 10.100.100.16 (10.100.100.16) 56(84) bytes of data.
64 bytes from 10.100.100.16: icmp_seq=1 ttl=64 time=0.075 ms
64 bytes from 10.100.100.16: icmp_seq=2 ttl=64 time=0.048 ms
64 bytes from 10.100.100.16: icmp_seq=3 ttl=64 time=20.2 ms
64 bytes from 10.100.100.16: icmp_seq=4 ttl=64 time=0.146 ms
64 bytes from 10.100.100.16: icmp_seq=5 ttl=64 time=59.2 ms
64 bytes from 10.100.100.16: icmp_seq=6 ttl=64 time=42.2 ms
64 bytes from 10.100.100.16: icmp_seq=7 ttl=64 time=0.085 ms
64 bytes from 10.100.100.16: icmp_seq=8 ttl=64 time=8.15 ms
64 bytes from 10.100.100.16: icmp_seq=9 ttl=64 time=42.6 ms
64 bytes from 10.100.100.16: icmp_seq=10 ttl=64 time=0.051 ms
64 bytes from 10.100.100.16: icmp_seq=11 ttl=64 time=74.2 ms
^C
--- 10.100.100.16 ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 10056ms
rtt min/avg/max/mdev = 0.048/22.453/74.166/26.171 ms
Since not long ago, I didn't have the storage traffic separated from the corosync network traffic and thought that these issues would happen because of that. Now we've switched from 1gbe NICs to 10Gbe NIC and now corosync runs on its dedicated NIC but I still seem to encounter those corosync issues when it looses quorum.
At this point I'm thinking that I'm not doing something right .... From what I've described above does my configuration approach look okay?
Also until now, in order to restore cluster state I had to bring down the entire cluster, then start the cluster owner and shortly after I would start one-by-one the other cluster nodes. If possible I would like to avoid that in the future. Do you guys have any better approach on restoring the corosync quorum at the cluster level without restarting the entire infrastructure?
What I've tried was to ssh into each node and attempted to restart pve-cluster and corosync services but that didn't do it.
Any advice to sort this one out would be highly appreciated.
Here some info about by environment:
pveversion -v output:
Code:
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
pvecm status (from cluster owner):
Code:
Cluster information
-------------------
Name: pmx-cluster-is
Config Version: 36
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Mon Sep 4 06:06:31 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.b5f
Quorate: No
Votequorum information
----------------------
Expected votes: 18
Highest expected: 18
Total votes: 1
Quorum: 10 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.100.200.18 (local)
output for systemctl status corosync:
Code:
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Mon 2023-09-04 05:38:51 EEST; 32min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3659 (corosync)
Tasks: 9 (limit: 153744)
Memory: 202.6M
CPU: 1h 20min 4.028s
CGroup: /system.slice/corosync.service
└─3659 /usr/sbin/corosync -f
Sep 04 06:10:49 pve-node-08 corosync[3659]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:49 pve-node-08 corosync[3659]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Sep 04 06:10:50 pve-node-08 corosync[3659]: [KNET ] link: host: 11 link: 0 is down
Sep 04 06:10:50 pve-node-08 corosync[3659]: [KNET ] host: host: 11 (passive) best link: 0 (pri: 1)
Sep 04 06:10:50 pve-node-08 corosync[3659]: [KNET ] host: host: 11 has no active links
contents of /etc/pve/corosync.conf (on cluster owner):
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve-node-02
nodeid: 4
quorum_votes: 1
ring0_addr: 10.100.200.12
}
node {
name: pve-node-03
nodeid: 2
quorum_votes: 1
ring0_addr: 10.100.200.13
}
node {
name: pve-node-04
nodeid: 3
quorum_votes: 1
ring0_addr: 10.100.200.14
}
node {
name: pve-node-05
nodeid: 5
quorum_votes: 1
ring0_addr: 10.100.200.15
}
node {
name: pve-node-06
nodeid: 16
quorum_votes: 1
ring0_addr: 10.100.200.16
}
node {
name: pve-node-07
nodeid: 6
quorum_votes: 1
ring0_addr: 10.100.200.17
}
node {
name: pve-node-08
nodeid: 1
quorum_votes: 1
ring0_addr: 10.100.200.18
}
node {
name: pve-node-10
nodeid: 7
quorum_votes: 1
ring0_addr: 10.100.200.20
}
node {
name: pve-node-11
nodeid: 8
quorum_votes: 1
ring0_addr: 10.100.200.21
}
node {
name: pve-node-12
nodeid: 9
quorum_votes: 1
ring0_addr: 10.100.200.22
}
node {
name: pve-node-13
nodeid: 18
quorum_votes: 1
ring0_addr: 10.100.200.23
}
node {
name: pve-node-14
nodeid: 10
quorum_votes: 1
ring0_addr: 10.100.200.24
}
node {
name: pve-node-15
nodeid: 11
quorum_votes: 1
ring0_addr: 10.100.200.25
}
node {
name: pve-node-16
nodeid: 12
quorum_votes: 1
ring0_addr: 10.100.200.26
}
node {
name: pve-node-17
nodeid: 17
quorum_votes: 1
ring0_addr: 10.100.200.27
}
node {
name: pve-node-18
nodeid: 14
quorum_votes: 1
ring0_addr: 10.100.200.28
}
node {
name: pve-node-19
nodeid: 13
quorum_votes: 1
ring0_addr: 10.100.200.29
}
node {
name: pve-node-20
nodeid: 15
quorum_votes: 1
ring0_addr: 10.100.200.30
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: pmx-cluster-is
config_version: 36
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
Thanks in advance,
Bogdan M.
Last edited: