So, I have a 14 node cluster.
We had a switch failure and we had to move all the frontend networking (public and pve cluster) endpoints to a backup switch and after a couple of days, we moved them back.
Now, one node from the cluster misbehaves, I can't write to /etc/pve folder.
I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Same behaviour:
1. it works for a couple of minutes then corosync throws:
If I power down that node, everything works again. When I power it up again and join the cluster, cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.
I've fixed clock/timezones, changed interfaces, cables, sfp adapters.
With that host powered down:
I've tried to write the authkey.pub file from another host but I get permission denied after 1-2 minutes of no response
pve-cluster is on a vlan, 0.124 -> 0.140ms constant ping
Any ideas where should I look ?
We had a switch failure and we had to move all the frontend networking (public and pve cluster) endpoints to a backup switch and after a couple of days, we moved them back.
Now, one node from the cluster misbehaves, I can't write to /etc/pve folder.
I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Same behaviour:
1. it works for a couple of minutes then corosync throws:
Jun 12 10:01:59 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 40
Jun 12 10:02:00 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 50
Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] rx: host: 5 link: 0 is up
Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 12 10:02:01 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 60
Jun 12 10:02:02 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 70
Jun 12 10:02:03 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 80
Jun 12 10:02:04 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 90
Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 100
Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retried 100 times
Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] crit: cpg_send_message failed: 6
Jun 12 10:02:06 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 10
Jun 12 10:02:07 pve299 corosync[35762]: [TOTEM ] Token has not been received in 8132 ms
If I power down that node, everything works again. When I power it up again and join the cluster, cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.
I've fixed clock/timezones, changed interfaces, cables, sfp adapters.
With that host powered down:
Cluster information
-------------------
Name: xxxx
Config Version: 14
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Tue Jun 13 18:16:05 2023
Quorum provider: corosync_votequorum
Nodes: 13
Node ID: 0x00000008
Ring ID: 1.104c0
Quorate: Yes
Votequorum information
----------------------
Expected votes: 14
Highest expected: 14
Total votes: 13
Quorum: 8
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.2.2.98
0x00000002 1 10.2.2.97
0x00000003 1 10.2.2.95
0x00000004 1 10.2.2.93
0x00000006 1 10.2.2.37
0x00000007 1 10.2.2.38
0x00000008 1 10.2.2.99 (local)
0x00000009 1 10.2.2.39
0x0000000a 1 10.2.2.96
0x0000000b 1 10.2.2.91
0x0000000c 1 10.2.2.90
0x0000000d 1 10.2.2.92
0x0000000e 1 10.2.2.94
I've tried to write the authkey.pub file from another host but I get permission denied after 1-2 minutes of no response
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: pve236
nodeid: 5
quorum_votes: 1
ring0_addr: 10.2.2.36
}
node {
name: pve237
nodeid: 6
quorum_votes: 1
ring0_addr: 10.2.2.37
}
node {
name: pve238
nodeid: 7
quorum_votes: 1
ring0_addr: 10.2.2.38
}
node {
name: pve239
nodeid: 9
quorum_votes: 1
ring0_addr: 10.2.2.39
}
node {
name: pve290
nodeid: 12
quorum_votes: 1
ring0_addr: 10.2.2.90
}
node {
name: pve291
nodeid: 11
quorum_votes: 1
ring0_addr: 10.2.2.91
}
node {
name: pve292
nodeid: 13
quorum_votes: 1
ring0_addr: 10.2.2.92
}
node {
name: pve293
nodeid: 4
quorum_votes: 1
ring0_addr: 10.2.2.93
}
node {
name: pve294
nodeid: 14
quorum_votes: 1
ring0_addr: 10.2.2.94
}
node {
name: pve295
nodeid: 3
quorum_votes: 1
ring0_addr: 10.2.2.95
}
node {
name: pve296
nodeid: 10
quorum_votes: 1
ring0_addr: 10.2.2.96
}
node {
name: pve297
nodeid: 2
quorum_votes: 1
ring0_addr: 10.2.2.97
}
node {
name: pve298
nodeid: 1
quorum_votes: 1
ring0_addr: 10.2.2.98
}
node {
name: pve299
nodeid: 8
quorum_votes: 1
ring0_addr: 10.2.2.99
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: xxxx
config_version: 14
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
proxmox-ve: 7.4-1 (running kernel: 5.11.22-7-pve)
pve-manager: 7.4-13 (running version: 7.4-13/46c37d9c)
pve-kernel-5.15: 7.4-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-2-pve: 5.15.39-2
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph: 16.2.13-pve1
ceph-fuse: 16.2.13-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.7.2
pve-cluster: 7.3-3
pve-container: 4.4-4
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
pve-cluster is on a vlan, 0.124 -> 0.140ms constant ping
Any ideas where should I look ?