[SOLVED] One node in cluster brings everything down.

brosky

Well-Known Member
Oct 13, 2015
55
4
48
So, I have a 14 node cluster.

We had a switch failure and we had to move all the frontend networking (public and pve cluster) endpoints to a backup switch and after a couple of days, we moved them back.

Now, one node from the cluster misbehaves, I can't write to /etc/pve folder.
I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Same behaviour:
1. it works for a couple of minutes then corosync throws:
Jun 12 10:01:59 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 40 Jun 12 10:02:00 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 50 Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] rx: host: 5 link: 0 is up Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] link: Resetting MTU for link 0 because host 5 joined Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1) Jun 12 10:02:00 pve299 corosync[35762]: [KNET ] pmtud: Global data MTU changed to: 1397 Jun 12 10:02:01 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 60 Jun 12 10:02:02 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 70 Jun 12 10:02:03 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 80 Jun 12 10:02:04 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 90 Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 100 Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retried 100 times Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] crit: cpg_send_message failed: 6 Jun 12 10:02:06 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 10 Jun 12 10:02:07 pve299 corosync[35762]: [TOTEM ] Token has not been received in 8132 ms

If I power down that node, everything works again. When I power it up again and join the cluster, cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.

I've fixed clock/timezones, changed interfaces, cables, sfp adapters.

With that host powered down:

Cluster information ------------------- Name: xxxx Config Version: 14 Transport: knet Secure auth: on Quorum information ------------------ Date: Tue Jun 13 18:16:05 2023 Quorum provider: corosync_votequorum Nodes: 13 Node ID: 0x00000008 Ring ID: 1.104c0 Quorate: Yes Votequorum information ---------------------- Expected votes: 14 Highest expected: 14 Total votes: 13 Quorum: 8 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 10.2.2.98 0x00000002 1 10.2.2.97 0x00000003 1 10.2.2.95 0x00000004 1 10.2.2.93 0x00000006 1 10.2.2.37 0x00000007 1 10.2.2.38 0x00000008 1 10.2.2.99 (local) 0x00000009 1 10.2.2.39 0x0000000a 1 10.2.2.96 0x0000000b 1 10.2.2.91 0x0000000c 1 10.2.2.90 0x0000000d 1 10.2.2.92 0x0000000e 1 10.2.2.94

I've tried to write the authkey.pub file from another host but I get permission denied after 1-2 minutes of no response

logging { debug: off to_syslog: yes } nodelist { node { name: pve236 nodeid: 5 quorum_votes: 1 ring0_addr: 10.2.2.36 } node { name: pve237 nodeid: 6 quorum_votes: 1 ring0_addr: 10.2.2.37 } node { name: pve238 nodeid: 7 quorum_votes: 1 ring0_addr: 10.2.2.38 } node { name: pve239 nodeid: 9 quorum_votes: 1 ring0_addr: 10.2.2.39 } node { name: pve290 nodeid: 12 quorum_votes: 1 ring0_addr: 10.2.2.90 } node { name: pve291 nodeid: 11 quorum_votes: 1 ring0_addr: 10.2.2.91 } node { name: pve292 nodeid: 13 quorum_votes: 1 ring0_addr: 10.2.2.92 } node { name: pve293 nodeid: 4 quorum_votes: 1 ring0_addr: 10.2.2.93 } node { name: pve294 nodeid: 14 quorum_votes: 1 ring0_addr: 10.2.2.94 } node { name: pve295 nodeid: 3 quorum_votes: 1 ring0_addr: 10.2.2.95 } node { name: pve296 nodeid: 10 quorum_votes: 1 ring0_addr: 10.2.2.96 } node { name: pve297 nodeid: 2 quorum_votes: 1 ring0_addr: 10.2.2.97 } node { name: pve298 nodeid: 1 quorum_votes: 1 ring0_addr: 10.2.2.98 } node { name: pve299 nodeid: 8 quorum_votes: 1 ring0_addr: 10.2.2.99 } } quorum { provider: corosync_votequorum } totem { cluster_name: xxxx config_version: 14 interface { linknumber: 0 } ip_version: ipv4-6 link_mode: passive secauth: on version: 2 }
proxmox-ve: 7.4-1 (running kernel: 5.11.22-7-pve) pve-manager: 7.4-13 (running version: 7.4-13/46c37d9c) pve-kernel-5.15: 7.4-3 pve-kernel-5.15.107-2-pve: 5.15.107-2 pve-kernel-5.15.104-1-pve: 5.15.104-2 pve-kernel-5.15.85-1-pve: 5.15.85-1 pve-kernel-5.15.83-1-pve: 5.15.83-1 pve-kernel-5.15.64-1-pve: 5.15.64-1 pve-kernel-5.15.60-1-pve: 5.15.60-1 pve-kernel-5.15.39-4-pve: 5.15.39-4 pve-kernel-5.15.39-2-pve: 5.15.39-2 pve-kernel-5.15.39-1-pve: 5.15.39-1 pve-kernel-5.11.22-7-pve: 5.11.22-12 ceph: 16.2.13-pve1 ceph-fuse: 16.2.13-pve1 corosync: 3.1.7-pve1 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx4 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve2 libproxmox-acme-perl: 1.4.4 libproxmox-backup-qemu0: 1.3.1-1 libproxmox-rs-perl: 0.2.1 libpve-access-control: 7.4.1 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.4-1 libpve-guest-common-perl: 4.2-4 libpve-http-server-perl: 4.2-3 libpve-rs-perl: 0.7.7 libpve-storage-perl: 7.4-3 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.2-2 lxcfs: 5.0.3-pve1 novnc-pve: 1.4.0-1 proxmox-backup-client: 2.4.2-1 proxmox-backup-file-restore: 2.4.2-1 proxmox-kernel-helper: 7.4-1 proxmox-mail-forward: 0.1.1-1 proxmox-mini-journalreader: 1.3-1 proxmox-offline-mirror-helper: 0.5.1-1 proxmox-widget-toolkit: 3.7.2 pve-cluster: 7.3-3 pve-container: 4.4-4 pve-docs: 7.4-2 pve-edk2-firmware: 3.20230228-4~bpo11+1 pve-firewall: 4.3-4 pve-firmware: 3.6-5 pve-ha-manager: 3.6.1 pve-i18n: 2.12-1 pve-qemu-kvm: 7.2.0-8 pve-xtermjs: 4.16.0-2 qemu-server: 7.4-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.8.0~bpo11+3 vncterm: 1.7-1 zfsutils-linux: 2.1.11-pve1

pve-cluster is on a vlan, 0.124 -> 0.140ms constant ping

Any ideas where should I look ?
 
please post (from all nodes)
- pveversion -v
- contents of /etc/corosync/corosync.conf

how did you change the switches? did that also entail updating corosync.conf, or was it transparent from the PVE nodes' point of view?
 
Issue resolved.

Is seems that even if I had 0.140ms on the corosync network, having a 2-3-5% packet loss was the source of the issue.
I've replaced the link and magically everything healed and worked again.
 
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!