[SOLVED] One node in cluster brings everything down.

brosky · Jun 13, 2023

So, I have a 14 node cluster.

We had a switch failure and we had to move all the frontend networking (public and pve cluster) endpoints to a backup switch and after a couple of days, we moved them back.

Now, one node from the cluster misbehaves, I can't write to /etc/pve folder.
I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Same behaviour:
1. it works for a couple of minutes then corosync throws:


Jun 12 10:01:59 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 40
Jun 12 10:02:00 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 50
Jun 12 10:02:00 pve299 corosync[35762]:   [KNET  ] rx: host: 5 link: 0 is up
Jun 12 10:02:00 pve299 corosync[35762]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
Jun 12 10:02:00 pve299 corosync[35762]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jun 12 10:02:00 pve299 corosync[35762]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 12 10:02:01 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 60
Jun 12 10:02:02 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 70
Jun 12 10:02:03 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 80
Jun 12 10:02:04 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 90
Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 100
Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retried 100 times
Jun 12 10:02:05 pve299 pmxcfs[17870]: [status] crit: cpg_send_message failed: 6
Jun 12 10:02:06 pve299 pmxcfs[17870]: [status] notice: cpg_send_message retry 10
Jun 12 10:02:07 pve299 corosync[35762]:   [TOTEM ] Token has not been received in 8132 ms

If I power down that node, everything works again. When I power it up again and join the cluster, cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.

I've fixed clock/timezones, changed interfaces, cables, sfp adapters.

With that host powered down:


Cluster information
-------------------
Name:             xxxx
Config Version:   14
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Jun 13 18:16:05 2023
Quorum provider:  corosync_votequorum
Nodes:            13
Node ID:          0x00000008
Ring ID:          1.104c0
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   14
Highest expected: 14
Total votes:      13
Quorum:           8
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.2.2.98
0x00000002          1 10.2.2.97
0x00000003          1 10.2.2.95
0x00000004          1 10.2.2.93
0x00000006          1 10.2.2.37
0x00000007          1 10.2.2.38
0x00000008          1 10.2.2.99 (local)
0x00000009          1 10.2.2.39
0x0000000a          1 10.2.2.96
0x0000000b          1 10.2.2.91
0x0000000c          1 10.2.2.90
0x0000000d          1 10.2.2.92
0x0000000e          1 10.2.2.94

I've tried to write the authkey.pub file from another host but I get permission denied after 1-2 minutes of no response


logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve236
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.2.2.36
  }
  node {
    name: pve237
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.2.2.37
  }
  node {
    name: pve238
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.2.2.38
  }
  node {
    name: pve239
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.2.2.39
  }
  node {
    name: pve290
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.2.2.90
  }
  node {
    name: pve291
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.2.2.91
  }
  node {
    name: pve292
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.2.2.92
  }
  node {
    name: pve293
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.2.2.93
  }
  node {
    name: pve294
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.2.2.94
  }
  node {
    name: pve295
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.2.2.95
  }
  node {
    name: pve296
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.2.2.96
  }
  node {
    name: pve297
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.2.2.97
  }
  node {
    name: pve298
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.2.2.98
  }
  node {
    name: pve299
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.2.2.99
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: xxxx
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}


proxmox-ve: 7.4-1 (running kernel: 5.11.22-7-pve)
pve-manager: 7.4-13 (running version: 7.4-13/46c37d9c)
pve-kernel-5.15: 7.4-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.39-2-pve: 5.15.39-2
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph: 16.2.13-pve1
ceph-fuse: 16.2.13-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.7.2
pve-cluster: 7.3-3
pve-container: 4.4-4
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

pve-cluster is on a vlan, 0.124 -> 0.140ms constant ping

Any ideas where should I look ?

bbgeek17 · Jun 13, 2023

This reminds me of https://forum.proxmox.com/threads/adding-node-10-makes-puts-whole-cluster-out-of-order.115074/page-2
Take a look at the conversation there, even though its not solved, perhaps it can give you some troubleshooting ideas

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

fabian · Jun 14, 2023

please post (from all nodes)
- pveversion -v
- contents of /etc/corosync/corosync.conf

how did you change the switches? did that also entail updating corosync.conf, or was it transparent from the PVE nodes' point of view?

brosky · Jun 16, 2023

Issue resolved.

Is seems that even if I had 0.140ms on the corosync network, having a 2-3-5% packet loss was the source of the issue.
I've replaced the link and magically everything healed and worked again.

Search

Search

[SOLVED] One node in cluster brings everything down.

brosky

Well-Known Member

bbgeek17

Distinguished Member

fabian

Proxmox Staff Member

brosky

Well-Known Member