total loss of quorum all cluster

Gastondc · Nov 14, 2024

The Cluster lost quorum from one moment to the next. I was checking the network and there is not a very firm problem around the network.

Quorum all nodes:

root@pve23:~# pvecm status
Cluster information
-------------------
Name: DC01
Config Version: 14
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Nov 14 14:18:10 2024
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000006
Ring ID: 6.10fdc
Quorate: No

Votequorum information
----------------------
Expected votes: 11
Highest expected: 11
Total votes: 1
Quorum: 6 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000006 1 192.168.150.248 (local)

Pveversion

pveversion
pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-2-pve)

corosync.conf:

Code:

root@pve23:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve04
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.150.243
  }
  node {
    name: pve05
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.150.244
  }
  node {
    name: pve1
    nodeid: 3
    quorum_votes: 2
    ring0_addr: 192.168.150.251
  }
  node {
    name: pve2
    nodeid: 1
    quorum_votes: 2
    ring0_addr: 192.168.150.252
  }
  node {
    name: pve21
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.150.249
  }
  node {
    name: pve22
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.150.250
  }
  node {
    name: pve23
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.150.248
  }
  node {
    name: pve3
    nodeid: 2
    quorum_votes: 2
    ring0_addr: 192.168.150.253
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: DC01
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@pve1:~# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
addr = 192.168.150.251
status:
nodeid: 1: connected
nodeid: 2: connected
nodeid: 3: localhost
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: connected
nodeid: 7: connected
nodeid: 8: connected

root@pve1:~# journalctl -u corosync -f
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] rx: host: 5 link: 0 is up
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
Nov 14 14:22:23 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Nov 14 14:22:24 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Nov 14 14:22:25 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 8 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: Global data MTU changed to: 1397
^C
root@pve1:~# journalctl -xe -u corosync
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 8 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [QUORUM] Sync members[1]: 3
Nov 14 14:22:20 pve1 corosync[550128]: [QUORUM] Sync joined[1]: 3
Nov 14 14:22:20 pve1 corosync[550128]: [TOTEM ] A new membership (3.fd0f) was formed. Members joined: 3
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [QUORUM] Members[1]: 3
Nov 14 14:22:20 pve1 corosync[550128]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 14 14:22:20 pve1 systemd[1]: Started corosync.service - Corosync Cluster Engine.
░░ Subject: A start job for unit corosync.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync.service has finished successfully.
░░
░░ The job identifier is 5117.
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 8 joined
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 7 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 4 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 2 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 6 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] rx: host: 5 link: 0 is up
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
Nov 14 14:22:23 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Nov 14 14:22:24 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Nov 14 14:22:25 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 8 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 14 14:22:41 pve1 corosync[550128]: [KNET ] link: host: 5 link: 0 is down
Nov 14 14:22:41 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:41 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:42 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 14 14:22:42 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)

root@pve04:~# systemctl status pve-cluster.service corosync.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Mon 2024-07-15 12:08:09 -03; 4 months 0 days ago
Main PID: 2421 (pmxcfs)
Tasks: 9 (limit: 309246)
Memory: 66.6M
CPU: 10h 27min 4.434s
CGroup: /system.slice/pve-cluster.service
└─2421 /usr/bin/pmxcfs

Nov 14 14:32:14 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 60
Nov 14 14:32:15 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 70
Nov 14 14:32:16 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 80
Nov 14 14:32:17 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 90
Nov 14 14:32:18 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 100
Nov 14 14:32:18 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retried 100 times
Nov 14 14:32:18 pve04 pmxcfs[2421]: [status] crit: cpg_send_message failed: 6
Nov 14 14:32:19 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 10
Nov 14 14:32:20 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 20
Nov 14 14:32:21 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 30

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Mon 2024-07-15 12:08:10 -03; 4 months 0 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2546 (corosync)
Tasks: 9 (limit: 309246)
Memory: 608.5M
CPU: 4d 15h 54min 18.993s
CGroup: /system.slice/corosync.service
└─2546 /usr/sbin/corosync -f

Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable

aaron · Nov 14, 2024

Is that network used by other "services" that might take up all the bandwidth? Prime examples are anything storage related. If the backup target is on that network, that could also be a reason.
Can the nodes still ping each other on that network?

Best practice is to have multiple Corosync links, this way Corosync has options, should one link become unusable. Ideally one physical network is there only for Corosync to avoid any issues of other services congesting the link.

Gastondc · Nov 14, 2024

Thanks for de fast reply

aaron said:
Is that network used by other "services" that might take up all the bandwidth? Prime examples are anything storage related. If the backup target is on that network, that could also be a reason.
Can the nodes still ping each other on that network?

Best practice is to have multiple Corosync links, this way Corosync has options, should one link become unusable. Ideally one physical network is there only for Corosync to avoid any issues of other services congesting the link.

There is a separate VLAN for Corosync. But it is the same physical cable.

It doesn't seem to be a bandwidth problem. I have 3 nodes that are “development” on the same cluster. I was able to reboot them, they have no load and also do not raise quorum between them 3.

It is a 10GB link. I have visibility by ssh, ping, and udp between all nodes.

aaron · Nov 15, 2024

Gastondc said:
There is a separate VLAN for Corosync. But it is the same physical cable.

Can they still ping each other? Is the VLAN still working? It isn't unheard of to forget to save a switch config in the startup/boot config and then a sudden switch reboot will make the switch lose the running config.

Gastondc · Sunday at 13:29

aaron said:
Can they still ping each other? Is the VLAN still working? It isn't unheard of to forget to save a switch config in the startup/boot config and then a sudden switch reboot will make the switch lose the running config.

Yes.

Ping is working between all nodes.
tcp/ip connections work
udp on corosync ports works.

It is very rare.

This dedicated vlan for corosync, suddenly it stopped working.

I added a second ring in corosync, but it didn't raise the quorum either.

I removed the first ring (the settings in the first post), leaving only the second ring, the quorum recovered. Very strange.

What could you check to see why it fails in the dedicated network for corosync?

I'm going to check the switches. But it was running over that vlan years ago, there were no changes to the switch. the switch had not been rebooted

Search

Search

total loss of quorum all cluster

Gastondc

Well-Known Member

aaron

Proxmox Staff Member

Gastondc

Well-Known Member

aaron

Proxmox Staff Member

Gastondc

Well-Known Member