total loss of quorum all cluster

Gastondc

Well-Known Member
Aug 3, 2017
34
0
46
40
The Cluster lost quorum from one moment to the next. I was checking the network and there is not a very firm problem around the network.

Quorum all nodes:

root@pve23:~# pvecm status
Cluster information
-------------------
Name: DC01
Config Version: 14
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Nov 14 14:18:10 2024
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000006
Ring ID: 6.10fdc
Quorate: No

Votequorum information
----------------------
Expected votes: 11
Highest expected: 11
Total votes: 1
Quorum: 6 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000006 1 192.168.150.248 (local)

Pveversion

pveversion
pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-2-pve)


corosync.conf:
Code:
root@pve23:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve04
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.150.243
  }
  node {
    name: pve05
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.150.244
  }
  node {
    name: pve1
    nodeid: 3
    quorum_votes: 2
    ring0_addr: 192.168.150.251
  }
  node {
    name: pve2
    nodeid: 1
    quorum_votes: 2
    ring0_addr: 192.168.150.252
  }
  node {
    name: pve21
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.150.249
  }
  node {
    name: pve22
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.150.250
  }
  node {
    name: pve23
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.150.248
  }
  node {
    name: pve3
    nodeid: 2
    quorum_votes: 2
    ring0_addr: 192.168.150.253
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: DC01
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}




root@pve1:~# corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
addr = 192.168.150.251
status:
nodeid: 1: connected
nodeid: 2: connected
nodeid: 3: localhost
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: connected
nodeid: 7: connected
nodeid: 8: connected



root@pve1:~# journalctl -u corosync -f
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] rx: host: 5 link: 0 is up
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
Nov 14 14:22:23 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Nov 14 14:22:24 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Nov 14 14:22:25 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 8 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: Global data MTU changed to: 1397
^C
root@pve1:~# journalctl -xe -u corosync
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 8 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 4 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [QUORUM] Sync members[1]: 3
Nov 14 14:22:20 pve1 corosync[550128]: [QUORUM] Sync joined[1]: 3
Nov 14 14:22:20 pve1 corosync[550128]: [TOTEM ] A new membership (3.fd0f) was formed. Members joined: 3
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 6 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 2 has no active links
Nov 14 14:22:20 pve1 corosync[550128]: [QUORUM] Members[1]: 3
Nov 14 14:22:20 pve1 corosync[550128]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 14 14:22:20 pve1 systemd[1]: Started corosync.service - Corosync Cluster Engine.
░░ Subject: A start job for unit corosync.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync.service has finished successfully.
░░
░░ The job identifier is 5117.
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 8 joined
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Nov 14 14:22:20 pve1 corosync[550128]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 7 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 4 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 2 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] rx: host: 6 link: 0 is up
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 14 14:22:21 pve1 corosync[550128]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] rx: host: 5 link: 0 is up
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Nov 14 14:22:22 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
Nov 14 14:22:23 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Nov 14 14:22:24 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Nov 14 14:22:25 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 8 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Nov 14 14:22:27 pve1 corosync[550128]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 14 14:22:41 pve1 corosync[550128]: [KNET ] link: host: 5 link: 0 is down
Nov 14 14:22:41 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 14 14:22:41 pve1 corosync[550128]: [KNET ] host: host: 5 has no active links
Nov 14 14:22:42 pve1 corosync[550128]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 14 14:22:42 pve1 corosync[550128]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)



root@pve04:~# systemctl status pve-cluster.service corosync.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Mon 2024-07-15 12:08:09 -03; 4 months 0 days ago
Main PID: 2421 (pmxcfs)
Tasks: 9 (limit: 309246)
Memory: 66.6M
CPU: 10h 27min 4.434s
CGroup: /system.slice/pve-cluster.service
└─2421 /usr/bin/pmxcfs

Nov 14 14:32:14 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 60
Nov 14 14:32:15 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 70
Nov 14 14:32:16 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 80
Nov 14 14:32:17 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 90
Nov 14 14:32:18 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 100
Nov 14 14:32:18 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retried 100 times
Nov 14 14:32:18 pve04 pmxcfs[2421]: [status] crit: cpg_send_message failed: 6
Nov 14 14:32:19 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 10
Nov 14 14:32:20 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 20
Nov 14 14:32:21 pve04 pmxcfs[2421]: [status] notice: cpg_send_message retry 30

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Mon 2024-07-15 12:08:10 -03; 4 months 0 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2546 (corosync)
Tasks: 9 (limit: 309246)
Memory: 608.5M
CPU: 4d 15h 54min 18.993s
CGroup: /system.slice/corosync.service
└─2546 /usr/sbin/corosync -f

Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
Nov 14 14:31:54 pve04 corosync[2546]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
 
Last edited:
Is that network used by other "services" that might take up all the bandwidth? Prime examples are anything storage related. If the backup target is on that network, that could also be a reason.
Can the nodes still ping each other on that network?

Best practice is to have multiple Corosync links, this way Corosync has options, should one link become unusable. Ideally one physical network is there only for Corosync to avoid any issues of other services congesting the link.
 
Thanks for de fast reply
Is that network used by other "services" that might take up all the bandwidth? Prime examples are anything storage related. If the backup target is on that network, that could also be a reason.
Can the nodes still ping each other on that network?

Best practice is to have multiple Corosync links, this way Corosync has options, should one link become unusable. Ideally one physical network is there only for Corosync to avoid any issues of other services congesting the link.
There is a separate VLAN for Corosync. But it is the same physical cable.

It doesn't seem to be a bandwidth problem. I have 3 nodes that are “development” on the same cluster. I was able to reboot them, they have no load and also do not raise quorum between them 3.

It is a 10GB link. I have visibility by ssh, ping, and udp between all nodes.
 
There is a separate VLAN for Corosync. But it is the same physical cable.
Can they still ping each other? Is the VLAN still working? It isn't unheard of to forget to save a switch config in the startup/boot config and then a sudden switch reboot will make the switch lose the running config.
 
Can they still ping each other? Is the VLAN still working? It isn't unheard of to forget to save a switch config in the startup/boot config and then a sudden switch reboot will make the switch lose the running config.
Yes.

Ping is working between all nodes.
tcp/ip connections work
udp on corosync ports works.

It is very rare.

This dedicated vlan for corosync, suddenly it stopped working.

I added a second ring in corosync, but it didn't raise the quorum either.

I removed the first ring (the settings in the first post), leaving only the second ring, the quorum recovered. Very strange.

What could you check to see why it fails in the dedicated network for corosync?


I'm going to check the switches. But it was running over that vlan years ago, there were no changes to the switch. the switch had not been rebooted
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!