[SOLVED] Cluster lost quorum

Stik · Apr 2, 2022

Hello, after unknown problems in our system network our proxmox cluster lost quorum and not working.

I've checked several threads but don't find solution.

We have 9 nodes in cluster and all of them show different results in pvecm - most slow only thmeselves:

Code:

root@vu203adm:~# pvecm status                   
Cluster information                             
-------------------                             
Name:             lightcluster                 
Config Version:   19                           
Transport:        knet                         
Secure auth:      on                           
                                                
Quorum information                             
------------------                             
Date:             Sat Apr  2 11:15:02 2022     
Quorum provider:  corosync_votequorum           
Nodes:            1                             
Node ID:          0x00000004                   
Ring ID:          4.81c8                       
Quorate:          No                           
                                                
Votequorum information                         
----------------------                         
Expected votes:   9                             
Highest expected: 9                             
Total votes:      1                             
Quorum:           5 Activity blocked           
Flags:                                         
                                                
Membership information                         
----------------------                         
    Nodeid      Votes Name                     
0x00000004          1 10.100.141.203 (local)    


root@vu204adm:~# pvecm status                 
Cluster information                           
-------------------                           
Name:             lightcluster                
Config Version:   19                          
Transport:        knet                        
Secure auth:      on                          
                                              
Quorum information                            
------------------                            
Date:             Sat Apr  2 11:45:01 2022    
Quorum provider:  corosync_votequorum         
Nodes:            1                           
Node ID:          0x00000007                  
Ring ID:          7.823c                      
Quorate:          No                          
                                              
Votequorum information                        
----------------------                        
Expected votes:   9                           
Highest expected: 9                           
Total votes:      1                           
Quorum:           5 Activity blocked          
Flags:                                        
                                              
Membership information                        
----------------------                        
    Nodeid      Votes Name                    
0x00000007          1 10.100.141.204 (local)

Some shows other nodes but not each other:

Code:

root@vu175adm:~# pvecm status                         
Cluster information                                   
-------------------                                   
Name:             lightcluster                       
Config Version:   19                                 
Transport:        knet                               
Secure auth:      on                                 
                                                      
Quorum information                                   
------------------                                   
Date:             Sat Apr  2 11:34:07 2022           
Quorum provider:  corosync_votequorum                 
Nodes:            4                                   
Node ID:          0x00000006                         
Ring ID:          1.794c                             
Quorate:          No                                 
                                                      
Votequorum information                               
----------------------                               
Expected votes:   9                                   
Highest expected: 9                                   
Total votes:      1                                   
Quorum:           5 Activity blocked                 
Flags:                                               
                                                      
Membership information                               
----------------------                               
    Nodeid      Votes Name                           
0x00000001          1 10.100.140.176                 
0x00000004          1 10.100.141.203                 
0x00000005          1 10.100.140.174                 
0x00000006          1 10.100.140.175 (local)

On one node I've stopped pve-cluster but now it cannot start:

Code:

root@vu205adm:~# pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

root@vu205adm:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: activating (start) since Sat 2022-04-02 11:42:34 MSK; 56s ago
Cntrl PID: 92740 (pmxcfs)
    Tasks: 3 (limit: 17203)
   Memory: 3.2M
   CGroup: /system.slice/pve-cluster.service
           ├─92740 /usr/bin/pmxcfs
           └─92745 /usr/bin/pmxcfs

апр 02 11:43:22 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 470
апр 02 11:43:23 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 480
апр 02 11:43:24 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 490
апр 02 11:43:25 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 500
апр 02 11:43:26 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 510
апр 02 11:43:27 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 520
апр 02 11:43:28 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 530
апр 02 11:43:29 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 540
апр 02 11:43:30 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 550
апр 02 11:43:31 vu205adm pmxcfs[92745]: [dcdb] notice: cpg_join retry 560

I've totally lost and need your help.
My only idea is to turn off all nodes and start them one by one, because they see each other - checked by ping:

Code:

root@vu204adm:~# ping 10.100.141.203                               
PING 10.100.141.203 (10.100.141.203) 56(84) bytes of data.         
64 bytes from 10.100.141.203: icmp_seq=1 ttl=64 time=0.052 ms     
64 bytes from 10.100.141.203: icmp_seq=2 ttl=64 time=0.071 ms     
64 bytes from 10.100.141.203: icmp_seq=3 ttl=64 time=0.066 ms     
64 bytes from 10.100.141.203: icmp_seq=4 ttl=64 time=0.108 ms     
^C                                                                 
--- 10.100.141.203 ping statistics ---                             
4 packets transmitted, 4 received, 0% packet loss, time 66ms       
rtt min/avg/max/mdev = 0.052/0.074/0.108/0.021 ms

Moayad · Apr 4, 2022

Hello,

Can you please post your network configuration cat /etc/network/interfaces and the Corosync.conf cat /etc/pve/corosync.conf?

Stik · Apr 4, 2022

Code:

root@vu174adm:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback


auto bond0
iface bond0 inet manual
        slaves eno1 eno2
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address 10.100.140.174
        netmask 255.255.252.0
        gateway 10.100.143.254
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0



root@vu174adm:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vu174adm
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.100.140.174
  }
  node {
    name: vu175adm
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.100.140.175
  }
  node {
    name: vu176adm
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.100.140.176
  }
  node {
    name: vu177adm
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.100.140.177
  }
  node {
    name: vu202adm
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.100.141.202
  }
  node {
    name: vu203adm
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.100.141.203
  }
  node {
    name: vu204adm
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.100.141.204
  }
  node {
    name: vu205adm
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.100.141.205
  }
  node {
    name: vu206adm
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.100.141.206
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: lightcluster
  config_version: 19
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

After nothing helped we decided to shutdown cluster (with 490 days uptime) and start one by one and now everything working.

If you have any hints why corosync restart doesn't help - I'll listen.

Gh0st · Apr 4, 2022

I recently had this problem too. Check all of your nodes. With me, one of my nodes had used all of the space on its / drive. I couldn't recover the node for some reason, I gave up with it after a few days but after putting the server in rescue mode I got quorum back on my other nodes. So if you can identify a node with a problem try shutting it down. The error messages you have are exactly the same as the ones I saw.

Pierre-Yves · Apr 4, 2022

Hello
2 ideas :
- you've got 10.100.140.x and 10.100.141.x nodes, are you sure your netmask is the same on all your nodes (a 255.255.255.0 on several nodes can explain all this mess)
- host firewall activated

Stik · Apr 4, 2022

I've checked netmask - it is same.
And I didn't change anything in firewall settings.

I'll mark thread solved, but feel free to post debug solutions.

Moayad · Apr 4, 2022

Hi,

Thank you for posting the network and Corosync config!

In addition to the above, we recommend setting two ring_x at least to the Corosync network in order to avoid losing the quorum [0]. Corosync requires a low latency network of less than 2ms between the nodes.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

holckj · Apr 24, 2022

Moayad said:
Hi,

Thank you for posting the network and Corosync config!

In addition to the above, we recommend setting two ring_x at least to the Corosync network in order to avoid losing the quorum [0]. Corosync requires a low latency network of less than 2ms between the nodes.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

I have a cluster setup with two "real" nodes, and one Raspberry Pi used as quorum device, see my corosync.conf below. All three nodes are connected to two networks, 10.10.10.0/24 and 192.168.1.0/24, with the 10.10.10.0/24 network being used only for cluster communication.
I can add ring1_addr specifications for the two "real" nodes, kira and sonja, but what about the quorum device?

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: kira
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.4
  }
  node {
    name: sonja
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 10.10.10.3
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: 68B
  config_version: 7
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Search

Search

[SOLVED] Cluster lost quorum

Stik

Member

Moayad

Proxmox Staff Member

Stik

Member

Gh0st

Member

Pierre-Yves

Active Member

Stik

Member

Moayad

Proxmox Staff Member

holckj

Active Member

We value your privacy