Quorum Activity blocked

jhuesser · Jan 6, 2020

Hi there,

We have a proxmox cluster, with each node in the same subnet. The cluster contained 3 hosts and everything worked as it should in the last years. Recently we wanted to add a new host to the cluster. This host is also in the same subnet, nothing changed in the network confugration. We've added the host following the tutorial:

Bash:

 pvecm add <IP_Address_of_host_in_cluster>

But after about 5 minutes we've had a split brain in the cluster, where the new node wan't visible to the others and vice versa. Status says, that the quorum activity was blocked:

Code:

root@node4:~#  pvecm status
Quorum information
------------------
Date:             Mon Jan  6 15:44:43 2020
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4/2067668
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 <ip_of_node4> (local)

The other nodes can still see each other:

Code:

root@node1:~# pvecm status
Quorum information
------------------
Date:             Mon Jan  6 15:46:25 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          3/2096608
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 <ip_node3>
0x00000002          1 <ip_node2>
0x00000001          1 <ip_node1> (local)

/etc/pve/corosync.conf seems to be the same on the nodes:

Code:

root@node1:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: node1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: node2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: <ip_node3>
  }
  node {
    name: node4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: <ip_node4>
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: VPSG4
  config_version: 7
  interface {
    bindnetaddr: <ip_node1>
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

and

Code:

root@node4:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: node1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: node2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: <ip_node3>
  }
  node {
    name: ndoe4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: <ip_node4>
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: VPSG4
  config_version: 7
  interface {
    bindnetaddr: <ip_node1>
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

I have discovered that the version of /etc/pve/.members differs between node1 and node4:

Code:

root@node4:~# cat /etc/pve/.members
{
"nodename": "node4",
"version": 7,
"cluster": { "name": "VPSG4", "version": 7, "nodes": 4, "quorate": 0 },
"nodelist": {
  "node1": { "id": 1, "online": 0, "ip": "<ip_node1>"},
  "node2": { "id": 2, "online": 0, "ip": "<ip_node2>"},
  "node3": { "id": 3, "online": 0, "ip": "<ip_node3"},
  "node4": { "id": 4, "online": 1, "ip": "<ip_node4>"}
  }
}

Code:

root@node1:~# cat /etc/pve/.members
{
"nodename": "chzhc1px010",
"version": 42,
"cluster": { "name": "VPSG4", "version": 7, "nodes": 4, "quorate": 1 },
"nodelist": {
  "node1": { "id": 1, "online": 1, "ip": "<ip_node1>"},
  "node2": { "id": 2, "online": 1, "ip": "<ip_node2>"},
  "node3": { "id": 3, "online": 1, "ip": "<ip_node3>"},
  "node4": { "id": 4, "online": 0, "ip": "<ip_node4>"}
  }
}

Is that a problem?

The nodes are seeing each other, after we reboot node4, but again after 5 minutes, we have a split brain again.

The syslog of node4:

Code:

root@node4:~# tail /var/log/syslog -n 50
Jan  6 09:50:31 node4 corosync[1472]: warning [CPG   ] downlist left_list: 3 received
Jan  6 09:50:31 node4 corosync[1472]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  6 09:50:31 node4 corosync[1472]: notice  [QUORUM] Members[1]: 4
Jan  6 09:50:31 node4 corosync[1472]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  6 09:50:31 node4 corosync[1472]:  [TOTEM ] A new membership (<ip_node4>:2067668) was formed. Members left: 3 2 1
Jan  6 09:50:31 node4 corosync[1472]:  [TOTEM ] Failed to receive the leave message. failed: 3 2 1
Jan  6 09:50:31 node4 corosync[1472]:  [CPG   ] downlist left_list: 3 received
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: members: 4/1307
Jan  6 09:50:31 node4 pmxcfs[1307]: [status] notice: members: 4/1307
Jan  6 09:50:31 node4 corosync[1472]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  6 09:50:31 node4 corosync[1472]:  [QUORUM] Members[1]: 4
Jan  6 09:50:31 node4 corosync[1472]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  6 09:50:31 node4 pmxcfs[1307]: [status] notice: node lost quorum
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: cpg_send_message retried 1 times
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] crit: received write while not quorate - trigger resync
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] crit: leaving CPG group
Jan  6 09:50:31 node4 pve-ha-lrm[1678]: unable to write lrm status file - closing file '/etc/pve/nodes/node4/lrm_status.tmp.1678' failed - Operation not permitted
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: start cluster connection
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: members: 4/1307
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: all data is up to date
Jan  6 09:51:00 node4 systemd[1]: Starting Proxmox VE replication runner...
Jan  6 09:51:01 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:02 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:03 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:04 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:05 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:06 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:07 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:08 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:09 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:10 node4 pvesr[2202]: error with cfs lock 'file-replication_cfg': no quorum!
Jan  6 09:51:10 node4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan  6 09:51:10 node4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan  6 09:51:10 node4 systemd[1]: pvesr.service: Unit entered failed state.
Jan  6 09:51:10 node4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan  6 09:52:00 node4 systemd[1]: Starting Proxmox VE replication runner...
Jan  6 09:52:01 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:02 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:03 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:04 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:05 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:06 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:07 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:08 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:09 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:10 node4 pvesr[2307]: error with cfs lock 'file-replication_cfg': no quorum!
Jan  6 09:52:10 node4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan  6 09:52:10 node4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan  6 09:52:10 node4 systemd[1]: pvesr.service: Unit entered failed state.
Jan  6 09:52:10 node4 systemd[1]: pvesr.service: Failed with result 'exit-code'.

I've read something about issues with multicast in the forums already, but I believe this is something else, as the infrastructure hasn't changed and the rest of the cluster still working fine. I appreciate any help on this. Thank you.

narrateourale · Jan 7, 2020

Which version of PVE are you running? With PVE 6 Corosync is nor using multicast anymore.

jhuesser · Jan 7, 2020

Hi narrateourale

Thanks for your response. We are using the following versions:

Node1:

Code:

proxmox-ve: 5.4-1 (running kernel: 4.15.18-15-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-3
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1

Node2:

Code:

proxmox-ve: 5.4-1 (running kernel: 4.15.18-16-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-4
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1

Node3:

Code:

proxmox-ve: 5.4-1 (running kernel: 4.15.18-16-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-4
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1

Node4:

Code:

proxmox-ve: 5.4-2 (running kernel: 4.15.18-23-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-11
pve-kernel-4.15.18-23-pve: 4.15.18-51
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1

spirit · Jan 7, 2020

if you only have the node4 isolated, and the 3 others nodes can still see together, It sound like a multicast problem on node4.

do you have igmp snooping enabled on your physical network ?

with only 4 nodes, you could try to test with unicast instead multicast:

edit /etc/pve/corosync.conf (when you have quorum) and add

Code:

totem {
  ....
  transport: udpu
}

(and increment config version)

Then, restart corosync on each node.

(if you have ha enabled, you need to disable it before changing corosync config)

jhuesser · Jan 8, 2020

Thanks spirit!

We changed to unicast on every node and now the nodes are seeing each other again.

promok · Jul 8, 2022

Hello,
My case is much simpler. I have 2 PCs. One is the first node recently installed which worked OK, i have created new cluster on it, cluster created OK. But this new cluster never worked with more than 1 node.
Then i decided to add a new second node to this cluster. I only have 1 home gigabit LAN with 0.6 ms latency. The PCs have only 1 network adapter configured for this exactly 1 LAN.
I cannot build the cluster!!!! Neither via GUI (Connection error 401: permission denied - invalid PVE ticket) nor via the command line:
root@proxmox2:/etc# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
trying to acquire lock...
can't lock file '/var/lock/pvecm.lock' - got timeout

I have deleted that file, no luck, only errors:
root@proxmox2:/var/lock# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* corosync is already running, is this node already in a cluster?!
Check if node may join a cluster failed!
=============================================================================
NODE1:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:37 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.40 (local)
-----------------------------------------------------------------------------------------------------
NODE2:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:14 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.30 (local)
-----------------------------------------------------------------------------------------------------
corosync.conf:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxmox1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.1.40
}
node {
name: proxmox2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.1.30
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: dellcluster
config_version: 6
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
-----------------------------------------------------------
Firewalls were disabled (on both PCs) on the datacenter and node levels before trying to join the cluster. I can ssh from one to another. These 2 machines have been installed in the last month, so the latest proxmox version.
The cluster, which cannot be built from scratch, and the containers which ALL went down - this is enough to kill the trust in proxmox solutions.
Do you have any ideas?

fiona · Jul 11, 2022

Hi,

promok said:
Hello,
My case is much simpler. I have 2 PCs. One is the first node recently installed which worked OK, i have created new cluster on it, cluster created OK. But this new cluster never worked with more than 1 node.

you need quorum and if you only have two nodes, than both need to be online for that. Otherwise you'll need a QDevice for vote support.

promok said:
Then i decided to add a new second node to this cluster. I only have 1 home gigabit LAN with 0.6 ms latency. The PCs have only 1 network adapter configured for this exactly 1 LAN.
I cannot build the cluster!!!! Neither via GUI (Connection error 401: permission denied - invalid PVE ticket) nor via the command line:
root@proxmox2:/etc# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
trying to acquire lock...
can't lock file '/var/lock/pvecm.lock' - got timeout

I have deleted that file, no luck, only errors:

EDIT: deleting such files is not always a good idea...

promok said:
root@proxmox2:/var/lock# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* corosync is already running, is this node already in a cluster?!
Check if node may join a cluster failed!

EDIT: It seems like that this node is already part of the cluster? Please check journalctl -b0 -u corosync.service on both nodes.

promok said:
=============================================================================
NODE1:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:37 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.40 (local)
-----------------------------------------------------------------------------------------------------
NODE2:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:14 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.30 (local)
-----------------------------------------------------------------------------------------------------
corosync.conf:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxmox1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.1.40
}
node {
name: proxmox2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.1.30
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: dellcluster
config_version: 6
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
-----------------------------------------------------------
Firewalls were disabled (on both PCs) on the datacenter and node levels before trying to join the cluster. I can ssh from one to another. These 2 machines have been installed in the last month, so the latest proxmox version.
The cluster, which cannot be built from scratch, and the containers which ALL went down - this is enough to kill the trust in proxmox solutions.
Do you have any ideas?

From your outputs it seems that only one node is online (EDIT: or at least they there's an issue with the nodes finding each other), so the cluster is not quorate and no new node can join. Please make sure both nodes are online and the cluster network works reliably between them before attempting to join a third node.

promok · Jul 11, 2022

Hello,
I have looked into 2 articles about quorum and came to conclusion that the minimum supported configuration for normally working quorum is to have at least 3 nodes. I planned to have only 2 nodes. So, unless i have 3 nodes, the whole cluster will not work properly?
Yes, it looks like node1 sees the join attempt, adds node2 to its database, but does not see that node2 online. This is strange because i am sure both nodes are online, they are here with me on the same home LAN, they can ping and ssh each other.
Since this incident i have returned to the previous configuration (node1 with its cluster and node2 as standalone). Then i ran apt update on both nodes who have only no-subscription repository active. And then, when i tried to repeat the join cluster thing, i got a different error:

root@proxmox2:~# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': **************************
Establishing API connection with host '192.168.1.40'
The authenticity of host '192.168.1.40' can't be established.
X509 SHA256 key fingerprint is 83:33:4B:EB:E9:88:FB:90:6F:E2:84:BA:F3:FF:F6:CF:DF:17:65:CC:73:D6:12:19:14:E4:00:48:18:FD:5D:93.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '192.168.1.30'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1657378669.sql.gz'
waiting for quorum...

It waits for hours. But the difference is now that 2 containers continue to run on the node1. Both nodes see each other as "offline" in the web GUI.
So, because this is something different, I have already created a new post with some logs attached to it:
https://forum.proxmox.com/threads/c...-waiting-for-quorum-pve-manager-7-2-7.112009/
I will try to run that service check command when i get to the machines.

Another thing is probably not related is when i first created a new cluster on node1 via web GUI, i could not because it told me something like "not enough privileges". I was connected as another admin user properly configured in proxmox GUI with access to the root / of all commands. I could only create a cluster when i had logged in as root. So, in later experiments i log in as root just to be sure that this is not user-related problem.

fiona · Jul 13, 2022

promok said:
Hello,
I have looked into 2 articles about quorum and came to conclusion that the minimum supported configuration for normally working quorum is to have at least 3 nodes. I planned to have only 2 nodes. So, unless i have 3 nodes, the whole cluster will not work properly?

If you want HA, you need 3 nodes. The cluster should work when both nodes are up. If you really want to go ahead with only two nodes you can use a QDevice for vote support.

promok said:
Yes, it looks like node1 sees the join attempt, adds node2 to its database, but does not see that node2 online. This is strange because i am sure both nodes are online, they are here with me on the same home LAN, they can ping and ssh each other.

Taken from the log in the other thread:

Code:

Jul  9 16:57:49 proxmox1 corosync[949]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  9 16:57:49 proxmox1 corosync[949]:   [KNET  ] host: host: 2 has no active links
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 has no active links
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 has no active links

Code:

Jul  9 16:57:50 proxmox2 corosync[12190]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul  9 16:57:50 proxmox2 corosync[12190]:   [KNET  ] host: host: 1 has no active links
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 has no active links
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 has no active links
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 0)
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 2 has no active links

Corosync doesn't say that the links got up. On a working cluster, there should be messages like

Code:

Jul 11 15:39:44 pve702 corosync[1398]:   [KNET  ] rx: host: 1 link: 1 is up

What's the output of ip a on both nodes (for the relevant subnet)? Do you have any firewall? Corosync uses 5405-5412 for its links0-8 (not entirely sure if 5404 is also still required, it was in the past).

promok said:
Since this incident i have returned to the previous configuration (node1 with its cluster and node2 as standalone). Then i ran apt update on both nodes who have only no-subscription repository active. And then, when i tried to repeat the join cluster thing, i got a different error:

How did you revert? After separating a node you need to re-install it from scratch before it can join again.

promok said:
root@proxmox2:~# pvecm add 192.168.1.40 Please enter superuser (root) password for '192.168.1.40': ************************** Establishing API connection with host '192.168.1.40' The authenticity of host '192.168.1.40' can't be established. X509 SHA256 key fingerprint is 83:33:4B:EB:E9:88:FB:90:6F:E2:84:BA:F3:FF:F6:CF:DF:17:65:CC:73:D6:12:19:14:E4:00:48:18:FD:5D:93. Are you sure you want to continue connecting (yes/no)? yes Login succeeded. check cluster join API version No cluster network links passed explicitly, fallback to local node IP '192.168.1.30' Request addition of this node Join request OK, finishing setup locally stopping pve-cluster service backup old database to '/var/lib/pve-cluster/backup/config-1657378669.sql.gz' waiting for quorum...
It waits for hours. But the difference is now that 2 containers continue to run on the node1. Both nodes see each other as "offline" in the web GUI.
So, because this is something different, I have already created a new post with some logs attached to it:
https://forum.proxmox.com/threads/c...-waiting-for-quorum-pve-manager-7-2-7.112009/
I will try to run that service check command when i get to the machines.

Another thing is probably not related is when i first created a new cluster on node1 via web GUI, i could not because it told me something like "not enough privileges". I was connected as another admin user properly configured in proxmox GUI with access to the root / of all commands. I could only create a cluster when i had logged in as root. So, in later experiments i log in as root just to be sure that this is not user-related problem.

promok · Jul 14, 2022

Hello,
Today i tried again to join the cluster without success and with exactly the same symptoms:
I have no background firewalls (see logs), on the proxmox1 both firewalls on datacenter and node levels are off (options: Firewall: No).
I have only one LAN, each node has only one network adapter and only one ipv4 address.
root@proxmox2:~# ping 192.168.1.40
PING 192.168.1.40 (192.168.1.40) 56(84) bytes of data.
64 bytes from 192.168.1.40: icmp_seq=1 ttl=64 time=0.164 ms
64 bytes from 192.168.1.40: icmp_seq=2 ttl=64 time=0.604 ms
64 bytes from 192.168.1.40: icmp_seq=3 ttl=64 time=0.265 ms
64 bytes from 192.168.1.40: icmp_seq=4 ttl=64 time=0.302 ms

root@proxmox1:/etc# ping proxmox2
PING proxmox2 (192.168.1.30) 56(84) bytes of data.
64 bytes from proxmox2 (192.168.1.30): icmp_seq=1 ttl=64 time=0.646 ms
64 bytes from proxmox2 (192.168.1.30): icmp_seq=2 ttl=64 time=0.564 ms
64 bytes from proxmox2 (192.168.1.30): icmp_seq=3 ttl=64 time=0.553 ms
64 bytes from proxmox2 (192.168.1.30): icmp_seq=4 ttl=64 time=0.689 ms

I have completely reinstalled proxmox2 from ISO.
Both nodes are online and can ping and ssh each other. They are on the same home LAN, 3 switches apart, no firewall in between.

root@proxmox2:~# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
Establishing API connection with host '192.168.1.40'
The authenticity of host '192.168.1.40' can't be established.
X509 SHA256 key fingerprint is 83:3C:4B:EB:E9:88:F3:90:6F:E2:84:BA:F6:FF:F6:CF

F:17:65:CC:7C

6:12:19:14:E4:00:48:13:FD:5D:93.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '192.168.1.30'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1657808802.sql.gz'
waiting for quorum...

And it sits there indefinitely.
You said that corosync uses some ports, are they TCP or UDP?
When i run your command journalctl -b0 -u corosync.service on the first node, i get

Jul 14 16:26:41 proxmox1 corosync[950]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.

What does it mean?
The same thing in syslog:
Jul 14 16:26:41 proxmox1 pvedaemon[999]: <root@pam> successful auth for user 'root@pam'
Jul 14 16:26:41 proxmox1 pvedaemon[1000]: <root@pam> adding node proxmox2 to cluster
Jul 14 16:26:41 proxmox1 pmxcfs[945]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 10)
Jul 14 16:26:41 proxmox1 corosync[950]: [CFG ] Config reload requested by node 1
Jul 14 16:26:41 proxmox1 corosync[950]: [TOTEM ] Configuring link 0
Jul 14 16:26:41 proxmox1 corosync[950]: [TOTEM ] Configured link number 0: local addr: 192.168.1.40, port=5405
Jul 14 16:26:41 proxmox1 corosync[950]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 14 16:26:41 proxmox1 corosync[950]: [QUORUM] Members[1]: 1
Jul 14 16:26:41 proxmox1 pmxcfs[945]: [status] notice: node lost quorum
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 has no active links
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 has no active links
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 has no active links
Jul 14 16:26:41 proxmox1 pmxcfs[945]: [status] notice: update cluster info (cluster name dellcluster, version = 10)
Jul 14 16:26:53 proxmox1 systemd[1]: Started Checkmk agent (PID 694/UID 997).
Jul 14 16:26:55 proxmox1 systemd[1]: check-mk-agent@1381-694-997.service: Succeeded.
Jul 14 16:26:55 proxmox1 systemd[1]: check-mk-agent@1381-694-997.service: Consumed 1.239s CPU time.
Jul 14 16:27:09 proxmox1 pvescheduler[621362]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 14 16:27:09 proxmox1 pvescheduler[621361]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 14 16:27:53 proxmox1 systemd[1]: Started Checkmk agent (PID 694/UID 997).
Jul 14 16:27:55 proxmox1 systemd[1]: check-mk-agent@1382-694-997.service: Succeeded.
Jul 14 16:27:55 proxmox1 systemd[1]: check-mk-agent@1382-694-997.service: Consumed 1.244s CPU time.
Jul 14 16:28:09 proxmox1 pvescheduler[621796]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 14 16:28:09 proxmox1 pvescheduler[621797]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
--------------------------------------------------------------------

I attach the logs in the file.
Thanks

promok · Aug 29, 2022

Today i have upgraded both nodes to
Linux proxmox1 5.15.39-4-pve #1 SMP PVE 5.15.39-4 (Mon, 08 Aug 2022 15:11:15 +0200) x86_64
and tried to join the cluster. NOT working. The second node is "waiting for quorum..." indefinitely.


root@proxmox1:/etc/pve/nodes# pvecm status
Cluster information
-------------------
Name:             dellcluster
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug 29 16:55:10 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.3c
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.40 (local)
root@proxmox1:/etc/pve/nodes#

This "activity blocked" message should be much more elaborate and specific.
What is blocked? Where? How? What exactly this proxmox does not like?
Why it refuses second node to join the cluster?
What second node is waiting for exactly?
Is it even possible to join second node to one-node cluster?

With time it looks like this proxmox cluster thing is very unreliable and obscure.
How do you manage to build a cluster? It simply does not work out of the box in very simple configuration...
Also i have found in
https://en.wikipedia.org/wiki/Corosync_Cluster_Engine
that the latest version 3.0.2 was developed 3 years ago and since there is no developement. Is it maintained since then?
Very frustrating.
--------------------
Update:
I have moved the second node to the same switch, and voila!
The cluster came up finally.


root@proxmox1:/etc/pve# pvecm status
Cluster information
-------------------
Name:             dellcluster
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug 29 17:54:09 2022
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.40
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.40 (local)
0x00000002          1 192.168.1.30
root@proxmox1:/etc/pve#

So, the "3 switches apart" was the problem, i think.

fiona · Aug 30, 2022

promok said:
Today i have upgraded both nodes to
Linux proxmox1 5.15.39-4-pve #1 SMP PVE 5.15.39-4 (Mon, 08 Aug 2022 15:11:15 +0200) x86_64
and tried to join the cluster. NOT working. The second node is "waiting for quorum..." indefinitely.

root@proxmox1:/etc/pve/nodes# pvecm status Cluster information ------------------- Name: dellcluster Config Version: 12 Transport: knet Secure auth: on Quorum information ------------------ Date: Mon Aug 29 16:55:10 2022 Quorum provider: corosync_votequorum Nodes: 1 Node ID: 0x00000001 Ring ID: 1.3c Quorate: No Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 1 Quorum: 2 Activity blocked Flags: Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.1.40 (local) root@proxmox1:/etc/pve/nodes#

This "activity blocked" message should be much more elaborate and specific.
What is blocked? Where? How? What exactly this proxmox does not like?
Why it refuses second node to join the cluster?
What second node is waiting for exactly?
Is it even possible to join second node to one-node cluster?

With time it looks like this proxmox cluster thing is very unreliable and obscure.
How do you manage to build a cluster? It simply does not work out of the box in very simple configuration...
Also i have found in
https://en.wikipedia.org/wiki/Corosync_Cluster_Engine
that the latest version 3.0.2 was developed 3 years ago and since there is no developement. Is it maintained since then?

The Wikipedia article is just outdated. See the Github page or the repository in Proxmox VE. Corosync is rather stable, so there are not that many updates.

promok said:
Very frustrating.
--------------------
Update:
I have moved the second node to the same switch, and voila!
The cluster came up finally.

root@proxmox1:/etc/pve# pvecm status Cluster information ------------------- Name: dellcluster Config Version: 12 Transport: knet Secure auth: on Quorum information ------------------ Date: Mon Aug 29 17:54:09 2022 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 0x00000001 Ring ID: 1.40 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.1.40 (local) 0x00000002 1 192.168.1.30 root@proxmox1:/etc/pve#

So, the "3 switches apart" was the problem, i think.

Glad you were able to solve it

. So it was some kind of network issue after all.

patefoniq · Nov 24, 2023

I have the same problem while adding the node. But it also broke my pve2 configuration. Why, during the adding the node process you don't create the backup of /etc/pve folder on adding node? What am I supposed to do now if adding the node hung on the "waiting for quorum..." message and deleted almost all files from that directory? What now? I have a production environment during migration from VMware, having no extra servers for doing those jobs.
If I remember correctly, the older version of PVE was not so destructive during the creation of a cluster process. I always could roll back to a normal state. Now in the newest version, I should reinstall everything? Where is that user-friendly progress?

sb-jw · Nov 24, 2023

@patefoniq first of all, please open a new thread. In this you describe everything and attach relevant excerpts that show the current status.

Otherwise it's like it always is, you should know what you're doing. Proxmox is very user-friendly and there is sufficient documentation. Otherwise, you can install PVE anywhere at any time and play through your scenario. If you blindly replace a productive setup with an unfamiliar product with features you are unfamiliar with, please do not blame the product.

tempacc375924 · Nov 24, 2023

patefoniq said:
I have the same problem while adding the node. But it also broke my pve2 configuration. Why, during the adding the node process you don't create the backup of /etc/pve folder on adding node? What am I supposed to do now if adding the node hung on the "waiting for quorum..." message and deleted almost all files from that directory? What now? I have a production environment during migration from VMware, having no extra servers for doing those jobs.
If I remember correctly, the older version of PVE was not so destructive during the creation of a cluster process. I always could roll back to a normal state. Now in the newest version, I should reinstall everything? Where is that user-friendly progress?

I agree with you, it could be less destructive.

sb-jw said:
@patefoniq first of all, please open a new thread. In this you describe everything and attach relevant excerpts that show the current status.

Otherwise it's like it always is, you should know what you're doing. Proxmox is very user-friendly and there is sufficient documentation. Otherwise, you can install PVE anywhere at any time and play through your scenario. If you blindly replace a productive setup with an unfamiliar product with features you are unfamiliar with, please do not blame the product.

I agree it should be a new thread with details.

tempacc375924 · Nov 24, 2023

Note: Lots of new people with fresh account get their fresh posts unfortunately blocked (albeit temporarily ... for hours though) by a spam filter, it's why they then go into existing ancient threads.

patefoniq · Nov 25, 2023

I created the new thread:
https://forum.proxmox.com/threads/problems-with-cluster-adding-nodes-and-qdevices.137063/

Search

Search

Quorum Activity blocked

jhuesser

New Member

narrateourale

Well-Known Member

jhuesser

New Member

spirit

Distinguished Member

jhuesser

New Member

promok

New Member

fiona

Proxmox Staff Member

promok

New Member

fiona

Proxmox Staff Member

promok

New Member

Attachments

promok

New Member

fiona

Proxmox Staff Member

patefoniq

Well-Known Member

sb-jw

Famous Member

tempacc375924

Member

tempacc375924

Member

patefoniq

Well-Known Member