Jan 6, 2020
8
0
1
26
Hi there,

We have a proxmox cluster, with each node in the same subnet. The cluster contained 3 hosts and everything worked as it should in the last years. Recently we wanted to add a new host to the cluster. This host is also in the same subnet, nothing changed in the network confugration. We've added the host following the tutorial:

Bash:
 pvecm add <IP_Address_of_host_in_cluster>

But after about 5 minutes we've had a split brain in the cluster, where the new node wan't visible to the others and vice versa. Status says, that the quorum activity was blocked:
Code:
root@node4:~#  pvecm status
Quorum information
------------------
Date:             Mon Jan  6 15:44:43 2020
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4/2067668
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 <ip_of_node4> (local)

The other nodes can still see each other:

Code:
root@node1:~# pvecm status
Quorum information
------------------
Date:             Mon Jan  6 15:46:25 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          3/2096608
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 <ip_node3>
0x00000002          1 <ip_node2>
0x00000001          1 <ip_node1> (local)

/etc/pve/corosync.conf seems to be the same on the nodes:

Code:
root@node1:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: node1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: node2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: <ip_node3>
  }
  node {
    name: node4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: <ip_node4>
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: VPSG4
  config_version: 7
  interface {
    bindnetaddr: <ip_node1>
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

and

Code:
root@node4:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: node1
  }
  node {
    name: node2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: node2
  }
  node {
    name: node3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: <ip_node3>
  }
  node {
    name: ndoe4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: <ip_node4>
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: VPSG4
  config_version: 7
  interface {
    bindnetaddr: <ip_node1>
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

I have discovered that the version of /etc/pve/.members differs between node1 and node4:

Code:
root@node4:~# cat /etc/pve/.members
{
"nodename": "node4",
"version": 7,
"cluster": { "name": "VPSG4", "version": 7, "nodes": 4, "quorate": 0 },
"nodelist": {
  "node1": { "id": 1, "online": 0, "ip": "<ip_node1>"},
  "node2": { "id": 2, "online": 0, "ip": "<ip_node2>"},
  "node3": { "id": 3, "online": 0, "ip": "<ip_node3"},
  "node4": { "id": 4, "online": 1, "ip": "<ip_node4>"}
  }
}

Code:
root@node1:~# cat /etc/pve/.members
{
"nodename": "chzhc1px010",
"version": 42,
"cluster": { "name": "VPSG4", "version": 7, "nodes": 4, "quorate": 1 },
"nodelist": {
  "node1": { "id": 1, "online": 1, "ip": "<ip_node1>"},
  "node2": { "id": 2, "online": 1, "ip": "<ip_node2>"},
  "node3": { "id": 3, "online": 1, "ip": "<ip_node3>"},
  "node4": { "id": 4, "online": 0, "ip": "<ip_node4>"}
  }
}

Is that a problem?

The nodes are seeing each other, after we reboot node4, but again after 5 minutes, we have a split brain again.

The syslog of node4:

Code:
root@node4:~# tail /var/log/syslog -n 50
Jan  6 09:50:31 node4 corosync[1472]: warning [CPG   ] downlist left_list: 3 received
Jan  6 09:50:31 node4 corosync[1472]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  6 09:50:31 node4 corosync[1472]: notice  [QUORUM] Members[1]: 4
Jan  6 09:50:31 node4 corosync[1472]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  6 09:50:31 node4 corosync[1472]:  [TOTEM ] A new membership (<ip_node4>:2067668) was formed. Members left: 3 2 1
Jan  6 09:50:31 node4 corosync[1472]:  [TOTEM ] Failed to receive the leave message. failed: 3 2 1
Jan  6 09:50:31 node4 corosync[1472]:  [CPG   ] downlist left_list: 3 received
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: members: 4/1307
Jan  6 09:50:31 node4 pmxcfs[1307]: [status] notice: members: 4/1307
Jan  6 09:50:31 node4 corosync[1472]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  6 09:50:31 node4 corosync[1472]:  [QUORUM] Members[1]: 4
Jan  6 09:50:31 node4 corosync[1472]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  6 09:50:31 node4 pmxcfs[1307]: [status] notice: node lost quorum
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: cpg_send_message retried 1 times
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] crit: received write while not quorate - trigger resync
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] crit: leaving CPG group
Jan  6 09:50:31 node4 pve-ha-lrm[1678]: unable to write lrm status file - closing file '/etc/pve/nodes/node4/lrm_status.tmp.1678' failed - Operation not permitted
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: start cluster connection
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: members: 4/1307
Jan  6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: all data is up to date
Jan  6 09:51:00 node4 systemd[1]: Starting Proxmox VE replication runner...
Jan  6 09:51:01 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:02 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:03 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:04 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:05 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:06 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:07 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:08 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:09 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:51:10 node4 pvesr[2202]: error with cfs lock 'file-replication_cfg': no quorum!
Jan  6 09:51:10 node4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan  6 09:51:10 node4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan  6 09:51:10 node4 systemd[1]: pvesr.service: Unit entered failed state.
Jan  6 09:51:10 node4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan  6 09:52:00 node4 systemd[1]: Starting Proxmox VE replication runner...
Jan  6 09:52:01 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:02 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:03 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:04 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:05 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:06 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:07 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:08 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:09 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan  6 09:52:10 node4 pvesr[2307]: error with cfs lock 'file-replication_cfg': no quorum!
Jan  6 09:52:10 node4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan  6 09:52:10 node4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan  6 09:52:10 node4 systemd[1]: pvesr.service: Unit entered failed state.
Jan  6 09:52:10 node4 systemd[1]: pvesr.service: Failed with result 'exit-code'.

I've read something about issues with multicast in the forums already, but I believe this is something else, as the infrastructure hasn't changed and the rest of the cluster still working fine. I appreciate any help on this. Thank you.
 
Hi narrateourale

Thanks for your response. We are using the following versions:

Node1:

Code:
proxmox-ve: 5.4-1 (running kernel: 4.15.18-15-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-3
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1

Node2:
Code:
proxmox-ve: 5.4-1 (running kernel: 4.15.18-16-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-4
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1

Node3:
Code:
proxmox-ve: 5.4-1 (running kernel: 4.15.18-16-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-4
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1

Node4:
Code:
proxmox-ve: 5.4-2 (running kernel: 4.15.18-23-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-11
pve-kernel-4.15.18-23-pve: 4.15.18-51
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1
 
if you only have the node4 isolated, and the 3 others nodes can still see together, It sound like a multicast problem on node4.

do you have igmp snooping enabled on your physical network ?



with only 4 nodes, you could try to test with unicast instead multicast:

edit /etc/pve/corosync.conf (when you have quorum) and add

Code:
totem {
  ....
  transport: udpu
}

(and increment config version)


Then, restart corosync on each node.

(if you have ha enabled, you need to disable it before changing corosync config)
 
Hello,
My case is much simpler. I have 2 PCs. One is the first node recently installed which worked OK, i have created new cluster on it, cluster created OK. But this new cluster never worked with more than 1 node.
Then i decided to add a new second node to this cluster. I only have 1 home gigabit LAN with 0.6 ms latency. The PCs have only 1 network adapter configured for this exactly 1 LAN.
I cannot build the cluster!!!! Neither via GUI (Connection error 401: permission denied - invalid PVE ticket) nor via the command line:
root@proxmox2:/etc# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
trying to acquire lock...
can't lock file '/var/lock/pvecm.lock' - got timeout

I have deleted that file, no luck, only errors:
root@proxmox2:/var/lock# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* corosync is already running, is this node already in a cluster?!
Check if node may join a cluster failed!
=============================================================================
NODE1:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:37 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.40 (local)
-----------------------------------------------------------------------------------------------------
NODE2:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:14 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.30 (local)
-----------------------------------------------------------------------------------------------------
corosync.conf:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxmox1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.1.40
}
node {
name: proxmox2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.1.30
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: dellcluster
config_version: 6
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
-----------------------------------------------------------
Firewalls were disabled (on both PCs) on the datacenter and node levels before trying to join the cluster. I can ssh from one to another. These 2 machines have been installed in the last month, so the latest proxmox version.
The cluster, which cannot be built from scratch, and the containers which ALL went down - this is enough to kill the trust in proxmox solutions.
Do you have any ideas?
 
Hi,
Hello,
My case is much simpler. I have 2 PCs. One is the first node recently installed which worked OK, i have created new cluster on it, cluster created OK. But this new cluster never worked with more than 1 node.
you need quorum and if you only have two nodes, than both need to be online for that. Otherwise you'll need a QDevice for vote support.
Then i decided to add a new second node to this cluster. I only have 1 home gigabit LAN with 0.6 ms latency. The PCs have only 1 network adapter configured for this exactly 1 LAN.
I cannot build the cluster!!!! Neither via GUI (Connection error 401: permission denied - invalid PVE ticket) nor via the command line:
root@proxmox2:/etc# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
trying to acquire lock...
can't lock file '/var/lock/pvecm.lock' - got timeout

I have deleted that file, no luck, only errors:
EDIT: deleting such files is not always a good idea...
root@proxmox2:/var/lock# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* corosync is already running, is this node already in a cluster?!
Check if node may join a cluster failed!
EDIT: It seems like that this node is already part of the cluster? Please check journalctl -b0 -u corosync.service on both nodes.
=============================================================================
NODE1:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:37 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.40 (local)
-----------------------------------------------------------------------------------------------------
NODE2:
Cluster information
-------------------
Name: dellcluster
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Jul 8 18:48:14 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.a
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.30 (local)
-----------------------------------------------------------------------------------------------------
corosync.conf:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxmox1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.1.40
}
node {
name: proxmox2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.1.30
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: dellcluster
config_version: 6
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
-----------------------------------------------------------
Firewalls were disabled (on both PCs) on the datacenter and node levels before trying to join the cluster. I can ssh from one to another. These 2 machines have been installed in the last month, so the latest proxmox version.
The cluster, which cannot be built from scratch, and the containers which ALL went down - this is enough to kill the trust in proxmox solutions.
Do you have any ideas?
From your outputs it seems that only one node is online (EDIT: or at least they there's an issue with the nodes finding each other), so the cluster is not quorate and no new node can join. Please make sure both nodes are online and the cluster network works reliably between them before attempting to join a third node.
 
Last edited:
Hello,
I have looked into 2 articles about quorum and came to conclusion that the minimum supported configuration for normally working quorum is to have at least 3 nodes. I planned to have only 2 nodes. So, unless i have 3 nodes, the whole cluster will not work properly?
Yes, it looks like node1 sees the join attempt, adds node2 to its database, but does not see that node2 online. This is strange because i am sure both nodes are online, they are here with me on the same home LAN, they can ping and ssh each other.
Since this incident i have returned to the previous configuration (node1 with its cluster and node2 as standalone). Then i ran apt update on both nodes who have only no-subscription repository active. And then, when i tried to repeat the join cluster thing, i got a different error:
root@proxmox2:~# pvecm add 192.168.1.40 Please enter superuser (root) password for '192.168.1.40': ************************** Establishing API connection with host '192.168.1.40' The authenticity of host '192.168.1.40' can't be established. X509 SHA256 key fingerprint is 83:33:4B:EB:E9:88:FB:90:6F:E2:84:BA:F3:FF:F6:CF:DF:17:65:CC:73:D6:12:19:14:E4:00:48:18:FD:5D:93. Are you sure you want to continue connecting (yes/no)? yes Login succeeded. check cluster join API version No cluster network links passed explicitly, fallback to local node IP '192.168.1.30' Request addition of this node Join request OK, finishing setup locally stopping pve-cluster service backup old database to '/var/lib/pve-cluster/backup/config-1657378669.sql.gz' waiting for quorum...
It waits for hours. But the difference is now that 2 containers continue to run on the node1. Both nodes see each other as "offline" in the web GUI.
So, because this is something different, I have already created a new post with some logs attached to it:
https://forum.proxmox.com/threads/c...-waiting-for-quorum-pve-manager-7-2-7.112009/
I will try to run that service check command when i get to the machines.

Another thing is probably not related is when i first created a new cluster on node1 via web GUI, i could not because it told me something like "not enough privileges". I was connected as another admin user properly configured in proxmox GUI with access to the root / of all commands. I could only create a cluster when i had logged in as root. So, in later experiments i log in as root just to be sure that this is not user-related problem.
 
Hello,
I have looked into 2 articles about quorum and came to conclusion that the minimum supported configuration for normally working quorum is to have at least 3 nodes. I planned to have only 2 nodes. So, unless i have 3 nodes, the whole cluster will not work properly?
If you want HA, you need 3 nodes. The cluster should work when both nodes are up. If you really want to go ahead with only two nodes you can use a QDevice for vote support.

Yes, it looks like node1 sees the join attempt, adds node2 to its database, but does not see that node2 online. This is strange because i am sure both nodes are online, they are here with me on the same home LAN, they can ping and ssh each other.

Taken from the log in the other thread:
Code:
Jul  9 16:57:49 proxmox1 corosync[949]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul  9 16:57:49 proxmox1 corosync[949]:   [KNET  ] host: host: 2 has no active links
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 has no active links
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 9 16:57:49 proxmox1 corosync[949]: [KNET ] host: host: 2 has no active links

Code:
Jul  9 16:57:50 proxmox2 corosync[12190]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul  9 16:57:50 proxmox2 corosync[12190]:   [KNET  ] host: host: 1 has no active links
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 has no active links
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 1 has no active links
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 0)
Jul 9 16:57:50 proxmox2 corosync[12190]: [KNET ] host: host: 2 has no active links

Corosync doesn't say that the links got up. On a working cluster, there should be messages like
Code:
Jul 11 15:39:44 pve702 corosync[1398]:   [KNET  ] rx: host: 1 link: 1 is up

What's the output of ip a on both nodes (for the relevant subnet)? Do you have any firewall? Corosync uses 5405-5412 for its links0-8 (not entirely sure if 5404 is also still required, it was in the past).

Since this incident i have returned to the previous configuration (node1 with its cluster and node2 as standalone). Then i ran apt update on both nodes who have only no-subscription repository active. And then, when i tried to repeat the join cluster thing, i got a different error:
How did you revert? After separating a node you need to re-install it from scratch before it can join again.

root@proxmox2:~# pvecm add 192.168.1.40 Please enter superuser (root) password for '192.168.1.40': ************************** Establishing API connection with host '192.168.1.40' The authenticity of host '192.168.1.40' can't be established. X509 SHA256 key fingerprint is 83:33:4B:EB:E9:88:FB:90:6F:E2:84:BA:F3:FF:F6:CF:DF:17:65:CC:73:D6:12:19:14:E4:00:48:18:FD:5D:93. Are you sure you want to continue connecting (yes/no)? yes Login succeeded. check cluster join API version No cluster network links passed explicitly, fallback to local node IP '192.168.1.30' Request addition of this node Join request OK, finishing setup locally stopping pve-cluster service backup old database to '/var/lib/pve-cluster/backup/config-1657378669.sql.gz' waiting for quorum...
It waits for hours. But the difference is now that 2 containers continue to run on the node1. Both nodes see each other as "offline" in the web GUI.
So, because this is something different, I have already created a new post with some logs attached to it:
https://forum.proxmox.com/threads/c...-waiting-for-quorum-pve-manager-7-2-7.112009/
I will try to run that service check command when i get to the machines.

Another thing is probably not related is when i first created a new cluster on node1 via web GUI, i could not because it told me something like "not enough privileges". I was connected as another admin user properly configured in proxmox GUI with access to the root / of all commands. I could only create a cluster when i had logged in as root. So, in later experiments i log in as root just to be sure that this is not user-related problem.
 
Hello,
Today i tried again to join the cluster without success and with exactly the same symptoms:
I have no background firewalls (see logs), on the proxmox1 both firewalls on datacenter and node levels are off (options: Firewall: No).
I have only one LAN, each node has only one network adapter and only one ipv4 address.
root@proxmox2:~# ping 192.168.1.40
PING 192.168.1.40 (192.168.1.40) 56(84) bytes of data.
64 bytes from 192.168.1.40: icmp_seq=1 ttl=64 time=0.164 ms
64 bytes from 192.168.1.40: icmp_seq=2 ttl=64 time=0.604 ms
64 bytes from 192.168.1.40: icmp_seq=3 ttl=64 time=0.265 ms
64 bytes from 192.168.1.40: icmp_seq=4 ttl=64 time=0.302 ms

root@proxmox1:/etc# ping proxmox2
PING proxmox2 (192.168.1.30) 56(84) bytes of data.
64 bytes from proxmox2 (192.168.1.30): icmp_seq=1 ttl=64 time=0.646 ms
64 bytes from proxmox2 (192.168.1.30): icmp_seq=2 ttl=64 time=0.564 ms
64 bytes from proxmox2 (192.168.1.30): icmp_seq=3 ttl=64 time=0.553 ms
64 bytes from proxmox2 (192.168.1.30): icmp_seq=4 ttl=64 time=0.689 ms

I have completely reinstalled proxmox2 from ISO.
Both nodes are online and can ping and ssh each other. They are on the same home LAN, 3 switches apart, no firewall in between.

root@proxmox2:~# pvecm add 192.168.1.40
Please enter superuser (root) password for '192.168.1.40': *******************
Establishing API connection with host '192.168.1.40'
The authenticity of host '192.168.1.40' can't be established.
X509 SHA256 key fingerprint is 83:3C:4B:EB:E9:88:F3:90:6F:E2:84:BA:F6:FF:F6:CF:DF:17:65:CC:7C:D6:12:19:14:E4:00:48:13:FD:5D:93.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '192.168.1.30'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1657808802.sql.gz'
waiting for quorum...

And it sits there indefinitely.
You said that corosync uses some ports, are they TCP or UDP?
When i run your command journalctl -b0 -u corosync.service on the first node, i get
Jul 14 16:26:41 proxmox1 corosync[950]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
What does it mean?
The same thing in syslog:
Jul 14 16:26:41 proxmox1 pvedaemon[999]: <root@pam> successful auth for user 'root@pam'
Jul 14 16:26:41 proxmox1 pvedaemon[1000]: <root@pam> adding node proxmox2 to cluster
Jul 14 16:26:41 proxmox1 pmxcfs[945]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 10)
Jul 14 16:26:41 proxmox1 corosync[950]: [CFG ] Config reload requested by node 1
Jul 14 16:26:41 proxmox1 corosync[950]: [TOTEM ] Configuring link 0
Jul 14 16:26:41 proxmox1 corosync[950]: [TOTEM ] Configured link number 0: local addr: 192.168.1.40, port=5405
Jul 14 16:26:41 proxmox1 corosync[950]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 14 16:26:41 proxmox1 corosync[950]: [QUORUM] Members[1]: 1
Jul 14 16:26:41 proxmox1 pmxcfs[945]: [status] notice: node lost quorum
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 has no active links
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 has no active links
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 14 16:26:41 proxmox1 corosync[950]: [KNET ] host: host: 2 has no active links
Jul 14 16:26:41 proxmox1 pmxcfs[945]: [status] notice: update cluster info (cluster name dellcluster, version = 10)
Jul 14 16:26:53 proxmox1 systemd[1]: Started Checkmk agent (PID 694/UID 997).
Jul 14 16:26:55 proxmox1 systemd[1]: check-mk-agent@1381-694-997.service: Succeeded.
Jul 14 16:26:55 proxmox1 systemd[1]: check-mk-agent@1381-694-997.service: Consumed 1.239s CPU time.
Jul 14 16:27:09 proxmox1 pvescheduler[621362]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 14 16:27:09 proxmox1 pvescheduler[621361]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 14 16:27:53 proxmox1 systemd[1]: Started Checkmk agent (PID 694/UID 997).
Jul 14 16:27:55 proxmox1 systemd[1]: check-mk-agent@1382-694-997.service: Succeeded.
Jul 14 16:27:55 proxmox1 systemd[1]: check-mk-agent@1382-694-997.service: Consumed 1.244s CPU time.
Jul 14 16:28:09 proxmox1 pvescheduler[621796]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 14 16:28:09 proxmox1 pvescheduler[621797]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
--------------------------------------------------------------------

I attach the logs in the file.
Thanks
 

Attachments

  • error20220714.txt
    19.3 KB · Views: 0
Last edited:
Today i have upgraded both nodes to
Linux proxmox1 5.15.39-4-pve #1 SMP PVE 5.15.39-4 (Mon, 08 Aug 2022 15:11:15 +0200) x86_64
and tried to join the cluster. NOT working. The second node is "waiting for quorum..." indefinitely.


root@proxmox1:/etc/pve/nodes# pvecm status Cluster information ------------------- Name: dellcluster Config Version: 12 Transport: knet Secure auth: on Quorum information ------------------ Date: Mon Aug 29 16:55:10 2022 Quorum provider: corosync_votequorum Nodes: 1 Node ID: 0x00000001 Ring ID: 1.3c Quorate: No Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 1 Quorum: 2 Activity blocked Flags: Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.1.40 (local) root@proxmox1:/etc/pve/nodes#

This "activity blocked" message should be much more elaborate and specific.
What is blocked? Where? How? What exactly this proxmox does not like?
Why it refuses second node to join the cluster?
What second node is waiting for exactly?
Is it even possible to join second node to one-node cluster?

With time it looks like this proxmox cluster thing is very unreliable and obscure.
How do you manage to build a cluster? It simply does not work out of the box in very simple configuration...
Also i have found in
https://en.wikipedia.org/wiki/Corosync_Cluster_Engine
that the latest version 3.0.2 was developed 3 years ago and since there is no developement. Is it maintained since then?
Very frustrating.
--------------------
Update:
I have moved the second node to the same switch, and voila!
The cluster came up finally.

root@proxmox1:/etc/pve# pvecm status Cluster information ------------------- Name: dellcluster Config Version: 12 Transport: knet Secure auth: on Quorum information ------------------ Date: Mon Aug 29 17:54:09 2022 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 0x00000001 Ring ID: 1.40 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.1.40 (local) 0x00000002 1 192.168.1.30 root@proxmox1:/etc/pve#

So, the "3 switches apart" was the problem, i think.
 
Last edited:
Today i have upgraded both nodes to
Linux proxmox1 5.15.39-4-pve #1 SMP PVE 5.15.39-4 (Mon, 08 Aug 2022 15:11:15 +0200) x86_64
and tried to join the cluster. NOT working. The second node is "waiting for quorum..." indefinitely.


root@proxmox1:/etc/pve/nodes# pvecm status Cluster information ------------------- Name: dellcluster Config Version: 12 Transport: knet Secure auth: on Quorum information ------------------ Date: Mon Aug 29 16:55:10 2022 Quorum provider: corosync_votequorum Nodes: 1 Node ID: 0x00000001 Ring ID: 1.3c Quorate: No Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 1 Quorum: 2 Activity blocked Flags: Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.1.40 (local) root@proxmox1:/etc/pve/nodes#

This "activity blocked" message should be much more elaborate and specific.
What is blocked? Where? How? What exactly this proxmox does not like?
Why it refuses second node to join the cluster?
What second node is waiting for exactly?
Is it even possible to join second node to one-node cluster?

With time it looks like this proxmox cluster thing is very unreliable and obscure.
How do you manage to build a cluster? It simply does not work out of the box in very simple configuration...
Also i have found in
https://en.wikipedia.org/wiki/Corosync_Cluster_Engine
that the latest version 3.0.2 was developed 3 years ago and since there is no developement. Is it maintained since then?
The Wikipedia article is just outdated. See the Github page or the repository in Proxmox VE. Corosync is rather stable, so there are not that many updates.

Very frustrating.
--------------------
Update:
I have moved the second node to the same switch, and voila!
The cluster came up finally.

root@proxmox1:/etc/pve# pvecm status Cluster information ------------------- Name: dellcluster Config Version: 12 Transport: knet Secure auth: on Quorum information ------------------ Date: Mon Aug 29 17:54:09 2022 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 0x00000001 Ring ID: 1.40 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 192.168.1.40 (local) 0x00000002 1 192.168.1.30 root@proxmox1:/etc/pve#

So, the "3 switches apart" was the problem, i think.
Glad you were able to solve it :). So it was some kind of network issue after all.
 
I have the same problem while adding the node. But it also broke my pve2 configuration. Why, during the adding the node process you don't create the backup of /etc/pve folder on adding node? What am I supposed to do now if adding the node hung on the "waiting for quorum..." message and deleted almost all files from that directory? What now? I have a production environment during migration from VMware, having no extra servers for doing those jobs.
If I remember correctly, the older version of PVE was not so destructive during the creation of a cluster process. I always could roll back to a normal state. Now in the newest version, I should reinstall everything? Where is that user-friendly progress?
 
@patefoniq first of all, please open a new thread. In this you describe everything and attach relevant excerpts that show the current status.

Otherwise it's like it always is, you should know what you're doing. Proxmox is very user-friendly and there is sufficient documentation. Otherwise, you can install PVE anywhere at any time and play through your scenario. If you blindly replace a productive setup with an unfamiliar product with features you are unfamiliar with, please do not blame the product.
 
I have the same problem while adding the node. But it also broke my pve2 configuration. Why, during the adding the node process you don't create the backup of /etc/pve folder on adding node? What am I supposed to do now if adding the node hung on the "waiting for quorum..." message and deleted almost all files from that directory? What now? I have a production environment during migration from VMware, having no extra servers for doing those jobs.
If I remember correctly, the older version of PVE was not so destructive during the creation of a cluster process. I always could roll back to a normal state. Now in the newest version, I should reinstall everything? Where is that user-friendly progress?

I agree with you, it could be less destructive.

@patefoniq first of all, please open a new thread. In this you describe everything and attach relevant excerpts that show the current status.

Otherwise it's like it always is, you should know what you're doing. Proxmox is very user-friendly and there is sufficient documentation. Otherwise, you can install PVE anywhere at any time and play through your scenario. If you blindly replace a productive setup with an unfamiliar product with features you are unfamiliar with, please do not blame the product.

I agree it should be a new thread with details. :)
 
Note: Lots of new people with fresh account get their fresh posts unfortunately blocked (albeit temporarily ... for hours though) by a spam filter, it's why they then go into existing ancient threads.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!