Adding cluster getting stuck after Waiting for quorum...OK

sheshman

Member
Jan 16, 2023
56
4
13
Hi,

Both nodes are 8.0.3 and fully updated, trying to create cluster through cli but each time it's getting stuck Waiting for quorum...OK.

I've waited over 2 hours to see if it needs time to complete :) but it wasn't the case.

When i terminate cluster with :
Code:
systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l
rm /etc/pve/corosync.conf
rm -r /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster
pvecm delnode oldnode
rm /var/lib/corosync/*
it's exiting the process (obviously) :)

--All nodes can ping eachother(both ip and fqdn)
--All nodes can ssh to eachother(both ip and fqdn)
--Nodes are not behind NAT
--both node defined in /etc/hosts
--tried to create cluster with both ip and hostname ,result was the same

corosync.conf as below;
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: helsinki
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 65.21.27.202
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: egeclst012
  config_version: 1
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

i've created Syslog.txt with below command;
Code:
journalctl --since "2023-08-25 05:30" --until "2023-08-26 09:00" > /tmp/Syslog.txt
and attached to post

pveversion -v output as below;
Code:
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

I've tried every solution i found but no luck so far, any advice will be appreciated.
 

Attachments

Last edited:
Did you try adding the nodes via the webui?
Just to mention it, if you have a cluster of two severs, each time a server goes offline, the other one will go into no quorum / protective state, bc it does not know if it is the one thats offline or the other one, there are tons of post on this topic.

You can try to overrule quorum with the command pvecm expect 1 or pvecm expect 2 on the node thats giving you the error, however this can cause problems so use with caution
 
  • Like
Reactions: sheshman
Did you try adding the nodes via the webui?
Just to mention it, if you have a cluster of two severs, each time a server goes offline, the other one will go into no quorum / protective state, bc it does not know if it is the one thats offline or the other one, there are tons of post on this topic.

You can try to overrule quorum with the command pvecm expect 1 or pvecm expect 2 on the node thats giving you the error, however this can cause problems so use with caution
Yes i've already tried to add through webui, and it's also getting stuck (i forgot what was the last message before it get stuck), node's web access is going down (i mean it's not accepting your password anymore), on the main cluster node you can see it's online but when you click to it it says
Code:
/etc/pve/nodes/merkez/pve-ssl.pem' does not exist!

when i check /etc/pve/nodes/merkez there are no .pem file

Honestly i don't think it's quorum issue because it says OK, not getting stuck on waiting quorum, but i'm a rookie after all, maybe that's the problem.
 
Last edited:
you can try to generate new certs with
pvecm updatecerts --force
i've tried this method on both hosts, but didn't help
Code:
rm /etc/pve/priv/pve-root*
pvecm updatecerts --force
systemctl restart pveproxy.service

is it ok? or should i just run the command your provided?
 
ok wait on which node did you get the
/etc/pve/nodes/merkez/pve-ssl.pem' does not exist! error,
on the node named merkez or onthe other one?
 
ok wait on which node did you get the
/etc/pve/nodes/merkez/pve-ssl.pem' does not exist! error,
on the node named merkez or onthe other one?
1st(main system) - helsinki
2nd system - merkez

just now tried to add merkez to helsinki as cluster and it's stuck on "Request addition of this node".

On helsinki it shows merkez as online, when i go to datacenter->cluster it says "'/etc/pve/nodes/merkez/pve-ssl.pem' does not exist! (500)" and merkez's webgui is now returning ERR_TIMED_OUT, screenshots as attached
 

Attachments

  • 003.png
    003.png
    17.1 KB · Views: 16
  • 004.png
    004.png
    18.4 KB · Views: 15
  • 005.png
    005.png
    17.4 KB · Views: 15
1st(main system) - helsinki
2nd system - merkez

just now tried to add merkez to helsinki as cluster and it's stuck on "Request addition of this node".

On helsinki it shows merkez as online, when i go to datacenter->cluster it says "'/etc/pve/nodes/merkez/pve-ssl.pem' does not exist! (500)" and merkez's webgui is now returning ERR_TIMED_OUT, screenshots as attached
Did you ever get this fixed I've been trying to troubleshoot this for over a week and I've reinstalled proxmox on both servers multiple times at this point
 
Last edited:
Did you ever get this fixed I've been trying to troubleshoot this for over a week and I've reinstalled proxmox on both servers multiple times at this point
For anyone who could find it helpful - this is what helped me in a situation when a freshly installed node refused to join the cluster (or was refused by the cluster).

This is only a short version which might work. I've executed more commands so it is possible that this will be not enough. Please take it only as a hint.

On the new node:
Bash:
systemctl stop pve-cluster corosync
pmxcfs -l
pvecm updatecerts --force
scp -r /etc/pve/local/* <oldnode>:/etc/pve/nodes/<newnode>/
scp -r <oldnode>:/etc/pve/auth* /etc/pve/
scp -r <oldnode>:/etc/pve/pve-* /etc/pve/
systemctl restart pve-cluster corosync pveproxy pve-ha*
(rename <oldnode> and <newnode> to your values of course)

This stops the corosync and forces the local mode, regenerates the certs on the new node, transfers the certificates of the new node to the cluster, transfers the cluster root certificates to the new node and (re)starts the impacted services.
It is important to restart pveproxy as it serves the html frontend. Not sure whether it is needed to restart the pve-ha-crm and pve-ha-lrm services (there were some "ha" errors in the log) but it doesn't hurt and is faster than restarting the whole node.

Then, on the old node:
Bash:
systemctl restart pveproxy


One more thing that could help you. If you won't be able to start the corosync service, try to stop corosync and pve-cluster and run pmxcfs in the foreground
Bash:
systemctl stop pve-cluster corosync
pmxcfs -f
In case it runs without error that way, stop it (^C), edit /lib/systemd/system/corosync.service and change
Type=notify
to
Type=simple
then
Bash:
systemctl daemon-reload
systemctl start pve-cluster corosync
It helped me in a situation when nothing else worked. After the corosync started, I was able to copy the certs and in the end, I returned the original value Type=notify and it worked. Maybe someone will figure out why it worked.

This helped me a lot to understand where the problem could be: Proxmox Cluster file system (pmxcfs)
 
Last edited: