How to retry to join a cluster?

floh79 · Jul 15, 2021

Hello,

I tried to join a new node into a cluster. Unfortunately it was stuck in waiting for quorum.... Then I found out, I forgot to create A-Records for new Node, now its done.

Because the join-process was stuck in waiting for quorum... I had to abort with strg-c. If I redo the command pvecm add xx.xx.xx.xx then I got that:

Code:

root@Node03:~# pvecm add xx.xx.xx.xx
Please enter superuser (root) password for 'xx.xx.xx.xx': ********************
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* corosync is already running, is this node already in a cluster?!
Check if node may join a cluster failed!

If I list the node on new Node:

Code:

root@Node03:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         4          1 Node03 (local)

So the 3 first Nodes are missing (Master, Node01 and Node02) in that output. :/

What should I do now?

Best regards
Floh

floh79 · Jul 15, 2021

I think, I'll reinstall Node03 and retry, but I have to remove the Node03 from cluster before joining the new one.

Node03 appears in WebGUI (with red cross symbol, because it's not working). But it doesn't appear in CLI:

Code:

root@Node01:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 master
         2          1 node01 (local)
         3          1 node02

So how can I remove Node03 if it doesn't appear in node list in CLI?

Best regards
Floh

mira · Jul 15, 2021

You can find instructions on how to remove a node without reinstalling it in the documentation:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node (5.5.1)

floh79 · Jul 15, 2021

I know that documentation, but what should I use as parameter for the delete command since Node03 doesn't appear in pvecm nodes so I dunno which name it has.
pvecm delnode ??????

Just try with pvecm delnode node03?

mira · Jul 15, 2021

If it is not mentioned in the corosync config (/etc/pve/corosync.conf) you don't have to run that command.
If it is listed there, then use that name.

floh79 · Jul 15, 2021

Hello Mira, that helped. Thank you very much, now I know about the file corosync.conf.

I tried again but still stuck at waiting for quorum...

In WebGUI on Cluster if I click on new node, I got:
tls_process_server_certificate: certificate verify failed (596)

Maybe it just won't work because node03 is Version 7.0-8 while the cluster all nodes have 6.3-3. I wanted to add node03 and move Container there before I update existing nodes to 7.0-8 that is why I didn't update to 7.0-8 yet.

Shall I try with 6.4 first?

Best regards
Floh

mira · Jul 15, 2021

Mixing of versions is not recommended. It might work, but we don't guarantee this and it should only be done for the duration of the upgrade.

Is there anything tin the syslog/journal of the cluster node where you join and the node you want to join?

floh79 · Jul 15, 2021

Sure it shouldn't keep in that state that I have mixed version. I'll keep all nodes at same versions.

Now I tried with new node with version 6.4 with the same result.

Syslog of Cluster (Version 6.3-3):

Code:

Jul 15 16:16:04 node01 pvedaemon[1090]: <root@pam> successful auth for user 'root@pam'
Jul 15 16:16:04 node01 pvedaemon[1092]: <root@pam> adding node node03 to cluster
Jul 15 16:16:04 node01 pmxcfs[902]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 10)
Jul 15 16:16:05 node01 corosync[1026]:   [CFG   ] Config reload requested by node 2
Jul 15 16:16:05 node01 corosync[1026]:   [TOTEM ] Configuring link 0
Jul 15 16:16:05 node01 corosync[1026]:   [TOTEM ] Configured link number 0: local addr: xx.xx.xx.xx, port=5405
Jul 15 16:16:05 node01 corosync[1026]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 15 16:16:05 node01 corosync[1026]:   [KNET  ] host: host: 4 has no active links
Jul 15 16:16:05 node01 corosync[1026]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 15 16:16:05 node01 corosync[1026]:   [KNET  ] host: host: 4 has no active links
Jul 15 16:16:05 node01 corosync[1026]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 15 16:16:05 node01 corosync[1026]:   [KNET  ] host: host: 4 has no active links
Jul 15 16:16:05 node01 pmxcfs[902]: [status] notice: update cluster info (cluster name  pvcluster, version = 10)

Nothing suspicious...

But Syslog of new node (Version 6.4-4):

Code:

Jul 15 16:16:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 16:16:00 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 16:16:00 node03 systemd[1]: Started Proxmox VE replication runner.
Jul 15 16:16:01 node03 pmxcfs[746]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/node03: -1
Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local-lvm: -1
Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local-lvm: /var/lib/rrdcached/db/pve2-storage/node03/local-lvm: illegal attempt to update using time 1626358561 when last update time is 1626365581 (minimum one second step)
Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358561 when last update time is 1626365581 (minimum one second step)
Jul 15 16:16:05 node03 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Jul 15 16:16:05 node03 pmxcfs[746]: [main] notice: teardown filesystem
Jul 15 16:16:05 node03 systemd[1219]: etc-pve.mount: Succeeded.
Jul 15 16:16:05 node03 systemd[1]: etc-pve.mount: Succeeded.
Jul 15 16:16:06 node03 pmxcfs[746]: [main] notice: exit proxmox configuration filesystem (0)
Jul 15 16:16:06 node03 systemd[1]: pve-cluster.service: Succeeded.
Jul 15 16:16:06 node03 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Jul 15 16:16:06 node03 systemd[1]: Starting Corosync Cluster Engine...
Jul 15 16:16:06 node03 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jul 15 16:16:06 node03 corosync[1745]:   [MAIN  ] Corosync Cluster Engine 3.1.2 starting up
Jul 15 16:16:06 node03 corosync[1745]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 15 16:16:06 node03 corosync[1745]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 15 16:16:06 node03 kernel: [  189.738922] sctp: Hash tables configured (bind 256/256)
Jul 15 16:16:06 node03 pmxcfs[1750]: [quorum] crit: quorum_initialize failed: 2
Jul 15 16:16:06 node03 pmxcfs[1750]: [quorum] crit: can't initialize service
Jul 15 16:16:06 node03 pmxcfs[1750]: [confdb] crit: cmap_initialize failed: 2
Jul 15 16:16:06 node03 pmxcfs[1750]: [confdb] crit: can't initialize service
Jul 15 16:16:06 node03 pmxcfs[1750]: [dcdb] crit: cpg_initialize failed: 2
Jul 15 16:16:06 node03 pmxcfs[1750]: [dcdb] crit: can't initialize service
Jul 15 16:16:06 node03 pmxcfs[1750]: [status] crit: cpg_initialize failed: 2
Jul 15 16:16:06 node03 pmxcfs[1750]: [status] crit: can't initialize service
Jul 15 16:16:06 node03 corosync[1745]:   [TOTEM ] totemknet initialized
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 15 16:16:06 node03 corosync[1745]:   [QB    ] server name: cmap
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 15 16:16:06 node03 corosync[1745]:   [QB    ] server name: cfg
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 15 16:16:06 node03 corosync[1745]:   [QB    ] server name: cpg
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 15 16:16:06 node03 corosync[1745]:   [WD    ] Watchdog not enabled by configuration
Jul 15 16:16:06 node03 corosync[1745]:   [WD    ] resource load_15min missing a recovery key.
Jul 15 16:16:06 node03 corosync[1745]:   [WD    ] resource memory_used missing a recovery key.
Jul 15 16:16:06 node03 corosync[1745]:   [WD    ] no resources configured.
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 15 16:16:06 node03 corosync[1745]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 15 16:16:06 node03 corosync[1745]:   [QB    ] server name: votequorum
Jul 15 16:16:06 node03 corosync[1745]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 15 16:16:06 node03 corosync[1745]:   [QB    ] server name: quorum
Jul 15 16:16:06 node03 corosync[1745]:   [TOTEM ] Configuring link 0
Jul 15 16:16:06 node03 corosync[1745]:   [TOTEM ] Configured link number 0: local addr: 10.10.100.3, port=5405
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 1 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 1 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 1 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 2 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 2 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 2 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [QUORUM] Sync members[1]: 4
Jul 15 16:16:06 node03 corosync[1745]:   [QUORUM] Sync joined[1]: 4
Jul 15 16:16:06 node03 corosync[1745]:   [TOTEM ] A new membership (4.5) was formed. Members joined: 4
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 3 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 3 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 15 16:16:06 node03 corosync[1745]:   [KNET  ] host: host: 3 has no active links
Jul 15 16:16:06 node03 corosync[1745]:   [QUORUM] Members[1]: 4
Jul 15 16:16:06 node03 corosync[1745]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 16:16:06 node03 systemd[1]: Started Corosync Cluster Engine.
Jul 15 16:16:07 node03 systemd[1]: Started The Proxmox VE cluster filesystem.
Jul 15 16:16:11 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/node03: -1
Jul 15 16:16:11 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:16:11 node03 pmxcfs[1750]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358571 when last update time is 1626365581 (minimum one second step)
Jul 15 16:16:12 node03 pmxcfs[1750]: [status] notice: update cluster info (cluster name  pvcluster, version = 10)
Jul 15 16:16:12 node03 pmxcfs[1750]: [dcdb] notice: members: 4/1750
Jul 15 16:16:12 node03 pmxcfs[1750]: [dcdb] notice: all data is up to date
Jul 15 16:16:12 node03 pmxcfs[1750]: [status] notice: members: 4/1750
Jul 15 16:16:12 node03 pmxcfs[1750]: [status] notice: all data is up to date
Jul 15 16:16:21 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/node03: -1
Jul 15 16:16:21 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:16:21 node03 pmxcfs[1750]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358581 when last update time is 1626365581 (minimum one second step)
Jul 15 16:16:31 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/node03: -1
Jul 15 16:16:31 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:16:31 node03 pmxcfs[1750]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358591 when last update time is 1626365581 (minimum one second step)
Jul 15 16:16:41 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/node03: -1
Jul 15 16:16:41 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:16:41 node03 pmxcfs[1750]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358601 when last update time is 1626365581 (minimum one second step)
Jul 15 16:16:51 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/node03: -1
Jul 15 16:16:51 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:16:51 node03 pmxcfs[1750]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358611 when last update time is 1626365581 (minimum one second step)
Jul 15 16:17:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 16:17:00 node03 pvesr[1875]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 16:17:00 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 16:17:00 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 16:17:00 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 16:17:01 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/node03: -1
Jul 15 16:17:01 node03 pmxcfs[1750]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:17:01 node03 pmxcfs[1750]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358621 when last update time is 1626365581 (minimum one second step)
Jul 15 16:17:01 node03 cron[852]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

Seems pvesr has a problem. (also there is a lot of notice: RRD update error which I should look further).

mira · Jul 15, 2021

The `rrd` messages hint at an issue with time.

Code:

Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local-lvm: -1
Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local-lvm: /var/lib/rrdcached/db/pve2-storage/node03/local-lvm: illegal attempt to update using time 1626358561 when last update time is 1626365581 (minimum one second step)
Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/node03/local: -1
Jul 15 16:16:02 node03 pmxcfs[746]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/node03/local: /var/lib/rrdcached/db/pve2-storage/node03/local: illegal attempt to update using time 1626358561 when last update time is 1626365581 (minimum one second step)

Looks like your node can't connect to the other nodes. Can you manually connect via SSH?

floh79 · Jul 15, 2021

Time is ok, doublechecked with date command.
The RDRC update error already happens before I try to join, so its freshly installed.

Manual SSH is possible (using pubkey).

floh79 · Jul 15, 2021

Ok, I fixed the RDRC update error. It seems that reinstalling with Proxmox installer didn't really clear the disk. I cleared the disk and reinstalled, then these errors are gone. But still cannot join the cluster:

Code:

Jul 15 20:55:19 node03 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Jul 15 20:55:19 node03 pmxcfs[721]: [main] notice: teardown filesystem
Jul 15 20:55:19 node03 systemd[2559]: etc-pve.mount: Succeeded.
Jul 15 20:55:19 node03 systemd[1028]: etc-pve.mount: Succeeded.
Jul 15 20:55:19 node03 systemd[1]: etc-pve.mount: Succeeded.
Jul 15 20:55:20 node03 pmxcfs[721]: [main] notice: exit proxmox configuration filesystem (0)
Jul 15 20:55:20 node03 systemd[1]: pve-cluster.service: Succeeded.
Jul 15 20:55:20 node03 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Jul 15 20:55:20 node03 systemd[1]: Starting Corosync Cluster Engine...
Jul 15 20:55:20 node03 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jul 15 20:55:20 node03 corosync[2899]:   [MAIN  ] Corosync Cluster Engine 3.1.2 starting up
Jul 15 20:55:20 node03 corosync[2899]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jul 15 20:55:20 node03 corosync[2899]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 15 20:55:20 node03 kernel: [  801.443064] sctp: Hash tables configured (bind 256/256)
Jul 15 20:55:20 node03 pmxcfs[2911]: [quorum] crit: quorum_initialize failed: 2
Jul 15 20:55:20 node03 pmxcfs[2911]: [quorum] crit: can't initialize service
Jul 15 20:55:20 node03 pmxcfs[2911]: [confdb] crit: cmap_initialize failed: 2
Jul 15 20:55:20 node03 pmxcfs[2911]: [confdb] crit: can't initialize service
Jul 15 20:55:20 node03 pmxcfs[2911]: [dcdb] crit: cpg_initialize failed: 2
Jul 15 20:55:20 node03 pmxcfs[2911]: [dcdb] crit: can't initialize service
Jul 15 20:55:20 node03 pmxcfs[2911]: [status] crit: cpg_initialize failed: 2
Jul 15 20:55:20 node03 pmxcfs[2911]: [status] crit: can't initialize service
Jul 15 20:55:20 node03 corosync[2899]:   [TOTEM ] totemknet initialized
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 15 20:55:20 node03 corosync[2899]:   [QB    ] server name: cmap
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 15 20:55:20 node03 corosync[2899]:   [QB    ] server name: cfg
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 15 20:55:20 node03 corosync[2899]:   [QB    ] server name: cpg
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 15 20:55:20 node03 corosync[2899]:   [WD    ] Watchdog not enabled by configuration
Jul 15 20:55:20 node03 corosync[2899]:   [WD    ] resource load_15min missing a recovery key.
Jul 15 20:55:20 node03 corosync[2899]:   [WD    ] resource memory_used missing a recovery key.
Jul 15 20:55:20 node03 corosync[2899]:   [WD    ] no resources configured.
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 15 20:55:20 node03 corosync[2899]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 15 20:55:20 node03 corosync[2899]:   [QB    ] server name: votequorum
Jul 15 20:55:20 node03 corosync[2899]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 15 20:55:20 node03 corosync[2899]:   [QB    ] server name: quorum
Jul 15 20:55:20 node03 corosync[2899]:   [TOTEM ] Configuring link 0
Jul 15 20:55:20 node03 corosync[2899]:   [TOTEM ] Configured link number 0: local addr: 10.10.100.3, port=5405
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 1 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 1 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 1 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 2 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 2 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 2 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 3 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 3 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 15 20:55:20 node03 corosync[2899]:   [KNET  ] host: host: 3 has no active links
Jul 15 20:55:20 node03 corosync[2899]:   [QUORUM] Sync members[1]: 4
Jul 15 20:55:20 node03 corosync[2899]:   [QUORUM] Sync joined[1]: 4
Jul 15 20:55:20 node03 corosync[2899]:   [TOTEM ] A new membership (4.5) was formed. Members joined: 4
Jul 15 20:55:20 node03 corosync[2899]:   [QUORUM] Members[1]: 4
Jul 15 20:55:20 node03 corosync[2899]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 20:55:20 node03 systemd[1]: Started Corosync Cluster Engine.
Jul 15 20:55:21 node03 systemd[1]: Started The Proxmox VE cluster filesystem.
Jul 15 20:55:26 node03 pmxcfs[2911]: [status] notice: update cluster info (cluster name  pvcluster, version = 14)
Jul 15 20:55:26 node03 pmxcfs[2911]: [dcdb] notice: members: 4/2911
Jul 15 20:55:26 node03 pmxcfs[2911]: [dcdb] notice: all data is up to date
Jul 15 20:55:26 node03 pmxcfs[2911]: [status] notice: members: 4/2911
Jul 15 20:55:26 node03 pmxcfs[2911]: [status] notice: all data is up to date
Jul 15 20:56:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:56:01 node03 pvesr[2995]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 20:56:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 20:56:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 20:56:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 20:56:01 node03 cron[829]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

Any hints which caused trouble?

I think its because of: Jul 15 20:55:20 node03 pmxcfs[2911]: [quorum] crit: quorum_initialize failed: 2

So I doublechecked:

Code:

root@node03:~# journalctl -u pvesr.service
-- Journal begins at Thu 2021-07-15 20:42:02 CEST, ends at Thu 2021-07-15 21:02:01 CEST. --
-- Journal begins at Thu 2021-07-15 20:42:02 CEST, ends at Thu 2021-07-15 21:02:01 CEST. --
Jul 15 20:43:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:43:00 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:43:00 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:44:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:44:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:44:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:45:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:45:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:45:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:46:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:46:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:46:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:47:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:47:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:47:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:48:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:48:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:48:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:49:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:49:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:49:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:50:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:50:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:50:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:51:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:51:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:51:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:52:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:52:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:52:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:53:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:53:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:53:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:54:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:54:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:54:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:55:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:55:01 node03 systemd[1]: pvesr.service: Succeeded.
Jul 15 20:55:01 node03 systemd[1]: Finished Proxmox VE replication runner.
Jul 15 20:56:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:56:01 node03 pvesr[2995]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 20:56:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 20:56:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 20:56:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 20:57:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:57:01 node03 pvesr[3115]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 20:57:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 20:57:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 20:57:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 20:58:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:58:01 node03 pvesr[3240]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 20:58:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 20:58:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 20:58:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 20:59:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 20:59:01 node03 pvesr[3360]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 20:59:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 20:59:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 20:59:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 21:00:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 21:00:01 node03 pvesr[3480]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 21:00:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 21:00:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 21:00:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 21:01:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 21:01:01 node03 pvesr[3600]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 21:01:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 21:01:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 21:01:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.
Jul 15 21:02:00 node03 systemd[1]: Starting Proxmox VE replication runner...
Jul 15 21:02:01 node03 pvesr[3720]: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 15 21:02:01 node03 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 15 21:02:01 node03 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jul 15 21:02:01 node03 systemd[1]: Failed to start Proxmox VE replication runner.

So... why cfs-lock 'file-replication_cfg' error: no quorum!?

Best regards
Floh

mira · Jul 16, 2021

Please provide the network config (/etc/network/interfaces) of the node you want to join and one of the cluster nodes.
In addition please provide the corosync config of one of the cluster nodes (/etc/pve/corosync.conf).

floh79 · Jul 16, 2021

Sure.

Code:

floh@node03:~$ cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface ens3 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.10.100.3/24
    gateway 10.10.100.1
    bridge-ports ens3
    bridge-stp off
    bridge-fd 0

Code:

root@katana03:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: master
    nodeid: 1
    quorum_votes: 1
    ring0_addr: aa.aa.aa.aa
  }
  node {
    name: node01
    nodeid: 2
    quorum_votes: 1
    ring0_addr: bb.bb.bb.bb
  }
  node {
    name: node02
    nodeid: 3
    quorum_votes: 1
    ring0_addr: cc.cc.cc.cc
  }
  node {
    name: node03
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.100.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pvcluster
  config_version: 16
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

Maybe I see the problem now... should the line ring0-addr: 10.10.100.3 contain external IP-Address? Since its the internal IP-Address between Proxmox and Firewall. I'm looking in documentation for pvecm add IPADDRESS OPTIONS if I can manually enter external IP-Address there.

All Nodes are accessible by external IP-Adress (without NAT) but the third one is in office and is NATed. This could explain why I never had the issue.

Best regards
Floh

mira · Jul 16, 2021

You want to use a private network, not a public one for corosync

So your cluster runs in a different subnet than the single node?

floh79 · Jul 16, 2021

All Nodes are on different locations (different cities). Firewall are set so only IP-Address of Nodes are allowed to access to its Port 22 and 8006. All requests from different IP addresses are blocked by FW.

Search

Search

How to retry to join a cluster?

floh79

Member

floh79

Member

mira

Proxmox Staff Member

floh79

Member

mira

Proxmox Staff Member

floh79

Member

mira

Proxmox Staff Member

floh79

Member

mira

Proxmox Staff Member

floh79

Member

floh79

Member

mira

Proxmox Staff Member

floh79

Member

mira

Proxmox Staff Member

floh79

Member

We value your privacy