28 Node HA Cluster (can't join nodes into cluster)

sebstr

Member
Aug 1, 2018
8
1
8
34
Hi,

We're finally about to setup our 28 host environment but have bumped into some concerning problems.

The hosts aren't able to find "quorum" when they are joined into a cluster - no matter which way and no matter the number of hosts. It seems like the SSL-certs aren't able to get synced between the hosts as we get the error "tls_process_server_certificate: certificate verify failed (596)".

I've been searching around the forum and found a few posts with the same error, but the actual solutions aren't applicable as I'm denied access to run said commands.

We're running Proxmox Latest Stable (5.2-5) Enterprise Edition on all nodes, installed from original ISO. All nodes have both IP addresses (mgmt/vmbr0 + HA/ring0) in the hosts-file.

I get the same error when trying to join the nodes with a ring0-network and without. What essentially happens is that the joining is OK, and they successfully populate the hosts-list, but they aren't able to communicate. After a while the GUI stops working too due to missing SSL-cert on the joining node.

Any help is greatly appreciated!

Best regards,
Seb

EDIT: These are all fresh installs, no VMs or any other information is migrated - but a brand new environment. I've tried to reinstall several hosts to verify that the same problem reoccurs.

EDIT2: running pvecm updatecerts -f gives 'no quorum - unable to update files'
 
Last edited:
The hosts aren't able to find "quorum" when they are joined into a cluster - no matter which way and no matter the number of hosts. It seems like the SSL-certs aren't able to get synced between the hosts as we get the error "tls_process_server_certificate: certificate verify failed (596)".
Do all nodes not get quorum? Multicast is working? You can test with omping, you can find the command in the docs.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

EDIT2: running pvecm updatecerts -f gives 'no quorum - unable to update files'
For that the pmxcfs (/etc/pve) needs to be writable.
 
Do all nodes not get quorum? Multicast is working? You can test with omping, you can find the command in the docs.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements


For that the pmxcfs (/etc/pve) needs to be writable.

Hi Alwin,

Thanks for your quick reply.

I've tried to join several nodes to each other, and all get the same kind of error. Multicast is working when testing omping accoringly with no packet loss whatsoever. I tested this before starting the clustering procedure and followed the wiki accordingly.

I've tested changing permissions on the different folders as well, but it only throws back operation not permitted.
 
Seems like a problem with corosync/your cluster network - any relevant loglines in the journal (`journalctl -r`, look for the services corosync/pmxcfs) - also check the output of `pvecm status`
 
This is the only relevant lines on the node where I created the cluster:
Code:
Aug 01 16:44:59 adv-bc01-b6 pveproxy[18541]: unable to read '/etc/pve/nodes/adv-bc01-b7/pve-ssl.pem' - No such file or directory
Aug 01 16:44:59 adv-bc01-b6 pveproxy[18541]: unable to read '/etc/pve/nodes/adv-bc01-b7/pve-ssl.pem' - No such file or directory
Aug 01 16:44:58 adv-bc01-b6 pveproxy[18541]: unable to read '/etc/pve/nodes/adv-bc01-b7/pve-ssl.pem' - No such file or directory
Aug 01 16:44:58 adv-bc01-b6 pveproxy[18541]: unable to read '/etc/pve/nodes/adv-bc01-b7/pve-ssl.pem' - No such file or directory
Aug 01 16:44:58 adv-bc01-b6 pveproxy[18541]: unable to read '/etc/pve/nodes/adv-bc01-b7/pve-ssl.pem' - No such file or directory
Aug 01 16:44:53 adv-bc01-b6 pveproxy[18541]: unable to read '/etc/pve/nodes/adv-bc01-b7/pve-ssl.pem' - No such file or directory
Aug 01 16:44:52 adv-bc01-b6 pveproxy[18541]: unable to read '/etc/pve/nodes/adv-bc01-b7/pve-ssl.pem' - No such file or directory

and pvecm status:
Code:
Quorum information
------------------
Date:             Wed Aug  1 16:57:48 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/4
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.99.99.105 (local)

This is on the joining node:
Code:
Aug 01 16:58:30 adv-bc01-b7 pveproxy[19997]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1643.
Aug 01 16:58:30 adv-bc01-b7 pveproxy[1970]: worker 19997 started
Aug 01 16:58:30 adv-bc01-b7 pveproxy[1970]: starting 1 worker(s)
Aug 01 16:58:30 adv-bc01-b7 pveproxy[1970]: worker 19987 finished
Aug 01 16:58:30 adv-bc01-b7 pveproxy[19987]: worker exit
Aug 01 16:58:29 adv-bc01-b7 pveproxy[19996]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1643.
Aug 01 16:58:29 adv-bc01-b7 pveproxy[1970]: worker 19996 started
Aug 01 16:58:29 adv-bc01-b7 pveproxy[1970]: starting 1 worker(s)
Aug 01 16:58:29 adv-bc01-b7 pveproxy[1970]: worker 19986 finished
Aug 01 16:58:29 adv-bc01-b7 pveproxy[19986]: worker exit
--
Aug 01 16:20:01 adv-bc01-b7 cron[1857]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Aug 01 16:20:01 adv-bc01-b7 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Aug 01 16:20:01 adv-bc01-b7 systemd[1]: pvesr.service: Unit entered failed state.
Aug 01 16:20:01 adv-bc01-b7 systemd[1]: Failed to start Proxmox VE replication runner.
Aug 01 16:20:01 adv-bc01-b7 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 01 16:20:01 adv-bc01-b7 pvesr[16714]: error with cfs lock 'file-replication_cfg': no quorum!
Aug 01 16:20:00 adv-bc01-b7 systemd[1]: Starting Proxmox VE replication runner...
Aug 01 16:19:17 adv-bc01-b7 pmxcfs[16647]: [status] notice: all data is up to date
Aug 01 16:19:17 adv-bc01-b7 pmxcfs[16647]: [status] notice: members: 2/16647
Aug 01 16:19:17 adv-bc01-b7 pmxcfs[16647]: [dcdb] notice: all data is up to date
Aug 01 16:19:17 adv-bc01-b7 pmxcfs[16647]: [dcdb] notice: members: 2/16647
Aug 01 16:19:17 adv-bc01-b7 pmxcfs[16647]: [status] notice: update cluster info (cluster name  bc01-ha, version = 2)
Aug 01 16:19:13 adv-bc01-b7 systemd[1]: Started The Proxmox VE cluster filesystem.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [QUORUM] Members[1]: 2
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [CPG   ] downlist left_list: 0 received
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [TOTEM ] A new membership (10.99.99.106:4) was formed. Members joined: 2
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [QB    ] server name: quorum
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [QB    ] server name: votequorum
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [QUORUM] Using quorum provider corosync_votequorum
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [WD    ] no resources configured.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [WD    ] resource memory_used missing a recovery key.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [WD    ] resource load_15min missing a recovery key.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [QB    ] server name: cpg
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [QB    ] server name: cfg
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [QUORUM] Members[1]: 2
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: warning [CPG   ] downlist left_list: 0 received
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [TOTEM ] A new membership (10.99.99.106:4) was formed. Members joined: 2
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: info    [QB    ] server name: quorum
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: info    [QB    ] server name: votequorum
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Aug 01 16:19:12 adv-bc01-b7 systemd[1]: Started Corosync Cluster Engine.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [QB    ] server name: cmap
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: info    [WD    ] no resources configured.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: warning [WD    ] resource memory_used missing a recovery key.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: warning [WD    ] resource load_15min missing a recovery key.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: info    [QB    ] server name: cpg
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: info    [QB    ] server name: cfg
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: info    [QB    ] server name: cmap
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]:  [TOTEM ] The network interface [10.99.99.106] is now up.
Aug 01 16:19:12 adv-bc01-b7 corosync[16632]: notice  [TOTEM ] The network interface [10.99.99.106] is now up.
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bind
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Aug 01 16:19:11 adv-bc01-b7 corosync[16632]:  [MAIN  ] Corosync Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Aug 01 16:19:11 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:11 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:11 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [status] crit: can't initialize service
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [status] crit: cpg_initialize failed: 2
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [dcdb] crit: can't initialize service
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [dcdb] crit: cpg_initialize failed: 2
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [confdb] crit: can't initialize service
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [confdb] crit: cmap_initialize failed: 2
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [quorum] crit: can't initialize service
Aug 01 16:19:11 adv-bc01-b7 pmxcfs[16647]: [quorum] crit: quorum_initialize failed: 2
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: Starting Corosync Cluster Engine...
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: pve-cluster.service: Unit entered failed state.
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: pve-cluster.service: Killing process 1791 (pmxcfs) with signal SIGKILL.
Aug 01 16:19:11 adv-bc01-b7 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Aug 01 16:19:11 adv-bc01-b7 pve-firewall[1898]: firewall update time (5.036 seconds)


Aug 01 16:19:10 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pve-ha-lrm[2002]: updating service status from manager failed: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pvestatd[1891]: status update time (5.014 seconds)
Aug 01 16:19:10 adv-bc01-b7 pvestatd[1891]: status update error: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pvestatd[1891]: ipcc_send_rec[4] failed: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pvestatd[1891]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pvestatd[1891]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:10 adv-bc01-b7 pvestatd[1891]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:09 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:09 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:09 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:07 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:07 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:07 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:06 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:04 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:04 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:04 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:04 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:04 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:04 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:03 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:03 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:03 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:03 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:03 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:03 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:02 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:02 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:02 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:02 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[3] failed: Connection refused
Aug 01 16:19:02 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[2] failed: Connection refused
Aug 01 16:19:02 adv-bc01-b7 pveproxy[26039]: ipcc_send_rec[1] failed: Connection refused
Aug 01 16:19:01 adv-bc01-b7 pmxcfs[1791]: [main] notice: teardown filesystem
Aug 01 16:19:01 adv-bc01-b7 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Aug 01 16:19:01 adv-bc01-b7 pvedaemon[1925]: <root@pam> starting task UPID:adv-bc01-b7:000040E2:005B4EFC:5B61C155:clusterjoin:10.99.99.105:root@pam:
Aug 01 16:19:01 adv-bc01-b7 systemd[1]: Started Proxmox VE replication runner.
Aug 01 16:19:00 adv-bc01-b7 systemd[1]: Starting Proxmox VE replication runner...
Aug 01 16:18:01 adv-bc01-b7 systemd[1]: Started Proxmox VE replication runner.
Aug 01 16:18:00 adv-bc01-b7 systemd[1]: Starting Proxmox VE replication runner...
Aug 01 16:17:01 adv-bc01-b7 CRON[16452]: pam_unix(cron:session): session closed for user root
Aug 01 16:17:01 adv-bc01-b7 CRON[16453]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 01 16:17:01 adv-bc01-b7 CRON[16452]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 01 16:17:01 adv-bc01-b7 systemd[1]: Started Proxmox VE replication runner.
Aug 01 16:17:00 adv-bc01-b7 systemd[1]: Starting Proxmox VE replication runner...
Aug 01 16:16:01 adv-bc01-b7 systemd[1]: Started Proxmox VE replication runner.
Aug 01 16:16:00 adv-bc01-b7 systemd[1]: Starting Proxmox VE replication runner...
Aug 01 16:15:01 adv-bc01-b7 systemd[1]: Started Proxmox VE replication runner.
Aug 01 16:15:00 adv-bc01-b7 systemd[1]: Starting Proxmox VE replication runner...
Aug 01 16:14:40 adv-bc01-b7 pvedaemon[1925]: <root@pam> end task UPID:adv-bc01-b7:00003E81:005AE7A0:5B61C04C:aptupdate::root@pam: OK
Aug 01 16:14:38 adv-bc01-b7 pvedaemon[16001]: update new package list: /var/lib/pve-manager/pkgupdates

and pvecm status on the joining node:
Code:
Quorum information
------------------
Date:             Wed Aug  1 17:03:21 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2/4
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.99.99.106 (local)
 
Oh, and just to add - the above mentioned lines are pulled from the two hosts that doesn't run on the ring0 network, but on the same network as mangement.

This is the cisco switch they're running through on the ring0 network and it's igmp settings:
Code:
ip igmp snooping querier address 10.99.250.0
ip igmp snooping querier
ip igmp snooping vlan 250 immediate-leave
ip igmp snooping vlan 250 robustness-variable 1
ip igmp snooping vlan 250 last-member-query-count 1
ip igmp snooping vlan 250 last-member-query-interval 1000

This is a newly joined node to a ring0 cluster:
Code:
Aug 01 17:18:19 adv-bc01-b9 pveproxy[13776]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/s
Aug 01 17:18:19 adv-bc01-b9 pveproxy[1930]: worker 13776 started
Aug 01 17:18:19 adv-bc01-b9 pveproxy[1930]: starting 1 worker(s)
Aug 01 17:18:19 adv-bc01-b9 pveproxy[1930]: worker 13775 finished
Aug 01 17:18:19 adv-bc01-b9 pveproxy[13775]: worker exit
Aug 01 17:18:14 adv-bc01-b9 pveproxy[13775]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/s
Aug 01 17:18:14 adv-bc01-b9 pveproxy[1930]: worker 13775 started
Aug 01 17:18:14 adv-bc01-b9 pveproxy[1930]: starting 1 worker(s)
Aug 01 17:18:14 adv-bc01-b9 pveproxy[1930]: worker 13767 finished
Aug 01 17:18:14 adv-bc01-b9 pveproxy[13767]: worker exit
Aug 01 17:18:10 adv-bc01-b9 pveproxy[11587]: unable to read '/etc/pve/nodes/adv-bc01-b8/pve-ssl.pem' - No such file or directory
Aug 01 17:18:09 adv-bc01-b9 pveproxy[11587]: unable to read '/etc/pve/nodes/adv-bc01-b8/pve-ssl.pem' - No such file or directory
Aug 01 17:18:09 adv-bc01-b9 pveproxy[11587]: Clearing outdated entries from certificate cache
Aug 01 17:18:09 adv-bc01-b9 pveproxy[11587]: unable to read '/etc/pve/nodes/adv-bc01-b8/pve-ssl.pem' - No such file or directory
Aug 01 17:18:09 adv-bc01-b9 pveproxy[13767]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/s
Aug 01 17:18:09 adv-bc01-b9 pveproxy[1930]: worker 13767 started
Aug 01 17:18:09 adv-bc01-b9 pveproxy[1930]: starting 1 worker(s)
Aug 01 17:18:09 adv-bc01-b9 pveproxy[1930]: worker 13766 finished
Aug 01 17:18:09 adv-bc01-b9 pveproxy[13766]: worker exit
Aug 01 17:18:04 adv-bc01-b9 pveproxy[13766]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/s
Aug 01 17:18:04 adv-bc01-b9 pveproxy[1930]: worker 13766 started
Aug 01 17:18:04 adv-bc01-b9 pveproxy[1930]: starting 1 worker(s)
Aug 01 17:18:04 adv-bc01-b9 pveproxy[1930]: worker 13734 finished
Aug 01 17:18:04 adv-bc01-b9 pveproxy[13734]: worker exit
Aug 01 17:18:01 adv-bc01-b9 cron[1822]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Aug 01 17:18:00 adv-bc01-b9 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Aug 01 17:18:00 adv-bc01-b9 systemd[1]: pvesr.service: Unit entered failed state.
Aug 01 17:18:00 adv-bc01-b9 systemd[1]: Failed to start Proxmox VE replication runner.
Aug 01 17:18:00 adv-bc01-b9 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 01 17:18:00 adv-bc01-b9 pvesr[13735]: error with cfs lock 'file-replication_cfg': no quorum!
Aug 01 17:18:00 adv-bc01-b9 systemd[1]: Starting Proxmox VE replication runner...
Aug 01 17:17:59 adv-bc01-b9 pveproxy[13734]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/s
Aug 01 17:17:59 adv-bc01-b9 pveproxy[1930]: worker 13734 started
Aug 01 17:17:59 adv-bc01-b9 pveproxy[1930]: starting 1 worker(s)
Aug 01 17:17:59 adv-bc01-b9 pveproxy[1930]: worker 17618 finished
Aug 01 17:17:59 adv-bc01-b9 pveproxy[17618]: worker exit
Aug 01 17:17:43 adv-bc01-b9 pmxcfs[13696]: [status] notice: all data is up to date
Aug 01 17:17:43 adv-bc01-b9 pmxcfs[13696]: [status] notice: members: 2/13696
Aug 01 17:17:43 adv-bc01-b9 pmxcfs[13696]: [dcdb] notice: all data is up to date
Aug 01 17:17:43 adv-bc01-b9 pmxcfs[13696]: [dcdb] notice: members: 2/13696
Aug 01 17:17:43 adv-bc01-b9 pmxcfs[13696]: [status] notice: update cluster info (cluster name  test, version = 2)
Aug 01 17:17:38 adv-bc01-b9 systemd[1]: Started The Proxmox VE cluster filesystem.
Aug 01 17:17:38 adv-bc01-b9 pve-ha-lrm[1962]: updating service status from manager failed: Connection refused
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [QUORUM] Members[1]: 2
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [CPG   ] downlist left_list: 0 received
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [TOTEM ] A new membership (10.99.250.108:4) was formed. Members joined: 2
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [QB    ] server name: quorum
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [QB    ] server name: votequorum
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [QUORUM] Using quorum provider corosync_votequorum
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [WD    ] no resources configured.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [WD    ] resource memory_used missing a recovery key.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [WD    ] resource load_15min missing a recovery key.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [QB    ] server name: cpg
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [QB    ] server name: cfg
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [QUORUM] Members[1]: 2
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: warning [CPG   ] downlist left_list: 0 received
Aug 01 17:17:37 adv-bc01-b9 systemd[1]: Started Corosync Cluster Engine.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [QB    ] server name: cmap
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [TOTEM ] A new membership (10.99.250.108:4) was formed. Members joined: 2
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: info    [QB    ] server name: quorum
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: info    [QB    ] server name: votequorum
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: info    [WD    ] no resources configured.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: warning [WD    ] resource memory_used missing a recovery key.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: warning [WD    ] resource load_15min missing a recovery key.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Aug 01 17:17:37 adv-bc01-b9 corosync[13677]:  [SERV  ] Service engine loaded: corosync configuration map access [0]

and its pvecm status:
Code:
Quorum information
------------------
Date:             Wed Aug  1 17:19:53 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2/4
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.99.250.108 (local)
 
could you post the output of omping between the nodes (no packetloss does not imply that the latency is within limits)?
also make sure that there is nothing blocking corosync traffic between the nodes (omping runs over a different udp-port than corosync).

you could also compare the /etc/corosync/corosync.conf on the nodes.
 
could you post the output of omping between the nodes (no packetloss does not imply that the latency is within limits)?
also make sure that there is nothing blocking corosync traffic between the nodes (omping runs over a different udp-port than corosync).

you could also compare the /etc/corosync/corosync.conf on the nodes.

Here you go:

host 107: where I created the cluster
host 108: the joining node

omping host 107 -> 108:
Code:
10.99.250.108 :   unicast, xmt/rcv/%loss = 9624/9624/0%, min/avg/max/std-dev = 0.044/0.065/1.117/0.016
10.99.250.108 : multicast, xmt/rcv/%loss = 9624/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
omping host 108 -> 107:
Code:
10.99.250.107 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.043/0.065/1.150/0.018
10.99.250.107 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

corosync.conf comparison:

host 107:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: adv-bc01-b8
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.99.250.107
  }
  node {
    name: adv-bc01-b9
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.99.250.108
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: test
  config_version: 2
  interface {
    bindnetaddr: 10.99.250.107
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

host 108:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: adv-bc01-b8
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.99.250.107
  }
  node {
    name: adv-bc01-b9
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.99.250.108
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: test
  config_version: 2
  interface {
    bindnetaddr: 10.99.250.107
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}
 
Oh damn. That's no good multicast number for omping. Could that be the actual issue i've been overseeing this whole time?
 
omping indicates that multicast is not working (the 100% in the loss part)
 
I would guess so - unless you configure your cluster to use unicast (which is neither recommended, nor do I think that it would work too well in such a large cluster as you're planning), multicast traffic passing through is needed for corosync (and pmxcfs, and thus the PVE cluster) to work.
 
I would guess so - unless you configure your cluster to use unicast (which is neither recommended, nor do I think that it would work too well in such a large cluster as you're planning), multicast traffic passing through is needed for corosync (and pmxcfs, and thus the PVE cluster) to work.

It seems like it did the trick. So all it took was for me to register here for help and have a second look at my own stats - which i already double checked.

Thanks for your help though, really appreciated.

#dontbestressedwhilesettinguplotsofstuff
 
  • Like
Reactions: Alwin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!