[SOLVED] Create cluster problem - possibly SSL related

tipex · Jul 8, 2023

Hi All

I’ve spent all week trying to get clustering working but I’m having problems. I’m now at the point were I need to reach out to the community.

Steps:
I install Proxmox 8.0.3 to two separate machines.
I edit the network config, delete the bridge and leave a single network interface configured with static IP. This interface will be dedicated to cluster comms. eno2 is currently unplugged and unused. My plan is to create a bridge from eno2 which VMs will use, but for now I've not set it up to simply the network config. Here a screenshot form one of my servers. The other server is setup the same only with a different IP address.
Note I have also tried it without changing the network settings and just going with the default bridge configured. Makes no difference i.e. I still get the same problem.

On node 1 I create a cluster, copy the join cluster info and paste it into node 2. On node 2 I select which network interface to use and enter the root password. The joining then begins and I lose connection to node 2. I think losing connection is possibly expected at this point but I don’t know enough about the process to be certain?

On node 1 I can see that node 2 has been added - so something worked. But node 2 has a red cross on it

When I try to access to node 2 from the GUI of node 1 I got some messages about SSL.

I thought maybe it’s because the default installs use self signed certs and when trying to access them via a browser you get the untrusted message. As a human you can simply accept the warning and proceed to the login page but a piece of code can’t do such a thing and so maybe that’s why the cluster joining is failing.

With this theory, I installed Proxmox again from scratch on both servers and before creating a cluster I uploaded my lets encrypt SSL certs to both machines. Verified that the browser recognised that official SSL certs were being used. I restarted both servers and then tried creating a cluster again. Still got the same issue.

Another thing I have tried is on node one I copied the SSL certs from “/etc/pve/nodes/hypervisor-a-1/” to “/etc/pve/nodes/hypervisor-b-1/” as I saw they weren’t present. Restarted both machines but still no luck.
If I SSH into node 2 it does not have the same layout in "/etc/pve/nodes" directory. Maybe this is normal?

If it helps:
I can SSH from the GUI of node 1 to node 2.

I cant reach node2 via its web interface which possibly explains why node1 cant do the same. Its like as soon as I add node 2 to the cluster it dies.

I’m not really sure what to do at this point. My next task was to add a RPi as a 3rd voting node but there is no point moving onto this if I cant get 2 nodes doing anything.

Moayad · Jul 10, 2023

Hi,

Do you have physical access to node 2? if yes - can you check the network if connected?

By the way, you do not need to edit the certificates on the nodes.

tipex · Jul 10, 2023

Yes I have physical access to node 2.

I can SSH into node 2 directly from my laptop.
I can also SSH into node 2 from the GUI of node 1.
Node 2 can ping 8.8.8.8 successfully.

Moayad · Jul 11, 2023

Can you please try to renew the certificates, using the following command:

Bash:

pvecm updatecerts --force

If that didn't help, I would check the syslog to look for an error message.

tipex · Jul 11, 2023

pvecm updatecerts --force did not work unfortunately.

I reinstalled Proxmox from scratch on both servers and did journalctl -f on each server via a dedicated SSH connection i.e. not via the Proxmox GUI.

Here is the output from node 1:

Code:

Creating a cluster via the GUI:

Jul 11 20:55:09 hypervisor-a-1 pvedaemon[2080]: <root@pam> starting task UPID:hypervisor-a-1:00004974:00023E5C:64ADB39D:clustercreate:HH-Cluster:root@pam:
Jul 11 20:55:09 hypervisor-a-1 systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.conf).
Jul 11 20:55:09 hypervisor-a-1 systemd[1]: Stopping pve-cluster.service - The Proxmox VE cluster filesystem...
Jul 11 20:55:09 hypervisor-a-1 pmxcfs[1918]: [main] notice: teardown filesystem
Jul 11 20:55:09 hypervisor-a-1 systemd[1]: etc-pve.mount: Deactivated successfully.
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[1918]: [main] notice: exit proxmox configuration filesystem (0)
Jul 11 20:55:10 hypervisor-a-1 systemd[1]: pve-cluster.service: Deactivated successfully.
Jul 11 20:55:10 hypervisor-a-1 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 11 20:55:10 hypervisor-a-1 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18812]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 1)
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18812]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 1)
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [quorum] crit: quorum_initialize failed: 2
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [quorum] crit: can't initialize service
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [confdb] crit: cmap_initialize failed: 2
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [confdb] crit: can't initialize service
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [dcdb] crit: cpg_initialize failed: 2
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [dcdb] crit: can't initialize service
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [status] crit: cpg_initialize failed: 2
Jul 11 20:55:10 hypervisor-a-1 pmxcfs[18813]: [status] crit: can't initialize service
Jul 11 20:55:11 hypervisor-a-1 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 11 20:55:11 hypervisor-a-1 pvedaemon[2080]: <root@pam> end task UPID:hypervisor-a-1:00004974:00023E5C:64ADB39D:clustercreate:HH-Cluster:root@pam: OK
Jul 11 20:55:11 hypervisor-a-1 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jul 11 20:55:11 hypervisor-a-1 corosync[18819]:   [MAIN  ] Corosync Cluster Engine  starting up
Jul 11 20:55:11 hypervisor-a-1 corosync[18819]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 11 20:55:12 hypervisor-a-1 kernel: sctp: Hash tables configured (bind 1024/1024)
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [TOTEM ] totemknet initialized
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [KNET  ] pmtud: MTU manually set to: 0
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QB    ] server name: cmap
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QB    ] server name: cfg
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QB    ] server name: cpg
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [WD    ] Watchdog not enabled by configuration
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [WD    ] resource load_15min missing a recovery key.
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [WD    ] resource memory_used missing a recovery key.
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [WD    ] no resources configured.
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QUORUM] This node is within the primary component and will provide service.
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QUORUM] Members[0]:
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QB    ] server name: votequorum
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QB    ] server name: quorum
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [TOTEM ] Configuring link 0
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [TOTEM ] Configured link number 0: local addr: 172.20.0.140, port=5405
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QUORUM] Sync members[1]: 1
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QUORUM] Sync joined[1]: 1
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [TOTEM ] A new membership (1.5) was formed. Members joined: 1
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [QUORUM] Members[1]: 1
Jul 11 20:55:12 hypervisor-a-1 corosync[18819]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 20:55:12 hypervisor-a-1 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Jul 11 20:55:16 hypervisor-a-1 pmxcfs[18813]: [status] notice: update cluster info (cluster name  HH-Cluster, version = 1)
Jul 11 20:55:16 hypervisor-a-1 pmxcfs[18813]: [status] notice: node has quorum
Jul 11 20:55:16 hypervisor-a-1 pmxcfs[18813]: [dcdb] notice: members: 1/18813
Jul 11 20:55:16 hypervisor-a-1 pmxcfs[18813]: [dcdb] notice: all data is up to date
Jul 11 20:55:16 hypervisor-a-1 pmxcfs[18813]: [status] notice: members: 1/18813
Jul 11 20:55:16 hypervisor-a-1 pmxcfs[18813]: [status] notice: all data is up to date


Then I joined node 2 to the cluster:

Jul 11 20:58:01 hypervisor-a-1 pvedaemon[2078]: <root@pam> successful auth for user 'root@pam'
Jul 11 20:58:01 hypervisor-a-1 pvedaemon[2078]: <root@pam> adding node hypervisor-b-1 to cluster
Jul 11 20:58:01 hypervisor-a-1 pmxcfs[18813]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 2)
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [CFG   ] Config reload requested by node 1
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [TOTEM ] Configuring link 0
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [TOTEM ] Configured link number 0: local addr: 172.20.0.140, port=5405
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [KNET  ] host: host: 2 has no active links
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [KNET  ] host: host: 2 has no active links
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [QUORUM] Members[1]: 1
Jul 11 20:58:02 hypervisor-a-1 pmxcfs[18813]: [status] notice: node lost quorum
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [KNET  ] host: host: 2 has no active links
Jul 11 20:58:02 hypervisor-a-1 corosync[18819]:   [KNET  ] pmtud: MTU manually set to: 0
Jul 11 20:58:02 hypervisor-a-1 pmxcfs[18813]: [status] notice: update cluster info (cluster name  HH-Cluster, version = 2)



After a minute or so:
Jul 11 20:59:09 hypervisor-a-1 pvescheduler[22335]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 11 20:59:09 hypervisor-a-1 pvescheduler[22334]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

Here is the output from node2:

Code:

Joining node 2 to cluster:


Jul 11 20:58:01 hypervisor-b-1 pvedaemon[2006]: <root@pam> starting task UPID:hypervisor-b-1:000024E9:00010014:64ADB449:clusterjoin::root@pam:
Jul 11 20:58:02 hypervisor-b-1 systemd[1]: Stopping pve-cluster.service - The Proxmox VE cluster filesystem...
Jul 11 20:58:02 hypervisor-b-1 pmxcfs[1846]: [main] notice: teardown filesystem
Jul 11 20:58:03 hypervisor-b-1 systemd[1]: etc-pve.mount: Deactivated successfully.
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[1846]: [main] notice: exit proxmox configuration filesystem (0)
Jul 11 20:58:04 hypervisor-b-1 systemd[1]: pve-cluster.service: Deactivated successfully.
Jul 11 20:58:04 hypervisor-b-1 systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 11 20:58:04 hypervisor-b-1 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jul 11 20:58:04 hypervisor-b-1 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [MAIN  ] Corosync Cluster Engine  starting up
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [quorum] crit: quorum_initialize failed: 2
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [quorum] crit: can't initialize service
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [confdb] crit: cmap_initialize failed: 2
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [confdb] crit: can't initialize service
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [dcdb] crit: cpg_initialize failed: 2
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [dcdb] crit: can't initialize service
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [status] crit: cpg_initialize failed: 2
Jul 11 20:58:04 hypervisor-b-1 pmxcfs[9546]: [status] crit: can't initialize service
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 11 20:58:04 hypervisor-b-1 kernel: sctp: Hash tables configured (bind 1024/1024)
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [TOTEM ] totemknet initialized
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] pmtud: MTU manually set to: 0
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QB    ] server name: cmap
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QB    ] server name: cfg
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QB    ] server name: cpg
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [WD    ] Watchdog not enabled by configuration
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [WD    ] resource load_15min missing a recovery key.
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [WD    ] resource memory_used missing a recovery key.
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [WD    ] no resources configured.
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QB    ] server name: votequorum
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QB    ] server name: quorum
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [TOTEM ] Configuring link 0
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [TOTEM ] Configured link number 0: local addr: 172.20.0.143, port=5405
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] host: host: 1 has no active links
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] host: host: 1 has no active links
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] host: host: 1 has no active links
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QUORUM] Sync members[1]: 2
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QUORUM] Sync joined[1]: 2
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [TOTEM ] A new membership (2.5) was formed. Members joined: 2
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [QUORUM] Members[1]: 2
Jul 11 20:58:04 hypervisor-b-1 corosync[9543]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 20:58:04 hypervisor-b-1 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Jul 11 20:58:05 hypervisor-b-1 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Jul 11 20:58:10 hypervisor-b-1 pmxcfs[9546]: [status] notice: update cluster info (cluster name  HH-Cluster, version = 2)
Jul 11 20:58:10 hypervisor-b-1 pmxcfs[9546]: [dcdb] notice: members: 2/9546
Jul 11 20:58:10 hypervisor-b-1 pmxcfs[9546]: [dcdb] notice: all data is up to date
Jul 11 20:58:10 hypervisor-b-1 pmxcfs[9546]: [status] notice: members: 2/9546
Jul 11 20:58:10 hypervisor-b-1 pmxcfs[9546]: [status] notice: all data is up to date




After a minute or so:

Jul 11 20:59:00 hypervisor-b-1 pvescheduler[10156]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 11 20:59:00 hypervisor-b-1 pvescheduler[10155]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 11 20:59:01 hypervisor-b-1 cron[1947]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Jul 11 21:00:00 hypervisor-b-1 pvescheduler[10716]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 11 21:00:00 hypervisor-b-1 pvescheduler[10715]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 11 21:00:01 hypervisor-b-1 cron[1947]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

There is nothing obvious that jumps out at me but then I dont really know what I'm looking for as I'm new to Proxmox and clustering. There are errors reqarding quorum but thats to be expected as there are only 2 nodes at this point.

tipex · Jul 11, 2023

From a GUI point of view the errors are...

Node 1 has '/etc/pve/nodes/hypervisor-b-1/pve-ssl.pem' does not exist! (500) on the screen.

Node 2 has permission denied - invalid PVE ticket (401) on the screen.

On the GUI of node 2 I simply cannot do anything once I have joined it to the cluster. It takes me to the login page and wont accept my credentials anymore. So my only access to node 2 is via the command line.

A potential silly question but is this expected behaviour? Do I need to add a 3rd node and then it will magically spring to life? Having never done any clustering I don’t really know what to expect. From what I read though its fine to have 2 nodes, you just cant do the high availability (HA) stuff. I don’t intend on doing HA stuff but have an RPi I can use as a 3rd node just for fun really.

PeLbmaN · Sep 8, 2023

Hi, @tipex
i ran into this problem yesterday. And now i ssh to master node and do

Code:

ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "ip_connecting_node"

and this works for me.

fabian · Sep 8, 2023

the logs clearly show that the two nodes are not able to talk to eachother - is there a firewall in place that might prevent them? are they in the same local network?

tipex · Sep 13, 2023

@PeLbmaN I tried your suggestion but it did not work:

@fabian Both machines are in the same VLAN so the firewall on the router should not be involved. Any firewalls on the machines themselves would simply be what ever Proxmox ships with which from memory is nothing i.e. firewall disabled.

One thing for me to try would be to move the machines into the default LAN so they are no longer in a VLAN. Not sure why this would work but its something to try.

I'm low on energy with this subject though as I've spent so long messing and spent a lot of money building two new servers which have been sat doing nothing for months. The idea was to get this all working in a home lab to then give me confidence to use it in a work environment.

scyto · Sep 13, 2023

tipex said:
Yes I have physical access to node 2.

I can SSH into node 2 directly from my laptop.
I can also SSH into node 2 from the GUI of node 1.
Node 2 can ping 8.8.8.8 successfully.

I am confused.

You IPs are 172.20.0.143 and 172.20.0.140 can you cross ping and cross ssh between the two nodes on *those* IPs?
The answer needs to be yes or proxmox won't work.

Forget 8.8.8.8 and DNS at this point and use only IPs

tipex · Sep 13, 2023

The answer is yes.

From my laptop I can SSH into both machines.

From each machine I can then SSH to the other machine. Equally from each machine I can ping the other machine.

Machine 1 (172.20.0.140) SSH to machine 2 (172.20.0.143):

Machine 2 (172.20.0.143) SSH to machine 1 (172.20.0.140):

Machine 1 (172.20.0.140) ping machine 2 (172.20.0.143):

Machine 2 (172.20.0.143) ping machine 1 (172.20.0.140):

tipex · Sep 13, 2023

Note that the above tests were done by me SSHing from my laptop into the two Proxmox machines.

I did not use the Proxmox web interface for any of it because once I attempt the cluster joining process the web interface for machine 2 becomes unresponsive so my only way to interact with machine 2 then is via SSH.

scyto · Sep 13, 2023

Thanks for the info. I had to check.
Couple of things (these are to test, i did same when i had weird issues... in that case it was a bug in the network driver...).

1. Disable the firewall at the node level and data center level (IIRC if you set at datacenter level it will override each node, but as the cluster is not working...)

2. Set the default input and output policy to accept at those levels

3. could you post your corosync config file? (and see if it is accesible on all nodes at /etc/pve (this is where pcmfxs stores its files i believe)

4. could you post the output of pvecm status

and to answer your question, two machines should be enough to form a qurorm IIRC from when i set mine up 6 weeks ago

and are you sure your network physical layer is not doing anything to affect these ports https://pve.proxmox.com/wiki/Ports#Proxmox_VE_6.x_and_later_port_list ?

(i.e. put your two nodes on an unmanaged switch, no VLAN, if you are not sure...)

tipex · Sep 14, 2023

Appreciate your help @scyto

1. The firewall at the datacentre level is showing as "No" i.e. its off:

The firewalls at the node level are on but as you say these get over ridden by the datacentre setting and so should really be off. Either way it wont let me change them becaue of no quorum.

2. It wont let me change the datacentre Input Policy because of no quorum:

3. Here are both corosync config files side by side, taken from /etc/pve/corosync.conf:

4.pvecm status for both machines:

Im not aware of my switch or router doing anything weird but I do think the next logical thing to try is moving the machines to a simpler network i.e. take them out of the VLAN and drop them into a default LAN. Could even take it a step further and connect them to a totally different network thats very basic. This requires some effort though as I will need to dig out an old router and switch etc.

scyto · Sep 14, 2023

Your ring ID should be identical in both. i have no idea what determines ringid.
maybe try a qdevice to get two votes on one cluster node?

fabian · Sep 14, 2023

the root cause is still the kronosnet doesn't mark the links up, which means there is some sort of network issue. you could try turning on debug logging, or dumping packets with tcpdump to see where things go south. further changing the setup on the corosync side is not a good idea and won't help at this point. once the two nodes see eachother, adding a qdevice or more nodes is a good idea.

scyto · Sep 14, 2023

See what corosync-cfgtool -s and corosync-cfgtool -n show

mine for reference


root@pve2:~# corosync-cfgtool -s
Local node ID 2, transport knet
LINK ID 0 udp
        addr    = 192.168.1.82
        status:
                nodeid:          1:     connected
                nodeid:          2:     localhost
                nodeid:          3:     connected
root@pve2:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (192.168.1.82->192.168.1.81) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (192.168.1.82->192.168.1.83) enabled connected mtu: 1397

tipex · Sep 14, 2023

corosync-cfgtool -s

corosync-cfgtool -n show

Depending on how burnt out I feel after todays work, I will try and get the machines hooked up to a new network tonight as its pointing at a network issue isnt it.

scyto · Sep 14, 2023

tipex said:
as its pointing at a network issue isnt it.

Yes. Good luck, it’s still weird tho. my recent escapades I eventually ran tcpdump on both nodes and realized certain TX packets were getting dropped by a driver before they hit the wire… showed up a errors in tcp dump stats….. I had to chase down the driver owner to fix it.

tipex · Sep 14, 2023

Oh man that sounds horrible. I hope mine is not that. I'm not sure my networking skills are good enough to find such an issue

[SOLVED] Create cluster problem - possibly SSL related

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

New Member

Proxmox Staff Member

Member

Active Member

Member

Member

Active Member

Member

Active Member

Proxmox Staff Member

Active Member

Member

Active Member

Member