Can't add additional node

aaproxy

New Member
Jun 13, 2022
8
1
3
I am completely puzzled. I started with 2 nodes (Prox1 and Prox2), no issues, they work great. I found some spare hardware and created a temporary third node (prox3), no issues. I grabbed a small mini pc for a forth node (just to run Mikrotik The Dude separate from the other servers). The 4th node (prox4) when ever I added it, it hung the web interfaces on nodes 1, 2 and 3. So I manually deleted node 4, life was fine.

I now have a proper server of equal power as prox1 and prox2. This node will be called prox5.

Every time I go to add prox5 (via web and command line) I get the following errors:
Web:
Establishing API connection with host '10.35.35.71'
TASK ERROR: 500 Can't connect to 10.35.35.71:8006
Command line:
root@prox5:~# pvecm add 10.35.35.71 --use_ssh
unable to copy ssh ID: exit code 1
root@prox5:~# pvecm add 10.35.35.71
Please enter superuser (root) password for '10.35.35.71': *********
Establishing API connection with host '10.35.35.71'
500 Can't connect to 10.35.35.71:8006
root@prox5:~# telnet 10.35.35.71 8006
Trying 10.35.35.71...
Connected to 10.35.35.71.
Escape character is '^]'.
^]
telnet> Connection closed.

I have updated all servers to ensure they are the same. I have restarted all services (I have not restarted any of the nodes). I am using 10Gb/s to connect all nodes (with MTU =9000).

I can ssh to all servers. I can telnet to 8006.

Any suggestions on how to this 4th node (prox5)? I am completely stumped.

Thank you!


root@prox1:~# pveversion -v

proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-helper: 7.2-4
pve-kernel-5.15: 7.2-3
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

root@prox1:~# cat /etc/pve/corosync.conf

logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: prox1
nodeid: 1
quorum_votes: 1
ring0_addr: 10.35.35.71
}
node {
name: prox2
nodeid: 2
quorum_votes: 1
ring0_addr: 10.35.35.72
}
node {
name: prox3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.35.35.61
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: Thinkers
config_version: 7
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}



root@prox1:~# systemctl status pve-cluster


systemctl status corosync.service

pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2022-06-13 15:59:15 EDT; 1h 26min ago
Process: 3203187 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 3203188 (pmxcfs)
Tasks: 6 (limit: 77159)
Memory: 45.4M
CPU: 7.097s
CGroup: /system.slice/pve-cluster.service


└─3203188 /usr/bin/pmxcfs

Jun 13 15:59:20 prox1 pmxcfs[3203188]: [dcdb] notice: waiting for updates from leader
Jun 13 15:59:20 prox1 pmxcfs[3203188]: [status] notice: received all states
Jun 13 15:59:20 prox1 pmxcfs[3203188]: [status] notice: all data is up to date
Jun 13 15:59:20 prox1 pmxcfs[3203188]: [dcdb] notice: update complete - trying to commit (got 3 inode updates)
Jun 13 15:59:20 prox1 pmxcfs[3203188]: [dcdb] notice: all data is up to date
Jun 13 16:01:31 prox1 pmxcfs[3203188]: [status] notice: received log
Jun 13 16:24:17 prox1 pmxcfs[3203188]: [status] notice: received log
Jun 13 16:39:17 prox1 pmxcfs[3203188]: [status] notice: received log
Jun 13 16:59:14 prox1 pmxcfs[3203188]: [dcdb] notice: data verification successful
Jun 13 17:11:50 prox1 pmxcfs[3203188]: [status] notice: received log

corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2022-06-13 15:59:16 EDT; 1h 26min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3203329 (corosync)
Tasks: 9 (limit: 77159)
Memory: 132.5M
CPU: 1min 6.668s

CGroup: /system.slice/corosync.service


└─3203329 /usr/sbin/corosync -f

Jun 13 15:59:18 prox1 corosync[3203329]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 13 15:59:18 prox1 corosync[3203329]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 8885
Jun 13 15:59:18 prox1 corosync[3203329]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 8885
Jun 13 15:59:18 prox1 corosync[3203329]: [KNET ] pmtud: Global data MTU changed to: 8885
Jun 13 15:59:19 prox1 corosync[3203329]: [QUORUM] Sync members[3]: 1 2 3
Jun 13 15:59:19 prox1 corosync[3203329]: [QUORUM] Sync joined[2]: 2 3
Jun 13 15:59:19 prox1 corosync[3203329]: [TOTEM ] A new membership (1.10b) was formed. Members joined: 2 3
Jun 13 15:59:19 prox1 corosync[3203329]: [QUORUM] This node is within the primary component and will provide service.
Jun 13 15:59:19 prox1 corosync[3203329]: [QUORUM] Members[3]: 1 2 3
Jun 13 15:59:19 prox1 corosync[3203329]: [MAIN ] Completed service synchronization, ready to provide service.
 
Hi,

did you try adding the new node to another node then prox1 in the cluster?

When you try to add the node do you see some error in the syslog of the node its trying to connect to?

Can you show the output of pvecm status.
 
Yes, I have tried adding prox5 to one of the other nodes (actually, prox1, prox2 and prox3), same issue, same response on prox5 and nothing noted on prox1, prox2 and prox3.

I've been tailing /var/log/messages and nothing shows up so I just did a journalctl -f.

prox1 shows nothing but this error even though Ceph is not installed:
Jun 14 08:01:38 prox1 ceph-crash[3934]: ERROR:ceph-crash:directory /var/lib/ceph/crash/posted does not exist; please create
Jun 14 08:02:08 prox1 ceph-crash[3934]: ERROR:ceph-crash:directory /var/lib/ceph/crash/posted does not exist; please create
Jun 14 08:02:38 prox1 ceph-crash[3934]: ERROR:ceph-crash:directory /var/lib/ceph/crash/posted does not exist; please create

prox5 shows:
Jun 14 08:00:52 prox5 pvedaemon[2199701]: <root@pam> successful auth for user 'root@pam'
Jun 14 08:01:38 prox5 pvedaemon[2199701]: <root@pam> starting task UPID:prox5:002C08B7:0184D60A:62A878A2:clusterjoin::root@pam:
Jun 14 08:02:38 prox5 pvedaemon[2885815]: 500 Can't connect to 10.35.35.71:8006
Jun 14 08:02:38 prox5 pvedaemon[2199701]: <root@pam> end task UPID:prox5:002C08B7:0184D60A:62A878A2:clusterjoin::root@pam: 500 Can't connect to 10.35.35.71:8006



root@prox1:~# pvecm status

Cluster information
-------------------
Name: Thinkers
Config Version: 7
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Jun 14 07:44:05 2022
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.10b
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.35.35.71 (local)
0x00000002 1 10.35.35.72
0x00000003 1 10.35.35.61


root@prox5:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?
root@prox5:~#
 
From prox5 using curl and wget, I can definitely connect to prox1:

root@prox5:~# curl https://10.35.35.75:8006/api2/json/
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

root@prox5:~# wget https://10.35.35.75:8006/api2/json/
--2022-06-14 08:52:13-- https://10.35.35.75:8006/api2/json/
Connecting to 10.35.35.75:8006... connected.
ERROR: The certificate of ‘10.35.35.75’ is not trusted.
ERROR: The certificate of ‘10.35.35.75’ doesn't have a known issuer.

root@prox5:~# curl --insecure -I https://10.35.35.75:8006/api2/json/
HTTP/1.1 501 method 'HEAD' not available
Cache-Control: max-age=0
Connection: close
Date: Tue, 14 Jun 2022 12:53:14 GMT
Pragma: no-cache
Server: pve-api-daemon/3.0
Expires: Tue, 14 Jun 2022 12:53:14 GMT
 
10.35.35.71 vs 10.35.35.75 ?
 
Good eye. Thanks for catching that.

prox1 can connect to prox1, prox2 and prox3, not prox5.
prox2 can connect to prox1, prox3 but not prox5
prox3 can connect to prox1, prox2 and prox3 but not prox5.
Prox5 cannot connect to prox1, prox2 or prox3 but can connect to itself.

From VPN I can connect to all nodes by https://<IP>:8006 without any issues. I can ssh to all without any issues. when doing telnet from prox5 to the other servers when testing 8006, I guess I didn't test properly as prox5 cannot ssh or curl to the URL of any other Proxmox server. I can ping each node from the other and from my laptop. So, a firewall issue.
I tried disabling the firewall on both prox5 and prox1 systemctl stop pve-firewall.service. I still cannot ssh from prox5 to prox1, nor access the api URL.
on prox1 under Datacenter/Firewall/Options, Firewall is (and was) set to "No".
 
Any device on any one of my networks can connect to any node via :8006 or ssh. Prox5 can pull data from other devices using the curl command and on non standard ports. With firewall disabled on all servers, I still do not see prox5 able to connect to any of the existing nodes.

Could this be an SSL issue (see curl command below)?

root@prox5:~# curl --insecure -I https://10.35.35.72:8006/
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to 10.35.35.72:8006
 
I did a tcpdump and I can see the requests from prox5 hitting prox1. Nothing was being returned. So I looked at the SSL cert by curl on my laptop to prox1 and from prox5 to prox1. Below is the output. This might be an SSL certificate issues. If anyone has any further suggestions on either repairing SSL certs (which baffles me as my laptop can connect without issues) or adding prox5 to the cluster and ignore the certificates, I'm all ears.

From my laptop I tried looking at the SSL certificate using curl:

$ curl https://10.35.35.71:8006 -vI
* Trying 10.35.35.71:8006...
* Connected to 10.35.35.71 (10.35.35.71) port 8006 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/cert.pem
* CApath: none
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* SSL certificate problem: unable to get local issuer certificate
* Closing connection 0
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.


From prox5:
root@prox5:~# curl https://10.35.35.71:8006 -vI
* Trying 10.35.35.71:8006...
* Connected to 10.35.35.71 (10.35.35.71) port 8006 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: Connection reset by peer in connection to 10.35.35.71:8006
* Closing connection 0
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to 10.35.35.71:8006
root@prox5:~#
 
I don't think that this is the problem somehow but you can try to call

Code:
pvecm updatecerts

If you add the --force parameter it will generate a new cert.


With the connection test you did a curl/wget onto the http port, or was it a ping?
 
@shrdlicka pings always work. It is the making of an ssh connection or connecting via api (8006) from prox5 to any of the nodes in the cluster that is the issue. It is either the cluster blocking return packets to prox5 or prox5 blocking the receiving of packets. I just can't figure out how. prox5 is brand new with no configuration with the exception of drives and IP addresses.
 
Well, I found the problem. What a dumb issue (I'm the issue). MTU mismatch. The switch was set to 1500 while the interface on prox5 was 9000.
 
  • Like
Reactions: datschlatscher