New 3-node build, nodes won't join cluster via dedicated direct links

Feb 28, 2025
3
1
1
Hello Forum,

I've been trying to get my nodes clustered via their direct links since two days - without success, and I think it's time to ask for help, now.

Each node has two 10G interfaces, that connect to the company network and two 100G interfaces that connect to its respective neighbours directly via DAC.
The 100G network is configured with frr as described here: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(with_Fallback)

Edit: I gave up on frr and used https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(Simple) instead.
Cluster and Ceph are up and running, now.


I was able to create a cluster with the 10G NICs, but whatever I try with the 100G interfaces fails.

vtysh looks good (on all 3 nodes):
Bash:
lfflngpmx001sv# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

f   172.16.254.240/29 [115/20] via 172.16.254.242, ens1f1np1 onlink, weight 1, 00:08:04
                               via 172.16.254.243, ens1f0np0 onlink, weight 1, 00:08:04
C>* 172.16.254.240/29 is directly connected, lo, 00:08:42
C>* 192.168.29.0/24 is directly connected, vmbr0, 00:08:42

Ping comes back (from all neighbours on all nodes):
Bash:
root@lfflngpmx002sv:~# ping -c 4  lfflngpmx001sv
PING lfflngpmx001sv (172.16.254.241) 56(84) bytes of data.
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=2 ttl=64 time=0.019 ms
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=3 ttl=64 time=0.018 ms
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=4 ttl=64 time=0.011 ms

--- lfflngpmx001sv ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3100ms
rtt min/avg/max/mdev = 0.011/0.017/0.020/0.003 ms
root@lfflngpmx002sv:~#

SSH works (to all neighbours from all nodes):
Bash:
root@lfflngpmx002sv:~# ssh lfflngpmx001sv
Linux lfflngpmx002sv 6.8.12-4-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-4 (2024-11-06T15:04Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Feb 28 12:28:40 2025
root@lfflngpmx002sv:~# exit
logout
Connection to lfflngpmx001sv closed.
root@lfflngpmx002sv:~#

Join attempt via API:
Bash:
root@lfflngpmx002sv:~# pvecm add lfflngpmx001sv --link0 172.16.254.242 --fingerprint 35:2A:64:64:8F:6B:13:91:57:9E:83:3D:99:96:CA:C6:8E:C3:A0:66:93:9D:9C:A1:4E:20:31:91:FE:52:94:DE --use_ssh 0
Please enter superuser (root) password for 'lfflngpmx001sv': *****************************************
Establishing API connection with host 'lfflngpmx001sv'
Login succeeded.
check cluster join API version
Request addition of this node
An error occurred on the cluster node: can't lock file '/var/lock/pvecm.lock' - got timeout
Cluster join aborted!
root@lfflngpmx002sv:~#
Deleted /var/lock/pvecm.lock, tried again, file got recreated - same error

Join attempt via SSH:
Bash:
root@lfflngpmx002sv:~# pvecm add lfflngpmx001sv --link0 172.16.254.242 --fingerprint 35:2A:64:64:8F:6B:13:91:57:9E:83:3D:99:96:CA:C6:8E:C3:A0:66:93:9D:9C:A1:4E:20:31:91:FE:52:94:DE --use_ssh 1
invalid corosync.conf
error0: no nodes found
unable to add node: command failed (ssh lfflngpmx001sv -o BatchMode=yes pvecm addnode lfflngpmx002sv --force 1 --link0 172.16.254.242)
root@lfflngpmx002sv:~#

Can't spot what's wrong with my corosync.conf
Bash:
root@lfflngpmx001sv:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: lfflngpmx001sv
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.254.241
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: RS500-CL1
  config_version: 1
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@lfflngpmx001sv:~#
 
Last edited: