Hello Forum,
I've been trying to get my nodes clustered via their direct links since two days - without success, and I think it's time to ask for help, now.
Each node has two 10G interfaces, that connect to the company network and two 100G interfaces that connect to its respective neighbours directly via DAC.
The 100G network is configured with frr as described here: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(with_Fallback)
Edit: I gave up on frr and used https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(Simple) instead.
Cluster and Ceph are up and running, now.
I was able to create a cluster with the 10G NICs, but whatever I try with the 100G interfaces fails.
vtysh looks good (on all 3 nodes):
Ping comes back (from all neighbours on all nodes):
SSH works (to all neighbours from all nodes):
Join attempt via API:
Deleted /var/lock/pvecm.lock, tried again, file got recreated - same error
Join attempt via SSH:
Can't spot what's wrong with my corosync.conf
I've been trying to get my nodes clustered via their direct links since two days - without success, and I think it's time to ask for help, now.
Each node has two 10G interfaces, that connect to the company network and two 100G interfaces that connect to its respective neighbours directly via DAC.
The 100G network is configured with frr as described here: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(with_Fallback)
Edit: I gave up on frr and used https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(Simple) instead.
Cluster and Ceph are up and running, now.
I was able to create a cluster with the 10G NICs, but whatever I try with the 100G interfaces fails.
vtysh looks good (on all 3 nodes):
Bash:
lfflngpmx001sv# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
f 172.16.254.240/29 [115/20] via 172.16.254.242, ens1f1np1 onlink, weight 1, 00:08:04
via 172.16.254.243, ens1f0np0 onlink, weight 1, 00:08:04
C>* 172.16.254.240/29 is directly connected, lo, 00:08:42
C>* 192.168.29.0/24 is directly connected, vmbr0, 00:08:42
Ping comes back (from all neighbours on all nodes):
Bash:
root@lfflngpmx002sv:~# ping -c 4 lfflngpmx001sv
PING lfflngpmx001sv (172.16.254.241) 56(84) bytes of data.
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=2 ttl=64 time=0.019 ms
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=3 ttl=64 time=0.018 ms
64 bytes from lfflngpmx001sv (172.16.254.241): icmp_seq=4 ttl=64 time=0.011 ms
--- lfflngpmx001sv ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3100ms
rtt min/avg/max/mdev = 0.011/0.017/0.020/0.003 ms
root@lfflngpmx002sv:~#
SSH works (to all neighbours from all nodes):
Bash:
root@lfflngpmx002sv:~# ssh lfflngpmx001sv
Linux lfflngpmx002sv 6.8.12-4-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-4 (2024-11-06T15:04Z) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Feb 28 12:28:40 2025
root@lfflngpmx002sv:~# exit
logout
Connection to lfflngpmx001sv closed.
root@lfflngpmx002sv:~#
Join attempt via API:
Bash:
root@lfflngpmx002sv:~# pvecm add lfflngpmx001sv --link0 172.16.254.242 --fingerprint 35:2A:64:64:8F:6B:13:91:57:9E:83:3D:99:96:CA:C6:8E:C3:A0:66:93:9D:9C:A1:4E:20:31:91:FE:52:94:DE --use_ssh 0
Please enter superuser (root) password for 'lfflngpmx001sv': *****************************************
Establishing API connection with host 'lfflngpmx001sv'
Login succeeded.
check cluster join API version
Request addition of this node
An error occurred on the cluster node: can't lock file '/var/lock/pvecm.lock' - got timeout
Cluster join aborted!
root@lfflngpmx002sv:~#
Join attempt via SSH:
Bash:
root@lfflngpmx002sv:~# pvecm add lfflngpmx001sv --link0 172.16.254.242 --fingerprint 35:2A:64:64:8F:6B:13:91:57:9E:83:3D:99:96:CA:C6:8E:C3:A0:66:93:9D:9C:A1:4E:20:31:91:FE:52:94:DE --use_ssh 1
invalid corosync.conf
error0: no nodes found
unable to add node: command failed (ssh lfflngpmx001sv -o BatchMode=yes pvecm addnode lfflngpmx002sv --force 1 --link0 172.16.254.242)
root@lfflngpmx002sv:~#
Can't spot what's wrong with my corosync.conf
Bash:
root@lfflngpmx001sv:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: lfflngpmx001sv
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.254.241
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: RS500-CL1
config_version: 1
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
root@lfflngpmx001sv:~#
Last edited: