3 node mesh Proxmox and Ceph

rdfeij

Member
Jan 8, 2020
20
2
23
44
Netherlands
www.aras.nl
we are working on a small Proxmox enviroment with 3 nodes and mesh network.
All servers have
2x 10GB copper in mesh for Ceph cluster network
2x 1GB copper in mesh (corosync)
2x 1GB free for exiting network to customer
2x10GB SFP+ not used yet. (might be used for VM traffic and Ceph public subnet but we test it with the 2 free 1GB interfaces first.)

I already used the https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_.28simple.29 routed with fallback.
I can ping all nodes trough the mesh but Proxmox seems to ignore this mesh completely since it does not have any 'interfaces'. So building a cluster fails since proxmox and corosync are not aware of the mesh.
Of course we can fall back to the RSTP way but then all one nodes has double the traffic passing trough to the 3th node.

Yes we are aware that running Proxmox and Ceph on the same servers is not ideal, but it is only for 2 windows enviroments with minimal troughput.
All 3 servers have 2x 500GB SSD zfs Raid1 and 3x 1TB SSD for Ceph OSD's and a load of CPU and memory power.

Is there anyone who can help with the correct methods for configuring the meshes?
 
I got my 3-node Ceph cluster using a full mesh broadcast topology https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup using 169.254.x.x/24 (IPv4 link-local addresses)

Obviously does not use any switches. Can't expand it either unless one uses a switch.

I run both Ceph public, private, and Corosync networking on this full mesh broadcast topology without issue.

I used the topology in https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Example

The nodes have 4 NIC ports each.
 
You need to use CLI to configure both the cluster and initialize Ceph configuration:

Create cluster:
Code:
pvecm create <ClusterName> --link0 <link0.NIC.IP>,priority=10 --link1 <link1.NIC.IP>,priority=20

Add node to cluster:
Code:
pvecm add <hostname.of.already.cluster_member> --link0 <link0.NIC.IP>,priority=10 --link1 <link1.NIC.IP>,priority=20

link0.NIC.IP should be the IP of each node in your 1GB copper in mesh (corosync).
link1.NIC.IP should be the IP of node in your 10GB copper in mesh for Ceph cluster network.

Init Ceph (this is explained in the docs):
Code:
pveceph init --network a.b.c.d/m --cluster-network e.f.g.h/m
pveceph mon create
 
Hi @VictorSTS. We have the FRR working over gigabit and 10gigabit.

We have created the cluster but adding a node results in a hostname verification error. This is new to me since i started another cluster long time ago with ssh connected.

The error we get:

root@ProxMoxHost2:~# pvecm add ProxMoxHost1.local --link0 10.14.14.2,priority=10 --link1 10.15.15.2,priority=20
Please enter superuser (root) password for 'ProxMoxHost1.local': **************
Establishing API connection with host 'ProxMoxHost1.local'
The authenticity of host 'ProxMoxHost1.local' can't be established.
X509 SHA256 key fingerprint is 9B:F0:29:5D:78:5F:D5:44:69:F3:80:40:BA:62:06:20:79:56:F7:D0:70:FB:DE:30:6B:A2:C3:09:09:99:C5:97.
Are you sure you want to continue connecting (yes/no)? yes
500 Can't connect to ProxMoxHost1.local:8006 (hostname verification failed)

Curl also rejects the connection:

root@ProxMoxHost2:~# curl https://10.14.14.1:8006 -vi (also not with hostname used in stead of ip address)
* Trying 10.14.14.1:8006...
* Connected to 10.14.14.1 (10.14.14.1) port 8006 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS alert, unknown CA (560):
* SSL certificate problem: unable to get local issuer certificate
* Closing connection 0
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.



we tried a lot to get the certificate trusted but are stuck. Can you or anyone else help us out?
 
Last edited:
We have created the cluster but adding a node results in a hostname verification error. This is new to me since i started another cluster long time ago with ssh connected.

Were any of these nodes used in a previous cluster? Are you using the same hostnames or IPs? There are issues when reusing the same IP or hostnames in the same cluster.

I'm sorry but in my instructions I told you to use "hostname" (pvecm add <hostname.of.already.cluster_member>) but I really meant the IP of a cluster member. Can you try with that?

If it still doesn't work, post /etc/hosts and /etc/hostname of these two machines.

Curl fails because the certs created when installing Proxmox are self-signed and curl by default does not trust them. curl https://10.14.14.1:8006-Ik should work.
 
Hi,
yesterday i after many testing with the frr i reinstalled all 3 nodes from scratch and put up the frr rings again.
then i got stuck on this problem.

/etc/hosts (with pvelocalhost equally with the correct host of course)

Code:
127.0.0.1 localhost.localdomain localhost
10.14.14.1 ProxMoxHost1.local ProxMoxHost1
10.14.14.2 ProxMoxHost2.local ProxMoxHost2 pvelocalhost
10.14.14.3 ProxMoxHost3.local ProxMoxHost3

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

The /etc/hostname values are matching those in the /etc/hosts.

connecting with hostname or IP address doesn't make a difference, same error.
 
Last edited:
I start to suspect it has something to do with my first node (already in cluster) where there is no quorum yet. It cannot change anything in /etc/pve folder since pmxcfs keeps the files locked?
 
If /etc/pve is locked then you have no quorum. If you only have one host in a cluster that is comprised of just that host, that shouldn't happen.

What's the output of pvecm status ?
 
If /etc/pve is locked then you have no quorum. If you only have one host in a cluster that is comprised of just that host, that shouldn't happen.

What's the output of pvecm status ?
Since i've been testing a lot quorum was lost. I upped it again by forcing votes of node 1 to 7. But it also does not help for the hostname verification error.... I'm getting more and more puzzled. I have installed more systems (but without frr) and never had this problem.

Code:
root@node1:~# pvecm status
Cluster information
-------------------
Name:             DsmCluster
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Nov 15 12:06:35 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.28
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      7
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          7 10.14.14.1 (local)
 
I've done a few mesh cluster and hundreds of non-mesh clusters and never found that hostname verification failed problem myself. Somehow you are doing something that may cause the hosts issue. Probably unrelated, but with one host in the cluster you should need to change any vote to have quorum. If you do, something else is wrong.
 
I've done a few mesh cluster and hundreds of non-mesh clusters and never found that hostname verification failed problem myself. Somehow you are doing something that may cause the hosts issue. Probably unrelated, but with one host in the cluster you should need to change any vote to have quorum. If you do, something else is wrong.
I know, like i said i tried almost everything to get it to work so somewhere i messed up quorum (by almost manually trying to add the 2nd node).
But that doesn't take away the main problem....
This is my first (native, not upgraded to) version 8 installation, maybe there is an issue, but not sure.

luckilly i'm attending the offical training of Proxmox next week so i have my questions ready ;)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!