10GBE cluster without switch

bbuster · Oct 30, 2024

I have bought 3 identical servers, both have a 10gbe network adapter with dual ports.
I want the cluster to communicate via the 10gbe network adapter but each individual server via 1gbe port to the router.

If one of the nodes fails, i want it to fallback to the next server.

So far, i have connected all 3 nodes via RJ45 to the router (via an 1gbe switch).

The 3 servers are connected to each other via 10GBE (spi+ adapter)

So:
server1 (port 2) -> server2 (port 1)
server2 (port 2) -> server3 (port 1)
Server3 (port 2) -> server1 (port 1)

Is this correct and how do i need to setup this?
I searched here (https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server) but i don't exactly know where and how to do what on which node.
So far i have installed proxmox on 3 servers and they have (via 1gbe switch, ip 192.168.3.21 till 192.168.3.23)

I want to use ceph as shared storage, but i don't know if that changes anything for this

brucexx · Oct 30, 2024

In order to fail back you need external storage or replication (with zfs drives). I had a 3 node cluster with replication on with zfs and it worked great.

The cluster should be really on 1 Gbps interfaces there is not much in there , depending how much traffic your VMs will use to communicate internally end over the internet.

The 10Gpbs interfaces I would use for syncing (replicating) the storage across the nodes.

Two separate networks: a) PVE cluster network that carrier cluster messages + might also carry your virtual machine "public" traffic , the vmbrX interface that you give to virtual machine and b) replication network that synchronizes your storage.

If one node goes offline the VMs will be started on the other node and work, works very well I tested and implemented it several years ago but did not touch it since then , perhaps there are also some new features.

Thx

bbuster · Oct 30, 2024

Okay, sounds good to me, but how do i set it up? I need to create a cluster on the 1gb port?
And i read that cepth is preferred over zfs because the storage is shared?

Proxmox is installed on zfs raid1 with 2x 240gb ssd (per server).
I will install additional 4 x 1.92tb ssd's (per server) for the storage (virtual machines), i think in raid5?

brucexx · Oct 30, 2024

sorry , yes you can also use Ceph. Ceph is harder to configure whereas the replication is easy. I am not sure what is the minimum number of storage drives (OSDs) requirement for Ceph but I had replication working on just 4 drives with 2 nodes + 3rd node to keep quorum - it is easy to setup.

I would move your cluster to 1 Gbps interfaces for simplicity you could also VLAN it on 10Gbps interfaces but the overall consensus is that you should not mix the cluster subnet with storage subnet.

If you cannot move it and you just install the PVE and it is not in production you can simply reinstall and use 1Gbps interfaces for PVE cluster.

bbuster · Oct 30, 2024

Okay, i still can use the 1gbps interface, so far it is configured on the 1gbps port. I connected the 3 servers via the 10gbe adapter to each other but didn't do anything with that yet.

If i understand you correctly, i need to create the cluster on the default 1gbps interface and the ceph on the 10gbe network?
Can you give me some information how to do that?

brucexx · Oct 30, 2024

yes, that is the idea. Ceph on high capacity interfaces.

brucexx · Oct 30, 2024

As for Ceph, it is more challenging to install. You should at least go over this wiki document.

https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

Because you are installing Ceph under Proxmox you can do that from web admin portal and add storage during the deployment. You need to enable Ceph repositories and install it via the Ceph section on Proxmox and during this process you can deploy it as well.

bbuster · Oct 30, 2024

Thankyou, i followed some video's but for some reason it doesn't install (on all 3 servers)
Here is where it stays

andrew-transparent · Oct 30, 2024

Check what repos you have enabled for each server under Updates -> Repositories. It should look like the screenshot attached. If the enterprise repo is enabled and you don't have a subscription, disable it.

(if it still doesn't find, check DNS. )

bbuster · Oct 30, 2024

My bad, i had a wrong router configuration that made the servers not be able to connect to internet. I now can install ceph but i first need to figure out how to make the 10gbe work

bbuster · Oct 31, 2024

I have followed this tutorial and am now able to get 10gbe connections between all 3 servers (with fallback on 1gbe if the 10gbe are down).
This works so far (tests seem to work like at the end of the video).

https://www.apalrd.net/posts/2023/cluster_routes/

Now i want to configure ceph but i can only select my vmbr0 as Public Network IP / CIDR and Cluster Network IP/CIDR.
I think i need to configure it via the ensf0 and ensf1?

How can i achieve this? Do i need to make a bond? And if so, how do i need to set it up?

andrew-transparent · Oct 31, 2024

We have it setup with bridges on their own respective VLAN for each network involved in an HA setup. We have at minimum the following bridges for:

Management
Ceph Private
Ceph Public
Corosync
Migration (for VM migration)

Each of the above a separate private CIDR/subnet (ie 10.0.10.0/24 for management, 10.0.11.0/24 for Ceph private, etc) or whatever IP scheme that works for you.

We bond the relevant physical adapters, and then attach bridges to those bonds. We then use SDN to define bridges/vlans for our guests.

alexskysilk · Oct 31, 2024

bbuster said:
How can i achieve this? Do i need to make a bond? And if so, how do i need to set it up?

the loopback (lo) address is what you would use for both ceph and corosync. there is no bonding or any other options to configure.

A word of caution of this configuration- it is possible to create a denial of service condition under heavy load, since corosync and ceph will be sharing available bandwidth.

bbuster · Nov 4, 2024

I think i got it working but now i don't know if i did it correctly. I created a bond and used the ipv6 i generated earlier. I created the ceph storage and added all the ssd's from all servers. It seems to work, i can disconnect or turn off a server and after a few minutes the vm switches to another node.

Can i check if i set it up correctly?

bbuster · Nov 8, 2024

Things changed because my switch didn’t have enough ports.

I could buy 2 HPE 2920-48G switches, both with 2 x J9731A dual 10gbe module and a stacking module.

My thoughts are to use the switch instead of point to point.

Every server connected 1 x 10gbe to each switch which makes it redundant, also connecting all 4 x 1gbe to both switches (2x to switch 1, 2 to switch 2).

Are my toughts correct? And how do i need to configure it for the best speed and most reliability?

Both switches have stacking modules as well, i just ordered the stacking cable which will arrive in 1 week

jdancer · Nov 10, 2024

I stood up a 3-node full-mesh broadcast Ceph Squid test cluster using these instructions https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup

Zero issues.

I did use a 169.254.3.0/24 network for Ceph public, private, & Corosync traffic since 169.254.0.0/16 is a IPv4 link-local network guaranteed not to route.

bbuster · Nov 11, 2024

I rebuild my server rack and needed to disconnect my lan cables (1gbe and 10gbe). I restarted my cluster to test a few things without the cables attached, or only 1 cable (1gbe) and now reconnected 4 lan cables (1gbe) and the 2 x 10gbe cables in point to point configuration.

When testing with iperf3, i only get 1 gb speed:

Code:

root@proxmox-2:~# iperf3 -c fd69:beef:cafe::521
Connecting to host fd69:beef:cafe::521, port 5201
[  5] local fd69:beef:cafe::522 port 37678 connected to fd69:beef:cafe::521 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   946 Mbits/sec    0    441 KBytes       
[  5]   1.00-2.00   sec   110 MBytes   925 Mbits/sec    0    441 KBytes       
[  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec    0    488 KBytes       
[  5]   3.00-4.00   sec   110 MBytes   922 Mbits/sec    0    488 KBytes       
[  5]   4.00-5.00   sec   111 MBytes   932 Mbits/sec    0    488 KBytes       
[  5]   5.00-6.00   sec   111 MBytes   927 Mbits/sec    0    488 KBytes       
[  5]   6.00-7.00   sec   110 MBytes   924 Mbits/sec    0    488 KBytes       
[  5]   7.00-8.00   sec   111 MBytes   931 Mbits/sec    0    512 KBytes       
[  5]   8.00-9.00   sec   111 MBytes   930 Mbits/sec    0    536 KBytes       
[  5]   9.00-10.00  sec   110 MBytes   925 Mbits/sec    0    536 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.08 GBytes   930 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec                  receiver


iperf Done.

I restarted the cluster multiple times but it keeps 1gb/sec instead of 10 (which worked before)
What can cause this issue?

Search

Search

10GBE cluster without switch

bbuster

Member

brucexx

Renowned Member

bbuster

Member

brucexx

Renowned Member

bbuster

Member

brucexx

Renowned Member

brucexx

Renowned Member

bbuster

Member

Attachments

andrew-transparent

Member

Attachments

bbuster

Member

bbuster

Member

andrew-transparent

Member

alexskysilk

Distinguished Member

bbuster

Member

bbuster

Member

jdancer

Renowned Member

bbuster

Member

We value your privacy