Help Wanted please! I would like to create a ceph cluster in full mesh but struggle to find the correct setup

JimmyV

Member
Apr 14, 2022
7
0
6
Hi Everyone

I would like to create a ceph cluster, and did this in the past.
Whatever I do in Proxmox 9, I seem to be hitting a roadblock.
The point is that I would like to create a ceph cluster in full-mesh using 3 nodes (cluster without a switch in between).

Hardware used for the 3 times 2U servers:

Supermicro X12
Xeon E-2388G
32Gb of RAM
Additionally we have Melanox fiber cards (1card with two interfaces of 25Gbps/node for the cluster)
256Gb NvMe SSD (for the Proxmox OS)
2 x 240Gb enterprise SSD 2.5" (for the Pool to be created in ceph called VM-Storage)
3 x 3,84TB enterprise SSD 2.5" (for the Pool to be created in ceph called Live)
4 x 20TB enterprise HDD 3.5" (for the Pool to be created in ceph called Archive)

It will be used with two Windows VM's.
The first will be for a management VMS system where only the VM-Storage pool will be connected to and providing 120Gb of data.
The second will be for a recording VMS system where aslo 120Gb will be provided along with the full capacity for the "Live" pool and full capacity for the "Archive" pool.

The idea is to set this up using the following network:

eno1 would be the vmbr0 for the management using
10.75.92.11/24 for node 1
10.75.92.12/24 for node 2
10.75.92.13/24 for node 3

We would like to use LACP for a VM-bond on eno3 and 4 on all these nodes.
With the fibercard we would like to use a bond using broadcast.
Bonding the nodes fibercard interfaces together using the following IP-scheme:

Node 1 192.168.100.11/24
Node 2 192.168.100.12/24
Node 3 192.168.100.13/24

However, in the setup we would like to see when NIC's interfaces are being pulled out,
that everything keeps running as it should and at high speeds over the cluster.
The bond was created and pingable, but when we pull the top NIC's interface out in node 1 it still works.
Doing this with the bottom one, fails the operation.

I'm assuming this has something to do with ARP and/or ip route.
Am I doing something wrong here, or is this not the best way to proceed?
What would be the best and fastest setup?
I would really appreciate if someone would be able to assist!
Working on this for weeks now and would like to see it finished.
Thank you fo any assistance.
 
Last edited:
  • Like
Reactions: UdoB
Have you seen the wiki article [0] on full mesh setup methods?

EDIT: As a side note, how stringent are your data consistency needs? As with the amount of data you could also use ZFS with storage replication [1].

[0] https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
[1] https://pve.proxmox.com/pve-docs/chapter-pvesr.html
Hi alwin,
Thank you for your feedback!
Previously the setup was with 2 nodes and a Qdevice on ZFS pools with replication and HA.
It was not migrating fast enough, so back to the drawing board.

Ergo, ceph came into picture.
Not entirely sure what the best approach is at this point without loss of speed in the cluster.
Rebuild this setup 4 times already ...
Speeds only where 100 to 150MB/s when migrating on a 25Gbps cluster with two interfaces on each NIC.
Must be doing something wrong.

It needs to be done by friday.
 
It was not migrating fast enough, so back to the drawing board.
By default the migration_network [0] is the one used during installation.

Not entirely sure what the best approach is at this point without loss of speed in the cluster.
If you're unsure which full-mesh setup to choose, then go with routed simple [1] (IMO). This is a more straight forward/known setup path.
A key difference is that the traffic won't be routed through a node, should the link die. But I believe there is a low chance that this happens in a 3x node cluster which sits on top of each other.

Speeds only where 100 to 150MB/s when migrating on a 25Gbps cluster with two interfaces on each NIC.
I assume this applies also to the migration_network [0].

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_migration_network
[1] https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(Simple)
 
UPDATE:

I went ahead for now with the bonded broadcast on the full-mesh cluster network and added another corosync link over 1Gbps interfaces with a switch as this was already created.

I went on to create a simple Windows VM and tried migrating it over.
It was slow again.
This time played with the PG's and made them a bit bigger.
Only made one of 3 pools being the VM-Storage pool and set the PG's to 256 on 6 osd (for two 240Gb SSD/node).
The live migration went more or less the same.
When doing an offline it litterally took a second.
Doing it live was a lot slower.

This is the output I got from the live migration:

2025-11-05 11:56:16 starting migration of VM 101 to node 'Node1' (10.75.92.12)
2025-11-05 11:56:16 starting VM 101 on remote node 'Node1'
2025-11-05 11:56:18 start remote tunnel
2025-11-05 11:56:18 ssh tunnel ver 1
2025-11-05 11:56:18 starting online/live migration on unix:/run/qemu-server/101.migrate
2025-11-05 11:56:18 set migration capabilities
2025-11-05 11:56:18 migration downtime limit: 100 ms
2025-11-05 11:56:18 migration cachesize: 2.0 GiB
2025-11-05 11:56:18 set migration parameters
2025-11-05 11:56:18 start migrate command to unix:/run/qemu-server/101.migrate
2025-11-05 11:56:19 migration active, transferred 105.1 MiB of 16.0 GiB VM-state, 4.3 GiB/s
2025-11-05 11:56:20 migration active, transferred 198.3 MiB of 16.0 GiB VM-state, 112.7 MiB/s
2025-11-05 11:56:21 migration active, transferred 308.1 MiB of 16.0 GiB VM-state, 111.7 MiB/s
2025-11-05 11:56:22 migration active, transferred 417.5 MiB of 16.0 GiB VM-state, 110.2 MiB/s
2025-11-05 11:56:23 migration active, transferred 526.2 MiB of 16.0 GiB VM-state, 115.1 MiB/s
2025-11-05 11:56:24 migration active, transferred 635.9 MiB of 16.0 GiB VM-state, 121.6 MiB/s
2025-11-05 11:56:25 migration active, transferred 745.4 MiB of 16.0 GiB VM-state, 110.2 MiB/s
2025-11-05 11:56:26 migration active, transferred 854.8 MiB of 16.0 GiB VM-state, 121.6 MiB/s
2025-11-05 11:56:27 migration active, transferred 966.6 MiB of 16.0 GiB VM-state, 120.5 MiB/s
2025-11-05 11:56:28 migration active, transferred 1.0 GiB of 16.0 GiB VM-state, 234.2 MiB/s
2025-11-05 11:56:29 migration active, transferred 1.1 GiB of 16.0 GiB VM-state, 257.0 MiB/s
2025-11-05 11:56:30 migration active, transferred 1.2 GiB of 16.0 GiB VM-state, 135.2 MiB/s
2025-11-05 11:56:31 migration active, transferred 1.3 GiB of 16.0 GiB VM-state, 133.2 MiB/s
2025-11-05 11:56:32 migration active, transferred 1.4 GiB of 16.0 GiB VM-state, 304.6 MiB/s
2025-11-05 11:56:33 migration active, transferred 1.5 GiB of 16.0 GiB VM-state, 112.5 MiB/s
2025-11-05 11:56:34 migration active, transferred 1.6 GiB of 16.0 GiB VM-state, 113.7 MiB/s
2025-11-05 11:56:35 migration active, transferred 1.8 GiB of 16.0 GiB VM-state, 115.6 MiB/s
2025-11-05 11:56:36 migration active, transferred 1.9 GiB of 16.0 GiB VM-state, 116.0 MiB/s
2025-11-05 11:56:37 migration active, transferred 2.0 GiB of 16.0 GiB VM-state, 119.4 MiB/s
2025-11-05 11:56:38 migration active, transferred 2.1 GiB of 16.0 GiB VM-state, 114.8 MiB/s
2025-11-05 11:56:39 migration active, transferred 2.2 GiB of 16.0 GiB VM-state, 112.2 MiB/s
2025-11-05 11:56:41 migration active, transferred 2.4 GiB of 16.0 GiB VM-state, 109.6 MiB/s
2025-11-05 11:56:42 migration active, transferred 2.5 GiB of 16.0 GiB VM-state, 114.4 MiB/s
2025-11-05 11:56:43 migration active, transferred 2.6 GiB of 16.0 GiB VM-state, 114.2 MiB/s
2025-11-05 11:56:44 migration active, transferred 2.7 GiB of 16.0 GiB VM-state, 125.4 MiB/s
2025-11-05 11:56:44 xbzrle: send updates to 9121 pages in 2.0 MiB encoded memory, cache-miss 98.01%, overflow 201
2025-11-05 11:56:45 migration active, transferred 2.8 GiB of 16.0 GiB VM-state, 112.4 MiB/s
2025-11-05 11:56:45 xbzrle: send updates to 9310 pages in 2.0 MiB encoded memory, cache-miss 98.01%, overflow 201
2025-11-05 11:56:46 average migration speed: 585.6 MiB/s - downtime 109 ms
2025-11-05 11:56:46 migration completed, transferred 2.9 GiB VM-state
2025-11-05 11:56:46 migration status: completed
2025-11-05 11:56:46 stopping migration dbus-vmstate helpers
2025-11-05 11:56:46 migrated 0 conntrack state entries
2025-11-05 11:56:47 flushing conntrack state for guest on source node
2025-11-05 11:56:50 migration finished successfully (duration 00:00:34)
TASK OK

Not trusting to much on chatgpt, but this was the suggestion:

QEMU needs: RAM diff’s for encoding (xbzrle), CPU compression execution, sending via SSH-tunnel, then flush OSDn → CPU/queue bottleneck → lower speeds
It says I have to change the compression to lz4
qm set 101 --migration-compression lz4

Also using direct migration instead of over ssh
qm set 101 --migration-type direct

Resizing the migration cache
qm set 101 --migration-cachesize 4G

Is this going to make a huge difference here?

Thanks for all the input here!

In any case the speed of the interfaces does not seem to be the issue here.
 
Extra update:

Previous post did not work in any way.

Output of /etc/network/interfaces on the first node:
auto lo
iface lo inet loopback

iface eno1 inet manual

auto eno2
iface eno2 inet static
address 192.168.101.11/24
#Corosync

iface eno3 inet manual

iface eno4 inet manual

iface enxbe3af2b6059f inet manual

auto enp1s0f0np0
iface enp1s0f0np0 inet manual

auto enp1s0f1np1
iface enp1s0f1np1 inet manual

auto bond0
iface bond0 inet static
address 192.168.100.11/24
bond-slaves enp1s0f0np0 enp1s0f1np1
bond-miimon 100
bond-mode broadcast
mtu 9000
#Cluster Bond

auto vmbr0
iface vmbr0 inet static
address 10.75.92.11/24
gateway 10.75.92.1
bridge-ports eno1
bridge-stp off
bridge-fd 0

source /etc/network/interfaces.d/*

The last octet is one up for all the IP-addresses per node.

The corosync setup (/etc/pve/corosync)


logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: Node1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.100.11
ring1_addr: 192.168.101.11
}
node {
name: Node2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.100.12
ring1_addr: 192.168.101.12
}
node {
name: Node3
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.100.13
ring1_addr: 192.168.101.13
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Cluster
config_version: 4
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}


The ceph osd tree for now

root@MY729-CCTVSRV01:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.30975 root default
-3 0.43658 host Node1
0 VM-Storage 0.21829 osd.0 up 1.00000 1.00000
3 VM-Storage 0.21829 osd.3 up 1.00000 1.00000
-5 0.43658 host Node2
1 VM-Storage 0.21829 osd.1 up 1.00000 1.00000
4 VM-Storage 0.21829 osd.4 up 1.00000 1.00000
-7 0.43658 host Node3
2 VM-Storage 0.21829 osd.2 up 1.00000 1.00000
5 VM-Storage 0.21829 osd.5 up 1.00000 1.00000