Help configuring CEPH - Slow Performance

Mes449

New Member
Jun 19, 2025
3
0
1
Hello,

I'm new to both Proxmox and CEPH... I'm trying to set up a cluster for long-term temporary use (Like 1-2 years) for a small organization that has most of their servers in AWS, but has a couple legacy VMs that are still hosted in a 3rd party data center running VMware ESXi. We also plan to host a few other things on these servers that may go beyond that timeline. The datacenter that is currently providing the hosting is being phased out at the end of the month, and I am trying to migrate those few VMs to Proxmox until those systems can be phased out. We purchased some relatively high end (though previous gen) servers for reasonably cheap, servers that are actually a fair bit better than the ones they're currently hosted on. However, because of budget and issues I was seeing online with people claiming Proxmox and SAS connected SANs didn't really work well together, and the desire to have the 3 server minimum for a cluster/HA etc, I decided to go with CEPH for storage. The drives are 1.6TB Dell NVME U.2 drives, I have a Mesh network using 25GB links between the 3 servers for CEPH, and there's a 10GB connection to the switch for networking. Currently 1 network port is unused, however I had planned to use it as a secondary connection to the switch for redundancy. Currently, I've only added 1 of these drives from each server to the CEPH setup, however I have more I want to add to once it's performing correctly. I was ideally trying to get the most redundancy/HA as possible with what hardware we were able to get a hold of and the short timeline. However things took longer just to get the hardware etc than I'd hoped, and although I did some testing, I didn't have hardware close enough to test some of this stuff with.

As far as I can tell, I followed instructions I could find for setting up CEPH with a Mesh network using the routed setup with fallback. However, it's running really slow. If I run something like CrystalDiskMark on a VM, I'm seeing around 76MB/sec for sequential reads and 38MB/sec for Seq writes. The random read/writes are around 1.5-3.5MB/sec.

At the same time, on the rigged test environment I set up prior to having the servers on hand, (which is just 3 old Dell workstations from 2016 with old SSDs in them and a 1GB shared network connection) I'm seeing 80-110MB/sec for SEQ reads, and 40-60 on writes, and on some of the random reads I'm seeing 77MB/sec compared to 3.5 on the new server.

I've done IPERF3 tests on the 25GB connections that go between the 3 servers and they're all running just about 25GB speeds.

Here is my /etc/network/interfaces file. It's possible I've overcomplicated some of this. My intention was to have separate interfaces for mgmt, VM traffic, cluster traffic, and ceph cluster and ceph osd/replication traffic. Some of these are set up as virtual interfaces as each server has 2 network cards, both with 2 ports, so not enough to give everything its own physical interface, and hoping virtual ones on separate vlans are more than adequate for the traffic that doesn't need high performance.

Bash:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1np0
iface eno1np0 inet manual
#Daughter Card - NIC1 10G to network switch

auto eno2np1
iface eno2np1 inet manual
#Daughter Card - NIC2 10G to network switch

iface ens6f0np0 inet manual
        mtu 9000
#PCIx - NIC1 25G Storage direct attached

iface ens6f1np1 inet manual
        mtu 9000
#PCIx - NIC2 25G Storage direct attached

auto bond0
iface bond0 inet manual
        bond-slaves eno1np0 eno2np1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        mtu 1500
#Network bond of both 10GB interfaces (Currently 1 is not plugged in)

auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        post-up /usr/bin/systemctl restart frr.service
#Bridge to network switch

auto vmbr0.6
iface vmbr0.6 inet static
        address 10.6.247.1/24
#VM network

auto vmbr0.1247
iface vmbr0.1247 inet static
        address 172.30.247.1/24
#Regular Non-CEPH Cluster Communication

auto vmbr0.254
iface vmbr0.254 inet static
        address 10.254.247.1/24
        gateway 10.254.254.1
#Mgmt-Interface

source /etc/network/interfaces.d/*

Ceph Config File:
Bash:
[global]
    auth_client_required = cephx
    auth_cluster_required = cephx
    auth_service_required = cephx
    cluster_network = 192.168.0.1/24
    fsid = 68593e29-22c7-418b-8748-852711ef7361
    mon_allow_pool_delete = true
    mon_host = 10.6.247.1 10.6.247.2 10.6.247.3
    ms_bind_ipv4 = true
    ms_bind_ipv6 = false
    osd_pool_default_min_size = 2
    osd_pool_default_size = 3
    public_network = 10.6.247.1/24

[client]
    keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
    keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.PM01]
    public_addr = 10.6.247.1

[mon.PM02]
    public_addr = 10.6.247.2

[mon.PM03]
    public_addr = 10.6.247.3


My /etc/frr/frr.conf file:

Bash:
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
# /var/log/frr/frr.log
#
# Note:
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
# configuration while FRR is running. When instructed, vtysh will persist the
# live configuration to this file, overwriting its contents. If you want to
# avoid this, you can edit this file manually before starting FRR, or instruct
# vtysh to write configuration to a different file.

frr defaults traditional
hostname PM01
log syslog warning
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 192.168.0.1/32
 ip router openfabric 1
 openfabric passive
!
interface ens6f0np0
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
interface ens6f1np1
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
 net 49.0001.1111.1111.1111.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

If I do the same disk benchmarking with another of the same NVME U.2 drives just as an LVM storage, I get 600-900MB/sec on SEQ reads and writes.

Any help is greatly appreciated, like I said setting up CEPH and some of this networking stuff is a bit out of my comfort zone, and I need to be off the old set up by July 1. I can just load the VMs onto local storage/LVM for now, but I'd rather do it correctly the first time.

Also, if anyone even has a link to a video or directions you think might help, I'd also be open to them. A lot of the videos and things I find are just "Install Ceph" and that's it, without much on the actual configuration of it.

Thanks
 
Last edited:
Yeah you're getting slow speeds because you're probably not actually using those 25G NICs, for some reason.

I would suggest trying a simpler network setup at first—see "Routed Setup (Simple)" in our Full Mesh Network for Ceph Server wiki article, for example.

For reference, on one of my virtual PVE + Ceph clusters I use for testing, I have the following network configuration (non-mesh):
Code:
# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface ens18 inet manual

auto ens19
iface ens19 inet static
        address 172.16.65.231/24
#Ceph

auto vmbr0
iface vmbr0 inet static
        address 172.16.64.231/24
        gateway 172.16.64.1
        bridge-ports ens18
        bridge-stp off
        bridge-fd 0


source /etc/network/interfaces.d/*

The above is for the first node; the config is identical for the second and third node, except that the IP addresses differ, of course.

Code:
# cat /etc/pve/ceph.conf
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 172.16.65.0/24
        fsid = 512a4321-de2a-4058-bce6-16e97b01b6cf
        mon_allow_pool_delete = true
        mon_host = 172.16.64.231 172.16.64.232 172.16.64.233
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 172.16.64.0/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.ceph-reef-01]
        public_addr = 172.16.64.231

[mon.ceph-reef-02]
        public_addr = 172.16.64.232

[mon.ceph-reef-03]
        public_addr = 172.16.64.233

This uses the 172.16.64.0/24 network for general traffic plus the public part of Ceph, and the 172.16.65.0/24 network for private Ceph stuff.

Note that here the traffic of the cluster network goes over the bridge of my workstation, which means there's a default route present. Therefore the up and down parts of the simple routed setup in the page linked above aren't necessary here; the NICs aren't connected to each other directly. (In other works, I don't have a mesh network here.)

I wanna emphasize here that the public network is used by MONs, MGRs, and MDSs, as well as anything that's communicating with Ceph. That means that PVE will communicate with Ceph over that network. In turn, this means that your cluster's performance will probably be bottlenecked by those 10G NICs (eventually). It's really only the replication traffic between the OSDs that gets offloaded to the cluster network. That means that ideally you'd want beefy NICs everywhere, not just in the cluster network.

Finally, just to add a little disclaimer here: I don't intend to make any recommendations for some production-grade network topology here or anything; I just want to suggest that you should try a much simpler setup first before trying to debug whatever you've got set up currently.
 
The reason is pretty clear in his network file. vmbr0 uses bond0, which in turn has the nics labelled "
<span>#Daughter Card - NIC1 10G to network switch</span><br>"

If you want to use your 25gb nics, use them.
As far as I can tell, I was. It's very possible I misunderstand what they're meaning, as I'm still a bit confused about which network the traffic goes over for ceph, at first I thought it was just the private, but I'm not so sure now. Rereading it sounds like maybe certain vm traffic might go over the public connection, but it's not clear. I even changed the config since posting so both the public and private ceph networks are using that 192.168.0.x network which is the 25GB, and even unplugged the 3 servers from the 10GB LAN connection while a VM was running a hdd benchmark, it still finished with the same horrible speeds. I also have been monitoring the 10GB ports on our switch, and it's definitely not using them for the ceph traffic, it's basically staying near 0 with the test going.

Even if it was using the 10GB link, 30-70MB/sec is crazy slow. I'm getting better than that on a test cluster with 3 2016 year dell workstations using an unmanaged 1GB switch between them. I also added the 2nd 10GB link to the bond0 above, and it made no difference.

I'm trying to reach out to some of the proxmox support partners to see if I can have someone assist as that will be easier than trying to randomly try things, but so far none of them are responding or their phone systems just hang up or don't ring...

If I am not able to get in touch with someone soon, I'll try to configure ceph to only use the 10G network we have a switch connected to, but I'm doubtful that'll help, as like I said, It seems that it's using the 25GB connections, just not at the speed it should be. A the moment, I Don't have a 25GB switch, and am limited on the 10G ports on the switch, I'm using all the ports at the moment with what we already had connected plus 2x3 10GB bonded ports on the proxmox servers. I suppose I could make 1 wire the regular network and 1 wire the ceph.
 
Last edited:
EDIT: Ugh, they need to fix their moderator approval policies, it's randomly not approving some posts and approves others. I've edited this post as it allowed it through, and pasted the info from my above post that is marked as awaiting approval...

I called around to a few proxmox gold partners today hoping for some quick assistance since I'm in a time crunch. A couple of them gave me some free advice, some of which included the recommendation of at least 5 hosts (which for my scenario isn't an option), I was told that although Proxmox's WIKI suggests mesh options without a switch for ceph for smaller numbers of hosts, they said that is very bad advice, and that you shouldn't do that, and that they've had to migrate several customers off that due either performance issues or stability issues among other things. We also discussed ceph vs zfs for my scenario, and it sounds like for my use case, zfs may make more sense, may try a couple of these suggestions for the ceph implementation, including their suggestion of totally removing the mesh and connecting everything to our 10GB switch, and just separating out ceph traffic and proxmox traffic, and if that isn't making any obvious improvement (or even if it is) I may just go to zfs, as we aren't planning to scale the servers or the storage up much if at all in the future, and this is meant as a stop-gap for a relatively short period of time to let us migrate off of old systems that are running in those VMs. None of the VMs are particularly resource hungry either, and nothing is important enough to care about a minute or a couple minute staleness that might come with a ZFS system, if the drive failed, having a copy that is a few minutes old is no issue. They gave me more advise, and I may not have 100% of the above as they said it as I don't have my notes with me at the moment, but I'm hopeful that either of those options will work, as I really don't want to look into a 25 or 40GB switch or trying to troubleshoot something that might be finicky at the moment, especially when my knowledge of ceph and proxmox is pretty low currently, and my implementation needs to be done so quickly. They also seemed to think the hardware I had chosen was all great and that I should not be having performance issues as badly as I am.

Yeah you're getting slow speeds because you're probably not actually using those 25G NICs, for some reason.

I would suggest trying a simpler network setup at first—see "Routed Setup (Simple)" in our Full Mesh Network for Ceph Server wiki article, for example.

For reference, on one of my virtual PVE + Ceph clusters I use for testing, I have the following network configuration (non-mesh):
Code:
# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface ens18 inet manual

auto ens19
iface ens19 inet static
        address 172.16.65.231/24
#Ceph

auto vmbr0
iface vmbr0 inet static
        address 172.16.64.231/24
        gateway 172.16.64.1
        bridge-ports ens18
        bridge-stp off
        bridge-fd 0


source /etc/network/interfaces.d/*

The above is for the first node; the config is identical for the second and third node, except that the IP addresses differ, of course.

Code:
# cat /etc/pve/ceph.conf
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 172.16.65.0/24
        fsid = 512a4321-de2a-4058-bce6-16e97b01b6cf
        mon_allow_pool_delete = true
        mon_host = 172.16.64.231 172.16.64.232 172.16.64.233
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 172.16.64.0/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.ceph-reef-01]
        public_addr = 172.16.64.231

[mon.ceph-reef-02]
        public_addr = 172.16.64.232

[mon.ceph-reef-03]
        public_addr = 172.16.64.233

This uses the 172.16.64.0/24 network for general traffic plus the public part of Ceph, and the 172.16.65.0/24 network for private Ceph stuff.

Note that here the traffic of the cluster network goes over the bridge of my workstation, which means there's a default route present. Therefore the up and down parts of the simple routed setup in the page linked above aren't necessary here; the NICs aren't connected to each other directly. (In other works, I don't have a mesh network here.)

I wanna emphasize here that the public network is used by MONs, MGRs, and MDSs, as well as anything that's communicating with Ceph. That means that PVE will communicate with Ceph over that network. In turn, this means that your cluster's performance will probably be bottlenecked by those 10G NICs (eventually). It's really only the replication traffic between the OSDs that gets offloaded to the cluster network. That means that ideally you'd want beefy NICs everywhere, not just in the cluster network.

Finally, just to add a little disclaimer here: I don't intend to make any recommendations for some production-grade network topology here or anything; I just want to suggest that you should try a much simpler setup first before trying to debug whatever you've got set up currently.

I believe I was able to verify that it is indeed using the 25GB connection. It may not have been originally, though I'm fairly sure it was. I set both the public and private ceph networks to the same 25GB network, then started a disk benchmark, and unplugged all the lan connections, so ONLY the 25GB direct connections between the 3 hosts were connected. I then plugged them back in about 10 minutes later, the test had completed, and had the same speeds. I also monitored/graphed the 10GB interfaces on our switch the hosts are connected to, and none of those interfaces moved much above minimal/almost zero network traffic during benchmarking. See the post I just made above for my next planned steps, hopefully one of those options works, if not I will continue to troubleshoot, reach back out to the gold partner and/or come back to this thread. I'll update regardless once I make any progress.

Thanks
 
Last edited: