New Proxmox Production System - slow Ceph Performance???

Jun 2, 2024
15
1
3
I've had a few posts in the last month about trialing Proxmox and I've decided to give it a go. I have my hands on a 3 x 2U 4 node cluster. They were new old stock Cohesity cluster systems. The system is based on the Inventec K900G4. (https://youtu.be/CfRZZbggzLk?si=fcyG4-w2Ge7vr2aH)

I've setup one of the 4 node servers for PVE/PBS. Here are the specs for each node.

2 x Intel Silver 4214
256GB Ram
2 x 240GB Nvme boot disks
6 x 1.92 Nvme storage disks.
1 x Quad port 10G SFP+ network card.

Cisco Nexus 3172 SFP+ switch.
Each node has port 1 of the of the quad sfp+ card connected to the switch with a 10G DAC cable.
Each node has port 4 of the of the quad sfp+ card connected to the switch with a 10G DAC cable.

All nodes OS was installed on the 2 x 240GB Nvme disks with ZFS raid mirror.
Nodes 1, 2 and 3 all have PVE 8.2.2 installed.
Node 4 has PBS 3.2-2 installed.
For nodes 1,2,3 the PVE install created a vmbr0 bridge on port 1. IPs 10.63.14.1, .2 and .3 /24 were added for the main network. I manually created vmbr1 to the other DAC connected port 4 on each of the 3 PVE nodes and assigned 10.10.10.1, .2 and .3 /24 and that network was selected when creating the 3 node cluster.

Each node for 1-3 has access to 6 disks. There are Intel D7-P5500 data center drives. Capable of Seq R/W 7000/4300 MB/s. Random 4KB R/W up to 1M/130K IOPS.
There are 18 OSD's in total. All OSD's were created as Nvme disks. The pool which all the VM's sit on was created with 128 PG's and a standard of 3/2 setup.

To test the network I installed iperf3 on all nodes. Between the PVE 1-3 nodes and PBS 4 node all speeds come back at aprox 9.4-9.5Gbps. I have some linux VM's each of the nodes with iperf3 as well and networking between them I average 9.3+ gbps still. I have some Windows VM's (Server 2022 and Win 11) installed and running iperf3 between them as the client to a linux vm or one of the host nodes direct I get 9.3+gbps. When I have the Windows VM's run as a iperf3 server anything connecting to it seems to slow to about 8.2Gbps, doesn't matter if its a linux vm, windows vm or one of the nodes shell directly. Not critical but a little slow down noted on windows vm's ability to 'push' data is a little less than the linux vm's or the nodes themselves.

When copying data between Windows vm's I am only getting around ~400MB/sec. I would of thought with the usage of multiple drives spread to write the data the speed would be near the max of the 10Gbps network (aprox ~1200MB/sec)

For PBS on node 4 there is a single Nvme disk for backups for now. It was setup as a single zfs drive. When I backup any of the vm's to the PBS system the max seems to be around ~200MiB/s for the read and a maybe a smidge less for the write while it is "actually" reading and writing the data. What I mean by that is this, for example I created a 500GB VM and there is only 50GB used on it...., when the backup starts, its showing you the % complete it posting a line of 1% increments at a time but once you get to the part where the (50.0 GiB of 500.0 GiB) is complete the read speed goes to ~9GiB/s read and the writes are 0 B/s and moves multiple % each line now. Assuming this is normal as the last 450GB of data was blank. But why the actual data being read at the start is so slow? Why would that be?

Any advice on tuning this before I start converting VMware production VMs would be appreciated.

thanks, Paul
 
Last edited:
When copying data between Windows vm's I am only getting around ~400MB/sec.
This could be completely normal/expected. Post your interfaces file for a more nuanced discussion; but generally speaking if you have both your public and private ceph traffic on the same interface you can only expect half the throughput, and thats if you're not comingling other traffic on top of it.

see https://forum.proxmox.com/threads/my-first-proxmox-and-ceph-design-and-setup.141446/post-633649
 
128pg for 24 osd is really too low. It should be around 1024 (for a replication size=3).

Note that the pg autoscaler should increase it, but if your cluster is empty (and that you only do bench), the autoscaler can even reduce it so a minimum number of pg (something like 32).

Try to set pg target ratio to 1.0 (100%), to see the number of pg increasing.

Too low pg number mean locks, slow performance,....



Another tuning: enable writeback on your vm disk cache
 
This could be completely normal/expected. Post your interfaces file for a more nuanced discussion; but generally speaking if you have both your public and private ceph traffic on the same interface you can only expect half the throughput, and thats if you're not comingling other traffic on top of it.

see https://forum.proxmox.com/threads/my-first-proxmox-and-ceph-design-and-setup.141446/post-633649
@alexskysilk - here is the interfaces file.
There is public network and a cluster network. I would of though the cluster network would also be used for Ceph? If not, how can I leverage adding one more network to only be used by Ceph?

root@pve2:~# cd /etc/network
root@pve2:/etc/network# cat interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface ens1f0np0 inet manual

iface enxba5da9d1ca7f inet manual

iface ens1f1np1 inet manual

iface ens1f2np2 inet manual

iface ens1f3np3 inet manual

auto vmbr0
iface vmbr0 inet static
address 10.63.14.2/24
gateway 10.63.14.66
bridge-ports ens1f0np0
bridge-stp off
bridge-fd 0

auto vmbr1
iface vmbr1 inet static
address 10.10.10.2/24
bridge-ports ens1f3np3
bridge-stp off
bridge-fd 0

source /etc/network/interfaces.d/*
root@pve2:/etc/network#
 
one other thing- whats the source of the data copied? is it using the same network interface?
The source was just copying data from file shares on Windows vm's. (ie. 5GB ISO images) But across different nodes as in 1 VM was on node 1 and it would copy to a VM on node 3. I suppose it was using vmbr0 on each of the servers for the copy?
 
128pg for 24 osd is really too low. It should be around 1024 (for a replication size=3).

Note that the pg autoscaler should increase it, but if your cluster is empty (and that you only do bench), the autoscaler can even reduce it so a minimum number of pg (something like 32).

Try to set pg target ratio to 1.0 (100%), to see the number of pg increasing.

Too low pg number mean locks, slow performance,....



Another tuning: enable writeback on your vm disk cache
@spirit - I actually went to double check this. I set 128 PG and it was for only 18 disks. (6 each for 3 nodes) When I checked it was reduced to 32! So I changed back to 128 thinking I've done so many iterations of building this environment that maybe I goofed on the last one. I just checked now and its reduced back to 32! It lists that as the 'optimal' and autoscale is on.

I do have writeback cache enabled on all the vm disks.
 
so you have two bridges set up, but what networks are carrying your various traffic types? of specific relevance, ceph.conf.
@alexskysilk - thanks for the replies!

Here is the /etc/ceph/ceph.conf file below. I am going to guess that I setup a 3rd bridge and it is for ceph only traffic? Also note I found a /etc/pve/ceph.conf and at quick glance it looked identical.

root@pve1:~# cat /etc/ceph/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.10.1/24
fsid = aa810f1f-9038-4840-af2d-a1c20bd15861
mon_allow_pool_delete = true
mon_host = 10.63.14.1 10.63.14.3 10.63.14.2
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.63.14.1/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.pve1]
host = pve1
mds_standby_for_name = pve
[mds.pve2]
host = pve2
mds_standby_for_name = pve
[mds.pve3]
host = pve3
mds_standby_for_name = pve
[mon.pve1]
public_addr = 10.63.14.1
[mon.pve2]
public_addr = 10.63.14.2
[mon.pve3]
public_addr = 10.63.14.3
root@pve1:~#
 
Last edited:
Thats not what I meant. What was the backing store? were you reading it from the same filesystem as you were writing to?
Ah yes. I created only 1 pool called VMs with the defaults of 3/2 etc. All my vm's disks reside in that pool. I see the point of your question now and I'm starting to think ceph works a little different than I thought. By that I mean this... I assumed if you created a VM with a disk on Node 1 (Osd's 1-6) it would spread across those multiple Osd's for performance and the redundancy from the extra copies would be written on node 2 (Osd's 7-12) and node 3 (Osd's 13-18). If this was the case a vm on node 1 doing a large copy to it (writing to Osd's 1-6) from a vm on node 2 (reading Osd's 7-12) would see good performance.. And later on the secondary write would "catch up" to keep in sync.

One thing I never mentioned was all the OSDs are at 0/0 on the apply/commit latency list. Even doing a standard single large copy I can get one or two to show me a "1" for lacency. If I really push it, and stack up a bunch of large read and rights across the vm's AND get multiple CrystalDiskMarks running I can then get majority of Osds to show 1 or 2 for the latency ms.
 
Additional thought... my bottleneck likely is the 10gbps network looking at how Ceph works and replicates etc. If you convert the max iperf3 result of aprox = 9.41gbps to MB/s its aprox 1176MB/s. If I have a node vm writing to a ceph disk getting 400MB/s I should probably shut up and feel pretty good about this as it was reading data from a vm on a different node so that's incoming network traffic and then the vm on that node writing it, not once but 3 times!! The 400MB/s starting to look pretty good.

Not sure why, but I was focused on the max performance of the Nvme drives and trying to make this config push to those limits. Keeping the eye on the ball I should just shut up again... as the systems I am immediately replacing are 3 VMware hosts that are Hpe Proliant's (gen 9 and 10) all with direct attached hardware RAID6 storage with 6-8 disk in each which are 12Gbps sas 10k spinners. Never really did this before but I just did similar windows vm file copies between vm's on the same host and also between vm's on different hosts. Same host vm copies averaged 15-25MB/s... Across different hosts was 10-12MB/s as thst network is only 1gbps.
 
Last edited:
So another update... I thought I had write back cache on all the disks, turns out only a few had it. I've since enabled it on all. RE ran similar tests,... Getting 5-600MB/s now. I even started a copy from vm (node 1) to a vm on node 2... It was about 50GB so once it was running it I then did a migrate of the node 1 vm to node 3 while I was also RDP connected on that vm. Everything went just fine, no hiccups or lags. I'm impressed.
 
Last edited:
You could easily add one or two of the free ports of your quad 10g card in a Layer3+4 LACP bond, move vmbr1 to the the bond and get improved network capacity, which should yield increased Ceph speeds as the drives seem to be capable of it.


For nodes 1,2,3 the PVE install created a vmbr0 bridge on port 1. IPs 10.63.14.1, .2 and .3 /24 were added for the main network. I manually created vmbr1 to the other DAC connected port 4 on each of the 3 PVE nodes and assigned 10.10.10.1, .2 and .3 /24 and that network was selected when creating the 3 node cluster.
As a side note, you should definitely use redundant corosync links [1] to avoid cluster hickups if the network gets saturated (easy given that you put corosync in the storage network), specially if you plan on using HA.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy
 
Not sure why, but I was focused on the max performance of the Nvme drives and trying to make this config push to those limits
You're not alone. I see this a LOT with first time system design. "big numbers" are just sexy, and its only human nature to aim at the visible "numbers." its not until you operated a system "in anger" where you see what true pains a system can inflict on you.

To that end, consider that your system currently has no fault tolerance on any network layer; an interruption on either of your NICs will cause a node to be fenced at best, and stuff to just freeze and not work at worst. This can be remediated SOMEWHAT even without adding more network links. Here is how I would approach your network with only two links:

Bash:
auto lo
iface lo inet loopback

iface ens1f0np0 inet manual

iface enxba5da9d1ca7f inet manual

iface ens1f1np1 inet manual

iface ens1f2np2 inet manual

iface ens1f3np3 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves ens1f0np0 ens1f3np3
    bond-primary ens1f0np0
    bond-mode active-backup
    bond-miimon 100
    
auto bond1
iface bond1 inet manual
    bond-slaves ens1f0np0 ens1f3np3
    bond-primary ens1f0np3
    bond-mode active-backup
    bond-miimon 100
    
# Network functions and vlan ids
# v1 (untagged): guest network
# v10: corosync
# v11: ceph public
# v12: ceph private
    
auto vmbr0
iface vmbr0 inet static
    address 10.63.14.2/24
    gateway 10.63.14.66 # odd placement for gateway- best practices are to put this at the beginning or end of the range
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0

auto bond0.10
iface bond0.10 inet static
    address 10.10.10.2/24

auto bond0.11
iface bond0.11 inet static
    address 10.10.11.2/24
    
auto bond1.12
iface bond1.12 inet static
    address 10.10.12.2/24

source /etc/network/interfaces.d/*

For this to work, you'd need to trunk your ports (or define vlan tags) on your switches; I'm assuming your current network is flat and the v1 network is 10.63.14.0/24

what do you gain by doing this? you can now drop EITHER of your nics and remain fully operational. You are still comingling WAY too much traffic types on a single interface (ens1f0np0) but that cant really be helped without more network links. I would urge you to AT LEAST add two more interfaces, even if just 1gbit.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!