Ceph very slow, what am I doing wrong?

Gumby

Renowned Member
Dec 15, 2011
21
7
68
Hello everyone. I am admittedly new to ceph so please be gentle. I have a 3 node cluster (3x Lenovo X3550 m5), each with the following spec

2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
128GB ECC DDR4 2133
4x 1.8TB Kingston DC600M sata SSD drives (ceph)
1x 900GB SAS Toshiba HDD (OS)
1x dual port 10Gbe 1 port connect to ubiquiti 10GB switch
1x quad port 1Gbe 1 port connected to ubiquiti 1GB switch

The ceph cluster has been configured to use a 10GbE network. If I run a VM on a single Kingston DC600M based ext4 storage I get 8-10x the performance compared to when the VM is running on ceph storage. I understand that 3 nodes isn't ideal as far as number of nodes goes, and that the hardware is a bit older, however I expected a fair bit more performance that what I am seeing as I can confirm a standalone SSD is getting 8-10x the perfomance of ceph. Can anyone offer a suggestion on what I have done wrong? Based on other articles I have browsed on the same topic, here is what I think is relevant info.

/etc/network/interfaces
Code:
auto ens1f0
iface ens1f0 inet static
        address 10.10.10.44/24

auto vmbr0
iface vmbr0 inet static
        address 192.168.0.44/24
        gateway 192.168.0.253
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

/etc/pve/ceph.conf
Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.10.10.44/24
         fsid = ef58c176-efb2-4f8b-9505-2c8d0fc3a6d0
         mon_allow_pool_delete = true
         mon_host = 192.168.0.44 192.168.0.46 192.168.0.48
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 192.168.0.44/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.oklp-ha-node4]
         public_addr = 192.168.0.44

[mon.oklp-ha-node5]
         public_addr = 192.168.0.46

[mon.oklp-ha-node6]
         public_addr = 192.168.0.48

ceoph osd df tree
Code:
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME             
-1         20.95917         -   21 TiB  1.1 TiB  1.0 TiB  223 KiB   17 GiB   20 TiB  5.03  1.00    -          root default         
-3          6.98639         -  7.0 TiB  359 GiB  354 GiB   64 KiB  5.5 GiB  6.6 TiB  5.03  1.00    -              host oklp-ha-node4
 0    ssd   1.74660   1.00000  1.7 TiB  112 GiB  110 GiB   13 KiB  1.4 GiB  1.6 TiB  6.24  1.24   41      up          osd.0         
 3    ssd   1.74660   1.00000  1.7 TiB   98 GiB   97 GiB   24 KiB  1.2 GiB  1.7 TiB  5.49  1.09   41      up          osd.3         
 6    ssd   1.74660   1.00000  1.7 TiB   60 GiB   59 GiB   16 KiB  1.1 GiB  1.7 TiB  3.37  0.67   37      up          osd.6         
 9    ssd   1.74660   1.00000  1.7 TiB   89 GiB   88 GiB   11 KiB  1.8 GiB  1.7 TiB  4.99  0.99   42      up          osd.9         
-5          6.98639         -  7.0 TiB  360 GiB  354 GiB   79 KiB  6.1 GiB  6.6 TiB  5.03  1.00    -              host oklp-ha-node5
 1    ssd   1.74660   1.00000  1.7 TiB   85 GiB   83 GiB   17 KiB  1.6 GiB  1.7 TiB  4.75  0.94   41      up          osd.1         
 4    ssd   1.74660   1.00000  1.7 TiB  126 GiB  125 GiB   21 KiB  1.3 GiB  1.6 TiB  7.04  1.40   44      up          osd.4         
 7    ssd   1.74660   1.00000  1.7 TiB   87 GiB   86 GiB   21 KiB  1.5 GiB  1.7 TiB  4.88  0.97   43      up          osd.7         
10    ssd   1.74660   1.00000  1.7 TiB   62 GiB   60 GiB   20 KiB  1.7 GiB  1.7 TiB  3.47  0.69   33      up          osd.10       
-7          6.98639         -  7.0 TiB  360 GiB  354 GiB   80 KiB  5.7 GiB  6.6 TiB  5.03  1.00    -              host oklp-ha-node6
 2    ssd   1.74660   1.00000  1.7 TiB   86 GiB   85 GiB   20 KiB  1.1 GiB  1.7 TiB  4.80  0.95   38      up          osd.2         
 5    ssd   1.74660   1.00000  1.7 TiB   82 GiB   80 GiB   22 KiB  1.7 GiB  1.7 TiB  4.59  0.91   46      up          osd.5         
 8    ssd   1.74660   1.00000  1.7 TiB   91 GiB   90 GiB   15 KiB  1.0 GiB  1.7 TiB  5.11  1.02   37      up          osd.8         
11    ssd   1.74660   1.00000  1.7 TiB  100 GiB   99 GiB   23 KiB  1.8 GiB  1.6 TiB  5.61  1.12   40      up          osd.11       
                        TOTAL   21 TiB  1.1 TiB  1.0 TiB  229 KiB   17 GiB   20 TiB  5.03                                           
MIN/MAX VAR: 0.67/1.40  STDDEV: 0.98

/etc/pve/qemu-server/402.conf
Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0;net0;ide0
cores: 4
efidisk0: cephpool-02:vm-402-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide0: none,media=cdrom
machine: pc-q35-6.2
memory: 8192
meta: creation-qemu=6.2.0,ctime=1704924451
name: vit-services-01
net0: virtio=9E:95:C4:68:EC:97,bridge=vmbr0,firewall=1,tag=50
numa: 0
onboot: 1
ostype: win11
scsihw: virtio-scsi-pci
smbios1: uuid=ef78ad61-19b5-499d-a24b-4185574abe8e
sockets: 1
tpmstate0: cephpool-02:vm-402-disk-1,size=4M,version=v2.0
virtio0: cephpool-02:vm-402-disk-2,size=100G
virtio1: cephpool-02:vm-402-disk-3,size=100G
vmgenid: 9fab2500-1b59-4846-b811-9acd89295291

Any help is appreciated.
 
1x dual port 10Gbe 1 port connect to ubiquiti 10GB switch
1x quad port 1Gbe 1 port connected to ubiquiti 1GB switch

Is my assumption correct, that eno1 is a/the 1 Gb port?
 
You're not bonding the 10G and looks like you are using the 1G for the front end ceph traffic? While DC600M's are PLP enabled you need to check the queue cache on your drives. You can use "cat /sys/block/sd*/queue/write_cache" to see what your sata ssds are setup for, if its write through you might want to push them to write back. Also I would suggest mq-deadline for SSDs if you are doing mixed and small IO access patterns, to see what queue is used now "cat /sys/block/sd*/queue/scheduler" it most likely is set to none.

There is a lot more device based tuning you can do, but understand that while Ceph on paper can do amazing things, its still a network object based storage solution and its very hard to scale it out with less then 5 nodes. A full NVME 3 node cluster, I am able to max at 800MB/s writes and 1.3GB/s reads over bonded 25G in my labs, if I setup a single node with Ceph I can push it to 1.3GB/s writes and 2.1GB/s reads, because all of the objects are on the same host. Yet on some of our 17+ node cluster builds we are not seeing any logical limits based on storage and/or network and are very much limited on the OSD's themselves. You will want to scale out since you are comparing Ceph to a single Sata SSD in performance.
 
Last edited:
  • Like
Reactions: unsichtbarre
You're not bonding the 10G and looks like you are using the 1G for the front end ceph traffic? While DC600M's are PLP enabled you need to check the queue cache on your drives. You can use "cat /sys/block/sd*/queue/write_cache" to see what your sata ssds are setup for, if its write through you might want to push them to write back. Also I would suggest mq-deadline for SSDs if you are doing mixed and small IO access patterns, to see what queue is used now "cat /sys/block/sd*/queue/scheduler" it most likely is set to none.

There is a lot more device based tuning you can do, but understand that while Ceph on paper can do amazing things, its still a network object based storage solution and its very hard to scale it out with less then 5 nodes. A full NVME 3 node cluster, I am able to max at 800MB/s writes and 1.3GB/s reads over bonded 25G in my labs, if I setup a single node with Ceph I can push it to 1.3GB/s writes and 2.1GB/s reads, because all of the objects are on the same host. Yet on some of our 17+ node cluster builds we are not seeing any logical limits based on storage and/or network and are very much limited on the OSD's themselves. You will want to scale out since you are comparing Ceph to a single Sata SSD in performance.
Correct, I am not bonding the 10G as I currently only have a 5 port 10G switch. The 1G is for the frontend. The DC600M drives have already automatically set to write back, the OS drive (900GB SAS) and another unused 900GB SAS drive are write-though and the queue us set to none as you expected

cat /sys/block/sd*/queue/write_cache
Code:
write through
write through
write back
write back
write back
write back

cat /sys/block/sd*/queue/scheduler
Code:
none [mq-deadline]
none [mq-deadline]
none [mq-deadline]
none [mq-deadline]
none [mq-deadline]
none [mq-deadline]

I'm not expecting blazing fast speeds, I was however expecting a fair bit more out of 3 nodes/12 disks/10Gbe which is why I assumed I have done something incorrectly. Perhaps this was an incorrect expectation?
 
  • Like
Reactions: unsichtbarre
Yes, this is correct

You defined that 1 Gb network (192.168.0.44 = vmbr0 = eno1) as your Ceph public network and your, I assume, 10 Gb network (10.10.10.44 = ens1f0) as your Ceph cluster network.

The public network is mandatory.
Separating the cluster network is optional. But if you do, both networks should definitely have the same speed; otherwise you are basically limiting your whole Ceph cluster to the slowest network.

[0] https://docs.ceph.com/en/reef/rados/configuration/network-config-ref
 
  • Like
Reactions: unsichtbarre
You defined that 1 Gb network (192.168.0.44 = vmbr0 = eno1) as your Ceph public network and your, I assume, 10 Gb network (10.10.10.44 = ens1f0) as your Ceph cluster network.

The public network is mandatory.
Separating the cluster network is optional. But if you do, both networks should definitely have the same speed; otherwise you are basically limiting your whole Ceph cluster to the slowest network.

[0] https://docs.ceph.com/en/reef/rados/configuration/network-config-ref
Thank you for this info @Neobin I will resolve that issue tomorrow and re-run some tests. Fingers crossed this is the misconfiguration causing the slowness.
 
  • Like
Reactions: unsichtbarre
@Neobin I couldn't wait until morning and had to get out of bed to try it. I am seeing a HUGE increase in read speeds. Write speeds have also seen an increase but not as dramatic, I assume that is to be expected. Thank you very much for your help!
 
  • Like
Reactions: unsichtbarre

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!