Ceph Storage Performance

felixheilig

Member
Jul 6, 2021
11
0
6
26
Hello everyone,

I don't have a problem per se, but I wanted to gather some more opinions.
Here's the setup:
- 3 Proxmox nodes in a cluster.
- Hosting provider: Hetzner
- AX41-NVMe (https://www.hetzner.com/de/dedicated-rootserver/ax41-nvme/konfigurator#/)
- Specifications: AMD Ryzen™ 5 3600, 64 GB DDR4 non-ECC.

The nodes are connected to each other via 1GIG for the Proxmox cluster, and additionally, they are dedicatedly connected via 10 GIG LAN for CephStorage.
(I have confirmed through a test with iperf that I am indeed getting 10 Gbps through the connection.)

Each of the three hosts also has a 2 TB NVMe SSD for my Ceph Storage Pool.
I have obtained the following values in a VM, both on ZFS storage and Ceph storage:

ZFS:
Read:~ 1298 MB/s
Write: ~ 1191 MB/s

Ceph:

Read:~ 338 MB/s
Write: ~ 20 MB/s


Since this is my first experience with Ceph storage, I have the following question:Is this significant performance difference in read and write operations normal, or is there potential for improvement that I am overlooking?

Would adding more nodes improve the performance, or would it be irrelevant?


Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.55.0/24
     fsid = bcff2142-9dbd-4ed3-9ea9-89b0b4b97ba0
     mon_allow_pool_delete = true
     mon_host = 192.168.55.14 192.168.55.15 192.168.55.16
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.55.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mon.Kamino04]
     public_addr = 192.168.55.14

[mon.Kamino05]
     public_addr = 192.168.55.15

[mon.Kamino06]
     public_addr = 192.168.55.16


Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host Kamino04 {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 1.86299
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 1.86299
}
host Kamino05 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 1.86299
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 1.86299
}
host Kamino06 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 1.86299
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 1.86299
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 5.58897
    alg straw2
    hash 0    # rjenkins1
    item Kamino04 weight 1.86299
    item Kamino05 weight 1.86299
    item Kamino06 weight 1.86299
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

I am very grateful for any input.

PS: Yes, I am aware that 3 nodes are not enough to operate Ceph with full fault tolerance.
 

Attachments

  • ceph.png
    ceph.png
    90.9 KB · Views: 8
  • ZFS.png
    ZFS.png
    88.5 KB · Views: 8
Last edited:
What is the latency between the nodes? That will have an impact on Ceph as for each IO request there will be a few roundtrips over the network.

Can you post the excact models of SSDs? For example, ls -l /dev/disk/by-id should list them.
 
"2 TB NVMe SSD for my Ceph Storage Pool."

what is your nvme model ?
(consumer ssd will have bad performance for small 4k writes for both ceph && zfs, but for big 1M write , it should be ok)


How many pgs do you have in your ceph pool ? you can also enable writeback on your vm, it's really help for ceph.
 
What is the latency between the nodes? That will have an impact on Ceph as for each IO request there will be a few roundtrips over the network.

Can you post the excact models of SSDs? For example, ls -l /dev/disk/by-id should list them.
Thanks aaron for your reply.

a ping between the nodes is something between 0.073 ms and 0.207 ms at max.
Or would you suggest to measure the latency a another way ?

The SSD Models seam be 2TB Samsung PM9A1
https://www.mindfactory.de/product_...3D-NAND-TLC--MZVL22T0HBLB-00B00-_1390754.html
 
"2 TB NVMe SSD for my Ceph Storage Pool."

what is your nvme model ?
(consumer ssd will have bad performance for small 4k writes for both ceph && zfs, but for big 1M write , it should be ok)


How many pgs do you have in your ceph pool ? you can also enable writeback on your vm, it's really help for ceph.
The Number of PGs for my Pool is at 128 atm. you can find my Pool config in the Attachment, if you have any suggestions for me.



Changing the virtual Drive to Writeback helped tremendously.

I’m at ~ 528 MB/s now.


Is that a corridor to expect ?
 

Attachments

  • Cephpool.png
    Cephpool.png
    13.6 KB · Views: 18
  • cephwriteback.png
    cephwriteback.png
    155.7 KB · Views: 18
as Samsung PM9A1 are consumer drives, they haven't power protection / caps for its dram cache.
writeback lie about sync writes from pov of os inside vm, so data is loss in the fly if host or vm crash or power outage. sometime data isn't important and system can reboot, but sometime if wrong file updated at wrong time, application data is lost and need to restore from backup to ensure data integrity.
 
  • Like
Reactions: jsterr
The Number of PGs for my Pool is at 128 atm. you can find my Pool config in the Attachment, if you have any suggestions for me.



Changing the virtual Drive to Writeback helped tremendously.

I’m at ~ 528 MB/s now.


Is that a corridor to expect ?

Enabling KRBD in Datacenter -> Storage also helps a lot, especially on Windows VMs.

Edit: but please dont use consumer ssds with writeback-mode in vm as gabriel said.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!