Ceph poor write speed

coolwap

New Member
Sep 12, 2024
1
0
1
Hi,

This is my first post here. I am new to Proxmox and just set up a cluster of 3 nodes with the following configuration:

Network:
NIC#1:
1G - Corosync
1G - not used
NIC#2:
10G - VM traffic
10G - Ceph

iperf between CEPH interfaces shows bandwidth in a range of 8.16 - 9.41 Gbits/sec.

Storage:
4x3.84TB SSD SAMSUNG MZWLR3T8HBLS-00007 - Ceph OSDs for RBD storage (Disk Images)
1x1.92TB Kingston DC500M - Proxmox, Backups, ISO

Advertised speeds are: Read - 7000 MB/s, Write - 3800 MB/s
Used dd to test write speed directly on nvme mounted as ext4, looks good:
Code:
root@PROXMOX-SERVER-01:~# dd if=/dev/zero of=/media/nvme/test1.img bs=2G count=10 oflag=dsync
dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
0+10 records in
0+10 records out
21474795520 bytes (21 GB, 20 GiB) copied, 19.0494 s, 1.1 GB/s

However, Rados benchmark shows really poor write performance (25MB/sec):
Code:
root@PROXMOX-SERVER-01:~# rados bench -p ceph 15 write -t 16 --object-size=4K
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 15 seconds or 0 objects
...
Total time run:         15.0123
Total writes made:      93145
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     24.2366
Stddev Bandwidth:       1.54596
Max bandwidth (MB/sec): 25.3398
Min bandwidth (MB/sec): 20.168
Average IOPS:           6204
Stddev IOPS:            395.765
Max IOPS:               6487
Min IOPS:               5163
Average Latency(s):     0.00257616
Stddev Latency(s):      0.00281054
Max latency(s):         0.221893
Min latency(s):         0.00132077

fio inside a VM is a bit better but still very poor (42.5MB/s):
Code:
root@ubuntu-vm:~$ fio --ioengine=psync --filename=/tmp/test_fio --size=1G --time_based --name=fio --group_reporting --runtime=60 --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=16
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=16
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=40.0MiB/s][w=10 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=63638: Fri Sep 13 08:55:58 2024
  write: IOPS=10, BW=40.5MiB/s (42.5MB/s)(2436MiB/60110msec); 0 zone resets
    clat (msec): min=48, max=287, avg=98.30, stdev=24.50
     lat (msec): min=48, max=287, avg=98.69, stdev=24.51
    clat percentiles (msec):
     |  1.00th=[   54],  5.00th=[   67], 10.00th=[   74], 20.00th=[   82],
     | 30.00th=[   86], 40.00th=[   89], 50.00th=[   95], 60.00th=[  106],
     | 70.00th=[  111], 80.00th=[  115], 90.00th=[  121], 95.00th=[  130],
     | 99.00th=[  161], 99.50th=[  207], 99.90th=[  288], 99.95th=[  288],
     | 99.99th=[  288]
   bw (  KiB/s): min=24576, max=49250, per=100.00%, avg=41519.04, stdev=5490.84, samples=120
   iops        : min=    6, max=   12, avg=10.00, stdev= 1.36, samples=120
  lat (msec)   : 50=0.33%, 100=53.53%, 250=45.65%, 500=0.49%
  cpu          : usr=0.48%, sys=0.35%, ctx=1218, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,609,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=40.5MiB/s (42.5MB/s), 40.5MiB/s-40.5MiB/s (42.5MB/s-42.5MB/s), io=2436MiB (2554MB), run=60110-60110msec

Disk stats (read/write):
    dm-0: ios=0/3063, merge=0/0, ticks=0/61452, in_queue=61452, util=98.95%, aggrios=0/4277, aggrmerge=0/617, aggrticks=0/191397, aggrin_queue=191674, aggrutil=99.31%
  sda: ios=0/4277, merge=0/617, ticks=0/191397, in_queue=191674, util=99.31%

Read speeds are a bit better but still very far from SSD specs and network bandwidth:
rados - 137MB/sec
fio - 236MB/sec

Why can Ceph performance be so much slower than network and SSD? What can I do to improve it? (I have used default settings when creating the cluster)
 
Last edited:
What CPUs doe the nodes have? Are the BIOS settings set to max performance / low latency? Maybe the vendor has some guides on that too.

Did you set a target_ratio for the pool? -> what is the current pg_num of the pool?
10 Gbit might be one bottleneck as well, overall with fast enough SSDs.

However, Rados benchmark shows really poor write performance (25MB/sec):
What if you test it with larger object sizes? Rados will work on 4M objects for example.

fio inside a VM is a bit better but still very poor (42.5MB/s):

What is the config of the VM? qm config {vmid}