Reducing Ceph Write Latency

maomaocake

Member
Feb 13, 2022
47
5
13
21
I have a 16 X 1 TB ceph cluster over 4 nodes and my write latency is kinda slow. I already have the db and wal on a separate SSD, Separate networks for the front and backend. How do I get the write latency to less than 16 ms? Other than going to full SSD cluster ofc.


Ping results
Backend:
Code:
1 -> 2
37 packets transmitted, 37 received, 0% packet loss, time 36841ms
rtt min/avg/max/mdev = 0.037/0.064/0.098/0.015 ms

1 -> 3
29 packets transmitted, 29 received, 0% packet loss, time 28668ms
rtt min/avg/max/mdev = 0.040/0.055/0.080/0.012 ms

1-> 4
29 packets transmitted, 29 received, 0% packet loss, time 28677ms
rtt min/avg/max/mdev = 0.028/0.108/0.221/0.071 ms

2 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29682ms
rtt min/avg/max/mdev = 0.040/0.055/0.116/0.015 ms

2 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.032/0.114/0.255/0.076 ms

3 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29681ms
rtt min/avg/max/mdev = 0.034/0.105/0.227/0.066 ms

Frontend network:
Code:
1 -> 2
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.240/0.338/0.523/0.057 ms

1 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.240/0.338/0.523/0.057 ms

1 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29698ms
rtt min/avg/max/mdev = 0.312/0.345/0.432/0.031 ms

2 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29688ms
rtt min/avg/max/mdev = 0.315/0.350/0.473/0.029 ms

2 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29697ms
rtt min/avg/max/mdev = 0.236/0.307/0.400/0.043 ms

3 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29698ms
rtt min/avg/max/mdev = 0.312/0.365/0.468/0.041 ms



Code:
fio --name=write_latency --ioengine=libaio --direct=1 --sync=1 --bs=4k --size=1G --numjobs=1 --runtime=60 --time_based --rw=write --filename=/tmp/testfile
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
write_latency: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][0.3%][w=4KiB/s][w=1 IOPS][eta 05h:30m:28s]
write_latency: (groupid=0, jobs=1): err= 0: pid=1492283: Thu Jul  4 03:00:21 2024
  write: IOPS=13, BW=53.0KiB/s (54.3kB/s)(3220KiB/60702msec); 0 zone resets
    slat (usec): min=36, max=532432, avg=709.84, stdev=18764.09
    clat (msec): min=6, max=3886, avg=74.69, stdev=214.05
     lat (msec): min=6, max=3886, avg=75.40, stdev=220.43
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    9], 60.00th=[   11],
     | 70.00th=[   26], 80.00th=[   86], 90.00th=[  205], 95.00th=[  326],
     | 99.00th=[  936], 99.50th=[ 1045], 99.90th=[ 3876], 99.95th=[ 3876],
     | 99.99th=[ 3876]
   bw (  KiB/s): min=    8, max=  288, per=100.00%, avg=67.00, stdev=61.38, samples=96
   iops        : min=    2, max=   72, avg=16.75, stdev=15.35, samples=96
  lat (msec)   : 10=57.27%, 20=9.81%, 50=9.32%, 100=5.09%, 250=10.43%
  lat (msec)   : 500=5.59%, 750=0.87%, 1000=0.87%, 2000=0.62%, >=2000=0.12%
  cpu          : usr=0.01%, sys=0.08%, ctx=807, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,805,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=53.0KiB/s (54.3kB/s), 53.0KiB/s-53.0KiB/s (54.3kB/s-54.3kB/s), io=3220KiB (3297kB), run=60702-60702msec

Disk stats (read/write):
  vda: ios=591/3853, merge=26/3558, ticks=24103/178271, in_queue=226908, util=90.29%
 
I use SAS HDDs in production. Since they are meant to be used on HW RAID controllers with a BBU, their write cache is turned off. May want to check if the HDDs have their cache enabled. VMs range from databases to DHCP/PXE servers. Not hurting for IOPS.

I use the following optimizations learned through trial-and-error. YMMV.
Code:
    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type to 'Host'
    Set VM CPU NUMA on servers with 2 more physical CPU sockets
    Set VM Networking VirtIO Multiqueue to number of Cores/vCPUs
    Set VM Qemu-Guest-Agent software installed
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pool to use 'krbd' option
 
Last edited:
The following is my result. 8x8TB SATA NAS OSD with 2x SSD as Wal disk per node and total 64 OSD
front end 2x10 and backend 2x10 lacp

fio --name=write_latency --ioengine=libaio --direct=1 --sync=1 --bs=4k --size=1G --numjobs=1 --runtime=60 --time_based --r w=write --filename=/tmp/testfile
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4 096B, ioengine=libaio, iodepth=1
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4 096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
write_latency: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=368KiB/s][w=92 IOPS][eta 00m:00s]
write_latency: (groupid=0, jobs=1): err= 0: pid=21291: Fri Jul 5 06:24:29 2024
write: IOPS=63, BW=254KiB/s (260kB/s)(14.9MiB/60083msec); 0 zone resets
slat (usec): min=36, max=461, avg=52.05, stdev=11.24
clat (msec): min=4, max=258, avg=15.67, stdev=22.43
lat (msec): min=4, max=258, avg=15.73, stdev=22.43
clat percentiles (msec):
| 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 7], 20.00th=[ 7],
| 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 8], 60.00th=[ 9],
| 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 47], 95.00th=[ 66],
| 99.00th=[ 108], 99.50th=[ 125], 99.90th=[ 184], 99.95th=[ 192],
| 99.99th=[ 259]
bw ( KiB/s): min= 112, max= 432, per=99.90%, avg=254.53, stdev=61.65, samples=120
iops : min= 28, max= 108, avg=63.63, stdev=15.41, samples=120
lat (msec) : 10=81.41%, 20=3.64%, 50=5.79%, 100=7.91%, 250=1.23%
lat (msec) : 500=0.03%
cpu : usr=0.05%, sys=0.39%, ctx=3821, majf=0, minf=12
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,3819,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=254KiB/s (260kB/s), 254KiB/s-254KiB/s (260kB/s-260kB/s), io=14.9MiB (15.6MB), run=60083-60083msec

Disk stats (read/write):
dm-0: ios=0/23769, merge=0/0, ticks=0/120072, in_queue=120072, util=99.84%, aggrios=0/15743, aggrmerge=0/8090, aggrticks=0/70092, aggrin_queue=72015, aggrutil=99.75%
sda: ios=0/15743, merge=0/8090, ticks=0/70092, in_queue=72015, util=99.75%
bash: syntax error near unexpected token `g=0'
 
ssd model ? (you need DC grade ssd with supercapacitor/PLP)
nothing particularly good its a KINGSTON SNVS/250GCN. Isn't the supercapacitor just for power cuts? I have UPSes on my server so thats not really a concern
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!