Reducing Ceph Write Latency

maomaocake

Member
Feb 13, 2022
47
5
13
22
I have a 16 X 1 TB ceph cluster over 4 nodes and my write latency is kinda slow. I already have the db and wal on a separate SSD, Separate networks for the front and backend. How do I get the write latency to less than 16 ms? Other than going to full SSD cluster ofc.


Ping results
Backend:
Code:
1 -> 2
37 packets transmitted, 37 received, 0% packet loss, time 36841ms
rtt min/avg/max/mdev = 0.037/0.064/0.098/0.015 ms

1 -> 3
29 packets transmitted, 29 received, 0% packet loss, time 28668ms
rtt min/avg/max/mdev = 0.040/0.055/0.080/0.012 ms

1-> 4
29 packets transmitted, 29 received, 0% packet loss, time 28677ms
rtt min/avg/max/mdev = 0.028/0.108/0.221/0.071 ms

2 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29682ms
rtt min/avg/max/mdev = 0.040/0.055/0.116/0.015 ms

2 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.032/0.114/0.255/0.076 ms

3 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29681ms
rtt min/avg/max/mdev = 0.034/0.105/0.227/0.066 ms

Frontend network:
Code:
1 -> 2
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.240/0.338/0.523/0.057 ms

1 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.240/0.338/0.523/0.057 ms

1 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29698ms
rtt min/avg/max/mdev = 0.312/0.345/0.432/0.031 ms

2 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29688ms
rtt min/avg/max/mdev = 0.315/0.350/0.473/0.029 ms

2 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29697ms
rtt min/avg/max/mdev = 0.236/0.307/0.400/0.043 ms

3 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29698ms
rtt min/avg/max/mdev = 0.312/0.365/0.468/0.041 ms



Code:
fio --name=write_latency --ioengine=libaio --direct=1 --sync=1 --bs=4k --size=1G --numjobs=1 --runtime=60 --time_based --rw=write --filename=/tmp/testfile
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
write_latency: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][0.3%][w=4KiB/s][w=1 IOPS][eta 05h:30m:28s]
write_latency: (groupid=0, jobs=1): err= 0: pid=1492283: Thu Jul  4 03:00:21 2024
  write: IOPS=13, BW=53.0KiB/s (54.3kB/s)(3220KiB/60702msec); 0 zone resets
    slat (usec): min=36, max=532432, avg=709.84, stdev=18764.09
    clat (msec): min=6, max=3886, avg=74.69, stdev=214.05
     lat (msec): min=6, max=3886, avg=75.40, stdev=220.43
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    9], 60.00th=[   11],
     | 70.00th=[   26], 80.00th=[   86], 90.00th=[  205], 95.00th=[  326],
     | 99.00th=[  936], 99.50th=[ 1045], 99.90th=[ 3876], 99.95th=[ 3876],
     | 99.99th=[ 3876]
   bw (  KiB/s): min=    8, max=  288, per=100.00%, avg=67.00, stdev=61.38, samples=96
   iops        : min=    2, max=   72, avg=16.75, stdev=15.35, samples=96
  lat (msec)   : 10=57.27%, 20=9.81%, 50=9.32%, 100=5.09%, 250=10.43%
  lat (msec)   : 500=5.59%, 750=0.87%, 1000=0.87%, 2000=0.62%, >=2000=0.12%
  cpu          : usr=0.01%, sys=0.08%, ctx=807, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,805,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=53.0KiB/s (54.3kB/s), 53.0KiB/s-53.0KiB/s (54.3kB/s-54.3kB/s), io=3220KiB (3297kB), run=60702-60702msec

Disk stats (read/write):
  vda: ios=591/3853, merge=26/3558, ticks=24103/178271, in_queue=226908, util=90.29%
 
I use SAS HDDs in production. Since they are meant to be used on HW RAID controllers with a BBU, their write cache is turned off. May want to check if the HDDs have their cache enabled. VMs range from databases to DHCP/PXE servers. Not hurting for IOPS.

I use the following optimizations learned through trial-and-error. YMMV.
Code:
    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type to 'Host'
    Set VM CPU NUMA on servers with 2 more physical CPU sockets
    Set VM Networking VirtIO Multiqueue to number of Cores/vCPUs
    Set VM Qemu-Guest-Agent software installed
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pool to use 'krbd' option
 
Last edited:
The following is my result. 8x8TB SATA NAS OSD with 2x SSD as Wal disk per node and total 64 OSD
front end 2x10 and backend 2x10 lacp

fio --name=write_latency --ioengine=libaio --direct=1 --sync=1 --bs=4k --size=1G --numjobs=1 --runtime=60 --time_based --r w=write --filename=/tmp/testfile
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4 096B, ioengine=libaio, iodepth=1
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4 096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
write_latency: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=368KiB/s][w=92 IOPS][eta 00m:00s]
write_latency: (groupid=0, jobs=1): err= 0: pid=21291: Fri Jul 5 06:24:29 2024
write: IOPS=63, BW=254KiB/s (260kB/s)(14.9MiB/60083msec); 0 zone resets
slat (usec): min=36, max=461, avg=52.05, stdev=11.24
clat (msec): min=4, max=258, avg=15.67, stdev=22.43
lat (msec): min=4, max=258, avg=15.73, stdev=22.43
clat percentiles (msec):
| 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 7], 20.00th=[ 7],
| 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 8], 60.00th=[ 9],
| 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 47], 95.00th=[ 66],
| 99.00th=[ 108], 99.50th=[ 125], 99.90th=[ 184], 99.95th=[ 192],
| 99.99th=[ 259]
bw ( KiB/s): min= 112, max= 432, per=99.90%, avg=254.53, stdev=61.65, samples=120
iops : min= 28, max= 108, avg=63.63, stdev=15.41, samples=120
lat (msec) : 10=81.41%, 20=3.64%, 50=5.79%, 100=7.91%, 250=1.23%
lat (msec) : 500=0.03%
cpu : usr=0.05%, sys=0.39%, ctx=3821, majf=0, minf=12
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,3819,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=254KiB/s (260kB/s), 254KiB/s-254KiB/s (260kB/s-260kB/s), io=14.9MiB (15.6MB), run=60083-60083msec

Disk stats (read/write):
dm-0: ios=0/23769, merge=0/0, ticks=0/120072, in_queue=120072, util=99.84%, aggrios=0/15743, aggrmerge=0/8090, aggrticks=0/70092, aggrin_queue=72015, aggrutil=99.75%
sda: ios=0/15743, merge=0/8090, ticks=0/70092, in_queue=72015, util=99.75%
bash: syntax error near unexpected token `g=0'
 
ssd model ? (you need DC grade ssd with supercapacitor/PLP)
nothing particularly good its a KINGSTON SNVS/250GCN. Isn't the supercapacitor just for power cuts? I have UPSes on my server so thats not really a concern