Reducing Ceph Write Latency

maomaocake · Jul 4, 2024

I have a 16 X 1 TB ceph cluster over 4 nodes and my write latency is kinda slow. I already have the db and wal on a separate SSD, Separate networks for the front and backend. How do I get the write latency to less than 16 ms? Other than going to full SSD cluster ofc.

Ping results
Backend:

Code:

1 -> 2
37 packets transmitted, 37 received, 0% packet loss, time 36841ms
rtt min/avg/max/mdev = 0.037/0.064/0.098/0.015 ms

1 -> 3
29 packets transmitted, 29 received, 0% packet loss, time 28668ms
rtt min/avg/max/mdev = 0.040/0.055/0.080/0.012 ms

1-> 4
29 packets transmitted, 29 received, 0% packet loss, time 28677ms
rtt min/avg/max/mdev = 0.028/0.108/0.221/0.071 ms

2 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29682ms
rtt min/avg/max/mdev = 0.040/0.055/0.116/0.015 ms

2 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.032/0.114/0.255/0.076 ms

3 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29681ms
rtt min/avg/max/mdev = 0.034/0.105/0.227/0.066 ms

Frontend network:

Code:

1 -> 2
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.240/0.338/0.523/0.057 ms

1 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29704ms
rtt min/avg/max/mdev = 0.240/0.338/0.523/0.057 ms

1 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29698ms
rtt min/avg/max/mdev = 0.312/0.345/0.432/0.031 ms

2 -> 3
30 packets transmitted, 30 received, 0% packet loss, time 29688ms
rtt min/avg/max/mdev = 0.315/0.350/0.473/0.029 ms

2 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29697ms
rtt min/avg/max/mdev = 0.236/0.307/0.400/0.043 ms

3 -> 4
30 packets transmitted, 30 received, 0% packet loss, time 29698ms
rtt min/avg/max/mdev = 0.312/0.365/0.468/0.041 ms

Code:

fio --name=write_latency --ioengine=libaio --direct=1 --sync=1 --bs=4k --size=1G --numjobs=1 --runtime=60 --time_based --rw=write --filename=/tmp/testfile
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
write_latency: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][0.3%][w=4KiB/s][w=1 IOPS][eta 05h:30m:28s]
write_latency: (groupid=0, jobs=1): err= 0: pid=1492283: Thu Jul  4 03:00:21 2024
  write: IOPS=13, BW=53.0KiB/s (54.3kB/s)(3220KiB/60702msec); 0 zone resets
    slat (usec): min=36, max=532432, avg=709.84, stdev=18764.09
    clat (msec): min=6, max=3886, avg=74.69, stdev=214.05
     lat (msec): min=6, max=3886, avg=75.40, stdev=220.43
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    9], 60.00th=[   11],
     | 70.00th=[   26], 80.00th=[   86], 90.00th=[  205], 95.00th=[  326],
     | 99.00th=[  936], 99.50th=[ 1045], 99.90th=[ 3876], 99.95th=[ 3876],
     | 99.99th=[ 3876]
   bw (  KiB/s): min=    8, max=  288, per=100.00%, avg=67.00, stdev=61.38, samples=96
   iops        : min=    2, max=   72, avg=16.75, stdev=15.35, samples=96
  lat (msec)   : 10=57.27%, 20=9.81%, 50=9.32%, 100=5.09%, 250=10.43%
  lat (msec)   : 500=5.59%, 750=0.87%, 1000=0.87%, 2000=0.62%, >=2000=0.12%
  cpu          : usr=0.01%, sys=0.08%, ctx=807, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,805,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=53.0KiB/s (54.3kB/s), 53.0KiB/s-53.0KiB/s (54.3kB/s-54.3kB/s), io=3220KiB (3297kB), run=60702-60702msec

Disk stats (read/write):
  vda: ios=591/3853, merge=26/3558, ticks=24103/178271, in_queue=226908, util=90.29%

jdancer · Jul 4, 2024

I use SAS HDDs in production. Since they are meant to be used on HW RAID controllers with a BBU, their write cache is turned off. May want to check if the HDDs have their cache enabled. VMs range from databases to DHCP/PXE servers. Not hurting for IOPS.

I use the following optimizations learned through trial-and-error. YMMV.

Code:

    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type to 'Host'
    Set VM CPU NUMA on servers with 2 more physical CPU sockets
    Set VM Networking VirtIO Multiqueue to number of Cores/vCPUs
    Set VM Qemu-Guest-Agent software installed
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pool to use 'krbd' option

alexskysilk · Jul 4, 2024

jdancer said:
Set VM CPU NUMA on servers with 2 more physical CPU sockets

probably safe/desirable to do even with one socket, especially on modern AMD cpus.

spirit · Jul 5, 2024

I already have the db and wal on a separate SSD

ssd model ? (you need DC grade ssd with supercapacitor/PLP)

kellogs · Jul 5, 2024

The following is my result. 8x8TB SATA NAS OSD with 2x SSD as Wal disk per node and total 64 OSD
front end 2x10 and backend 2x10 lacp

fio --name=write_latency --ioengine=libaio --direct=1 --sync=1 --bs=4k --size=1G --numjobs=1 --runtime=60 --time_based --r w=write --filename=/tmp/testfile
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4 096B, ioengine=libaio, iodepth=1
write_latency: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4 096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
write_latency: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=368KiB/s][w=92 IOPS][eta 00m:00s]
write_latency: (groupid=0, jobs=1): err= 0: pid=21291: Fri Jul 5 06:24:29 2024
write: IOPS=63, BW=254KiB/s (260kB/s)(14.9MiB/60083msec); 0 zone resets
slat (usec): min=36, max=461, avg=52.05, stdev=11.24
clat (msec): min=4, max=258, avg=15.67, stdev=22.43
lat (msec): min=4, max=258, avg=15.73, stdev=22.43
clat percentiles (msec):
| 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 7], 20.00th=[ 7],
| 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 8], 60.00th=[ 9],
| 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 47], 95.00th=[ 66],
| 99.00th=[ 108], 99.50th=[ 125], 99.90th=[ 184], 99.95th=[ 192],
| 99.99th=[ 259]
bw ( KiB/s): min= 112, max= 432, per=99.90%, avg=254.53, stdev=61.65, samples=120
iops : min= 28, max= 108, avg=63.63, stdev=15.41, samples=120
lat (msec) : 10=81.41%, 20=3.64%, 50=5.79%, 100=7.91%, 250=1.23%
lat (msec) : 500=0.03%
cpu : usr=0.05%, sys=0.39%, ctx=3821, majf=0, minf=12
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,3819,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=254KiB/s (260kB/s), 254KiB/s-254KiB/s (260kB/s-260kB/s), io=14.9MiB (15.6MB), run=60083-60083msec

Disk stats (read/write):
dm-0: ios=0/23769, merge=0/0, ticks=0/120072, in_queue=120072, util=99.84%, aggrios=0/15743, aggrmerge=0/8090, aggrticks=0/70092, aggrin_queue=72015, aggrutil=99.75%
sda: ios=0/15743, merge=0/8090, ticks=0/70092, in_queue=72015, util=99.75%
bash: syntax error near unexpected token `g=0'

maomaocake · Jul 7, 2024

spirit said:
ssd model ? (you need DC grade ssd with supercapacitor/PLP)

nothing particularly good its a KINGSTON SNVS/250GCN. Isn't the supercapacitor just for power cuts? I have UPSes on my server so thats not really a concern

Search

Search

Reducing Ceph Write Latency

maomaocake

Member

jdancer

Renowned Member

alexskysilk

Distinguished Member

spirit

Distinguished Member

kellogs

Member

maomaocake

Member