Ceph random high latency

mart.v

Well-Known Member
Mar 21, 2018
32
0
46
44
Hi,

I'm running proxmox cluster with 5 nodes and pure SSD Ceph storage (currently about 20 OSDs, all enterprise grade INTEL S3710/S4500, bluestore). Nodes are connected through 10Gbit network. Storage is about 50% full. Everything (system, proxmox, ceph) is updated to latest versions. On top of that there are dozens of KVM machines with Linux. During normal hours there is very small load (read/write about 20 MB/s, below 1000 iops, ethernet traffic below 100 MB/s). Ceph has two pools, 576 (512+64) PG total (so about 30 PG per OSD).

I have discovered that sometimes there is a noticeable lag across the whole cluster. I can measure it even during "normal" load (there is no heavy I/O task), for example with ioping ("ioping -DW -c 60" ran 10 times, so it covers 10 minutes window) - as you can see, there are sometimes 5 seconds latencies:

min/avg/max/mdev = 1.8 ms / 70.0 ms / 906.7 ms / 191.6 ms
min/avg/max/mdev = 1.9 ms / 198.7 ms / 4.4 s / 682.1 ms
min/avg/max/mdev = 1.9 ms / 41.6 ms / 1.9 s / 251.3 ms
min/avg/max/mdev = 1.8 ms / 11.7 ms / 258.3 ms / 47.1 ms
min/avg/max/mdev = 1.5 ms / 251.8 ms / 5.0 s / 956.0 ms
min/avg/max/mdev = 1.9 ms / 11.4 ms / 426.0 ms / 55.7 ms
min/avg/max/mdev = 1.5 ms / 37.1 ms / 1.5 s / 195.4 ms
min/avg/max/mdev = 1.8 ms / 79.4 ms / 1.7 s / 321.2 ms
min/avg/max/mdev = 1.8 ms / 2.3 ms / 4.5 ms / 474 us
min/avg/max/mdev = 1.7 ms / 4.9 ms / 49.1 ms / 8.2 ms

It is even worse during heavy I/O operations like rsync of milions small files. This operation affects not only current KVM machine (it becames almost unusable) but the whole cluster. During operations like this I can see high OSD latencies (both apply and commit) via proxmox interface; sometimes way above 100ms. During normal hours there is usually visible zero latency (or small numbers like 2 ms).

I tried to measure packetloss with ping, omping (everything smooth), network throughput with iperf (which went like 9,8Gbps).

Do you have any tips where to look?
Thanks!
 
Hi,
this is really strange.

is your cluster of mix of 3710 and 4500 ?

I neved had problem with 3710 in past. I'm not sure about 4500, but maybe a firmware bug (maybe latency spike, or something like that ?)
 
Ceph has two pools, 576 (512+64) PG total (so about 30 PG per OSD).

Maybe that's a low PGs/OSD rate. Afaik there's a minimum of 75 PGs/OSD, rule of thumb 100PGs/OSD.
What kind of HBA do you use?
We're running 4 Nodes with Intel 4500 SSDs @ 3000-6000 iops without any noticeable lags so far.
 
Maybe that's a low PGs/OSD rate. Afaik there's a minimum of 75 PGs/OSD, rule of thumb 100PGs/OSD.
What kind of HBA do you use?
We're running 4 Nodes with Intel 4500 SSDs @ 3000-6000 iops without any noticeable lags so far.

512 pg for 20 osd (with replication x3) seem to be ok

https://ceph.com/pgcalc/
 
also, do you have tried to do a bench with "fio" inside your vm, instead ioping ? (I'm not sure how ioping is working, but maybe it's always write/read same block, so , on the same pg / osd )

fio is IMHO very bad with small block size. It gets better with increasing BS,

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=512M --numjobs=8 --runtime=60 --group_reporting
randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=1632KiB/s][r=0,w=408 IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=8): err= 0: pid=23531: Tue Oct 2 23:24:25 2018
write: IOPS=2468, BW=9873KiB/s (10.1MB/s)(579MiB/60012msec)
slat (usec): min=5, max=4627.4k, avg=3199.44, stdev=41260.80
clat (usec): min=2, max=56086, avg= 9.86, stdev=515.92
lat (usec): min=9, max=4627.4k, avg=3215.47, stdev=41265.90
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3],
| 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 3], 60.00th=[ 3],
| 70.00th=[ 3], 80.00th=[ 3], 90.00th=[ 4], 95.00th=[ 4],
| 99.00th=[ 7], 99.50th=[ 9], 99.90th=[ 20], 99.95th=[ 28],
| 99.99th=[35390]
bw ( KiB/s): min= 8, max=58027, per=13.33%, avg=1315.66, stdev=6437.99, samples=896
iops : min= 2, max=14506, avg=328.90, stdev=1609.43, samples=896
lat (usec) : 4=95.71%, 10=3.86%, 20=0.34%, 50=0.06%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
cpu : usr=0.17%, sys=0.56%, ctx=2654, majf=0, minf=219
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,148125,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=9873KiB/s (10.1MB/s), 9873KiB/s-9873KiB/s (10.1MB/s-10.1MB/s), io=579MiB (607MB), run=60012-60012msec

Disk stats (read/write):
dm-0: ios=355/46457, merge=0/0, ticks=25469/262504, in_queue=287976, util=97.64%, aggrios=336/45923, aggrmerge=5/705, aggrticks=25343/221010, aggrin_queue=246347, aggrutil=97.57%
sda: ios=336/45923, merge=5/705, ticks=25343/221010, in_queue=246347, util=97.57%

I have experienced one bigger hang for a couple of seconds on the VM (write rate dropped to zero) and at I saw big latency on some of the OSDs (>1000ms). The lagging OSDs are not always the same, so I didn't find anything in common (for example bad node / bad disk).
 
>> w=408 IOPS

something is really wrong is your setup. (I'm able to reach 1500iops from 1 vm with iodepth=1 and 4k block for write and 3000-3500 for 4k randread, with 18 osd (replica x3)).
what is the result with a bigger queue depth (64 for exemple) ?

can you send more infos:

was is your cpu model ? ceph.conf ?
distro / kernel version on ceph nodes.
and a sample vm config too.

maybe try to disable spectre/meltdown (
GRUB_CMDLINE_LINUX=" pti=off spectre_v2=off l1tf=off spec_store_bypass_disable=off")


Edit: can you test with fio and --direct=1, with only 1 job ?

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=1 --size=512M --runtime=60
 
Last edited:
Hi mart.v
I had same scenario with you. I had multiple VM running windows server 2012/2016 and linux. VM had database application. I have discovered that some VM had unresponsive running process cause ceph osd high latency. Subsequently other VM that use same underlying storage got the lag impact. Maybe VM operating system had misconfiguration or need update. Shutdown and start again VM solve my problem.
 
First of all, thank you for your time.

what is the result with a bigger queue depth (64 for exemple) ?

fio --name=randwrite --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4k --direct=1 --size=512M --runtime=60

Way better: Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=27.6MiB/s][r=0,w=7073 IOPS][eta 00m:00s]

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=1 --size=512M --runtime=60

Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1348KiB/s][r=0,w=337 IOPS][eta 00m:00s]

was is your cpu model ? ceph.conf ?
distro / kernel version on ceph nodes.
and a sample vm config too.

Ceph nodes are running on standard proxmox distro with latest kernel 4.15.18-5-pve.

Most of the nodes are Intel Xeon Silver 4114, but I have one smaller node (Intel Xeon E-1230Lv3) with two OSDs. Do you think that it can influence rest of the cluster? OSD latencies are on all nodes/all osds, not only on the slowest one.

Ceph config is default with no customizations:
Code:
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 172.16.254.0/24
         fsid = *** (masked)
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 172.16.254.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.node1]
         host = node1
         mon addr = 172.16.254.101:6789

[mon.node4]
         host = node4
         mon addr = 172.16.254.104:6789

[mon.node2]
         host = node2
         mon addr = 172.16.254.102:6789

Sample VM config:

Code:
agent: 1
boot: cd
bootdisk: scsi0
cores: 4
hotplug: disk,network,usb,memory,cpu
memory: 12288
name: *** (masked)
net0: virtio=*** (masked),bridge=vmbr0
net1: virtio=*** (masked),bridge=vmbr1
numa: 1
onboot: 1
ostype: l26
scsi0: proxmox:vm-132-disk-0,discard=on,size=600G
scsihw: virtio-scsi-pci
smbios1: uuid=*** (masked)
sockets: 1

maybe try to disable spectre/meltdown (
GRUB_CMDLINE_LINUX=" pti=off spectre_v2=off l1tf=off spec_store_bypass_disable=off")

Well, I can try that, but it's not generally a good idea to disable this, right? :-)
 
Hi mart.v
I had same scenario with you. I had multiple VM running windows server 2012/2016 and linux. VM had database application. I have discovered that some VM had unresponsive running process cause ceph osd high latency. Subsequently other VM that use same underlying storage got the lag impact. Maybe VM operating system had misconfiguration or need update. Shutdown and start again VM solve my problem.

Thank you for this tip. How did you identified the problematic VM? None of my VMs is using 100% CPU or making excessive disk read/write (according to graph available in proxmox).
 
Most of the nodes are Intel Xeon Silver 4114, but I have one smaller node (Intel Xeon E-1230Lv3) with two OSDs. Do you think that it can influence rest of the cluster? OSD latencies are on all nodes/all osds, not only on the slowest one.

I tried to set noout flag and stopped all OSDs on the slowest node, fio results were the same.
 
I can't explain the latency spike, (when 1vm can impact other vms)

but with only 1vm, and low iodepth bench and small block,
the latency is impacted by : cpu frequency (client rbd + ceph cluster) + network latency.

I'm using 3ghz cpu for client and ceph cluster, that's why I have more iops.

you can reduce cpu usage (so reduce latency), by disabling debug logs in ceph.conf (then stop/start vm), and disable cephx. (but that need a full ceph cluster restart)

Code:
[global]
auth_cluster_required = none
auth_service_required = none
auth_client_required = none

 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 rbd_skip_partial_discard = true
 perf = true
 mutex_perf_counter = false
 throttler_perf_counter = false
 
the latency is impacted by : cpu frequency (client rbd + ceph cluster) + network latency.

I'm using 3ghz cpu for client and ceph cluster, that's why I have more iops.

I understand. But none of CPUs (threads) are utilized on 100 %. Even during the fio tests all threads on all CPUs have load like this:
%Cpu0 : 27.0 us, 2.8 sy, 0.0 ni, 69.9 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
 
So you think that I should raise number of PGs for the main pool from 512 to 1024?

I don't think so, but 30PGS/OSD is a mismatch to your overall PGs of 576 @ 3 replicas.
Make a check with
Code:
ceph osd df tree