Ceph random high latency

mart.v · Oct 1, 2018

Hi,

I'm running proxmox cluster with 5 nodes and pure SSD Ceph storage (currently about 20 OSDs, all enterprise grade INTEL S3710/S4500, bluestore). Nodes are connected through 10Gbit network. Storage is about 50% full. Everything (system, proxmox, ceph) is updated to latest versions. On top of that there are dozens of KVM machines with Linux. During normal hours there is very small load (read/write about 20 MB/s, below 1000 iops, ethernet traffic below 100 MB/s). Ceph has two pools, 576 (512+64) PG total (so about 30 PG per OSD).

I have discovered that sometimes there is a noticeable lag across the whole cluster. I can measure it even during "normal" load (there is no heavy I/O task), for example with ioping ("ioping -DW -c 60" ran 10 times, so it covers 10 minutes window) - as you can see, there are sometimes 5 seconds latencies:

min/avg/max/mdev = 1.8 ms / 70.0 ms / 906.7 ms / 191.6 ms
min/avg/max/mdev = 1.9 ms / 198.7 ms / 4.4 s / 682.1 ms
min/avg/max/mdev = 1.9 ms / 41.6 ms / 1.9 s / 251.3 ms
min/avg/max/mdev = 1.8 ms / 11.7 ms / 258.3 ms / 47.1 ms
min/avg/max/mdev = 1.5 ms / 251.8 ms / 5.0 s / 956.0 ms
min/avg/max/mdev = 1.9 ms / 11.4 ms / 426.0 ms / 55.7 ms
min/avg/max/mdev = 1.5 ms / 37.1 ms / 1.5 s / 195.4 ms
min/avg/max/mdev = 1.8 ms / 79.4 ms / 1.7 s / 321.2 ms
min/avg/max/mdev = 1.8 ms / 2.3 ms / 4.5 ms / 474 us
min/avg/max/mdev = 1.7 ms / 4.9 ms / 49.1 ms / 8.2 ms

It is even worse during heavy I/O operations like rsync of milions small files. This operation affects not only current KVM machine (it becames almost unusable) but the whole cluster. During operations like this I can see high OSD latencies (both apply and commit) via proxmox interface; sometimes way above 100ms. During normal hours there is usually visible zero latency (or small numbers like 2 ms).

I tried to measure packetloss with ping, omping (everything smooth), network throughput with iperf (which went like 9,8Gbps).

Do you have any tips where to look?
Thanks!

spirit · Oct 2, 2018

Hi,
this is really strange.

is your cluster of mix of 3710 and 4500 ?

I neved had problem with 3710 in past. I'm not sure about 4500, but maybe a firmware bug (maybe latency spike, or something like that ?)

spirit · Oct 2, 2018

also, do you have tried to do a bench with "fio" inside your vm, instead ioping ? (I'm not sure how ioping is working, but maybe it's always write/read same block, so , on the same pg / osd )

Knuuut · Oct 2, 2018

mart.v said:
Ceph has two pools, 576 (512+64) PG total (so about 30 PG per OSD).

Maybe that's a low PGs/OSD rate. Afaik there's a minimum of 75 PGs/OSD, rule of thumb 100PGs/OSD.
What kind of HBA do you use?
We're running 4 Nodes with Intel 4500 SSDs @ 3000-6000 iops without any noticeable lags so far.

spirit · Oct 2, 2018

Knuuut said:
Maybe that's a low PGs/OSD rate. Afaik there's a minimum of 75 PGs/OSD, rule of thumb 100PGs/OSD.
What kind of HBA do you use?
We're running 4 Nodes with Intel 4500 SSDs @ 3000-6000 iops without any noticeable lags so far.

512 pg for 20 osd (with replication x3) seem to be ok

https://ceph.com/pgcalc/

Knuuut · Oct 2, 2018

spirit said:
512 pg for 20 osd (with replication x3) seem to be ok

Maybe not:

Knuuut · Oct 2, 2018

512PGs @ 20 OSDs would be nice, but

mart.v said:
so about 30 PG per OSD

won't match to this configuration.

spirit · Oct 2, 2018

Knuuut said:
Maybe not:

View attachment 8337

targetpg is not related to number of pg. It's just used by the cal to add more pg (200%, 300%) if you plan to add more osd in the future. (because if you increase number of pg later, rebalance all datas)

mart.v · Oct 2, 2018

Knuuut said:
What kind of HBA do you use?
We're running 4 Nodes with Intel 4500 SSDs @ 3000-6000 iops without any noticeable lags so far.

Plain Supermicro boards, no external HBA (most of the nodes are running on INTEL C622 https://www.supermicro.com/products/motherboard/xeon/C620/X11DDW-NT.cfm).

mart.v · Oct 2, 2018

spirit said:
also, do you have tried to do a bench with "fio" inside your vm, instead ioping ? (I'm not sure how ioping is working, but maybe it's always write/read same block, so , on the same pg / osd )

fio is IMHO very bad with small block size. It gets better with increasing BS,

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=512M --numjobs=8 --runtime=60 --group_reporting

randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.1
Starting 8 processes
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=1632KiB/s][r=0,w=408 IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=8): err= 0: pid=23531: Tue Oct 2 23:24:25 2018
write: IOPS=2468, BW=9873KiB/s (10.1MB/s)(579MiB/60012msec)
slat (usec): min=5, max=4627.4k, avg=3199.44, stdev=41260.80
clat (usec): min=2, max=56086, avg= 9.86, stdev=515.92
lat (usec): min=9, max=4627.4k, avg=3215.47, stdev=41265.90
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3],
| 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 3], 60.00th=[ 3],
| 70.00th=[ 3], 80.00th=[ 3], 90.00th=[ 4], 95.00th=[ 4],
| 99.00th=[ 7], 99.50th=[ 9], 99.90th=[ 20], 99.95th=[ 28],
| 99.99th=[35390]
bw ( KiB/s): min= 8, max=58027, per=13.33%, avg=1315.66, stdev=6437.99, samples=896
iops : min= 2, max=14506, avg=328.90, stdev=1609.43, samples=896
lat (usec) : 4=95.71%, 10=3.86%, 20=0.34%, 50=0.06%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
cpu : usr=0.17%, sys=0.56%, ctx=2654, majf=0, minf=219
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,148125,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=9873KiB/s (10.1MB/s), 9873KiB/s-9873KiB/s (10.1MB/s-10.1MB/s), io=579MiB (607MB), run=60012-60012msec

Disk stats (read/write):
dm-0: ios=355/46457, merge=0/0, ticks=25469/262504, in_queue=287976, util=97.64%, aggrios=336/45923, aggrmerge=5/705, aggrticks=25343/221010, aggrin_queue=246347, aggrutil=97.57%
sda: ios=336/45923, merge=5/705, ticks=25343/221010, in_queue=246347, util=97.57%

I have experienced one bigger hang for a couple of seconds on the VM (write rate dropped to zero) and at I saw big latency on some of the OSDs (>1000ms). The lagging OSDs are not always the same, so I didn't find anything in common (for example bad node / bad disk).

mart.v · Oct 2, 2018

Knuuut said:
512PGs @ 20 OSDs would be nice, but won't match to this configuration.

So you think that I should raise number of PGs for the main pool from 512 to 1024?

spirit · Oct 3, 2018

>> w=408 IOPS

something is really wrong is your setup. (I'm able to reach 1500iops from 1 vm with iodepth=1 and 4k block for write and 3000-3500 for 4k randread, with 18 osd (replica x3)).
what is the result with a bigger queue depth (64 for exemple) ?

can you send more infos:

was is your cpu model ? ceph.conf ?
distro / kernel version on ceph nodes.
and a sample vm config too.

maybe try to disable spectre/meltdown (
GRUB_CMDLINE_LINUX=" pti=off spectre_v2=off l1tf=off spec_store_bypass_disable=off")

Edit: can you test with fio and --direct=1, with only 1 job ?

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=1 --size=512M --runtime=60

cikoshel · Oct 3, 2018

Hi mart.v
I had same scenario with you. I had multiple VM running windows server 2012/2016 and linux. VM had database application. I have discovered that some VM had unresponsive running process cause ceph osd high latency. Subsequently other VM that use same underlying storage got the lag impact. Maybe VM operating system had misconfiguration or need update. Shutdown and start again VM solve my problem.

mart.v · Oct 3, 2018

First of all, thank you for your time.

spirit said:
what is the result with a bigger queue depth (64 for exemple) ?

fio --name=randwrite --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4k --direct=1 --size=512M --runtime=60

Way better: Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=27.6MiB/s][r=0,w=7073 IOPS][eta 00m:00s]

spirit said:
fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=1 --size=512M --runtime=60

Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1348KiB/s][r=0,w=337 IOPS][eta 00m:00s]

spirit said:
was is your cpu model ? ceph.conf ?
distro / kernel version on ceph nodes.
and a sample vm config too.

Ceph nodes are running on standard proxmox distro with latest kernel 4.15.18-5-pve.

Most of the nodes are Intel Xeon Silver 4114, but I have one smaller node (Intel Xeon E-1230Lv3) with two OSDs. Do you think that it can influence rest of the cluster? OSD latencies are on all nodes/all osds, not only on the slowest one.

Ceph config is default with no customizations:

Code:

[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 172.16.254.0/24
         fsid = *** (masked)
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 172.16.254.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.node1]
         host = node1
         mon addr = 172.16.254.101:6789

[mon.node4]
         host = node4
         mon addr = 172.16.254.104:6789

[mon.node2]
         host = node2
         mon addr = 172.16.254.102:6789

Sample VM config:

Code:

agent: 1
boot: cd
bootdisk: scsi0
cores: 4
hotplug: disk,network,usb,memory,cpu
memory: 12288
name: *** (masked)
net0: virtio=*** (masked),bridge=vmbr0
net1: virtio=*** (masked),bridge=vmbr1
numa: 1
onboot: 1
ostype: l26
scsi0: proxmox:vm-132-disk-0,discard=on,size=600G
scsihw: virtio-scsi-pci
smbios1: uuid=*** (masked)
sockets: 1

spirit said:
maybe try to disable spectre/meltdown (
GRUB_CMDLINE_LINUX=" pti=off spectre_v2=off l1tf=off spec_store_bypass_disable=off")

Well, I can try that, but it's not generally a good idea to disable this, right?

mart.v · Oct 3, 2018

cikoshel said:
Hi mart.v
I had same scenario with you. I had multiple VM running windows server 2012/2016 and linux. VM had database application. I have discovered that some VM had unresponsive running process cause ceph osd high latency. Subsequently other VM that use same underlying storage got the lag impact. Maybe VM operating system had misconfiguration or need update. Shutdown and start again VM solve my problem.

Thank you for this tip. How did you identified the problematic VM? None of my VMs is using 100% CPU or making excessive disk read/write (according to graph available in proxmox).

mart.v · Oct 3, 2018

mart.v said:
Most of the nodes are Intel Xeon Silver 4114, but I have one smaller node (Intel Xeon E-1230Lv3) with two OSDs. Do you think that it can influence rest of the cluster? OSD latencies are on all nodes/all osds, not only on the slowest one.

I tried to set noout flag and stopped all OSDs on the slowest node, fio results were the same.

spirit · Oct 3, 2018

I can't explain the latency spike, (when 1vm can impact other vms)

but with only 1vm, and low iodepth bench and small block,
the latency is impacted by : cpu frequency (client rbd + ceph cluster) + network latency.

I'm using 3ghz cpu for client and ceph cluster, that's why I have more iops.

you can reduce cpu usage (so reduce latency), by disabling debug logs in ceph.conf (then stop/start vm), and disable cephx. (but that need a full ceph cluster restart)

Code:

[global]
auth_cluster_required = none
auth_service_required = none
auth_client_required = none

 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 rbd_skip_partial_discard = true
 perf = true
 mutex_perf_counter = false
 throttler_perf_counter = false

mart.v · Oct 3, 2018

spirit said:
the latency is impacted by : cpu frequency (client rbd + ceph cluster) + network latency.

I'm using 3ghz cpu for client and ceph cluster, that's why I have more iops.

I understand. But none of CPUs (threads) are utilized on 100 %. Even during the fio tests all threads on all CPUs have load like this:
%Cpu0 : 27.0 us, 2.8 sy, 0.0 ni, 69.9 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st

Knuuut · Oct 4, 2018

mart.v said:
So you think that I should raise number of PGs for the main pool from 512 to 1024?

I don't think so, but 30PGS/OSD is a mismatch to your overall PGs of 576 @ 3 replicas.
Make a check with

Code:

ceph osd df tree

mokai77 · Jun 13, 2019

Hi Mart,

did you eventually figure out what was the problem? We are experiencing similar problems on our CEPH Cluster (5 Nodes, SSD).

Ceph random high latency

Well-Known Member

Distinguished Member

Distinguished Member

Member

Distinguished Member

Member

Member

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Member

New Member