Ceph latency remediation?

jaxx · Dec 14, 2021

Hi all,

I'm having what seems to be a network nottleneck.

Context is: one of my clients wants to revamp it's infrastructure and was happy already with PVE servers despite having only local zfs backed images, missing out on the broad possibilities offered by Ceph... I wanted to push him to go for Ceph with a layer of cephfs with mountpoints in the containers to share data accross horizontally scaled backends.

I already run such setups without a hitch in our own racks with a 10G bonded loop between machines that use ssd journaled magnetic drives, we don't have a huge payload to manage but never suffered from it.

Due to little availability of the desired servers at OVH and a bit scared by the DC Fire which happened back ago, client ordered two (of the future three) nodes on distant locations (northern and eastern france)... and then it hit me: "Oh cr*p, we're going to have latency issues for Ceph", the average ping is 11ms. (private network is 12Gbs in bw)

Both are beefy AMD Epyc based servers, small system SSDs, and with 4 enterprise grade NVMEs (Samsung PM983 1.92TB) fully dedicated to being OSDs

Well, went ahead and benched the setup a bit and didn't turn out to be that ugly.

Code:

ONE CEPH NODE (crush map tweaked to chooseleaf type osd and avoid it complaining):


local ceph rbd:


root@web-back-01:~/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

write: IOPS=48.0k, BW=188MiB/s (197MB/s)(12.0GiB/65501msec); 0 zone resets

Run status group 0 (all jobs):

  WRITE: bw=188MiB/s (197MB/s), 188MiB/s-188MiB/s (197MB/s-197MB/s), io=12.0GiB (12.9GB), run=65501-65501msec


root@web-back-01:~/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1

16x : write: IOPS=2630, BW=164MiB/s (172MB/s)(9984MiB/60734msec); 0 zone resets

Run status group 0 (all jobs):

  WRITE: bw=2628MiB/s (2755MB/s), 160MiB/s-170MiB/s (168MB/s-178MB/s), io=156GiB (168GB), run=60192-60824msec



cephfs mount point:

root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

Run status group 0 (all jobs):

write: IOPS=71.5k, BW=279MiB/s (293MB/s)(16.0GiB/62300msec); 0 zone resets

  WRITE: bw=279MiB/s (293MB/s), 279MiB/s-279MiB/s (293MB/s-293MB/s), io=16.0GiB (18.2GB), run=62300-62300msec


root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1

write: IOPS=8653, BW=541MiB/s (567MB/s)(32.3GiB/61060msec); 0 zone resets

Run status group 0 (all jobs):

  WRITE: bw=8727MiB/s (9151MB/s), 507MiB/s-594MiB/s (532MB/s-623MB/s), io=520GiB (559GB), run=60755-61062msec



TWO CEPH NODES (once the second was ordered, crushmap set back to original) cephfs:

root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

  write: IOPS=60.8k, BW=238MiB/s (249MB/s)(14.4GiB/61919msec); 0 zone resets

Run status group 0 (all jobs):

  WRITE: bw=238MiB/s (249MB/s), 238MiB/s-238MiB/s (249MB/s-249MB/s), io=14.4GiB (15.4GB), run=61919-61919msec



root@web-back-01:/shares/web/tmp# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1

16x  write: IOPS=10.2k, BW=639MiB/s (670MB/s)(39.5GiB/63339msec); 0 zone resets

Run status group 0 (all jobs):

  WRITE: bw=10.8GiB/s (11.6GB/s), 566MiB/s-808MiB/s (593MB/s-847MB/s), io=686GiB (736GB), run=62542-63350msec

Issue is:
Once the real payload was deployed, performance turned out to be excruciatingly painful...
npm installs / composer take ages, whether it's on cephfs or directly within the containers rbd disk, and globally the web payload is just plain slow, even mariadb is abnormally slow.

Pools are all size/min 2/1 (there's a third smaller dev node in the cluster acting as a third mon by safety) 64PGs for images, 64PGs for cephfs
(it will be 3/2 once the third node is ordered)

Is there a way to localize ceph client<>rbd interactions to the local host (and osds) and let ceph sync data asynchronously to the remote node(s)? (without having it wait on acking writes to each receiving osd?)

Willing to try different solutions... Goal being: having a good enough replication with a shared cephfs, with subfolders mounted as mointpoints in various backend containers (logs are centralized elsewhere)

Also set the global rbd_cache to true, but seems ignored since rbd_cache is enabled anyways

Thanks

gurubert · Dec 15, 2021

You are out of luck. Ceph is very latency aware and does not like geographically stretched clusters.

jaxx · Dec 16, 2021

gurubert said:
You are out of luck. Ceph is very latency aware and does not like geographically stretched clusters.

I, basically, know, I just never had to worry about that until recently... I was merely asking if there where any crush rules or setting that would allow to have it ack writes with once it arrives to the primary OSD (letting it replicate in background to the remaining OSDs) and if it were possible to have it elect that primary OSD as a close one (or at least to rack/dc level)

gurubert · Dec 16, 2021

jaxx said:
if there where any crush rules or setting that would allow to have it ack writes with once it arrives to the primary OSD (letting it replicate in background to the remaining OSDs) and if it were possible to have it elect that primary OSD as a close one (or at least to rack/dc level)

Unfortunately this would be against the consistency principles of Ceph. The client only gets the write ACK after all copies have been created.

aaron · Dec 16, 2021

In that situation, I would opt for local ZFS + ZFS replication between the nodes.
And with ping times of 11ms you are also about to get into interesting territory regarding the PVE Cluster itself as Corosync (used by PVE for the cluster communication) also wants low latency.

Ceph really really wants low latency and does not know anything about locality.

In addition to the previous answer by @gurubert, check out the Ceph docs. At the end of this section there is a diagram that shows how many times the network will be involved until the client (VM) gets the ACK for a write operation

You might want to consider to have 2 clusters, each much more local, and use something like RBD mirroring to have a current copy in the remote failover cluster. You would also have to manually copy the VM configs on a regular basis to the failover cluster.

jaxx · Dec 16, 2021

aaron said:
In that situation, I would opt for local ZFS + ZFS replication between the nodes.
And with ping times of 11ms you are also about to get into interesting territory regarding the PVE Cluster itself as Corosync (used by PVE for the cluster communication) also wants low latency.

Ceph really really wants low latency and does not know anything about locality.

In addition to the previous answer by @gurubert, check out the Ceph docs. At the end of this section there is a diagram that shows how many times the network will be involved until the client (VM) gets the ACK for a write operation

You might want to consider to have 2 clusters, each much more local, and use something like RBD mirroring to have a current copy in the remote failover cluster. You would also have to manually copy the VM configs on a regular basis to the failover cluster.

I actually have that page open since a while

We're gonna order closer nodes... two same DC ones (<0.2ms), third and worst one would be at ~2ms away... though that one might end up being a mirror of some sort if it remains suboptimal, it's just not what we were hoping for (cephfs mountpoints for shared data across 'triple' containers serving high rate web content)

aaron · Dec 16, 2021

jaxx said:
I actually have that page open since a while

hehe okay. We are also working on that guide to also include a different approach using RBD snapshots instead of full on journaling. In some situations, we have seen users running into the situation that the journal on the source kept ever-growing as the remote site was not able to keep up. A snapshot based approach should not run into that situation that easily.

spirit · Dec 16, 2021

I have 1 customer at ovh, between Roubaix && Graveline location, latency is around 2-3ms.
(I don't known latency to Strasbourg, but my customer have only a third monitor at Strasbourg for the quorum, and osd between roubaix && graveline)

Another way is indeed to have 2 ceph cluster on each location, and use rbd mirror featuring. (through mirror journal or through snapshot export/import)

jaxx · Dec 16, 2021

spirit said:
I have 1 customer at ovh, between Roubaix && Graveline location, latency is around 2-3ms.
(I don't known latency to Strasbourg, but my customer have only a third monitor at Strasbourg for the quorum, and osd between roubaix && graveline)

Another way is indeed to have 2 ceph cluster on each location, and use rbd mirror featuring. (through mirror journal or through snapshot export/import)

Well, the initial plan was one node in GRA RBX and SBG, we were hitting the hurdle with SBG being too far (11ms, OVH's smokeping ( http://sbg1-sbg.smokeping.ovh.net/smokeping?target=OVH.DCs.RBX ) is on par with the latencies we see on the vRack network), so we'll be regrouping to GRA+RBX, one location with two nodes (possibly RBX which has sligthly less latency to the outside world)... If performance with one node being 2ms away is ok, it'll be a normal triple node ceph setup, if not, dual node ceph(+extra mon) + replication to the distant node

Ceph latency remediation?

jaxx

Renowned Member

gurubert

Distinguished Member

jaxx

Renowned Member

gurubert

Distinguished Member

aaron

Proxmox Staff Member

jaxx

Renowned Member

aaron

Proxmox Staff Member

spirit

Distinguished Member

jaxx

Renowned Member

We value your privacy