Caching for ceph writes ?

Nov 23, 2023
12
1
3
Howdy folks,

I'm running an all nvme pci-4 proxmox cluster with hyperconverged ceph, on 8 nodes
over an SFP+ network.

I have around 16ms of write lag
as shown by 'rbd perf image iotop'
and this slows down mysql
as it does a sync for each operation.

I've 'fixed' the issue by adding
innodb_flush_log_at_trx_commit = 0
and
innodb_flush_log_at_timeout = 3
to mysql's configuration
.. which would result in max 3 seconds of data loss in the event of a crash.


I wonder which tricks you are using to speed up the writes !

Is there some magic option for ceph ?
Do I need to add some hardware ?


Thank you,
Mike
 
writeback on vm help a bit , but not for fsync or commit transactions. (so, indeed innodb_flush_log_at_trx_commit should help).

It's also possible to add local cache disk on the hypervisor where the vm is running for persistant writeback
https://docs.ceph.com/en/pacific/rbd/rbd-persistent-write-back-cache/

but I never have tested it. (I think than 1 or 2 forums users have reported that it was working, I'm not sure about how are working live migration)
 
  • Like
Reactions: Kuonel
Thank you all for your feedback.

@jasonsansone no i didn't, the documentation seems a bit complicated, do I have to add another 1 or 2 ssd/nvme to each server ?

@sb-jw yes it's a 'plain' hyperconverged setup (didn't do any tuning), each server has 2 ssd for the os and 2 nvme for ceph. 8 servers in total, dedicated sfp+ network with lacp. around 60tb of storage in total.
 
What kind of NVMEs? Do they have power loss protection (PLP)?
 
  • Like
Reactions: Kuonel
Yes, and I would suggest optane or nothing. Ceph latency is always going to be slow since a write is acknowledged until the last write is committed, which over a network is several latent hops. With the RBD persistent cache the write is considered committed as soon as it hits the local, on node cache device. It will then replicate across the cluster. I found it extremely effective when I used Ceph.
 
  • Like
Reactions: Kuonel
Which network components do you use?
Which hashing algorithm do you use for LACP?
Which NVMe model do you use?
Are you using Replica 3?
How many Mons/Mgr do you currently have?
Did you use CephFS?
How many Pools so you have?
How many PGs do your pools have?
Do you have metrics for your CEPH in terms of latency and IOPS etc?
 
  • Like
Reactions: Kuonel
Thank you again for your feedback.

@aaron no i don't think so, they are 'gaming' nvme

@jasonsansone that's interesting, can I ask you why "optane or nothing" ?
I mean wouldn't a m.2 over a pci card be similar ? (or two in raid-1)
Or dare I say it, wouldn't a local sata ssd be faster than my current 16ms write latency over ceph ?
Also I guess a 256gb drive would be enough ?


@sb-jw follows the list:
network cards: Intel X520-DA2
lacp algorithm: don't know but the switches are Juniper EX4550
nvme: some western digital sn850x and some firecuda 530
yes replica 3
3 mons and 3 mgr
no CephFS, I'm using rbd for the vms
1 pool
129 pgs (between 19 and 28 pg per drive, if 'ceph pg dump osds' serves me right)
around 16ms write latency and around 2.5k write op/s

Thank you.
 
@aaron no i don't think so, they are 'gaming' nvme
well, consumer SSDs, (SATA or NVME) are optimized for short bursts of large writes, but under sustained load their performance are rather meh. Additionally, without PLP they should only ACK a sync write, once the data is in the non-volatile flash memory, so it survives a power loss. SSDs with PLP can ACK the write once it is in their RAM as the capacitors provide enough energy to write everything to the non-volatile flash if power is lost. Ceph is mostly doing sync writes...

The other comments might also have an influence. Especially, a too low pg_num on the pool can also drastically reduce performance. A large MTU can also increase the performance as the data to protocol overhead ratio for each network transfer gets a lot better. You have a lot of network transfer for anything in Ceph.
 
  • Like
Reactions: Kuonel
lacp algorithm: don't know but the switches are Juniper EX4550
Can you share your network configuration?

I would recommend Arista and switch to MLAG instead of VC.
nvme: some western digital sn850x and some firecuda 530
That's your problem. Never use consumer hardware for CEPH. You should invest in something like Samsung SM/PM 863a/883/893 etc. These SSDs typically have next to nothing latency.
 
  • Like
Reactions: Kuonel
Thanks everyone for your help and inputs

@aaron yes I agree, those 'gaming' drives have decent cache and can sustain ~30 minutes saturating the network (for example when adding some osd to ceph), but after that they degrade, slowly reaching 50MB/s (!) after 45 minutes ...

About the PGs, for now I have 129 pgs and i'm using 24tb of 58tb.
I didn't touch it from the default configuration.
Is that ok ?

@jasonsansone thank you that makes perfect sense

@sb-jw i've checked the latency and you are right, they are the problem.
Thanks for the feedback on the network.
I can't provide the configuration of the switches right now because I didn't configure them myself.

@spirit that's some very nice low latency, thanks for the info

It's great that we can share and compare our numbers and results among different hardware !


I may or may not confess that I went down a rabbit hole
during the last few hours ..

I removed an nvme drive from ceph and did some write latency tests, with 'ioping'
and compared with other hardware i have available:

server with hperc: around 800 us (cheating because of cache)
spinning rust old Blue WD: between 7 and 22 ms
lame desktop sata ssd: consistent 5.1 ms
firecuda 530: between 1 and 10 ms

the nvme firecuda 530 result (which i'm using with ceph) was strange
inconsistent with other benchmarks on windows: https://www.storagereview.com/review/seagate-firecuda-530-ssd-review
which report around 0.25 ms.

I tried to change the filesystem from xfs to ext4 and got a sub-ms result.
Tried again after a couple minutes and the ioping went back to between 1 and 10 ms.
Changed again the filesystem from ext4 to xfs and got a sub-ms result.
After a couple of minutes it went back to between 1 and 10 ms.
...
I discovered that (consumer ? m.2 ?) nvme have enabled something called Autonomous Power State Transitions (APST)
which changes the power profile of the drives, depending on the load
(nvme id-ctrl /dev/nvme0, at the bottom)
and which can be disabled from grub.

I've disabled it on all of my servers,
rebooted them,
checked that it was disabled with 'cat /sys/module/nvme_core/parameters/default_ps_max_latency_us'.
turned on my VMs
but got the same ceph write latency as before (around 16ms).

Well .. it was worth a try :)
 
Thanks for the feedback on the network.
I can't provide the configuration of the switches right now because I didn't configure them myself.
The configuration of your nodes is enough for me for now. In the switch configuration, Layer3+4 is usually not set anyway, either the switch can do it or it can't and I think that the EX4550 can do that.

What would be even more interesting would indeed be the configured MTU, for which you would have to have the switch config.

Depending on how busy the VC link is, "local-bias" could be a possible solution to optimize a bit. With LACP, packets are distributed to both links, the switch basically does not pay attention to whether it makes more sense that the traffic stays on the one member. It may therefore be that traffic has to be exchanged between the two members, this creates additional latencies and if you are unlucky, a busy VC link and the effects you don't want to get to know.
See Juniper KB-Article: https://www.juniper.net/documentati...-interfaces-aex-aggregated-ether-options.html

I removed an nvme drive from ceph and did some write latency tests, with 'ioping'
This is one way to at least evaluate the whole thing, but it shouldn't be your only test. For example, you should check the write latency without caches, and ideally you can use a sequential stream to check when it drops. Especially with NVMe, however, the temperature is decisive, so you should always make sure that it is sufficiently cooled.
1 pool
129 pgs (between 19 and 28 pg per drive, if 'ceph pg dump osds' serves me right)
I would also recommend you to activate the autoscaler. I'm guessing your PGs count is a bit too low. An increase could result in additional performance gains for you.

See Docs for Autoscaler: https://docs.ceph.com/en/latest/rados/operations/placement-groups/
which changes the power profile of the drives, depending on the load
Power Profile itself is a good keyword. What have you set your nodes up to? You should disable any power saving setting for CEPH and allow the node to run at full power, which can significantly reduce latency. I recently experienced this again and was able to almost halve the latency (about 5 ms to 2 ms). The additional consumption of electricity is often not that much and it is extremely worthwhile for it.
 
@sb-jw hello good day and thanks again for your feedback, much appreciated

my MTU is 1500, checked with 'ifconfig' and 'ip a', I think it's the default

thanks for the Juniper link, I'll try asap different bond modes

with the usual workload my NVMe are between 35C and 56C

it seems that the autoscaler is enabled by default, on my proxmox installation
if I read this correctly, the "ceph osd pool autoscale-status" command gives me AUTOSCALE = on

I have three 1U chassis with 7713 epyc CPU
and five 2U chassis with two 7713 epyc CPUs and three 7T83 epyc CPUs (basically same as the 7713)

On the 1U chassis I've crippled the clock of the CPU by disabling boosting, from the bios
so that they run from 3.1 GHz to 2 GHz
(I wanted to use the liquid cooled Dynatron L29 but wasn't able to fit it into the chassis, so I'm using a common fan with heatsink)
otherwise the CPU temp goes up to 78C, the poor fans spin at 18K and the power consumption to 320W.
With the crippled CPU the temp stays at 58C, fans spin at 12K and the power consumption is 220W.

On those crippled CPUs I'm also running the monitors and managers;
I've read around that CEPH's monitors are sensitive to the CPU clock.
Is that true ?
Do you suggest me to 'move' the monitors to the high clocked servers ?

Thank you.