[SOLVED] Ceph snaptrim causing perforamnce impact on whole Cluster since update

Lephisto · Aug 22, 2023

Hi,

I upgraded a Cluster right all the way from Proxmox 6.2/Ceph 14.x to Proxmox 8.0/Ceph 17.x (latest). Hardware is Epyc Servers, all flash / NVME. I can rule out Hardware issues. I can reproduce the issue as well.

All running fine so far, except that my whole system gehts slowed down when i delete snapshots. All osd processes shoot to 100% cpu utilisation. I read here and there that deleting all snapshots and then restarting all osd's fixes it.

Can someone confiirm this or give a more fine grained advise how to solve this?

thanks,
meph

aaron · Aug 22, 2023

Probably due to the switch in OSD scheduler. See https://pve.proxmox.com/wiki/Ceph_mclock_tuning for more infos and what you can try to configure to limit the performance impact of the snapstrims.

Lephisto · Aug 22, 2023

An interessting Fact:

- This issues does not occur on Clusters that were "born" as 7.x / Pacific or Quincy
- The I/O impact is so hard, that workload can barely run, I/O is very laggy..

I will look into the mclock tuning thing.

Lephisto · Aug 22, 2023

Hm.. even with these params a snapshot delete brings i/o for client operations to nearly 0:

Code:

ceph tell osd.* injectargs "--osd_mclock_profile=custom"
ceph tell osd.* injectargs "--osd_mclock_scheduler_client_wgt=4"
ceph tell osd.* injectargs "--osd_mclock_scheduler_background_recovery_lim=100"
ceph tell osd.* injectargs "--osd_mclock_scheduler_background_recovery_res=100"

I guess i don't need to restart all osd?

Lephisto · Aug 22, 2023

Okay a report of my progress:

After trying to finetune on the OPS weights and limits... it just does not work for me. Going back to the old scheduler solved the thing for me and everything works like charme again. I know this will be deprecated in the future. I will continue trying to figure out the right values, but it'd be nice if things would work out of the Box.

Still - I am curious why this effect only seems to appear on Clusters, that have been born in 6.x/nautilus.

aaron · Aug 24, 2023

Looking at the Ceph docs that are linked in the wiki page, snap trims are part of the background_best_effort type: https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#mclock-client-types

So limiting background_recovery will probably not show any effect if snaptrims are causing issues

Lephisto · Aug 28, 2023

I think i am narrowing the problem down.

On a Production cluster I have the effect that also with the old WPQ scheduler I have a huge performance impact - not as bad, but still not funny - on a few osd's in the Cluster;:

osd  commit_latency(ms)  apply_latency(ms)
 13                   0                  0
 14                   0                  0
 12                   1                  1
 15                   0                  0
 19                   0                  0
 18                   0                  0
 17                   0                  0
  0                   1                  1
  1                   0                  0
  2                   0                  0
  3                  37                 37
 16                   0                  0
  4                   0                  0
  5                   0                  0
  6                   0                  0
  7                   0                  0
  8                   0                  0
  9                   0                  0
 10                   8                  8
 11                   0                  0

This is under normal load (No snaptrim or something iops heavy running). It's just osd.3 and osd.10 that seem suspect. These are also the osd's that spike CPU.

I digged into the telemetry and it confirms this:

The two blue lines on top are osd.3+osd.10 .. (SCale is logairthmic.)

This happened while snapshotting + exporting diff + removing snapshot.

I have the feeling this might be related to other compareable reports:

https://forum.proxmox.com/threads/s...m-6-to-7-and-from-nautilus-to-pacific.116327/

I also noticed an overall degraded read performance even if no snaptrim is running.

Ideas?

aaron · Aug 29, 2023

What you could try is to destroy and recreate these OSDs, one at a time, and check how they behave afterward. If they are in the same node, you could recreate them at the same time, since they won't be sharing any replicas.

Lephisto · Aug 29, 2023

I suspected this might be required and indeed:

Recreating osd.3 and osd.10 solved my whole issue. As it turns out here and there OSD's seem to be somehow corrupted when getting internal structures converted on the upgrade to pacific. Still - i couldn't detect anything wrong on the logfiles, only the suspicious read latency and cpu usage pointed to that.

I would advise everyone to make use of the Ceph Telemetry module. It was very helpful tracking this down.

Thanks for your help.

Search

Search

[SOLVED] Ceph snaptrim causing perforamnce impact on whole Cluster since update

Lephisto

Well-Known Member

aaron

Proxmox Staff Member

Lephisto

Well-Known Member

Lephisto

Well-Known Member

Lephisto

Well-Known Member

aaron

Proxmox Staff Member

Lephisto

Well-Known Member

aaron

Proxmox Staff Member

Lephisto

Well-Known Member

We value your privacy