osd performance problem after upgrading ceph 14.2 => 15.2.15

mohnewald

Well-Known Member
Aug 21, 2018
50
4
48
59
Hello List,

we upgraded from ceph 14.2 to 15.2.15. Still on Debian 10.

The Upgrade went quite smoothly and ended with HEALTH_OK.

2h later some low io problems started and VMs got unresponsive. Right around the snapshot cron jobs...

It looks like the OSDs (SSD) are really busy now:

Code:
root@cluster5-node01:~# ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
  4                 341                341
 11                   0                  0
 15                 314                314
 10                   0                  0
 16                1293               1293
  9                1373               1373
 18                 568                568
  1                 986                986
 20                   0                  0
  3                 377                377
 17                2034               2034
  6                 233                233
  5                1960               1960
 14                1116               1116
 13                1336               1336
 12                 657                657
  8                1364               1364
  2                  96                 96


Those high latency´s force OSDs to be marked as down.
I dont think the SSDs are all broken oversudden.
There is also some snaptrim going on.

"iostat -dx 3" looks good. 0-1 %util.
Not much reading/writing is happening but top shows 100% CPU wait for the OSD Process.

Any idea whats going on here and how i can fix it?
Here is my ceph status:
Code:
root@cluster5-node01:~# ceph -s
  cluster:
    id:     e1153ea5-bb07-4548-83a9-edd8bae3eeec
    health: HEALTH_WARN
            noout flag(s) set
            1 osds down
            4 nearfull osd(s)
            Degraded data redundancy: 833161/14948934 objects degraded (5.573%), 296 pgs degraded, 193 pgs undersized
            1 pool(s) do not have an application enabled
            3 pool(s) nearfull
            9 daemons have recently crashed
            36 slow ops, oldest one blocked for 225 sec, daemons [osd.1,osd.12,osd.13,osd.14,osd.15,osd.17,osd.9] have slow ops.

  services:
    mon: 3 daemons, quorum cluster5-node01,cluster5-node02,cluster5-node03 (age 112m)
    mgr: cluster5-node03(active, since 66m), standbys: cluster5-node02, cluster5-node01
    osd: 18 osds: 17 up (since 2m), 18 in (since 9M); 39 remapped pgs
         flags noout

  data:
    pools:   3 pools, 1143 pgs
    objects: 4.98M objects, 17 TiB
    usage:   53 TiB used, 10 TiB / 63 TiB avail
    pgs:     833161/14948934 objects degraded (5.573%)
             172492/14948934 objects misplaced (1.154%)
             400 active+clean+snaptrim_wait
             231 active+clean
             177 active+undersized+degraded
             164 active+clean+snaptrim
             103 active+recovery_wait+degraded
             32  active+remapped+backfill_wait
             6   active+undersized+degraded+remapped+backfill_wait
             5   active+recovery_wait+undersized+degraded
             5   active+clean+snaptrim+laggy
             4   active+recovery_wait
             4   active
             3   active+recovering+degraded
             2   active+undersized
             2   active+recovering+undersized+degraded
             2   active+clean+laggy
             1   active+recovering
             1   active+undersized+remapped+backfill_wait
             1   active+clean+snaptrim_wait+laggy

  io:
    client:   7.9 KiB/s rd, 5.0 KiB/s wr, 1 op/s rd, 1 op/s wr
    recovery: 0 B/s, 0 objects/s
 
Hello,

I've just had an issue pretty similar to this one you had: unresponsive VMs, a lot of snaptrim operations pending, almost no I/O on the disks (Samsung PM1733) nor in the network (25G). At the pace snaptrim was working it would have finished in nearly 20 hours (!!!!!). I disabled snaptrim operations to allow VMS to recover and work. Currently I have like half the PG's with pending snaptrim operations.

This cluster is on Proxmox 7.3 and Ceph was upgraded from 14.2.x to 15.2.x a couple of months ago. I'm 100% sure that I have created and removed big snapshots in this cluster after the upgrade without any issue.

I've read another post you had about some settings regarding snaptrim sleep and so on. Did you ever found the root cause about this issue? Maybe some setting that needs to be adjusted when upgrading from 14.2 to 15.2? Maybe upgrading to 16.2?

Thanks in advance.
 
We never found out the root of the problem.
Our new update policy is now to migrate the VMs to a empty/healthy cluster and never run big release updates on a live system
We have 7 Clusters (3nodes each) and one cluster ist always empty/spare.

We use some rbd export/import/diff script to move the VMs between clusters.
Thats a bit more work, but its safe.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!