[SOLVED] 3 node ceph - performance degraded due to bad disk? affecting other pool? crushmap?

HanS.1997

New Member
Mar 8, 2022
7
0
1
27
Hi, I am Hans, I am using Proxmox for quite some time now and often found valuable help (reading) this community. Thanks a lot for so much valuable information!
Today I have some questions which I could not help myself, so I am posting my 1st post :-)

I recently inherited a 3 node hyperconverged PVE 7.0 / Ceph cluster which ran, I'd say flawlessly for quite some time until a few weeks ago the disk I/O degraded massively and most of the guests became nearly unusable. It is mostly noticeable in Windows guests which now take ages to perform disk operations, suffer around 100% disk utilization and experience answer times of several hundred milliseconds.

Ceph shows "HEALTH_OK", there is no scrub job, no backup job, no known heavy IO-job running.

The 3 node cluster consists of 3 identical nodes which each provide 4 enterprise NMVe and 3 consumer grade SATA SSD (for bulk storage / very light workloads). Storage is separated on a dedicated 25G network.

Those OSD are utilized from the VMs and containers via two pools, named "ceph-nvme" and "ceph-ssd". There is a crushmap to sort this out.

What I noticed so far: One OSD (OSD.14, one of the consumer grade SATA SSD) shows significant higher "Apply/Commit Latency", like ~500 (sometimes even more), where all other OSD are usually at around 0 or less than 30. This respective SSD also shows "more" wearout (S.M.A.R.T.) than the others (on the "Node" -> "Disks" page). Please have a look at the attached screenshots.

1Screenshot 2022-03-08 104723.png
2Screenshot 2022-03-08 104919.png

1.) As far as I understand these high "Apply/Commit Latency" times on this one device can dramatically slow down the whole ceph I/O (of this pool?), right? So I am gonna swap this OSD.14 with a new SSD?

2.) But, what seems totally weird to me is the fact that the VMs that suffer from I/O problems almost entirely use "ceph-nvme" (for their system drives) storage... So how could this be? As per the crush configuration they should not use the SSDs at all?

Maybe there is something wrong with my CRUSH? See some configuration details, hope this is helpfull:

Code:
root@pve-node-02:~# ceph osd crush class ls
[
    "nvme",
    "ssd"
]

Code:
root@pve-node-02:~# ceph osd crush class ls-osd ssd
12
13
14
15
16
17
18
19
20

Code:
root@pve-node-02:~# ceph osd crush class ls-osd nvme
0
1
2
3
4
5
6
7
8
9
10
11

Code:
root@pve-node-02:~# ceph osd crush rule ls
replicated_rule
repl-ssd

Code:
root@pve-node-02:~# ceph osd crush tree --show-shadow
ID   CLASS  WEIGHT    TYPE NAME             
-12    ssd  16.37457  root default~ssd       
 -9    ssd   5.45819      host pve-node-01~ssd
 12    ssd   1.81940          osd.12         
 13    ssd   1.81940          osd.13         
 14    ssd   1.81940          osd.14         
-10    ssd   5.45819      host pve-node-02~ssd
 15    ssd   1.81940          osd.15         
 16    ssd   1.81940          osd.16         
 17    ssd   1.81940          osd.17         
-11    ssd   5.45819      host pve-node-03~ssd
 18    ssd   1.81940          osd.18         
 19    ssd   1.81940          osd.19         
 20    ssd   1.81940          osd.20         
 -2   nvme  20.95917  root default~nvme     
 -4   nvme   6.98639      host pve-node-01~nvme
  0   nvme   1.74660          osd.0         
  1   nvme   1.74660          osd.1         
  2   nvme   1.74660          osd.2         
  3   nvme   1.74660          osd.3         
 -6   nvme   6.98639      host pve-node-02~nvme
  4   nvme   1.74660          osd.4         
  5   nvme   1.74660          osd.5         
  6   nvme   1.74660          osd.6         
  7   nvme   1.74660          osd.7         
 -8   nvme   6.98639      host pve-node-03~nvme
  8   nvme   1.74660          osd.8         
  9   nvme   1.74660          osd.9         
 10   nvme   1.74660          osd.10         
 11   nvme   1.74660          osd.11         
 -1         37.33374  root default           
 -3         12.44458      host pve-node-01     
  0   nvme   1.74660          osd.0         
  1   nvme   1.74660          osd.1         
  2   nvme   1.74660          osd.2         
  3   nvme   1.74660          osd.3         
 12    ssd   1.81940          osd.12         
 13    ssd   1.81940          osd.13         
 14    ssd   1.81940          osd.14         
 -5         12.44458      host pve-node-02     
  4   nvme   1.74660          osd.4         
  5   nvme   1.74660          osd.5         
  6   nvme   1.74660          osd.6         
  7   nvme   1.74660          osd.7         
 15    ssd   1.81940          osd.15         
 16    ssd   1.81940          osd.16         
 17    ssd   1.81940          osd.17         
 -7         12.44458      host pve-node-03     
  8   nvme   1.74660          osd.8         
  9   nvme   1.74660          osd.9         
 10   nvme   1.74660          osd.10         
 11   nvme   1.74660          osd.11         
 18    ssd   1.81940          osd.18         
 19    ssd   1.81940          osd.19         
 20    ssd   1.81940          osd.20


TL;DR so assuming this one consumer SSD affects the whole performance, it should still be limited to the SSDs and the vms using the SSD based pool. How is the connection to the slow I/O in the VMs (that are supposed) using only NVMe storage?

Thanks for any help with this!
Best regards, Hans.
 
What is the output of the following:
ceph osd pool ls detail
ceph osd crush rule dump
ceph osd df tree

Hi, @RokaKen , here it is:

Code:
root@pve-node-01:~# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on last_change 3612 lfor 0/0/73 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'ceph-nvme' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 3671 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'ceph-ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 3488 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 180 flags hashpspool stripe_width 0 application cephfs
pool 5 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 181 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs

root@pve-node-01:~# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 1,
        "rule_name": "repl-ssd",
        "ruleset": 1,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -12,
                "item_name": "default~ssd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]



root@pve-node-01:~# ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME         
-1         37.33374         -   37 TiB  7.4 TiB  7.4 TiB  155 MiB   32 GiB   30 TiB  19.80  1.00    -          root default     
-3         12.44458         -   12 TiB  2.5 TiB  2.5 TiB   52 MiB   11 GiB   10 TiB  19.80  1.00    -              host pve-node-01
 0   nvme   1.74660   1.00000  1.7 TiB  329 GiB  327 GiB  2.3 MiB  1.4 GiB  1.4 TiB  18.37  0.93  124      up          osd.0     
 1   nvme   1.74660   1.00000  1.7 TiB  329 GiB  327 GiB  5.5 MiB  1.5 GiB  1.4 TiB  18.38  0.93  114      up          osd.1     
 2   nvme   1.74660   1.00000  1.7 TiB  215 GiB  214 GiB  5.5 MiB  1.3 GiB  1.5 TiB  12.05  0.61  102      up          osd.2     
 3   nvme   1.74660   1.00000  1.7 TiB  331 GiB  329 GiB   20 MiB  1.5 GiB  1.4 TiB  18.50  0.93  120      up          osd.3     
12    ssd   1.81940   1.00000  1.8 TiB  482 GiB  480 GiB  1.6 MiB  1.8 GiB  1.3 TiB  25.87  1.31  165      up          osd.12   
13    ssd   1.81940   1.00000  1.8 TiB  353 GiB  351 GiB   15 MiB  1.6 GiB  1.5 TiB  18.94  0.96  147      up          osd.13   
14    ssd   1.81940   1.00000  1.8 TiB  485 GiB  483 GiB  1.6 MiB  1.7 GiB  1.3 TiB  26.01  1.31  156      up          osd.14   
-5         12.44458         -   12 TiB  2.5 TiB  2.5 TiB   52 MiB   11 GiB   10 TiB  19.80  1.00    -              host pve-node-02
 4   nvme   1.74660   1.00000  1.7 TiB  329 GiB  327 GiB  5.5 MiB  1.6 GiB  1.4 TiB  18.40  0.93  118      up          osd.4     
 5   nvme   1.74660   1.00000  1.7 TiB  312 GiB  310 GiB  7.0 MiB  1.4 GiB  1.4 TiB  17.43  0.88  117      up          osd.5     
 6   nvme   1.74660   1.00000  1.7 TiB  264 GiB  263 GiB  5.5 MiB  1.5 GiB  1.5 TiB  14.76  0.75  107      up          osd.6     
 7   nvme   1.74660   1.00000  1.7 TiB  332 GiB  330 GiB   16 MiB  1.6 GiB  1.4 TiB  18.57  0.94  115      up          osd.7     
15    ssd   1.81940   1.00000  1.8 TiB  416 GiB  414 GiB  1.5 MiB  1.4 GiB  1.4 TiB  22.31  1.13  157      up          osd.15   
16    ssd   1.81940   1.00000  1.8 TiB  390 GiB  388 GiB  6.2 MiB  1.7 GiB  1.4 TiB  20.92  1.06  154      up          osd.16   
17    ssd   1.81940   1.00000  1.8 TiB  481 GiB  479 GiB   10 MiB  1.9 GiB  1.3 TiB  25.83  1.30  160      up          osd.17   
-7         12.44458         -   12 TiB  2.5 TiB  2.5 TiB   52 MiB   10 GiB   10 TiB  19.80  1.00    -              host pve-node-03
 8   nvme   1.74660   1.00000  1.7 TiB  298 GiB  296 GiB   13 MiB  1.6 GiB  1.5 TiB  16.66  0.84  115      up          osd.8     
 9   nvme   1.74660   1.00000  1.7 TiB  248 GiB  247 GiB  5.5 MiB  1.0 GiB  1.5 TiB  13.87  0.70  117      up          osd.9     
10   nvme   1.74660   1.00000  1.7 TiB  247 GiB  246 GiB  5.5 MiB  1.5 GiB  1.5 TiB  13.82  0.70  105      up          osd.10   
11   nvme   1.74660   1.00000  1.7 TiB  280 GiB  278 GiB  834 KiB  1.2 GiB  1.5 TiB  15.63  0.79  109      up          osd.11   
18    ssd   1.81940   1.00000  1.8 TiB  497 GiB  496 GiB   11 MiB  1.6 GiB  1.3 TiB  26.70  1.35  165      up          osd.18   
19    ssd   1.81940   1.00000  1.8 TiB  478 GiB  476 GiB   15 MiB  1.7 GiB  1.4 TiB  25.66  1.30  166      up          osd.19   
20    ssd   1.81940   1.00000  1.8 TiB  474 GiB  472 GiB  861 KiB  1.8 GiB  1.4 TiB  25.46  1.29  151      up          osd.20   
                        TOTAL   37 TiB  7.4 TiB  7.4 TiB  155 MiB   32 GiB   30 TiB  19.80                                       
MIN/MAX VAR: 0.61/1.35  STDDEV: 4.54
 
So, yes, the poor performance of one SSD is affecting all storage. As you can see from your ceph osd crush tree --show-shadow, the root default includes _both_ NVME and SSD OSDs. Then, the CRUSH "replicated_rule" uses that for the 'device_health_metrics', 'ceph-nvme', 'cephfs_data' and 'cephfs_metadata' pools.

If you really want those pools to _only_ use NVME storage, you'll need to create an appropriate rule and set those pools to use it. Otherwise, there is certainly no harm (and possibly an improvement) in replacing the high latency SSD; but, consumer grade SSDs are not a good for production CEPH clusters.
 
  • Like
Reactions: HanS.1997 and aaron
So, yes, the poor performance of one SSD is affecting all storage. As you can see from your ceph osd crush tree --show-shadow, the root default includes _both_ NVME and SSD OSDs. Then, the CRUSH "replicated_rule" uses that for the 'device_health_metrics', 'ceph-nvme', 'cephfs_data' and 'cephfs_metadata' pools.
So thanks a ton @RokaKen - this helps me to understand this issue and to link the clearly bad performing SSD to the issue with the NVMe pool (which apparently is not NVMe only)! Will start by changing the suspected bad/faulty SSD.

If you really want those pools to _only_ use NVME storage, you'll need to create an appropriate rule and set those pools to use it. Otherwise, there is certainly no harm (and possibly an improvement) in replacing the high latency SSD; but, consumer grade SSDs are not a good for production CEPH clusters.

Yes, the plan was to have strictly separated pools for the different storage tiers. (NVMe OSDs only in one pool: "ceph-nvme" - and SSDs only in another pool "ceph-ssd" with lower performance demands).

So, the default rule including both SSD and NVMe is the root cause? As it is then used in the "replicated_rule" (it is a proxmox default, isn't it?). As far as I understand the ceph-ssd rule is actually SSD only? Or am I totally barking up the wrong tree here?

Any hint in sorting this out? Is it even possible with production data to relocate?
Thanks again!
 
Any hint in sorting this out? Is it even possible with production data to relocate?
Thanks again!
Yes, you just need to create a CRUSH rule for the NVME class (similar to what was done for the SSD class) and then set the existing pools to use that instead of the default replicated_rule. There's details here [0] and in the CEPH documentation [1]. The resulting data migration will cause significant I/O, so I would do it during a maintenance window. Migrate one pool at a time and wait for all PGs to be "active+clean". Also, keep an eye on ceph osd df tree to ensure you do not approach the near-full ratio of the OSDs.

When creating future pools be sure to specify which rule to use. Leave the existing replicated_rule alone -- you don't need to remove it.

Just a nit, but I notice you have 512 PGs for 'device_health_metrics' -- maybe the autoscaler or someone else set that, but it is a waste. That is an internal pool used by CEPH MGR that only needs 1 PG (an exception to the general pool guidance).

[0] https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster#pve_ceph_device_classes
[1] https://docs.ceph.com/en/octopus/rados/operations/crush-map/
 
  • Like
Reactions: HanS.1997
Yes, you just need to create a CRUSH rule for the NVME class (similar to what was done for the SSD class)

As per $HISTFILE the manual CRUSH rule for the SSD class was presumedly built by ceph osd crush rule create-replicated repl-ssd default host ssd. So this basically means, I create another CRUSH rule like so: ceph osd crush rule create-replicated repl-nvme default host nvme? And then edit the Ceph Pool to use this new "Crush Rule" (repl-nvme) in "Node" -> "Ceph" -> "Pools" -> "ceph-nvme"

Again, thanks a lot for your help with this! I am gonna plan a mainentance window and sort this out.

Just a nit, but I notice you have 512 PGs for 'device_health_metrics' -- maybe the autoscaler or someone else set that, but it is a waste. That is an internal pool used by CEPH MGR that only needs 1 PG (an exception to the general pool guidance).
Can I just set "# of PGs" to "1" in the PVE UI in: "Node" -> "Ceph" -> "Pools" -> "device_health_metrics"?
 
As per $HISTFILE the manual CRUSH rule for the SSD class was presumedly built by ceph osd crush rule create-replicated repl-ssd default host ssd. So this basically means, I create another CRUSH rule like so: ceph osd crush rule create-replicated repl-nvme default host nvme? And then edit the Ceph Pool to use this new "Crush Rule" (repl-nvme) in "Node" -> "Ceph" -> "Pools" -> "ceph-nvme"

Yes, that should work.

Again, thanks a lot for your help with this! I am gonna plan a mainentance window and sort this out.


Can I just set "# of PGs" to "1" in the PVE UI in: "Node" -> "Ceph" -> "Pools" -> "device_health_metrics"?

I've never changed PGs (pg_num and pgp_num) via the GUI -- always used CEPH CLI. Perhaps @aaron or other staff can confirm the PVE tooling does the same thing.
 
  • Like
Reactions: HanS.1997
Can you please post the output of ceph osd pool autoscale-status?
Since the device_health_metrics has such a high number of PGs, I want to confirm which profile the autoscaler is using. It might be necessary to change that.
 
Can you please post the output of ceph osd pool autoscale-status?
Since the device_health_metrics has such a high number of PGs, I want to confirm which profile the autoscaler is using. It might be necessary to change that.
Hi @aaron ,

ceph osd pool autoscale-status runs without any output.
 
Which Ceph version do you run? ceph versions
 
Which Ceph version do you run? ceph versions

Code:
{
    "mon": {
        "ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)": 21
    },
    "mds": {
        "ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)": 29
    }
}
 
Big Than You to @RokaKen ! This issue is solved now:

  1. Swapping/changing to new SSDs made the ssd-pool responsive again
  2. The correction of the crush rules and the correct assignment to the respective pools enabled the full potential of the NVMe
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!