Ceph design (Number of OSDs per NVMe Disk), rebalance problems

Feb 15, 2023
20
0
1
Hi,

We have a 5 node cluster running Ceph 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable) size 2/3.

Same h/w for all nodes, 7 7TB NVMe disks / node. The initial installs was performed by a company on an older version of Ceph (Octopus),for performance reasons
each NVMe was split into 4 OSDs. I've tried to find some more information about if this is a good thing today but haven't really found a clear answer and
I guess it depends on a number of reasons. Anyone with experience that could help?

We had a recent OSD crash where one OSD got filled to the max and died and would not start again so having 4 / disk got a bit more complicated.
Maybe we should convert to 1 OSD per disk but if that's the way, what would be the best approach to do it in a live cluster. 1 disk at a time in all 5 nodes or
just plow through all disks in one node after the other?

--Mats
 
Until Ceph OSDs become much more multithreaed (project Crimson), having multiple OSDs per physical NVME can still give you performance benefits. The downside is a more complicated setup.

What you prefer is up to you.

If you want to change things regarding OSDs in the cluster, I recommend that you it one at a time. Set the OSDs that share one NVME to OUT. Wait until Ceph has moved the data away and they are empty. Only then stop and destroy them.

If you stop too many OSDs at the same time, the chances are good, that you lose access to more than one replica -> less than min_size -> IO blocked. If you lose access to all 3 replicas, well, unless you can start the OSDs again, they are gone ;)
 
The only real issue we see today besides a more complicated setup is that the OSDs are pretty badly balanced with a general diff of 30%.
Tried a more frequent use of ceph osd reweight-by-utilization to even it out but it doesn't feel like a longterm solution.

We also get a few of these errors when the load on ceph is high (like when I out a single OSD)
kernel:[2599707.759497] watchdog: BUG: soft lockup - CPU#59 stuck for 21s! [swapper/59:0]
kernel:[2599891.822969] watchdog: BUG: soft lockup - CPU#56 stuck for 22s! [CPU 4/KVM:836752]
kernel:[2601275.671687] watchdog: BUG: soft lockup - CPU#52 stuck for 21s! [swapper/52:0]

Could this get even worse with a single OSD per disk or what kind of performance problems should we look for? Simply testing to convert all 5 nodes to single OSDs will be kind of a pain and I guess converting only one for testing will be of lesser value or what do you think?

--Mats
 
That sounds like a more general problem that should not be caused by having multiple OSDs / NVME.
What version of Ceph are you running? Can you post the output of the following commands inside [CODE][/CODE] blocks?

Code:
ceph osd pool ls detail
ceph osd df tree
 
This evening turned out really interesting......so after we had the crash of one osd that we couldn't start again, we destroyed it, since it was on shared disk I planned to do it a bit more cleanly today with the rest of the 3 OSD's on that disk and re-create them.
I started by marking 1 osd at a time as out, waited for it to go to zero pg's and the rebalance to finish and stopped and destroyed it....this is where all went south. Even though the osd's where out, all empty before destroyed the cluster went crazy. One node filled up with loads of cpu lockups, and one osd after the other crashed on that node, we stopped all rebalancing but it was too late, the node died and we had to hard boot it.

Once restarted it slowly came back to normal again but the minute we stopped and destroyed another OSD that was out with zero PG's it started to show signs of problems again.

Why is ceph starting to rebalance osd's when an osd that is out and has 0% data gets destroyed, I'm really confused?
 
Last edited:
One node filled up with loads of cpu lockups, and one osd after the other crashed on that node
Which node was that?

And please edit your posts and put the OSD list in [CODE][/CODE] tags! It is almost impossible to read otherwise.
 
i would tend to: one disk, one OSD
in case of disk falure, its much more easier to restore (and the risk of doing something wrong reduces)
better smaller disks and more as 1 large. but i guess your config is fixed already.
 
Sorry, I'll edit the post.

The most recently added 5th node was the one that died. After lots of reading in this and other forums we found that this is not that uncommon. It appears that with default settings there is a high risk that backfill/recovery will consume too much bandwidth locking up one osd after the other. Loosing even more osd's make the cluster go totally crazy. The only way to stop it was to to enable "nobackfill" but this only works if caught in time.

We ended up tuning mclock setting according to the example here:
https://pve.proxmox.com/wiki/Ceph_mclock_tuning

We adjusted below values to a number that our cluster could handle at current load which was about 200-300
ceph tell osd.* injectargs "--osd_mclock_scheduler_background_recovery_lim=100"
ceph tell osd.* injectargs "--osd_mclock_scheduler_background_recovery_res=100"

Without above setting the cluster could not rebalance/recover.

This made it possible to re-add our 4 OSDs at least but balancing is still a mess with some at 87% and others at 40%.
 
Aaron,

Re-added information as requested for ceph osd pool ls detail

Code:
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 96374 lfor 0/51345/51343 flags hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth
pool 2 'rbd_pool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn last_change 96374 lfor 0/0/63491 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 21474836480000 application rbd
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 96374 lfor 0/50922/50920 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 96374 lfor 0/50966/50964 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 96374 lfor 0/50944/50942 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 96374 lfor 0/52442/52440 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 96374 lfor 0/52485/52483 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
pool 8 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 96374 lfor 0/50962/50960 flags hashpspool stripe_width 0 application rgw
 
and ceph osd df tree. We're still facing poorly balanced osd's that gets full. Every node looks about the same with % used ranging from
40-80+


Code:
ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE   DATA      OMAP     META     AVAIL     %USE   VAR   PGS  STATUS  TYPE NAME               
 -1         244.52362         -  245 TiB   151 TiB   151 TiB  1.5 MiB  400 GiB    93 TiB  61.77  1.00    -          root default             
 -7          48.90472         -   49 TiB    31 TiB    31 TiB  104 KiB   81 GiB    18 TiB  62.94  1.02    -              host ix-sto1-cl-pve01
  0    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB      0 B  2.6 GiB   733 GiB  59.04  0.96   43      up          osd.0           
  1    ssd    1.74660   1.00000  1.7 TiB   856 GiB   854 GiB    4 KiB  2.0 GiB   933 GiB  47.84  0.77   40      up          osd.1           
  2    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.6 GiB   677 GiB  62.13  1.01   46      up          osd.2           
  3    ssd    1.74660   1.00000  1.7 TiB   955 GiB   953 GiB      0 B  2.2 GiB   833 GiB  53.42  0.86   43      up          osd.3           
  4    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.5 GiB   708 GiB  60.43  0.98   49      up          osd.4           
  5    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  3.9 GiB   524 GiB  70.70  1.14   52      up          osd.5           
  6    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB      0 B  4.4 GiB   426 GiB  76.16  1.23   57      up          osd.6           
  7    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.7 GiB   555 GiB  68.95  1.12   50      up          osd.7           
  8    ssd    1.74660   1.00000  1.7 TiB   954 GiB   951 GiB      0 B  2.5 GiB   835 GiB  53.33  0.86   41      up          osd.8           
  9    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.7 GiB   635 GiB  64.52  1.04   53      up          osd.9           
 10    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.7 GiB   654 GiB  63.45  1.03   46      up          osd.10           
 11    ssd    1.74660   1.00000  1.7 TiB   931 GiB   929 GiB      0 B  2.3 GiB   858 GiB  52.04  0.84   41      up          osd.11           
 12    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.6 GiB   706 GiB  60.50  0.98   47      up          osd.12           
 13    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.5 GiB   654 GiB  63.45  1.03   51      up          osd.13           
 14    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB    4 KiB  2.4 GiB   731 GiB  59.11  0.96   44      up          osd.14           
 15    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.7 GiB   555 GiB  68.97  1.12   52      up          osd.15           
 16    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  3.7 GiB   581 GiB  67.51  1.09   52      up          osd.16           
 17    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB    4 KiB  2.9 GiB   505 GiB  71.74  1.16   53      up          osd.17           
 18    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB    4 KiB  3.3 GiB   402 GiB  77.52  1.25   57      up          osd.18           
 19    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.5 GiB   708 GiB  60.40  0.98   47      up          osd.19           
 20    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB      0 B  2.6 GiB   734 GiB  58.96  0.95   44      up          osd.20           
 21    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.6 GiB   607 GiB  66.07  1.07   48      up          osd.21           
 22    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB      0 B  4.3 GiB   427 GiB  76.12  1.23   55      up          osd.22           
 23    ssd    1.74660   1.00000  1.7 TiB   929 GiB   927 GiB    4 KiB  2.4 GiB   859 GiB  51.97  0.84   40      up          osd.23           
 96    ssd    1.74660   1.00000  1.7 TiB   807 GiB   804 GiB   16 KiB  2.8 GiB   982 GiB  45.10  0.73   36      up          osd.96           
 97    ssd    1.74660   0.89999  1.7 TiB   1.3 TiB   1.3 TiB   19 KiB  3.3 GiB   453 GiB  74.70  1.21   57      up          osd.97           
 98    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB   24 KiB  3.6 GiB   580 GiB  67.56  1.09   50      up          osd.98           
 99    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   13 KiB  3.5 GiB   703 GiB  60.72  0.98   45      up          osd.99           
 -2          48.90472         -   49 TiB    30 TiB    30 TiB  167 KiB   74 GiB    19 TiB  61.63  1.00    -              host ix-sto1-cl-pve02
 24    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  3.0 GiB   699 GiB  60.93  0.99   46      up          osd.24           
 25    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.3 GiB   655 GiB  63.38  1.03   49      up          osd.25           
 26    ssd    1.74660   1.00000  1.7 TiB   982 GiB   980 GiB      0 B  2.2 GiB   807 GiB  54.90  0.89   41      up          osd.26           
 27    ssd    1.74660   1.00000  1.7 TiB  1003 GiB  1001 GiB      0 B  2.3 GiB   785 GiB  56.10  0.91   46      up          osd.27           
 28    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.9 GiB   678 GiB  62.07  1.00   45      up          osd.28           
 29    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.5 GiB   557 GiB  68.84  1.11   50      up          osd.29           
 30    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB      0 B  4.1 GiB   457 GiB  74.43  1.20   53      up          osd.30           
 31    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    8 KiB  2.6 GiB   581 GiB  67.51  1.09   49      up          osd.31           
 32    ssd    1.74660   1.00000  1.7 TiB  1007 GiB  1005 GiB      0 B  2.2 GiB   782 GiB  56.29  0.91   42      up          osd.32           
 33    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.7 GiB   606 GiB  66.11  1.07   51      up          osd.33           
 34    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.3 GiB   680 GiB  61.97  1.00   46      up          osd.34           
 35    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.5 GiB   630 GiB  64.76  1.05   46      up          osd.35           
 36    ssd    1.74660   1.00000  1.7 TiB   680 GiB   678 GiB      0 B  1.7 GiB   1.1 TiB  38.02  0.62   30      up          osd.36           
 37    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.4 GiB   607 GiB  66.05  1.07   50      up          osd.37           
 38    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB      0 B  2.3 GiB   729 GiB  59.25  0.96   45      up          osd.38           
 39    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB      0 B  2.3 GiB   758 GiB  57.62  0.93   43      up          osd.39           
 40    ssd    1.74660   1.00000  1.7 TiB   933 GiB   931 GiB    4 KiB  2.2 GiB   855 GiB  52.19  0.84   41      up          osd.40           
 41    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.1 TiB    4 KiB  2.3 GiB   609 GiB  65.94  1.07   50      up          osd.41           
 42    ssd    1.74660   1.00000  1.7 TiB   955 GiB   952 GiB      0 B  2.1 GiB   834 GiB  53.37  0.86   44      up          osd.42           
 43    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.7 GiB   531 GiB  70.31  1.14   55      up          osd.43           
 44    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB      0 B  3.9 GiB   402 GiB  77.52  1.25   58      up          osd.44           
 45    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.5 GiB   605 GiB  66.20  1.07   52      up          osd.45           
 46    ssd    1.74660   1.00000  1.7 TiB   881 GiB   879 GiB    4 KiB  2.0 GiB   907 GiB  49.26  0.80   37      up          osd.46           
 47    ssd    1.74660   1.00000  1.7 TiB   933 GiB   931 GiB    4 KiB  2.1 GiB   855 GiB  52.18  0.84   40      up          osd.47           
100    ssd    1.74660   1.00000  1.7 TiB   908 GiB   905 GiB   18 KiB  3.2 GiB   880 GiB  50.79  0.82   37      up          osd.100         
101    ssd    1.74660   1.00000  1.7 TiB   906 GiB   903 GiB   17 KiB  3.0 GiB   883 GiB  50.65  0.82   41      up          osd.101         
102    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB   73 KiB  4.1 GiB   330 GiB  81.57  1.32   61      up          osd.102         
103    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.3 TiB   23 KiB  3.9 GiB   402 GiB  77.50  1.25   56      up          osd.103         
 -8          48.90472         -   49 TiB    30 TiB    30 TiB  136 KiB   74 GiB    19 TiB  61.99  1.00    -              host ix-sto1-cl-pve03
 48    ssd    1.74660   1.00000  1.7 TiB   930 GiB   928 GiB      0 B  2.0 GiB   859 GiB  51.99  0.84   39      up          osd.48           
 49    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.6 GiB   605 GiB  66.18  1.07   53      up          osd.49           
 50    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.5 GiB   510 GiB  71.48  1.16   55      up          osd.50           
 51    ssd    1.74660   1.00000  1.7 TiB   985 GiB   983 GiB      0 B  2.2 GiB   803 GiB  55.10  0.89   43      up          osd.51           
 52    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB      0 B  2.3 GiB   758 GiB  57.62  0.93   45      up          osd.52           
 53    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB    4 KiB  2.7 GiB   454 GiB  74.64  1.21   56      up          osd.53           
 54    ssd    1.74660   1.00000  1.7 TiB   986 GiB   984 GiB      0 B  2.2 GiB   803 GiB  55.13  0.89   42      up          osd.54           
 55    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.2 GiB   708 GiB  60.43  0.98   47      up          osd.55           
 56    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB    4 KiB  3.9 GiB   504 GiB  71.80  1.16   52      up          osd.56           
 57    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.4 GiB   685 GiB  61.70  1.00   45      up          osd.57           
 58    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.4 GiB   658 GiB  63.19  1.02   47      up          osd.58           
 59    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  3.3 GiB   583 GiB  67.40  1.09   51      up          osd.59           
 60    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.3 GiB   660 GiB  63.11  1.02   48      up          osd.60           
 61    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.3 GiB   681 GiB  61.90  1.00   50      up          osd.61           
 62    ssd    1.74660   1.00000  1.7 TiB   782 GiB   780 GiB    4 KiB  1.6 GiB  1006 GiB  43.73  0.71   34      up          osd.62           
 63    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  3.8 GiB   528 GiB  70.46  1.14   53      up          osd.63           
 64    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB      0 B  2.7 GiB   479 GiB  73.24  1.19   54      up          osd.64           
 65    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.6 GiB   604 GiB  66.24  1.07   51      up          osd.65           
 66    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB    4 KiB  2.7 GiB   483 GiB  72.98  1.18   55      up          osd.66           
 67    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB      0 B  3.0 GiB   503 GiB  71.90  1.16   56      up          osd.67           
 68    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB      0 B  2.3 GiB   758 GiB  57.62  0.93   48      up          osd.68           
 69    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB    4 KiB  2.3 GiB   758 GiB  57.63  0.93   42      up          osd.69           
 70    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.5 GiB   579 GiB  67.61  1.09   50      up          osd.70           
 71    ssd    1.74660   1.00000  1.7 TiB   858 GiB   856 GiB      0 B  1.7 GiB   931 GiB  47.96  0.78   40      up          osd.71           
104    ssd    1.74660   1.00000  1.7 TiB   880 GiB   877 GiB   24 KiB  3.0 GiB   908 GiB  49.22  0.80   37      up          osd.104         
105    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   23 KiB  3.9 GiB   654 GiB  63.45  1.03   46      up          osd.105         
106    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB   18 KiB  3.2 GiB   736 GiB  58.87  0.95   46      up          osd.106         
107    ssd    1.74660   1.00000  1.7 TiB   951 GiB   948 GiB   19 KiB  3.1 GiB   837 GiB  53.20  0.86   41      up          osd.107         
 -
 
Code:
-3          48.90472         -   49 TiB    30 TiB    30 TiB  214 KiB   77 GiB    19 TiB  60.91  0.99    -              host ix-sto1-cl-pve04
 72    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB    4 KiB  4.0 GiB   455 GiB  74.58  1.21   56      up          osd.72           
 73    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  3.1 GiB   607 GiB  66.07  1.07   55      up          osd.73           
 74    ssd    1.74660   1.00000  1.7 TiB   806 GiB   804 GiB      0 B  1.7 GiB   983 GiB  45.05  0.73   33      up          osd.74           
 75    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.2 GiB   707 GiB  60.47  0.98   46      up          osd.75           
 76    ssd    1.74660   1.00000  1.7 TiB   929 GiB   927 GiB    4 KiB  2.0 GiB   859 GiB  51.96  0.84   37      up          osd.76           
 77    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  3.9 GiB   627 GiB  64.95  1.05   50      up          osd.77           
 78    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.6 GiB   605 GiB  66.16  1.07   50      up          osd.78           
 79    ssd    1.74660   1.00000  1.7 TiB   829 GiB   827 GiB    4 KiB  1.8 GiB   959 GiB  46.35  0.75   35      up          osd.79           
 80    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.6 GiB   581 GiB  67.53  1.09   51      up          osd.80           
 81    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  3.5 GiB   604 GiB  66.25  1.07   48      up          osd.81           
 82    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.5 GiB   606 GiB  66.14  1.07   49      up          osd.82           
 83    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.4 GiB   657 GiB  63.29  1.02   45      up          osd.83           
 84    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.4 GiB   658 GiB  63.21  1.02   46      up          osd.84           
 85    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.3 GiB   606 GiB  66.14  1.07   51      up          osd.85           
 86    ssd    1.74660   1.00000  1.7 TiB   979 GiB   977 GiB      0 B  2.0 GiB   809 GiB  54.76  0.89   41      up          osd.86           
 87    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.3 GiB   682 GiB  61.88  1.00   47      up          osd.87           
 88    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   29 KiB  3.2 GiB   656 GiB  63.32  1.03   50      up          osd.88           
 89    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   23 KiB  3.1 GiB   684 GiB  61.77  1.00   46      up          osd.89           
 90    ssd    1.74660   0.89999  1.7 TiB   1.4 TiB   1.4 TiB   18 KiB  3.4 GiB   402 GiB  77.50  1.25   57      up          osd.90           
 91    ssd    1.74660   1.00000  1.7 TiB   779 GiB   776 GiB   11 KiB  2.5 GiB  1010 GiB  43.55  0.70   34      up          osd.91           
 92    ssd    1.74660   1.00000  1.7 TiB   731 GiB   729 GiB      0 B  1.6 GiB   1.0 TiB  40.87  0.66   32      up          osd.92           
 93    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB      0 B  3.7 GiB   476 GiB  73.41  1.19   57      up          osd.93           
 94    ssd    1.74660   1.00000  1.7 TiB   956 GiB   955 GiB    4 KiB  1.9 GiB   832 GiB  53.48  0.87   44      up          osd.94           
 95    ssd    1.74660   1.00000  1.7 TiB   983 GiB   981 GiB      0 B  2.2 GiB   806 GiB  54.95  0.89   42      up          osd.95           
108    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.3 TiB   23 KiB  3.8 GiB   404 GiB  77.43  1.25   58      up          osd.108         
109    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   22 KiB  3.4 GiB   679 GiB  62.03  1.00   48      up          osd.109         
110    ssd    1.74660   1.00000  1.7 TiB   906 GiB   903 GiB   31 KiB  3.0 GiB   882 GiB  50.67  0.82   41      up          osd.110         
111    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   17 KiB  3.4 GiB   683 GiB  61.82  1.00   46      up          osd.111         
-11          48.90472         -   49 TiB    30 TiB    30 TiB  921 KiB   95 GiB    19 TiB  61.38  0.99    -              host ix-sto1-cl-pve05
112    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   22 KiB  4.0 GiB   454 GiB  74.62  1.21   55      up          osd.112         
113    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.2 TiB   16 KiB  3.9 GiB   505 GiB  71.77  1.16   56      up          osd.113         
114    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB   38 KiB  3.4 GiB   729 GiB  59.26  0.96   46      up          osd.114         
115    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   18 KiB  3.3 GiB   680 GiB  61.97  1.00   47      up          osd.115         
116    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   23 KiB  4.2 GiB   451 GiB  74.77  1.21   55      up          osd.116         
117    ssd    1.74660   1.00000  1.7 TiB   959 GiB   955 GiB   14 KiB  3.4 GiB   830 GiB  53.60  0.87   43      up          osd.117         
118    ssd    1.74660   1.00000  1.7 TiB   954 GiB   951 GiB   19 KiB  3.0 GiB   834 GiB  53.34  0.86   39      up          osd.118         
119    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB   21 KiB  3.7 GiB   554 GiB  69.04  1.12   52      up          osd.119         
120    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   57 KiB  3.6 GiB   680 GiB  62.00  1.00   46      up          osd.120         
121    ssd    1.74660   1.00000  1.7 TiB   806 GiB   803 GiB   16 KiB  2.9 GiB   983 GiB  45.04  0.73   38      up          osd.121         
122    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB   16 KiB  3.9 GiB   352 GiB  80.34  1.30   61      up          osd.122         
123    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB   18 KiB  3.9 GiB   374 GiB  79.08  1.28   59      up          osd.123         
124    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   21 KiB  3.7 GiB   677 GiB  62.12  1.01   49      up          osd.124         
125    ssd    1.74660   1.00000  1.7 TiB   956 GiB   954 GiB   56 KiB  2.7 GiB   832 GiB  53.47  0.87   39      up          osd.125         
126    ssd    1.74660   1.00000  1.7 TiB  1007 GiB  1003 GiB   17 KiB  3.2 GiB   782 GiB  56.28  0.91   42      up          osd.126         
127    ssd    1.74660   1.00000  1.7 TiB   957 GiB   955 GiB   26 KiB  2.7 GiB   831 GiB  53.53  0.87   40      up          osd.127         
128    ssd    1.74660   1.00000  1.7 TiB   781 GiB   779 GiB   20 KiB  2.5 GiB  1007 GiB  43.69  0.71   35      up          osd.128         
129    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   24 KiB  4.2 GiB   428 GiB  76.05  1.23   57      up          osd.129         
130    ssd    1.74660   1.00000  1.7 TiB   960 GiB   957 GiB   37 KiB  3.4 GiB   828 GiB  53.68  0.87   45      up          osd.130         
131    ssd    1.74660   1.00000  1.7 TiB   859 GiB   856 GiB   20 KiB  2.7 GiB   929 GiB  48.04  0.78   40      up          osd.131         
132    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB   16 KiB  3.2 GiB   756 GiB  57.75  0.93   42      up          osd.132         
133    ssd    1.74660   1.00000  1.7 TiB   908 GiB   904 GiB   16 KiB  3.2 GiB   881 GiB  50.74  0.82   40      up          osd.133         
134    ssd    1.74660   1.00000  1.7 TiB  1009 GiB  1006 GiB   26 KiB  3.1 GiB   779 GiB  56.42  0.91   42      up          osd.134         
135    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB   35 KiB  3.2 GiB   759 GiB  57.56  0.93   42      up          osd.135         
136    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   20 KiB  3.2 GiB   683 GiB  61.80  1.00   47      up          osd.136         
137    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   30 KiB  3.9 GiB   453 GiB  74.67  1.21   58      up          osd.137         
138    ssd    1.74660   1.00000  1.7 TiB   878 GiB   875 GiB  251 KiB  3.0 GiB   911 GiB  49.09  0.79   41      up          osd.138         
139    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB   28 KiB  4.0 GiB   380 GiB  78.78  1.28   59      up          osd.139         
                          TOTAL  245 TiB   151 TiB   151 TiB  1.5 MiB  400 GiB    93 TiB  61.77                                             
MIN/MAX VAR: 0.62/1.32  STDDEV: 9.44
 
what bandwith do you have between the nodes for the ceph traffic ?
27osd per node, not bad.
 
Code:
-3          48.90472         -   49 TiB    30 TiB    30 TiB  214 KiB   77 GiB    19 TiB  60.91  0.99    -              host ix-sto1-cl-pve04
 72    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB    4 KiB  4.0 GiB   455 GiB  74.58  1.21   56      up          osd.72          
 73    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  3.1 GiB   607 GiB  66.07  1.07   55      up          osd.73          
 74    ssd    1.74660   1.00000  1.7 TiB   806 GiB   804 GiB      0 B  1.7 GiB   983 GiB  45.05  0.73   33      up          osd.74          
 75    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.2 GiB   707 GiB  60.47  0.98   46      up          osd.75          
 76    ssd    1.74660   1.00000  1.7 TiB   929 GiB   927 GiB    4 KiB  2.0 GiB   859 GiB  51.96  0.84   37      up          osd.76          
 77    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  3.9 GiB   627 GiB  64.95  1.05   50      up          osd.77          
 78    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.6 GiB   605 GiB  66.16  1.07   50      up          osd.78          
 79    ssd    1.74660   1.00000  1.7 TiB   829 GiB   827 GiB    4 KiB  1.8 GiB   959 GiB  46.35  0.75   35      up          osd.79          
 80    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.6 GiB   581 GiB  67.53  1.09   51      up          osd.80          
 81    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  3.5 GiB   604 GiB  66.25  1.07   48      up          osd.81          
 82    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB      0 B  2.5 GiB   606 GiB  66.14  1.07   49      up          osd.82          
 83    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.4 GiB   657 GiB  63.29  1.02   45      up          osd.83          
 84    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB    4 KiB  2.4 GiB   658 GiB  63.21  1.02   46      up          osd.84          
 85    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB    4 KiB  2.3 GiB   606 GiB  66.14  1.07   51      up          osd.85          
 86    ssd    1.74660   1.00000  1.7 TiB   979 GiB   977 GiB      0 B  2.0 GiB   809 GiB  54.76  0.89   41      up          osd.86          
 87    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB      0 B  2.3 GiB   682 GiB  61.88  1.00   47      up          osd.87          
 88    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   29 KiB  3.2 GiB   656 GiB  63.32  1.03   50      up          osd.88          
 89    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   23 KiB  3.1 GiB   684 GiB  61.77  1.00   46      up          osd.89          
 90    ssd    1.74660   0.89999  1.7 TiB   1.4 TiB   1.4 TiB   18 KiB  3.4 GiB   402 GiB  77.50  1.25   57      up          osd.90          
 91    ssd    1.74660   1.00000  1.7 TiB   779 GiB   776 GiB   11 KiB  2.5 GiB  1010 GiB  43.55  0.70   34      up          osd.91          
 92    ssd    1.74660   1.00000  1.7 TiB   731 GiB   729 GiB      0 B  1.6 GiB   1.0 TiB  40.87  0.66   32      up          osd.92          
 93    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB      0 B  3.7 GiB   476 GiB  73.41  1.19   57      up          osd.93          
 94    ssd    1.74660   1.00000  1.7 TiB   956 GiB   955 GiB    4 KiB  1.9 GiB   832 GiB  53.48  0.87   44      up          osd.94          
 95    ssd    1.74660   1.00000  1.7 TiB   983 GiB   981 GiB      0 B  2.2 GiB   806 GiB  54.95  0.89   42      up          osd.95          
108    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.3 TiB   23 KiB  3.8 GiB   404 GiB  77.43  1.25   58      up          osd.108        
109    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   22 KiB  3.4 GiB   679 GiB  62.03  1.00   48      up          osd.109        
110    ssd    1.74660   1.00000  1.7 TiB   906 GiB   903 GiB   31 KiB  3.0 GiB   882 GiB  50.67  0.82   41      up          osd.110        
111    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   17 KiB  3.4 GiB   683 GiB  61.82  1.00   46      up          osd.111        
-11          48.90472         -   49 TiB    30 TiB    30 TiB  921 KiB   95 GiB    19 TiB  61.38  0.99    -              host ix-sto1-cl-pve05
112    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   22 KiB  4.0 GiB   454 GiB  74.62  1.21   55      up          osd.112        
113    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.2 TiB   16 KiB  3.9 GiB   505 GiB  71.77  1.16   56      up          osd.113        
114    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB   38 KiB  3.4 GiB   729 GiB  59.26  0.96   46      up          osd.114        
115    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   18 KiB  3.3 GiB   680 GiB  61.97  1.00   47      up          osd.115        
116    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   23 KiB  4.2 GiB   451 GiB  74.77  1.21   55      up          osd.116        
117    ssd    1.74660   1.00000  1.7 TiB   959 GiB   955 GiB   14 KiB  3.4 GiB   830 GiB  53.60  0.87   43      up          osd.117        
118    ssd    1.74660   1.00000  1.7 TiB   954 GiB   951 GiB   19 KiB  3.0 GiB   834 GiB  53.34  0.86   39      up          osd.118        
119    ssd    1.74660   1.00000  1.7 TiB   1.2 TiB   1.2 TiB   21 KiB  3.7 GiB   554 GiB  69.04  1.12   52      up          osd.119        
120    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   57 KiB  3.6 GiB   680 GiB  62.00  1.00   46      up          osd.120        
121    ssd    1.74660   1.00000  1.7 TiB   806 GiB   803 GiB   16 KiB  2.9 GiB   983 GiB  45.04  0.73   38      up          osd.121        
122    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB   16 KiB  3.9 GiB   352 GiB  80.34  1.30   61      up          osd.122        
123    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB   18 KiB  3.9 GiB   374 GiB  79.08  1.28   59      up          osd.123        
124    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   21 KiB  3.7 GiB   677 GiB  62.12  1.01   49      up          osd.124        
125    ssd    1.74660   1.00000  1.7 TiB   956 GiB   954 GiB   56 KiB  2.7 GiB   832 GiB  53.47  0.87   39      up          osd.125        
126    ssd    1.74660   1.00000  1.7 TiB  1007 GiB  1003 GiB   17 KiB  3.2 GiB   782 GiB  56.28  0.91   42      up          osd.126        
127    ssd    1.74660   1.00000  1.7 TiB   957 GiB   955 GiB   26 KiB  2.7 GiB   831 GiB  53.53  0.87   40      up          osd.127        
128    ssd    1.74660   1.00000  1.7 TiB   781 GiB   779 GiB   20 KiB  2.5 GiB  1007 GiB  43.69  0.71   35      up          osd.128        
129    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   24 KiB  4.2 GiB   428 GiB  76.05  1.23   57      up          osd.129        
130    ssd    1.74660   1.00000  1.7 TiB   960 GiB   957 GiB   37 KiB  3.4 GiB   828 GiB  53.68  0.87   45      up          osd.130        
131    ssd    1.74660   1.00000  1.7 TiB   859 GiB   856 GiB   20 KiB  2.7 GiB   929 GiB  48.04  0.78   40      up          osd.131        
132    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB   16 KiB  3.2 GiB   756 GiB  57.75  0.93   42      up          osd.132        
133    ssd    1.74660   1.00000  1.7 TiB   908 GiB   904 GiB   16 KiB  3.2 GiB   881 GiB  50.74  0.82   40      up          osd.133        
134    ssd    1.74660   1.00000  1.7 TiB  1009 GiB  1006 GiB   26 KiB  3.1 GiB   779 GiB  56.42  0.91   42      up          osd.134        
135    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB   1.0 TiB   35 KiB  3.2 GiB   759 GiB  57.56  0.93   42      up          osd.135        
136    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   20 KiB  3.2 GiB   683 GiB  61.80  1.00   47      up          osd.136        
137    ssd    1.74660   1.00000  1.7 TiB   1.3 TiB   1.3 TiB   30 KiB  3.9 GiB   453 GiB  74.67  1.21   58      up          osd.137        
138    ssd    1.74660   1.00000  1.7 TiB   878 GiB   875 GiB  251 KiB  3.0 GiB   911 GiB  49.09  0.79   41      up          osd.138        
139    ssd    1.74660   1.00000  1.7 TiB   1.4 TiB   1.4 TiB   28 KiB  4.0 GiB   380 GiB  78.78  1.28   59      up          osd.139        
                          TOTAL  245 TiB   151 TiB   151 TiB  1.5 MiB  400 GiB    93 TiB  61.77                                            
MIN/MAX VAR: 0.62/1.32  STDDEV: 9.44
maybe can you try to add more pg ? (you should target around 100 pg by osd)
 
Well I've thought about this, we could but there is no way back and we are using the recommended value of 2048, next step would be 4096 as I understand it if it should be the next nearest power of 2, quite a big leap.The optimal suggested number is also 2048 according to the GUI.

According to the PG calculation formula it looks like we should aim for 4096, 140*100 / 3 = 4666 but I'm a bit concerned what this might do to the overall performance.
 
There is quite the imbalance in the usage across the OSDs. Some are ~80% full, while others are less than 50% full.

What does the balancer say? ceph balancer status

AFAICT, the rbd pool is the only one in use? All the RGW don't contain any noteworthy amounts of data?

Doubling the number of PGs for the RBD pool would give you close to 100 PGs per OSD.

If you would set the target_ratio only for the RBD pool, the autoscaler will calculate as if that pool is expected to use all the available space. Then it should get a recommendation of 4096 PGs with that many OSDs (140 if I am not mistaken) https://old.ceph.com/pgcalc/:
1686750535436.png
That can also improve the usage among the OSDs.
 
Yep, I agree with the calculation. I also read that I should consider the performance impact the increase would have so I understand that looking on the formula alone might not be the absolute best thing. Only target size is set on the pool (20TB), do you know how I should calculate the potential performance impact of the increase?

I looked at the balancer and it seems active.

{
"active": true,
"last_optimize_duration": "0:00:00.022424",
"last_optimize_started": "Wed Jun 14 19:24:24 2023",
"mode": "upmap",
"optimize_result": "Optimization plan created successfully",
"plans": []
}

We also found a configuration that was wrong on one node (5) that I guess could be the cause of all the latency and lockups, we had forgot to adjust the qlen on the ceph interfaces so it was all on default 1000 but is now adjusted to the same setting as the other nodes. The setting did not help the rebalancing though, we still have an increasing diff of almost 45%.
 
Having more PGs will lead to smaller PGs that can be more evenly distributed. Right now, (or as of the last info) many OSDs are quite a bit fuller than others. The full ones are more likely to be involved in reading and writing, therefore have a higher chance to become a bottleneck.

Regarding target_size/target_ratio: The size is useful if you can estimate the size of the pool in absolute terms. The autoscaler is using the target_* values to estimate how many PGs it should assign to each pool. The target_ratio values will be weighted against each other.

Therefore, if you only have one pool that will be using pretty much all the space, setting a target_ratio of 1 (or any other value) will tell the autoscaler that this pool is expected to use up the available space.

If other pools have a target_size configured, it will be subtracted from the space considered for the target_ratio settings.

do you know how I should calculate the potential performance impact of the increase?
Hard to calculate or estimate. But experience shows, that a too small number of PGs can have an impact on the performance.


Is the balancer still in the same status? Ideally the usage of the OSDs should be much more even and the balancer should report the following:
"optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",

If it is still in the same state and the OSDs usage isn't changing in the cluster to even out more (balancer currently working), can you check what the following command outputs?

Code:
ceph osd get-require-min-compat-client
It should report "luminous". If it doesn't, you can set it with
Code:
ceph osd set-require-min-compat-client luminous
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!