PGs unequally distributed across the nodes

FelixJ

Well-Known Member
Mar 1, 2019
66
3
48
43
Hi Everybody,
this story might sound familiar to a similar one I posted recently, but I think, the circumstances are different:
I have a 3-node cluster, which's PGs are unequally distributed across the available nodes.
My suspicion rose, while updating the cluster from 15.x to 16.x. When I rebooted the nodes one after the other, sometimes the cluster went unresponsive.
Currently the status is OK, all nodes are up.
How ever, I investigated little further, and found out, that my PGs are not evenly distributed thought out all cluster nodes.
To confirm my suspicion, I replaced the OSD-NUMs with fictional node-names (01-03) to get a better overview of what's happening:

Code:
ceph pg dump all|cut -d [ -f 2| cut -d ] -f 1| sed -e"s/\b[0-7]\b/01/g;s/\b[89]\b\|\b10\b\|\b11\b\|\b20\b\|\b21\b\|\b23\b/02/g;s/\b1[2-9]\b/03/g" | grep -v -e "01,02,03\|02,01,03\|03,01,02"

The result looks then like this:
As one can see, the majority of the PGs are all stored on one single Node instead of beeing distributed on all 3 nodes equally.

Code:
01,01,01
02,01,01
01,01,03
01,03,03
03,01,01
01,01,03
03,01,01
02,03,02
01,03,02
01,03,01
01,02,02
02,01,01
01,03,01
01,03,02
03,02,03
03,02,01
03,03,03
01,01,03
02,01,01
02,02,01
03,02,01
03,02,02
03,02,01
01,02,02
03,02,01
03,02,01
01,03,01
03,03,02
03,03,02
01,03,03
03,01,01
03,03,02
03,03,02
01,03,03
02,03,02
01,03,02
01,01,03
01,01,03
....

Here is my crushmap:
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 23 osd.23 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host srv-virt-01 {
        id -3           # do not change unnecessarily
        id -4 class hdd         # do not change unnecessarily
        # weight 8.736
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 1.092
        item osd.1 weight 1.092
        item osd.2 weight 1.092
        item osd.3 weight 1.092
        item osd.4 weight 1.092
        item osd.5 weight 1.092
        item osd.6 weight 1.092
        item osd.7 weight 1.092
}
host srv-virt-02 {
        id -5           # do not change unnecessarily
        id -6 class hdd         # do not change unnecessarily
        # weight 7.644
        alg straw2
        hash 0  # rjenkins1
        item osd.8 weight 1.092
        item osd.9 weight 1.092
        item osd.10 weight 1.092
        item osd.11 weight 1.092
        item osd.20 weight 1.092
        item osd.21 weight 1.092
        item osd.23 weight 1.092
}
host srv-virt-03 {
        id -7           # do not change unnecessarily
        id -8 class hdd         # do not change unnecessarily
        # weight 8.736
        alg straw2
        hash 0  # rjenkins1
        item osd.12 weight 1.092
        item osd.13 weight 1.092
        item osd.14 weight 1.092
        item osd.17 weight 1.092
        item osd.18 weight 1.092
        item osd.19 weight 1.092
        item osd.15 weight 1.092
        item osd.16 weight 1.092
}
root default {
        id -1           # do not change unnecessarily
        id -9 class hdd         # do not change unnecessarily
        # weight 25.112
        alg straw2
        hash 0  # rjenkins1
        item srv-virt-01 weight 8.733
        item srv-virt-02 weight 7.644
        item srv-virt-03 weight 8.735
}
root production {
        id -20          # do not change unnecessarily
        id -2 class hdd         # do not change unnecessarily
        # weight 9.828
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 1.092
        item osd.1 weight 1.092
        item osd.3 weight 1.092
        item osd.9 weight 1.092
        item osd.10 weight 1.092
        item osd.11 weight 1.092
        item osd.12 weight 1.092
        item osd.14 weight 1.092
        item osd.17 weight 1.092
}
root backup {
        id -30          # do not change unnecessarily
        id -10 class hdd                # do not change unnecessarily
        # weight 15.288
        alg straw2
        hash 0  # rjenkins1
        item osd.2 weight 1.092
        item osd.4 weight 1.092
        item osd.5 weight 1.092
        item osd.6 weight 1.092
        item osd.7 weight 1.092
        item osd.8 weight 1.092
        item osd.13 weight 1.092
        item osd.15 weight 1.092
        item osd.16 weight 1.092
        item osd.18 weight 1.092
        item osd.19 weight 1.092
        item osd.20 weight 1.092
        item osd.21 weight 1.092
        item osd.23 weight 1.092
}

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule production_pool {
        id 1
        type replicated
        min_size 2
        max_size 6
        step take production
        step chooseleaf firstn 0 type osd
        step emit
}
rule backup_pool {
        id 2
        type replicated
        min_size 2
        max_size 3
        step take backup
        step chooseleaf firstn 0 type osd
        step emit
}

# end crush map

Here my OSDs:
Code:
ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-30         15.28793  root backup                                  
  2    hdd   1.09200      osd.2                 up   1.00000  1.00000
  4    hdd   1.09200      osd.4                 up   1.00000  1.00000
  5    hdd   1.09200      osd.5                 up   1.00000  1.00000
  6    hdd   1.09200      osd.6                 up   1.00000  1.00000
  7    hdd   1.09200      osd.7                 up   1.00000  1.00000
  8    hdd   1.09200      osd.8                 up   1.00000  1.00000
 13    hdd   1.09200      osd.13                up   1.00000  1.00000
 15    hdd   1.09200      osd.15                up   1.00000  1.00000
 16    hdd   1.09200      osd.16                up   1.00000  1.00000
 18    hdd   1.09200      osd.18                up   1.00000  1.00000
 19    hdd   1.09200      osd.19                up   1.00000  1.00000
 20    hdd   1.09200      osd.20                up   1.00000  1.00000
 21    hdd   1.09200      osd.21                up   1.00000  1.00000
 23    hdd   1.09200      osd.23                up   1.00000  1.00000
-20          9.82796  root production                              
  0    hdd   1.09200      osd.0                 up   1.00000  1.00000
  1    hdd   1.09200      osd.1                 up   1.00000  1.00000
  3    hdd   1.09200      osd.3                 up   1.00000  1.00000
  9    hdd   1.09200      osd.9                 up   1.00000  1.00000
 10    hdd   1.09200      osd.10                up   1.00000  1.00000
 11    hdd   1.09200      osd.11                up   1.00000  1.00000
 12    hdd   1.09200      osd.12                up   1.00000  1.00000
 14    hdd   1.09200      osd.14                up   1.00000  1.00000
 17    hdd   1.09200      osd.17                up   1.00000  1.00000
 -1         25.11197  root default                                 
 -3          8.73299      host srv-virt-01                         
  0    hdd   1.09200          osd.0             up   1.00000  1.00000
  1    hdd   1.09200          osd.1             up   1.00000  1.00000
  2    hdd   1.09200          osd.2             up   1.00000  1.00000
  3    hdd   1.09200          osd.3             up   1.00000  1.00000
  4    hdd   1.09200          osd.4             up   1.00000  1.00000
  5    hdd   1.09200          osd.5             up   1.00000  1.00000
  6    hdd   1.09200          osd.6             up   1.00000  1.00000
  7    hdd   1.09200          osd.7             up   1.00000  1.00000
 -5          7.64400      host srv-virt-02                         
  8    hdd   1.09200          osd.8             up   1.00000  1.00000
  9    hdd   1.09200          osd.9             up   1.00000  1.00000
 10    hdd   1.09200          osd.10            up   1.00000  1.00000
 11    hdd   1.09200          osd.11            up   1.00000  1.00000
 20    hdd   1.09200          osd.20            up   1.00000  1.00000
 21    hdd   1.09200          osd.21            up   1.00000  1.00000
23    hdd   1.09200          osd.23            up   1.00000  1.00000
 -7          8.73499      host srv-virt-03                         
 12    hdd   1.09200          osd.12            up   1.00000  1.00000
 13    hdd   1.09200          osd.13            up   1.00000  1.00000
 14    hdd   1.09200          osd.14            up   1.00000  1.00000
 15    hdd   1.09200          osd.15            up   1.00000  1.00000
 16    hdd   1.09200          osd.16            up   1.00000  1.00000
 17    hdd   1.09200          osd.17            up   1.00000  1.00000
 18    hdd   1.09200          osd.18            up   1.00000  1.00000
 19    hdd   1.09200          osd.19            up   1.00000  1.00000

Code:
ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 2    hdd  1.09200   1.00000  1.1 TiB  589 GiB  587 GiB   96 MiB  2.0 GiB  528 GiB  52.73  1.10   62      up
 4    hdd  1.09200   1.00000  1.1 TiB  587 GiB  585 GiB  191 MiB  2.0 GiB  531 GiB  52.53  1.10   68      up
 5    hdd  1.09200   1.00000  1.1 TiB  507 GiB  505 GiB  139 MiB  1.5 GiB  611 GiB  45.35  0.95   57      up
 6    hdd  1.09200   1.00000  1.1 TiB  579 GiB  577 GiB  180 MiB  2.0 GiB  538 GiB  51.83  1.08   66      up
 7    hdd  1.09200   1.00000  1.1 TiB  508 GiB  506 GiB  188 MiB  1.8 GiB  610 GiB  45.47  0.95   58      up
 8    hdd  1.09200   1.00000  1.1 TiB  569 GiB  566 GiB  197 MiB  2.0 GiB  549 GiB  50.87  1.06   65      up
13    hdd  1.09200   1.00000  1.1 TiB  569 GiB  567 GiB  175 MiB  1.8 GiB  549 GiB  50.89  1.06   64      up
15    hdd  1.09200   1.00000  1.1 TiB  499 GiB  496 GiB  222 MiB  2.1 GiB  619 GiB  44.62  0.93   59      up
16    hdd  1.09200   1.00000  1.1 TiB  568 GiB  566 GiB  144 MiB  2.3 GiB  550 GiB  50.81  1.06   65      up
18    hdd  1.09200   1.00000  1.1 TiB  510 GiB  508 GiB  257 MiB  1.4 GiB  608 GiB  45.60  0.95   60      up
19    hdd  1.09200   1.00000  1.1 TiB  558 GiB  557 GiB  109 MiB  1.8 GiB  559 GiB  49.96  1.04   63      up
20    hdd  1.09200   1.00000  1.1 TiB  498 GiB  496 GiB  146 MiB  1.7 GiB  620 GiB  44.54  0.93   55      up
21    hdd  1.09200   1.00000  1.1 TiB  528 GiB  526 GiB  247 MiB  1.9 GiB  590 GiB  47.26  0.99   63      up
23    hdd  1.09200   1.00000  1.1 TiB  588 GiB  586 GiB   74 MiB  2.0 GiB  530 GiB  52.57  1.10   61      up
 0    hdd  1.09200   1.00000  1.1 TiB  498 GiB  496 GiB   11 KiB  1.9 GiB  620 GiB  44.53  0.93   82      up
 1    hdd  1.09200   1.00000  1.1 TiB  492 GiB  490 GiB    7 KiB  2.1 GiB  626 GiB  44.03  0.92   81      up
 3    hdd  1.09200   1.00000  1.1 TiB  491 GiB  489 GiB  3.2 MiB  2.1 GiB  627 GiB  43.89  0.92   82      up
 9    hdd  1.09200   1.00000  1.1 TiB  545 GiB  543 GiB   10 KiB  1.9 GiB  573 GiB  48.74  1.02   90      up
10    hdd  1.09200   1.00000  1.1 TiB  545 GiB  543 GiB    9 KiB  1.7 GiB  573 GiB  48.73  1.02   90      up
11    hdd  1.09200   1.00000  1.1 TiB  527 GiB  525 GiB   12 KiB  2.0 GiB  590 GiB  47.18  0.98   87      up
12    hdd  1.09200   1.00000  1.1 TiB  523 GiB  521 GiB   16 KiB  2.0 GiB  595 GiB  46.78  0.98   86      up
14    hdd  1.09200   1.00000  1.1 TiB  504 GiB  502 GiB   13 KiB  1.8 GiB  614 GiB  45.05  0.94   83      up
17    hdd  1.09200   1.00000  1.1 TiB  536 GiB  534 GiB    8 KiB  1.6 GiB  582 GiB  47.91  1.00   88      up
                       TOTAL   25 TiB   12 TiB   12 TiB  2.3 GiB   43 GiB   13 TiB  47.91

I'd be very greatfull, if anyone could help me with this one!

regards and nice Sunday,
Felix
 
Last edited:
So far, I'll answer to myself: It seams, as if my crush replication rules do not as I expected it to do:

This one, I hoped for would ensure me 2 copies per host and 6 in total, so 2 per host.

Code:
rule production_pool {
        id 1
        type replicated
        min_size 2
        max_size 6
        step take production
        step chooseleaf firstn 0 type osd
        step emit
}

This one, at least on 2 hosts, but maximum on all 3 hosts.

Code:
rule backup_pool {
        id 2
        type replicated
        min_size 2
        max_size 3
        step take backup
        step chooseleaf firstn 0 type osd
        step emit
}

I have now set all pools to the default rule which is min_size 1 and mx_size 10.
The cluster shuffles now as crazy, but when I examine the PGs as I did before using my super duper sed 1-liner, I can see that more and more PGs are now spread though out the cluster as desired.

So I guess, the problem is solved, how ever, I'd be grateful for advice, how I can ensure, one PG sits on multiple OSDs per Host (node).

thank you!
Felix
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!