ceph misplaced objects

Gabor A. Tar · Mar 7, 2019

Hi all,

I'm new with ceph and ran into an interesting warning message which I think I can not interpret correctly and would appreciate any suggestion/comment on this, where to start rolling up the case and/or I missing something.

First of all, the configuration structured as follows:
3 node, each one is the same:
32GB RAM
i7-8700 CPU
1GbE for public network
1GbE for cluster communication
dual potrt 10GbE bonding
2x120GB SSD in RAID1 -> booting prox ve
1x120GB -> nvme cache
2x6TB HDD -> ceph osd
2x4TB HDD -> ceph osd

1x10GbE microtic SW -> for ceph storage
1x1GbE ethernet SW -> for cluster and public network communication.

The warning messages are the following:

Code:

103740/793608 objects misplaced (13.072%)

Degraded data redundancy: 28528/793614 objects degraded (3.595%), 55 pgs degraded, 201 pgs undersized

pg 1.45 is stuck undersized for 89295.255902, current state active+undersized+remapped, last acting [8,7,6,5]
pg 1.46 is stuck undersized for 89343.453644, current state active+undersized+remapped, last acting [7,5,0,10,11]
pg 1.47 is stuck undersized for 89344.213165, current state active+undersized+degraded, last acting [0,2,1]
pg 1.48 is stuck undersized for 89284.596868, current state active+undersized+remapped, last acting [4,9,5,8,1]
pg 1.49 is stuck undersized for 89343.148491, current state active+undersized+remapped, last acting [0,10,5,8,1]
pg 1.4a is stuck undersized for 89295.261484, current state active+undersized+remapped, last acting [6,10,11,5]
pg 1.4d is stuck undersized for 89348.559683, current state active+undersized+remapped, last acting [12,5,7,8]
pg 1.4e is stuck undersized for 89284.596518, current state active+undersized+remapped, last acting [4,5,9,8,10]
pg 1.4f is stuck undersized for 89295.240541, current state active+undersized+remapped, last acting [8,6,10,4,0]
pg 1.52 is stuck undersized for 89344.550538, current state active+undersized+degraded, last acting [5,0,10]
pg 1.53 is stuck undersized for 89284.573238, current state active+undersized+remapped, last acting [8,4,9,5,0]
pg 1.54 is stuck undersized for 89295.253566, current state active+undersized+remapped, last acting [5,6,10,2,7]
pg 1.55 is stuck undersized for 89284.574872, current state active+undersized+remapped, last acting [8,1,9,5]
pg 1.56 is stuck undersized for 89348.905412, current state active+undersized+remapped, last acting [4,5,12,10]
pg 1.57 is stuck undersized for 89343.150310, current state active+undersized+remapped, last acting [0,1,8,10]
pg 1.58 is stuck undersized for 89295.254172, current state active+undersized+remapped, last acting [5,6,10,11]
pg 1.59 is stuck undersized for 89348.906011, current state active+undersized+remapped, last acting [7,11,12,8]
pg 1.5a is stuck undersized for 89343.494564, current state active+undersized+remapped, last acting [1,0,8,11]
pg 1.5b is stuck undersized for 89295.254653, current state active+undersized+remapped, last acting [5,6,7,8,0]
pg 1.5c is stuck undersized for 89284.577999, current state active+undersized+remapped, last acting [9,8,7,10]
pg 1.5d is stuck undersized for 89348.902290, current state active+undersized+remapped, last acting [8,12,4,7]
pg 1.5e is stuck undersized for 89284.581713, current state active+undersized+remapped, last acting [10,11,9,7,6]
pg 1.5f is stuck undersized for 89295.255562, current state active+undersized+remapped, last acting [8,6,7,0]
pg 1.c8 is stuck undersized for 89295.254029, current state active+undersized+remapped, last acting [5,4,6,1,8]
pg 1.ca is stuck undersized for 89348.906285, current state active+undersized+remapped, last acting [7,2,12,5,10]
pg 1.cb is stuck undersized for 89343.148769, current state active+undersized+remapped, last acting [0,10,8,7,11]
pg 1.cd is stuck undersized for 89348.560341, current state active+undersized+remapped, last acting [12,10,11,5,7]
pg 1.ce is stuck undersized for 89295.255590, current state active+undersized+remapped, last acting [1,8,6,4,5]
pg 1.d2 is stuck undersized for 89295.261832, current state active+undersized+remapped, last acting [6,5,1,10,0]
pg 1.d3 is stuck undersized for 89343.492630, current state active+undersized+remapped, last acting [8,10,0,4,5]
pg 1.d4 is stuck undersized for 89295.255590, current state active+undersized+remapped, last acting [2,1,6,5,7]
pg 1.d5 is stuck undersized for 89284.593449, current state active+undersized+remapped, last acting [1,11,9,5,8]
pg 1.d6 is stuck undersized for 89348.902759, current state active+undersized+remapped, last acting [8,12,10,4,5]
pg 1.d7 is stuck undersized for 89343.453711, current state active+undersized+remapped, last acting [7,5,0,1,8]
pg 1.d8 is stuck undersized for 89343.493305, current state active+undersized+remapped, last acting [1,11,0,5,10]
pg 1.d9 is stuck undersized for 89348.904691, current state active+undersized+remapped, last acting [7,5,12,8]
pg 1.db is stuck undersized for 89343.149529, current state active+undersized+remapped, last acting [0,5,7,8]
pg 1.dc is stuck undersized for 89295.263541, current state active+undersized+remapped, last acting [6,4,5,7,8]
pg 1.e3 is stuck undersized for 89295.250517, current state active+undersized+remapped, last acting [7,6,8,1,5]
pg 1.e5 is stuck undersized for 89295.254107, current state active+undersized+remapped, last acting [8,1,6,7,10]
pg 1.e7 is stuck undersized for 89348.559345, current state active+undersized+remapped, last acting [12,7,5,1]
pg 1.eb is stuck undersized for 89348.884054, current state active+undersized+remapped, last acting [11,12,4,2,8]
pg 1.ec is stuck undersized for 89348.560150, current state active+undersized+remapped, last acting [12,10,11,2,4]
pg 1.ed is stuck undersized for 89343.497267, current state active+undersized+remapped, last acting [11,10,0,1,5]
pg 1.ee is stuck undersized for 89295.261962, current state active+undersized+remapped, last acting [6,10,8,1,7]
pg 1.f1 is stuck undersized for 89343.149813, current state active+undersized+remapped, last acting [0,7,5,2,4]
pg 1.f2 is stuck undersized for 89348.904245, current state active+undersized+remapped, last acting [4,12,11,1]
pg 1.f3 is stuck undersized for 89343.493227, current state active+undersized+remapped, last acting [8,0,10,2,5]
pg 1.f5 is stuck undersized for 89343.148341, current state active+undersized+remapped, last acting [0,2,10,1,8]
pg 1.fa is stuck undersized for 89284.573764, current state active+undersized+remapped, last acting [11,9,4,5,7]
pg 1.ff is stuck undersized for 89343.452833, current state active+undersized+remapped, last acting [7,0,8,4,5]

I checked that the "active" state means: Ceph will process requests to the placement group.
the "undersized" means: The placement group has fewer copies than the configured pool replication level.
the "remapped" means: The placement group is temporarily mapped to a different set of OSDs from what CRUSH specified.

We have had to replace one dualport 10GbE interface because of overheating which caused latency errors on node2 (this was solved), and we have had a bad OSD in node1 that we also replaced (also solved). All the OSDs and the communication is working fine now, there is no error in any OSD's log or related to the cluster, except the above mentioned warning message.

ceph osd tree

Code:

ID CLASS WEIGHT   TYPE NAME       STATUS REWEIGHT PRI-AFF
-1       54.57889 root default
-3       18.19296     host node1
 0   hdd  3.63860         osd.0       up  1.00000 1.00000
 6   hdd  5.45789         osd.6       up  1.00000 1.00000
 9   hdd  5.45789         osd.9       up  1.00000 1.00000
12   hdd  3.63860         osd.12      up  1.00000 1.00000
-5       18.19296     host node2
 1   hdd  3.63860         osd.1       up  1.00000 1.00000
 4   hdd  3.63860         osd.4       up  1.00000 1.00000
 7   hdd  5.45789         osd.7       up  1.00000 1.00000
10   hdd  5.45789         osd.10      up  1.00000 1.00000
-7       18.19296     host node3
 2   hdd  3.63860         osd.2       up  1.00000 1.00000
 5   hdd  3.63860         osd.5       up  1.00000 1.00000
 8   hdd  5.45789         osd.8       up  1.00000 1.00000
11   hdd  5.45789         osd.11      up  1.00000 1.00000

crush map

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node1 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 18.193
alg straw2
hash 0 # rjenkins1
item osd.0 weight 3.639
item osd.6 weight 5.458
item osd.9 weight 5.458
item osd.12 weight 3.639
}
host node2 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 18.193
alg straw2
hash 0 # rjenkins1
item osd.1 weight 3.639
item osd.4 weight 3.639
item osd.7 weight 5.458
item osd.10 weight 5.458
}
host node3 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 18.193
alg straw2
hash 0 # rjenkins1
item osd.2 weight 3.639
item osd.5 weight 3.639
item osd.8 weight 5.458
item osd.11 weight 5.458
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 54.579
alg straw2
hash 0 # rjenkins1
item node1 weight 18.193
item node2 weight 18.193
item node3 weight 18.193
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map

I've checked the pool "size" and pool "min_size" of the pool which is 6/2. As I understand, it means ceph want to replicate the PGs to 6 node. If less than 2 node out of 6 working, the OSDs going to be write protected.

Due to we got only 3 active node, it could cause the warning message, because it can not distribute the PGs to 6 node?
Is it possibel to reduce the pool size from 6 to 3 on the fly without data loss?
In this threed:
https:\\forum.proxmox.com\threads\urgent-proxmox-ceph-support-needed.24302

Q-wulf wrote that:

You should be able to reduce the number of replicas on a replicated pool as long as you keep your new "size" >= your "min_size", else you cause your cluster to not be able to do any I/o on the pool in question.

Is this reduction really works as long as the pool "size" is greater then pool "min_size"?
In the related documentation here, is not so clear for me:
http:\\docs.ceph.com\docs\master\rados\operations\pools\#set-the-number-of-object-replicas

Set the Number of Object Replicas
To set the number of object replicas on a replicated pool, execute the following:

ceph osd pool set {poolname} size {num-replicas}

Important

The {num-replicas} includes the object itself. If you want the object and two copies of the object for a total of three instances of the object, specify 3.

For example:

ceph osd pool set data size 3

You may execute this command for each pool. Note: An object might accept I/Os in degraded mode with fewer than pool size replicas. To set a minimum number of required replicas for I/O, you should use the min_size setting. For example:

ceph osd pool set data min_size 2

This ensures that no object in the data pool will receive I/O with fewer than min_size replicas.

If yes, how will it affect the cluster performance?

(please note that the slashes are replaced with backslash due to it's my first thread)

Thanks,
Gabor A. Tar

Alwin · Mar 8, 2019

Gabor A. Tar said:
I've checked the pool "size" and pool "min_size" of the pool which is 6/2. As I understand, it means ceph want to replicate the PGs to 6 node. If less than 2 node out of 6 working, the OSDs going to be write protected.

Due to we got only 3 active node, it could cause the warning message, because it can not distribute the PGs to 6 node?

The PGs are replicated on host level, as you already found out, when you have only 3 nodes, you can only hold three copies with the default. The 'size' is for how many copies ceph should distribute and the 'min_size' tells ceph, till how many copies the pool stays in read/write mode.

Gabor A. Tar said:
Is it possibel to reduce the pool size from 6 to 3 on the fly without data loss?

A decrease of PGs on pool is not possible in Luminous, only a increase. Re-calculate[1] your needed PGs and create a new pool with it, then you can use the 'move disk' of Proxmox VE to move the disks from one pool to the other. This can be done with live VMs and reduces the impact of data movement.

[1] https://ceph.com/pgcalc/

Gabor A. Tar · Mar 8, 2019

Thanks Alwin, I'm appreciate your help! I think now I understand the basics...

I would have one more question which isn't strictly connected to my problem. I have 4TB WD gold OSDs for production and 6TB WD purple OSDs for backup. I want to separe them by creating different CRUSH rules on which the different pools will rely on. Due to all of my devices have a class type "HDD", is it possible to create CRUSH rules that are based on OSDs and not on class type?

Thanks in advance!
Gabor A. Tar

Alwin · Mar 8, 2019

You can set your own device class and then follow the class based config from the link.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_ceph_crush_amp_device_classes

An older post but another way to do it.
http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map

Gabor A. Tar · Mar 8, 2019

Thank you Alwin! You're hilarious!

Andrew Hart · Mar 8, 2019

ceph osd pool set ceph size 3

would in theory lose 3 of your replicas of each data block but since you don't have them already (due to lack of nodes) you don't lose any data.

If you really want 6 copies (?!) you could chooseleaf osd but it is a bit late for that now unless you create another pool and then migrate your vms' disks to it.

sg90 · Mar 9, 2019

If you wanted to keep 6 copies of your data you could set a crush rule to find 2 OSD's per a host, this would then place 2 copies on each of your 3 host's.

However unless your expecting to loose a disk quite frequently when you only have 3 host's anything above 3 replicas is close to pretty useless, but fully understand you may not require high storage and just want a large replica count.

If so use a crush rule like:

step choose indep 0 type host
step chooseleaf indep 2 type osd

Search

Search

ceph misplaced objects

Gabor A. Tar

New Member

Alwin

Proxmox Retired Staff

Gabor A. Tar

New Member

Alwin

Proxmox Retired Staff

Gabor A. Tar

New Member

Andrew Hart

Member

sg90

Renowned Member

We value your privacy