Misplaced PG wont recover

Jan 9, 2018
10
0
41
51
After performing an upgrade to the latest version of Proxmox, we had one of the host servers reboot that had live VMs on it. All VMs recovered except for 2. It seems these 2 VMs use PGs on our caching tier and Proxmox (at least in this new version) seems to not recover misplaced PGs on the caching tier. Is there supposed to be any way to recover these so the VMs can boot?
 
hi,
can you provide the output of pveceph status? How many nodes are in your cluster?
 
There are 3 nodes in our cluster. Below is the requested output

root@pm6:~# pveceph status
cluster:
id: cd69845c-44dc-49b1-afe0-6f6e83b07b57
health: HEALTH_OK

services:
mon: 3 daemons, quorum pm4,pm5,pm6 (age 8h)
mgr: pm5(active, since 12h), standbys: pm6, pm4
mds: 1/1 daemons up, 1 standby
osd: 24 osds: 24 up (since 5h), 24 in (since 3y); 51 remapped pgs

data:
volumes: 1/1 healthy
pools: 5 pools, 976 pgs
objects: 11.25M objects, 43 TiB
usage: 83 TiB used, 52 TiB / 134 TiB avail
pgs: 380489/22497462 objects misplaced (1.691%)
924 active+clean
24 active+clean+remapped
18 active+remapped+backfill_wait
9 active+remapped+backfilling
1 active+clean+scrubbing+deep

io:
client: 5.2 MiB/s rd, 149 MiB/s wr, 378 op/s rd, 683 op/s wr
recovery: 264 MiB/s, 68 objects/s

root@pm6:~#





root@pm5:~# pveceph status
cluster:
id: cd69845c-44dc-49b1-afe0-6f6e83b07b57
health: HEALTH_OK

services:
mon: 3 daemons, quorum pm4,pm5,pm6 (age 8h)
mgr: pm5(active, since 12h), standbys: pm6, pm4
mds: 1/1 daemons up, 1 standby
osd: 24 osds: 24 up (since 5h), 24 in (since 3y); 51 remapped pgs

data:
volumes: 1/1 healthy
pools: 5 pools, 976 pgs
objects: 11.25M objects, 43 TiB
usage: 83 TiB used, 52 TiB / 134 TiB avail
pgs: 374283/22502638 objects misplaced (1.663%)
924 active+clean
24 active+clean+remapped
18 active+remapped+backfill_wait
9 active+remapped+backfilling
1 active+clean+scrubbing+deep

io:
client: 74 MiB/s rd, 90 MiB/s wr, 2.64k op/s rd, 393 op/s wr
recovery: 359 MiB/s, 93 objects/s

root@pm5:~#







root@pm4:~# pveceph status
cluster:
id: cd69845c-44dc-49b1-afe0-6f6e83b07b57
health: HEALTH_OK

services:
mon: 3 daemons, quorum pm4,pm5,pm6 (age 8h)
mgr: pm5(active, since 12h), standbys: pm6, pm4
mds: 1/1 daemons up, 1 standby
osd: 24 osds: 24 up (since 5h), 24 in (since 3y); 51 remapped pgs

data:
volumes: 1/1 healthy
pools: 5 pools, 976 pgs
objects: 11.25M objects, 43 TiB
usage: 83 TiB used, 52 TiB / 134 TiB avail
pgs: 377526/22500514 objects misplaced (1.678%)
924 active+clean
24 active+clean+remapped
18 active+remapped+backfill_wait
9 active+remapped+backfilling
1 active+clean+scrubbing+deep

io:
client: 54 MiB/s rd, 154 MiB/s wr, 1.35k op/s rd, 394 op/s wr
recovery: 302 MiB/s, 79 objects/s
cache: 0 op/s promote

root@pm4:~#
 
We moved the disks off of the cache tier onto the regular tier and that seems to have allowed us to boot the 2 VMs that were not working. Please advise if this is expected behavior. Thanks.
 
Server admin told me that the VMs move off the cache tier failed so they are still there. Yet after the attempted move we were able to start the VMs. Any suggestions for how we can find out the root cause?
 
It would appear that Proxmox/Ceph are not able to fix these PGs. This count has stayed constant for hours. Any suggestions for how to fix these misplaced PGs?



pool hot-storage id 4
123452/810584 objects misplaced (15.230%)
client io 135 MiB/s rd, 311 KiB/s wr, 138 op/s rd, 6 op/s wr
cache tier io 0 op/s promote

But then:

pool hot-storage id 4
123525/811278 objects misplaced (15.226%)

pool hot-storage id 4
124698/819462 objects misplaced (15.217%)

pool hot-storage id 4
132364/868594 objects misplaced (15.239%)

pool hot-storage id 4
132364/868594 objects misplaced (15.239%)

PGs active+clean+remapped is stuck at 25
 
Hi,

just as a side question: Why did you decide to use cache tiering. There are some words of warning of using this in the ceph documentation [1]. Did your admin check the guide to disable cache tiering there?


Did your cluster finish the backfill tasks ("18 active+remapped+backfill_wait", "9 active+remapped+backfilling") already. I think they have a higher priority then fixing misplaced PGs. Misplace in this context just means that the PGs aren't in the spot they are expected. The backfill operation is some special case of a recovery operation and might be deemed more important [2].

[1] https://docs.ceph.com/en/latest/rados/operations/cache-tiering
[2] https://docs.ceph.com/en/quincy/rados/operations/pg-states/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!