Misplaced PG wont recover

IH-proxmox · Dec 9, 2022

After performing an upgrade to the latest version of Proxmox, we had one of the host servers reboot that had live VMs on it. All VMs recovered except for 2. It seems these 2 VMs use PGs on our caching tier and Proxmox (at least in this new version) seems to not recover misplaced PGs on the caching tier. Is there supposed to be any way to recover these so the VMs can boot?

shrdlicka · Dec 9, 2022

hi,
can you provide the output of pveceph status? How many nodes are in your cluster?

IH-proxmox · Dec 9, 2022

There are 3 nodes in our cluster. Below is the requested output

root@pm6:~# pveceph status
cluster:
id: cd69845c-44dc-49b1-afe0-6f6e83b07b57
health: HEALTH_OK

services:
mon: 3 daemons, quorum pm4,pm5,pm6 (age 8h)
mgr: pm5(active, since 12h), standbys: pm6, pm4
mds: 1/1 daemons up, 1 standby
osd: 24 osds: 24 up (since 5h), 24 in (since 3y); 51 remapped pgs

data:
volumes: 1/1 healthy
pools: 5 pools, 976 pgs
objects: 11.25M objects, 43 TiB
usage: 83 TiB used, 52 TiB / 134 TiB avail
pgs: 380489/22497462 objects misplaced (1.691%)
924 active+clean
24 active+clean+remapped
18 active+remapped+backfill_wait
9 active+remapped+backfilling
1 active+clean+scrubbing+deep

io:
client: 5.2 MiB/s rd, 149 MiB/s wr, 378 op/s rd, 683 op/s wr
recovery: 264 MiB/s, 68 objects/s

root@pm6:~#

root@pm5:~# pveceph status
cluster:
id: cd69845c-44dc-49b1-afe0-6f6e83b07b57
health: HEALTH_OK

services:
mon: 3 daemons, quorum pm4,pm5,pm6 (age 8h)
mgr: pm5(active, since 12h), standbys: pm6, pm4
mds: 1/1 daemons up, 1 standby
osd: 24 osds: 24 up (since 5h), 24 in (since 3y); 51 remapped pgs

data:
volumes: 1/1 healthy
pools: 5 pools, 976 pgs
objects: 11.25M objects, 43 TiB
usage: 83 TiB used, 52 TiB / 134 TiB avail
pgs: 374283/22502638 objects misplaced (1.663%)
924 active+clean
24 active+clean+remapped
18 active+remapped+backfill_wait
9 active+remapped+backfilling
1 active+clean+scrubbing+deep

io:
client: 74 MiB/s rd, 90 MiB/s wr, 2.64k op/s rd, 393 op/s wr
recovery: 359 MiB/s, 93 objects/s

root@pm5:~#

root@pm4:~# pveceph status
cluster:
id: cd69845c-44dc-49b1-afe0-6f6e83b07b57
health: HEALTH_OK

services:
mon: 3 daemons, quorum pm4,pm5,pm6 (age 8h)
mgr: pm5(active, since 12h), standbys: pm6, pm4
mds: 1/1 daemons up, 1 standby
osd: 24 osds: 24 up (since 5h), 24 in (since 3y); 51 remapped pgs

data:
volumes: 1/1 healthy
pools: 5 pools, 976 pgs
objects: 11.25M objects, 43 TiB
usage: 83 TiB used, 52 TiB / 134 TiB avail
pgs: 377526/22500514 objects misplaced (1.678%)
924 active+clean
24 active+clean+remapped
18 active+remapped+backfill_wait
9 active+remapped+backfilling
1 active+clean+scrubbing+deep

io:
client: 54 MiB/s rd, 154 MiB/s wr, 1.35k op/s rd, 394 op/s wr
recovery: 302 MiB/s, 79 objects/s
cache: 0 op/s promote

root@pm4:~#

IH-proxmox · Dec 9, 2022

We moved the disks off of the cache tier onto the regular tier and that seems to have allowed us to boot the 2 VMs that were not working. Please advise if this is expected behavior. Thanks.

IH-proxmox · Dec 9, 2022

Server admin told me that the VMs move off the cache tier failed so they are still there. Yet after the attempted move we were able to start the VMs. Any suggestions for how we can find out the root cause?

IH-proxmox · Dec 9, 2022

It would appear that Proxmox/Ceph are not able to fix these PGs. This count has stayed constant for hours. Any suggestions for how to fix these misplaced PGs?

pool hot-storage id 4
123452/810584 objects misplaced (15.230%)
client io 135 MiB/s rd, 311 KiB/s wr, 138 op/s rd, 6 op/s wr
cache tier io 0 op/s promote

But then:

pool hot-storage id 4
123525/811278 objects misplaced (15.226%)

pool hot-storage id 4
124698/819462 objects misplaced (15.217%)

pool hot-storage id 4
132364/868594 objects misplaced (15.239%)

pool hot-storage id 4
132364/868594 objects misplaced (15.239%)

PGs active+clean+remapped is stuck at 25

shrdlicka · Dec 12, 2022

Hi,

just as a side question: Why did you decide to use cache tiering. There are some words of warning of using this in the ceph documentation [1]. Did your admin check the guide to disable cache tiering there?

Did your cluster finish the backfill tasks ("18 active+remapped+backfill_wait", "9 active+remapped+backfilling") already. I think they have a higher priority then fixing misplaced PGs. Misplace in this context just means that the PGs aren't in the spot they are expected. The backfill operation is some special case of a recovery operation and might be deemed more important [2].

[1] https://docs.ceph.com/en/latest/rados/operations/cache-tiering
[2] https://docs.ceph.com/en/quincy/rados/operations/pg-states/

Misplaced PG wont recover

IH-proxmox

Active Member

shrdlicka

Proxmox Retired Staff

IH-proxmox

Active Member

IH-proxmox

Active Member

IH-proxmox

Active Member

IH-proxmox

Active Member

shrdlicka

Proxmox Retired Staff

We value your privacy