Ceph PG error

Jackmynet

Member
Oct 5, 2024
36
0
6
Hi all,

I am a new beginner proxmox user and have just setup our first small data centre for our business running 3 VMs on a 3 node cluster.

I have setup Ceph shared storage and enabled HA for my VMs which seems to be working fine. However I have one error which I cannot figure out.

I have 4 1Tb HDD on each server along with 2 480Gb SSD. My ceph osds have been configured as below on each node.

OSD 1: 500Gb partition of HDD 1 and 80Gb of SSD 1 as the DB disk and 80Gb of SSD 2 as the WAL disk.

OSD 2: 500Gb partition of HDD 2 and 80Gb of SSD 1 as the DB disk and 80Gb of SSD 2 as the WAL disk.

OSD 3: 500Gb partition of HDD 3 and 80Gb of SSD 1 as the DB disk and 80Gb of SSD 2 as the WAL disk.

OSD 4: 500Gb partition of HDD 4 and 80Gb of SSD 1 as the DB disk and 80Gb of SSD 2 as the WAL disk.

The problem is ceph shows health warning but is full green circle with one grey part. It says 1 PG status is unknowns.

How do I troubleshoot such an error? I think this is the only issue I have across my configuration currently. The ceph storage pool is also working well with all my vms now moved to it.

Any help would be much appreciated.
 
Thanks for the reply!

See attached snapshots of the above outputs.

They were too long to paste the text in the comment. The pg dump is too long to even get a snapshot but I have snapshot what I think is the relevant part?
 

Attachments

  • Screenshot 2024-10-05 122818.png
    Screenshot 2024-10-05 122818.png
    33.6 KB · Views: 5
  • Screenshot 2024-10-05 122757.png
    Screenshot 2024-10-05 122757.png
    119.4 KB · Views: 5
  • Screenshot 2024-10-05 122717.png
    Screenshot 2024-10-05 122717.png
    48.8 KB · Views: 5
  • Screenshot 2024-10-05 122702.png
    Screenshot 2024-10-05 122702.png
    25.8 KB · Views: 4
What happend to the pool with the id 1 (usually the pool called ".mgr")?

Its only placement group is in state "unknown" which is not good.

Please try to restart OSDs 2, 3 and 7 with "systemctl restart ceph-osd@2.service" etc on their nodes.

If that does not fix the issue please post the output of "ceph health detail" and "ceph osd pool ls detail".
 
The pool mgr with this PG is actually still there its just not in use?

I see it shows actually that this PG is assigned to that pool still which i did not notice before.

We did mess around on original setup a bit with CEPH having to wipe and delete and re do the OSD's when we realised it was best to have the db & wal on ssd.

I have restarted each osd and it did not help, ceph still recovers to the same state.

See attached the output requested.
 

Attachments

You seem to have removed the OSDs too fast and the pool was not able to migrate its PG to other OSDs. And now it cannot find the PG any more.

You can try to remove the pool and recreate it. It is called ".mgr" with a dot in front.
 
Ok I have VM's using this storage now and working perfectly can I do this without disturbing them and having to move them?

Thanks for the help. this diagnosis sounds correct as it was all done quite quickly on setup
 
Ok I removed it now and it automatically recreated itself?

No more health warning on Ceph! It is recovering now. Thank you so much for your help and advise.
 
Thanks again. Is the way I have setup my OSD's acceptable? With the HDD as main disk and SSD as DB and WAL disk

There is a lot of conflicting information on OSD creation and ssd/hdd differences.