Odd Ceph Issues

emilhozan

Member
Aug 27, 2019
51
4
8
Hey all,

My cluster consists of:
- 5x physical servers
- storage is ceph
- each server has 4x HDDs and 1x SSD (all OSDs)
- each node has PVE installed on a 6th SAS drive
- I have an HDD pool and a separate pool for SSD

During testing, I had no issues with creating, migrating, or anything. In fact, I was quite impressed with how easy going everything was going. But things changed once we put production-level VMs. You may have read my previous posts and I greatly appreciate all your guys' help, but I am now facing another very odd issue.

My issues now are:
- node 1 was completely removed from the crush map
- the OSD view doesn't show any OSD on that node
- i cannot create a new VM on it nor any node now
- however, I had a VM on that node that worked fine
- I was able to migrate the VM onto another node just fine

I was watching SYSLOG during everything and all that's logged when creating a VM is that the task is starting, and there is no other information. The migration logs appeared when doing this and the VM is functional on another node. Even now as I write this, the VM still hasn't created and I've been typing this for at least 20 minutes.

My question is, where else can I look for logging information here?
I am baffled at how the OSDs don't appear and that node1 isn't in the ceph crush map, but even more how that the VM was working just fine?


I am tempted to try a complete reinstall of PVE and zapping all disks and starting afresh due to the issues we've been experiencing, but that's not really a solution and more of a stab in the dark so to speak. I tried looking things up but nothing of late and I figured the pre-2018 posts are too old.

I would really appreciate any help here - please!
 
- node 1 was completely removed from the crush map
The node has been removed manually?

- the OSD view doesn't show any OSD on that node
Well, if it's removed. Then it is supposed to not show up. Please post a ceph -s, ceph osd df tree and a pveversion -v.

- i cannot create a new VM on it nor any node now
- however, I had a VM on that node that worked fine
- I was able to migrate the VM onto another node just fine
It depends now in what state the node is in to allow the running process to continue.

My question is, where else can I look for logging information here?
Journal/Syslog and Ceph's own logs (/var/log/ceph).
 

Attachments

Interestingly I am unable to create backups of my VMs...
It shows you have one OSD down which is causing some PG's to be down, this will mean your be having I/O errors.

Your need to work out which OSD is down and get the OSD back online, this will then hopefully bring your PG's online for your cluster to start working.

What settings did you use when creating the pool (min size / max size)? As one OSD down should not cause any PG's to go offline.
 
@sg90 + @Alwin
Sorry for the lapse here.

I removed the OSDs, replaced disks, and tried to troubleshoot quite a bit, all to no avail.

To recap the issues we had started when we moved our production VMs from our other 10x node cluster to the 5x node cluster for other reasons related to our colo. We started by getting SMART errors of "FailedOpenDrive" and it was suggested that maybe the disks were bad or the bays. Well reseating the drives temporarily help, but when the errors came back I rearranged the drives. Out of the 20 HDDs, I believe about 15 or so "failed" so to speak. Even after rearranging the drives to rule out bay issues, the errors came back.

This led us to replace the disks that reported issues, removed the old OSDs and created new ones with the new disks, reboot the cluster. The SMART logs stopped generating, which is a good thing I suppose, but then we started facing some serious issues with the HDD pool.

I attached some notes I made when investigating this matter in great deal earlier this week.


The good news is that the cluster currently has no production VMs on it due to these issues. The bad news is that we're seeing similar preliminary SMART errors with our second cluster that got this first cluster into the situation it is in now, so we're hoping to avoid a repeat and figure out what the issue is.


That said, I contemplate a complete HDD pool tear down and rebuild, but wanted to see if any of you could help out.


Now to answer the past questions:

What settings did you use when creating the pool (min size / max size)?
Size 2, minimum 2.

The reason for this is because we wanted a bit more available space than what size 3 offered.

I did see mention of lowering the min size but was concerned about not having the "2" sets of data for the logs.

As for one OSD going down shouldn't be the cause of these issues, well there are multiple OSDs having issues. If I restart them or stop them, a different OSD is reported as well. The high level summary ad DC > Summary > Ceph warning and error logs are what I'm referring to.
 

Attachments

Some questions:
1) are your server new? On used server when you overload scsi controller it may happen that it loses drives showing them not connected.
2) ceph is not magical: if you loses too many hdds and it discovers that it has not enough copies of data it will stop immediately accepting writes and reads to not increase damage. So I can understand your backups do not work: they cannot read data from ceph osds.
3) do you know how to replace a broken ceph disk? you must be sure that it is empty or ceph has already replicated data to another place otherwise you will corrupt ceph yourself.
 
Sorry to be rude but reading again you have 20 osd but only 13 up so 7 are down. Then 27 pgs inactive means that ceph has no more copies and so it has lost your data. So it is obvious backup will hang: ceph has stopped working.
 
Hey @mgiammarco ,

1) No, but why is it that the SSDs on that SAME controller are just fine?
The issue is with HDDs and how whatever process is doing whatever it does to provide the ceph services it does, there's no doubt about that. This is further exemplified by moving the VMs back to the original cluster they were on (Cluster2).
2) I beg to differ. Despite the issues I've had with the "failed" disks, I managed to pull out ALL data with no data loss at all. The process was not easy nor fun, but I need to figure out where the issue is stemming from that's causing the OSDs to "fail". As for the backups, I was able to eventually make some.
3) When replacing the OSDs initially, I looked up several forum posts regarding this to ensure the process. Plus, if the OSDs fail, then by programming logic that data should have moved onto another OSD.
 
After working on the warnings from Datacenter > Summary > Ceph Health, I managed to get the errors resolved (including lowering the min_size parameter). The most recent logs I am seeing (the last ones, in fact) are:

pg 20.0 is incomplete, acting [13,7]
pg 20.3 is incomplete, acting [7,0]
pg 20.8 is incomplete, acting [16,14]


It's just a repeat of the above. I couldn't find much online, so if there's anyone that can help I'd really appreciate it.

I have to option to just destroy and recreate the HDD pool, but I don't want to resort to that if I don't have to just yet. Seeing we're experiencing similar errors on our other cluster, I cannot move VMs between clusters and keep recreating the HDD pools.
 
I did quite a bit of research and now see things a bit more clearly, but still don't have this figured out.

The PGs are incomplete due to not enough instances of this PG. I tried to search for how to delete the PGs but there was no real clear answer. This is more of a ceph issue I take, but at least wanted to post on here to see if anyone may know. I am still digging in ceph mailing lists in the mean time.
 
I did quite a bit of research and now see things a bit more clearly, but still don't have this figured out.

The PGs are incomplete due to not enough instances of this PG. I tried to search for how to delete the PGs but there was no real clear answer. This is more of a ceph issue I take, but at least wanted to post on here to see if anyone may know. I am still digging in ceph mailing lists in the mean time.

Do you need the data on that pool? If you delete the PGs your be deleting small chunks of a good % of your data making most of your data corrupt on that pool.
 
@sg90

Do you need the data on that pool?

I don't NEED the data, no, I just want to figure out a way to delete the PGs that are giving me issues without having to delete the pool itself. So you know if that's the only way to resolve this?

My concern is that this issue may rise up again in my other cluster and I can't just keep migrating the VMs between clusters when this stuff happens.

One additional piece of information that I was apprised of about this issue; the VMs that have their OS installed on the SSD pool and have a mounted "/data" store on the HDD pool, even if data isn't being accessed the HDD pool still causes issues.
 
some arcane commandline spells I gathered from
https://medium.com/opsops/recoverin...-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
and I used successfully to remove 4 "PG is incomplete" errors from my test cluster

WARNING THIS WILL DESTROY DATA

I don't care because
1. it's a test cluster
2. I have backups of the VMs I don't want to lose
3. I've lost enough time trying to do it "properly" already and I'd rather switch to XCP-ng (opensource Xen distro) than having to redo the whole pool because of this bs nonsense errors.

find the OSD where there are incomplete PGs with
ceph pg dump | grep incomplete

or also with a
ceph pg repair [PG ID FROM ERROR]

That has a very low chance of fixing anything but will print what OSD has this error in a very clean way

connect with SSH or Console to the node where the OSD resides physically, the following commands will work ONLY ON THAT NODE because we are using a ceph offline recovery tool to manipulate the OSD manually

Stop affected OSD from GUI or with systemd
systemctl stop ceph-osd@[PG ID FROM ERROR]

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-[OSD NUMBER] --op mark-complete --pgid [PG ID FROM ERROR]


start OSD with GUI or systemd
systemctl start ceph-osd@[PG ID FROM ERROR]

it seems that if I don't do this the OSD crashes within 10 minutes and can't be re-added again (only way is to mark it as "out" and then delete it, then add it again from GUI).
ceph osd force-create-pg [PG ID FROM ERROR] --yes-i-really-mean-it

This should fix the "incomplete" PG errors of the pool.

if now the same PG comes up as "unfound" we can nuke it again with a different command

ceph pg [PG ID FROM ERROR] mark_unfound_lost delete
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!