need help with CEPH and unfound errors after rebooting a node. Please.

damon1 · Feb 14, 2020

HI Guys,
I know how to fix it but am wondering why it happens.

when I reboot a node or nodes for an update everything works perfectly.
However, once it is done I always end up with a few of these

Possible data damage: 7 pgs recovery_unfound
pg 1.12 is active+recovery_unfound+degraded, acting [9,0], 1 unfound
pg 1.22 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 1 unfound
...

If I leave the system the errors will not be fixed and I have to manually fix them.
I.e. ceph pg 1.12 mark_unfound_lost revert

I am wondering if there is something wrong with my Configuration?
I use 1 SSD in each node as a CACHE and then 2 HDD (on each node) for storage.

there are 4 nodes and 3 monitors (node 2,3 and 4)

...
...

Lastly,
all the OSD's are green and using the current version. (14 Feb 2020)

Any thought are greatly appreciated OR is this expected behavior?

thanks
Damon

Alwin · Feb 14, 2020

Code:

step take default class ssd
step choose firstn 0 type osd

Why did you set this?

damon1 · Feb 16, 2020

Hi Alwin,

step take default class ssd
You are correct - this is redundant as the system will identify that this only contains SSD (or HDD) drives.

step choose firstn 0 type osd
Should this be "step choose firstn 0 type host" ?

Thanks (again)
damon

Alwin · Feb 16, 2020

damon1 said:
step take default class ssd
You are correct - this is redundant as the system will identify that this only contains SSD (or HDD) drives.

I would keep this, if you add different media types, the PGs don't need to be redistributed, onto SSDs again.

damon1 said:
step choose firstn 0 type osd
Should this be "step choose firstn 0 type host" ?

This results in PGs being distributed on the failure domain level OSD, not host. So, when a host dies, the PGs in question might have all their copies on said host.

The above, in conjunction with size / min_size, 2/1, will result in missing PGs (data loss). min_size should always be at least 2 (see the link for more info).
https://books.google.at/books?id=yn... risk books&hl=de&pg=PA28#v=onepage&q&f=false

Search

Search

need help with CEPH and unfound errors after rebooting a node. Please.

damon1

Active Member

Alwin

Proxmox Retired Staff

damon1

Active Member

Alwin

Proxmox Retired Staff