need help with CEPH and unfound errors after rebooting a node. Please.

damon1

Active Member
Apr 18, 2019
95
9
28
57
HI Guys,
I know how to fix it but am wondering why it happens.

when I reboot a node or nodes for an update everything works perfectly.
However, once it is done I always end up with a few of these

Possible data damage: 7 pgs recovery_unfound
pg 1.12 is active+recovery_unfound+degraded, acting [9,0], 1 unfound
pg 1.22 is active+recovery_unfound+undersized+degraded+remapped, acting [0], 1 unfound
...

If I leave the system the errors will not be fixed and I have to manually fix them.
I.e. ceph pg 1.12 mark_unfound_lost revert

I am wondering if there is something wrong with my Configuration?
I use 1 SSD in each node as a CACHE and then 2 HDD (on each node) for storage.

there are 4 nodes and 3 monitors (node 2,3 and 4)

1581643760337.png

1581643821701.png
...
...

1581643861168.png



1581643900520.png


1581643961172.png

Lastly,
all the OSD's are green and using the current version. (14 Feb 2020)

1581644007673.png

Any thought are greatly appreciated OR is this expected behavior?

thanks
Damon
 
Code:
step take default class ssd
step choose firstn 0 type osd
Why did you set this?
 
Hi Alwin,

step take default class ssd
You are correct - this is redundant as the system will identify that this only contains SSD (or HDD) drives.



step choose firstn 0 type osd
Should this be "step choose firstn 0 type host" ?



Thanks (again)
damon
 
step take default class ssd
You are correct - this is redundant as the system will identify that this only contains SSD (or HDD) drives.
I would keep this, if you add different media types, the PGs don't need to be redistributed, onto SSDs again.

step choose firstn 0 type osd
Should this be "step choose firstn 0 type host" ?
This results in PGs being distributed on the failure domain level OSD, not host. So, when a host dies, the PGs in question might have all their copies on said host.

The above, in conjunction with size / min_size, 2/1, will result in missing PGs (data loss). min_size should always be at least 2 (see the link for more info).
https://books.google.at/books?id=yn... risk books&hl=de&pg=PA28#v=onepage&q&f=false
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!