Ceph error

buyiyo · Jun 16, 2021

Hi everybody how are you,

Today after updating proxmox and restarting OSDs it began to create errors in which different SSDs are involved that we use as cache in a hybrid storage with SSD cache + HDD storage.

If I reset the OSDs involved I don't get any improvement.

Some help?

I thank you in advance,

Greetings.

buyiyo · Jun 18, 2021

Good Morning,

Anyone who can shed a bit of light on this matter?

Regards

buyiyo · Jun 22, 2021

Hello,

This is a complicated issue that I can't find a solution to,

After restarting all the OSD daemons, the inconsistency has increased a bit.

I would not want to perform any procedure without consulting an expert on this.

Regards.

coldenburg · Jun 22, 2021

You have outdatet OSD Versions. Your Ceph Version is 15.2.13 and the OSDs run on 15.2.10.
Please check all ceph nodes run with the Same Version and restart the osd daemons or the server After set no out on global osd flags.
After the reboot wait on the rebuild of your Ceph Cluster and disable the no out flag.

buyiyo · Jun 23, 2021

Good morning,

I am in the process of restarting the OSD daemons so that everything is up to date. This is a delicate process as when I rebooted last time, that is when these inconsistency issues appeared.

I will update with the cluster status so that it can be tracked.

Regards and thanks.

buyiyo · Jul 7, 2021

Good morning,

After restarting the OSD daemons, CEPH still shows the same inconsistency error.

Could you please guide me how to fix this error?

Regards and thanks.

buyiyo · Jul 14, 2021

Hello,

Can someone continue with the help on the cluster status.

It's been showing this error for a while and it can't be fixed, I'm attaching pveversion in case it helps.

Code:

root@zeus:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.13-pve1~bpo10
ceph-fuse: 15.2.13-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Regards

gurubert · Jul 14, 2021

Your cluster has unfound objects. This is not good. Did you remove any OSDs manually? Are all original OSDs online?

Please post the output of

Bash:

ceph health detail

.

Here is what the Ceph documentation says about unfound objects: https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects

buyiyo · Jul 16, 2021

Hello,

Thank you for taking the trouble to reply.

Attached is the output of ceph health detail

The disks are still the same from the start, this behaviour happened after updating proxmox and restarting OSD daemons to apply the new CEPH version.

Regards.

gurubert · Jul 17, 2021

The Ceph docs talk about unfound objects here:

https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects

What startles me is the fact that the PGs have not been scrubbed since April or May. Have you turned off scrubbing?

You should first let a scrub run on the pools. This may find the currently unfound objects.

https://docs.ceph.com/en/octopus/rados/operations/placement-groups/#scrub-a-placement-group

buyiyo · Jul 19, 2021

Hello gurubert,

Thank you for taking the time to reply.

What the article available in the documentation suggests is:

To scrub a placement group, execute the following:

ceph pg scrub {pg-id}

Can you confirm if this process does not result in service interruption?

There will be no OSD crashes, or anything related to this that will cause a downtime problem on my system?

Thank you very much for the help with this issue,

Regards.

Translated with www.DeepL.com/Translator (free version)

gurubert · Jul 19, 2021

Scrubbing is a normal maintenance task that the cluster regularly runs at least every week. A normal Ceph cluster will have no service interruption doing this.

Please have a look at "ceph config dump". Maybe there is a hint why scrubbing has not been run since April.

buyiyo · Jul 19, 2021

Hello,

The ceph config dump result shows no results.

Is that a normal thing?

Otherwise I will run the suggested cleanup in off hours and report the results.

Regards and Thanks

buyiyo · Jul 27, 2021

Hello,

After performing ceph pg scrub 4.fd which is the first pg-id to be cleaned, nothing is done,

I've left a time window of a few days and it still doesn't seem to run anything.

Attached is the output of ceph pg 4.fd query and ceph -w | grep 4.fd

Thank you for your help.

Regards.

Code:

root@afrodita:~# ceph -w | grep 4.fd
2021-07-27T13:19:38.212332+0200 osd.6 [ERR] 4.fd has 1 objects unfound and apparently lost
2021-07-27T13:20:00.000308+0200 mon.zeus [ERR]     pg 4.fd has 1 unfound objects
2021-07-27T13:20:00.001076+0200 mon.zeus [ERR]     pg 4.fd is active+recovery_unfound+degraded+remapped, acting [6,1,0], 1 unfound
2021-07-27T13:20:00.001513+0200 mon.zeus [ERR]     pg 4.fd is active+recovery_unfound+degraded+remapped, acting [6,1,0], 1 unfound
2021-07-27T13:20:00.001523+0200 mon.zeus [ERR]     pg 4.fd not deep-scrubbed since 2021-04-20T01:24:51.227020+0200
2021-07-27T13:20:00.001950+0200 mon.zeus [ERR]     pg 4.fd not scrubbed since 2021-04-25T04:26:09.606693+0200
2021-07-27T13:20:27.638602+0200 osd.6 [ERR] 4.fd has 1 objects unfound and apparently lost

gurubert · Aug 3, 2021

I am sorry but there is something strange in your cluster. Why does it not run the regular scrubbing jobs?

Search

Search

Ceph error

buyiyo

Member

Attachments

buyiyo

Member

buyiyo

Member

coldenburg

Member

buyiyo

Member

buyiyo

Member

Attachments

buyiyo

Member

gurubert

Distinguished Member

buyiyo

Member

Attachments

gurubert

Distinguished Member

buyiyo

Member

gurubert

Distinguished Member

buyiyo

Member

buyiyo

Member

Attachments

gurubert

Distinguished Member

We value your privacy