Ceph error

Aug 6, 2019
23
1
8
39
Spain
Hi everybody how are you,

Today after updating proxmox and restarting OSDs it began to create errors in which different SSDs are involved that we use as cache in a hybrid storage with SSD cache + HDD storage.

If I reset the OSDs involved I don't get any improvement.

Some help?

I thank you in advance,

Greetings.
 

Attachments

  • ceph_error.PNG
    ceph_error.PNG
    53.7 KB · Views: 44
Hello,

This is a complicated issue that I can't find a solution to,

After restarting all the OSD daemons, the inconsistency has increased a bit.

I would not want to perform any procedure without consulting an expert on this.

Regards.
 
You have outdatet OSD Versions. Your Ceph Version is 15.2.13 and the OSDs run on 15.2.10.
Please check all ceph nodes run with the Same Version and restart the osd daemons or the server After set no out on global osd flags.
After the reboot wait on the rebuild of your Ceph Cluster and disable the no out flag.
 
Good morning,

I am in the process of restarting the OSD daemons so that everything is up to date. This is a delicate process as when I rebooted last time, that is when these inconsistency issues appeared.

I will update with the cluster status so that it can be tracked.

Regards and thanks.
 
Good morning,

After restarting the OSD daemons, CEPH still shows the same inconsistency error.

Could you please guide me how to fix this error?

Regards and thanks.
 

Attachments

  • ceph_status.png
    ceph_status.png
    72.4 KB · Views: 14
Hello,

Can someone continue with the help on the cluster status.

It's been showing this error for a while and it can't be fixed, I'm attaching pveversion in case it helps.

Code:
root@zeus:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.13-pve1~bpo10
ceph-fuse: 15.2.13-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Regards
 
Hello,

Thank you for taking the trouble to reply.

Attached is the output of ceph health detail

The disks are still the same from the start, this behaviour happened after updating proxmox and restarting OSD daemons to apply the new CEPH version.

Regards.
 

Attachments

  • ceph health detail.txt
    19.1 KB · Views: 5
Hello gurubert,

Thank you for taking the time to reply.

What the article available in the documentation suggests is:

To scrub a placement group, execute the following:

ceph pg scrub {pg-id}

Can you confirm if this process does not result in service interruption?

There will be no OSD crashes, or anything related to this that will cause a downtime problem on my system?

Thank you very much for the help with this issue,


Regards.

Translated with www.DeepL.com/Translator (free version)
 
Scrubbing is a normal maintenance task that the cluster regularly runs at least every week. A normal Ceph cluster will have no service interruption doing this.

Please have a look at "ceph config dump". Maybe there is a hint why scrubbing has not been run since April.
 
Hello,

The ceph config dump result shows no results.

1626687582287.png

Is that a normal thing?


Otherwise I will run the suggested cleanup in off hours and report the results.

Regards and Thanks
 
Hello,

After performing ceph pg scrub 4.fd which is the first pg-id to be cleaned, nothing is done,

I've left a time window of a few days and it still doesn't seem to run anything.

Attached is the output of ceph pg 4.fd query and ceph -w | grep 4.fd

Thank you for your help.

Regards.


Code:
root@afrodita:~# ceph -w | grep 4.fd
2021-07-27T13:19:38.212332+0200 osd.6 [ERR] 4.fd has 1 objects unfound and apparently lost
2021-07-27T13:20:00.000308+0200 mon.zeus [ERR]     pg 4.fd has 1 unfound objects
2021-07-27T13:20:00.001076+0200 mon.zeus [ERR]     pg 4.fd is active+recovery_unfound+degraded+remapped, acting [6,1,0], 1 unfound
2021-07-27T13:20:00.001513+0200 mon.zeus [ERR]     pg 4.fd is active+recovery_unfound+degraded+remapped, acting [6,1,0], 1 unfound
2021-07-27T13:20:00.001523+0200 mon.zeus [ERR]     pg 4.fd not deep-scrubbed since 2021-04-20T01:24:51.227020+0200
2021-07-27T13:20:00.001950+0200 mon.zeus [ERR]     pg 4.fd not scrubbed since 2021-04-25T04:26:09.606693+0200
2021-07-27T13:20:27.638602+0200 osd.6 [ERR] 4.fd has 1 objects unfound and apparently lost
 

Attachments

  • ceph_pg_query.txt
    36.9 KB · Views: 1

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!