Regular errors on ceph pg's!

gosha · Jun 18, 2018

Hi!

I very much regret that I upgraded (via new install and restore all VMs and CTs from backup)
my cluster from 4.x to 5.2! A new storage Bluestore is driving me crazy!

I now do not sleep at night, trying to understand why the same disks worked fine in the old version CEPH,
and in the new version I regularly get constant errors. Here is an example:

2018-06-18 07:00:00.000178 mon.cn1 mon.0 192.168.110.1:6789/0 17405 : cluster [INF] overall HEALTH_OK
2018-06-18 08:00:00.000182 mon.cn1 mon.0 192.168.110.1:6789/0 18233 : cluster [INF] overall HEALTH_OK
2018-06-18 08:14:15.029761 osd.21 osd.21 192.168.110.4:6808/2776 120 : cluster [ERR] 4.115 shard 4: soid 4:a895bc88:::rbd_data.1ab922ae8944a.0000000000001177:head candidate had a read error
2018-06-18 08:14:15.029767 osd.21 osd.21 192.168.110.4:6808/2776 121 : cluster [ERR] 4.115 shard 4: soid 4:a8960fca:::rbd_data.b824874b0dc51.000000000000e0aa:head candidate had a read error
2018-06-18 08:15:01.879394 osd.21 osd.21 192.168.110.4:6808/2776 122 : cluster [ERR] 4.115 deep-scrub 0 missing, 2 inconsistent objects
2018-06-18 08:15:01.879400 osd.21 osd.21 192.168.110.4:6808/2776 123 : cluster [ERR] 4.115 deep-scrub 2 errors
2018-06-18 08:15:02.191305 mon.cn1 mon.0 192.168.110.1:6789/0 18451 : cluster [ERR] Health check failed: 2 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 08:15:02.191380 mon.cn1 mon.0 192.168.110.1:6789/0 18452 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 08:31:33.555440 mon.cn1 mon.0 192.168.110.1:6789/0 18652 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 08:32:43.780308 mon.cn1 mon.0 192.168.110.1:6789/0 18667 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 2 scrub errors)
2018-06-18 08:32:43.780438 mon.cn1 mon.0 192.168.110.1:6789/0 18668 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 08:32:43.780509 mon.cn1 mon.0 192.168.110.1:6789/0 18669 : cluster [INF] Cluster is now healthy
2018-06-18 09:00:00.000213 mon.cn1 mon.0 192.168.110.1:6789/0 18978 : cluster [INF] overall HEALTH_OK
2018-06-18 10:00:00.000120 mon.cn1 mon.0 192.168.110.1:6789/0 19695 : cluster [INF] overall HEALTH_OK
2018-06-18 11:00:00.009025 mon.cn1 mon.0 192.168.110.1:6789/0 20380 : cluster [INF] overall HEALTH_OK
2018-06-18 11:19:02.932367 osd.5 osd.5 192.168.110.1:6800/2817 54 : cluster [ERR] 4.269 shard 17: soid 4:96648683:::rbd_data.266182ae8944a.00000000000198cf:head candidate had a read error
2018-06-18 11:19:37.483829 mon.cn1 mon.0 192.168.110.1:6789/0 20621 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 11:19:37.483938 mon.cn1 mon.0 192.168.110.1:6789/0 20622 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 11:19:31.366600 osd.5 osd.5 192.168.110.1:6800/2817 55 : cluster [ERR] 4.269 deep-scrub 0 missing, 1 inconsistent objects
2018-06-18 11:19:31.366606 osd.5 osd.5 192.168.110.1:6800/2817 56 : cluster [ERR] 4.269 deep-scrub 1 errors
2018-06-18 12:00:00.000148 mon.cn1 mon.0 192.168.110.1:6789/0 21061 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 13:00:00.000171 mon.cn1 mon.0 192.168.110.1:6789/0 21741 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 14:00:00.000175 mon.cn1 mon.0 192.168.110.1:6789/0 22452 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 14:04:11.813396 mon.cn1 mon.0 192.168.110.1:6789/0 22503 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 14:05:32.086196 mon.cn1 mon.0 192.168.110.1:6789/0 22520 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
2018-06-18 14:05:32.086320 mon.cn1 mon.0 192.168.110.1:6789/0 22521 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 14:05:32.086400 mon.cn1 mon.0 192.168.110.1:6789/0 22522 : cluster [INF] Cluster is now healthy
2018-06-18 15:00:00.000262 mon.cn1 mon.0 192.168.110.1:6789/0 23210 : cluster [INF] overall HEALTH_OK
2018-06-18 16:00:00.000170 mon.cn1 mon.0 192.168.110.1:6789/0 23941 : cluster [INF] overall HEALTH_OK
2018-06-18 16:01:02.707696 osd.4 osd.4 192.168.110.1:6804/3022 44 : cluster [ERR] 4.31c shard 4: soid 4:38c35abc:::rbd_data.1c5032ae8944a.0000000000000583:head candidate had a read error
2018-06-18 16:02:31.311803 mon.cn1 mon.0 192.168.110.1:6789/0 23970 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 16:02:31.311901 mon.cn1 mon.0 192.168.110.1:6789/0 23971 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 16:02:28.067005 osd.4 osd.4 192.168.110.1:6804/3022 45 : cluster [ERR] 4.31c deep-scrub 0 missing, 1 inconsistent objects
2018-06-18 16:02:28.067022 osd.4 osd.4 192.168.110.1:6804/3022 46 : cluster [ERR] 4.31c deep-scrub 1 errors
2018-06-18 17:00:00.000141 mon.cn1 mon.0 192.168.110.1:6789/0 24709 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 17:22:45.323505 mon.cn1 mon.0 192.168.110.1:6789/0 24995 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 17:24:07.061763 mon.cn1 mon.0 192.168.110.1:6789/0 25016 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
2018-06-18 17:24:07.061885 mon.cn1 mon.0 192.168.110.1:6789/0 25017 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 17:24:07.061952 mon.cn1 mon.0 192.168.110.1:6789/0 25018 : cluster [INF] Cluster is now healthy
2018-06-18 18:00:00.000181 mon.cn1 mon.0 192.168.110.1:6789/0 25511 : cluster [INF] overall HEALTH_OK

Every time when there is a errors, I make a manual recovery via "ceph pg repair <num>" in CL.
Sometimes many hours of error do not appear! But they appear again and again.

Each time errors occur for different PGs. I am very tired of this struggle with the Bluestor...

proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-2 (running version: 5.2-2/b1d1c7f4)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-12
pve-kernel-4.15.17-2-pve: 4.15.17-10
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-32
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-11
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-28
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Somebody help me please!

Gosha

Alwin · Jun 18, 2018

What disks are you using as OSDs? How old are those and are they healthy? Do you have snapshots?

EDIT: Which upgrade procedure did you follow?

gosha · Jun 18, 2018

While writing this again happened!

Gosha.

gosha · Jun 18, 2018

Alwin said:
What disks are you using as OSDs? How old are those and are they healthy? Do you have snapshots?

EDIT: Which upgrade procedure did you follow?

This is new install via debian install method with restore VMs and CTs from backups.
All disks SATA HDD like this:

I did not make snapshots.

Gosha

gosha · Jun 18, 2018

So far I periodically run this script for repair:

ceph pg dump | grep -i incons | cut -f1 -d" " | while read i; do ceph pg repair ${i} ; done

Alwin · Jun 18, 2018

You're running the disks on a RAID controller (hence LOGICAL_VOLUME), check your controller as the SMART data might not be accurate. Also you can check if only the same OSDs are involved.

gosha · Jun 18, 2018

Alwin said:
You're running the disks on a RAID controller (hence LOGICAL_VOLUME), check your controller as the SMART data might not be accurate. Also you can check if only the same OSDs are involved.

Yes, my servers are equipped with RAID controllers. But they do not have the ability to use the HBA-mode.
I use RAID-0 for each disk (OSD).

And NO, each time it happens on different OSDs installed in different cluster nodes.
Could this be related to a RAID 1GB-cache? But why in the old version ceph (PVE 4.x) all worked without problems?
Do I need to go back to version 4.x? I do not want this, because in the new version CEPH the storage works much faster...

Gosha

gosha · Jun 18, 2018

Alwin said:
You're running the disks on a RAID controller (hence LOGICAL_VOLUME), check your controller as the SMART data might not be accurate. Also you can check if only the same OSDs are involved.

If I try to disable the cache in RAID-controller, this will not destroy the data on the disks? Is the performance of the CEPH-storage reduced?
I have never tried to do so.

Added later:
Oops! In all servers, the write cache option for RAID is already disabled! The question is canceled.

gosha · Jun 18, 2018

Alwin said:
You're running the disks on a RAID controller (hence LOGICAL_VOLUME), check your controller as the SMART data might not be accurate. Also you can check if only the same OSDs are involved.

I just checked the SMART information of all the RAID controllers. Errors were not found. All drives work fine.

Alwin · Jun 18, 2018

Check your controller logs, SMART values may not be accurate.

Alwin said:
What disks are you using as OSDs?
How old are those and are they healthy?
Do you have snapshots?

You didn't answer the rest of my questions.

gosha · Jun 19, 2018

Hi!

Alwin said:
Check your controller logs, SMART values may not be accurate.
What disks are you using as OSDs?
How old are those and are they healthy?
Do you have snapshots?
You didn't answer the rest of my questions.

1. All servers use HDD (HP) 1TB (on on cn1, cn2, cn3 nodes) and 2TB (on cn4 node only).

2. iLO4 on all servers show the status of all disks as OK. See picture for example:

1TB disks - about 3 years. 2TB disks - about 1 year.
The latest errors pointed to the OSD on these disks (2TB) on cn4 node.
But previous errors pointed to other disks 1TB (on other nodes) in the same way.

3. I do not have snapshots.

All the disks worked without problems with the previous version of the CEРH a few days ago.

Gosha

gosha · Jun 19, 2018

Hi!

I now did as described in this topic: https://forum.proxmox.com/threads/ceph-schedule-deep-scrubs-to-prevent-service-degradation.38499/
Set the schedule (crontab) for deep scrub during non-production hours only. Let's see what happens.

Alwin · Jun 19, 2018

In general, avoid RAID, use HBA. Ceph needs to be in control of the disks to perform stable.

gosha said:
All servers use HDD (HP) 1TB (on on cn1, cn2, cn3 nodes) and 2TB (on cn4 node only).

My question was more targeted towards what model they are to get some specifications.

gosha said:
iLO4 on all servers show the status of all disks as OK. See picture for example:

But is that the truth? The controller might also just check SMART values, but what if the disks have a defect not visible with SMART?

gosha said:
1TB disks - about 3 years. 2TB disks - about 1 year.
The latest errors pointed to the OSD on these disks (2TB) on cn4 node.
But previous errors pointed to other disks 1TB (on other nodes) in the same way.

If the errors are spread throughout the cluster then it is less likely to be a drive model or age.

gosha said:
I do not have snapshots.

One point less to check.

gosha said:
All the disks worked without problems with the previous version of the CEРH a few days ago.

As you said above, new bluestore, I assume you used filestore before. Bluestore is able to do full checksuming on each OSD, contrary to filestore. So might be also a issue due to the RAID controller.

BlueStore calculates, stores, and verifies checksums for all data and metadata it stores. Any time data is read off of disk, a checksum is used to verify the data is correct before it is exposed to any other part of the system (or the user).
https://ceph.com/community/new-luminous-bluestore/

To get more information into the logs, you can set all the subsystem to a higher debug level (eg. 1/5).
http://docs.ceph.com/docs/luminous/...g-and-debug/#subsystem-log-and-debug-settings
This can be done also on the fly.
http://docs.ceph.com/docs/jewel/rados/configuration/ceph-conf/#runtime-changes

gosha · Jun 19, 2018

Ok

Alwin said:
In general, avoid RAID, use HBA. Ceph needs to be in control of the disks to perform stable.

All my servers (three ProLiant DL380 Gen8 and one DL160 Gen8) are equipped with Smart Array P420 Controllers.
These controllers do not have the ability switch to the HBA mode.

And I just did this:

Code:

ceph tell osd.* injectargs '--debug_ms 1/5';
osd.0: debug_ms=1/5
osd.1: debug_ms=1/5
osd.2: debug_ms=1/5
osd.3: debug_ms=1/5
osd.4: debug_ms=1/5
osd.5: debug_ms=1/5
osd.6: debug_ms=1/5
osd.7: debug_ms=1/5
osd.8: debug_ms=1/5
osd.9: debug_ms=1/5
osd.10: debug_ms=1/5
osd.11: debug_ms=1/5
osd.12: debug_ms=1/5
osd.13: debug_ms=1/5
osd.14: debug_ms=1/5
osd.15: debug_ms=1/5
osd.16: debug_ms=1/5
osd.17: debug_ms=1/5
osd.18: debug_ms=1/5
osd.19: debug_ms=1/5
osd.20: debug_ms=1/5
osd.21: debug_ms=1/5
osd.22: debug_ms=1/5
osd.23: debug_ms=1/5

Gosha

gosha · Jun 19, 2018

Alwin said:
...
My question was more targeted towards what model they are to get some specifications.
...

Here are the drive models:

1TB - model MB1000GCWCV Firmware Version HPGH
2TB - model MB2000GFDSH Firmware Version HPG2

Gosha

Alwin · Jun 19, 2018

gosha said:
1TB - model MB1000GCWCV Firmware Version HPGH
2TB - model MB2000GFDSH Firmware Version HPG2

There is not much information on these models, but that doesn't really matter as there is a RAID controller in between. HBAs are not expensive.

mbaldini · Jun 19, 2018

Check this bug in ceph tracker
http://tracker.ceph.com/issues/22464
could be your (and mine) issue?

Alwin · Jun 19, 2018

mbaldini said:
Check this bug in ceph tracker
http://tracker.ceph.com/issues/22464
could be your (and mine) issue?

Common factors to this bug, are that many of the affected systems do either have a RAID controller or are low on memory. Though it seems to me that the low memory has more an effect on this then the controller.

An attempt as mentioned in the bug report, is to reduce the cache size of the bluestore OSDs.
http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size

alexskysilk · Jun 19, 2018

The biggest issue with a raid controller masquerading as JBOD is the inconsistent queue depth. Your disks are designed for 128-256, but your "raid volume" is (probably) set to 512 or more. while the controller does its best to queue and distribute the work, with only one target there really isnt anywhere for the IO to go leading to inconsistent write completions. Ceph is very sensitive to this.

If you really insist of using this hardware, you REALLY REALLY want an SSD for your db and wal traffic.

gosha · Jun 19, 2018

Alwin said:
... HBAs are not expensive.

This is a very low-budget organization.

Regular errors on ceph pg's!

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Distinguished Member

Well-Known Member