HEALTH_ERR with BlueStore

hidalgo

Renowned Member
Nov 11, 2016
60
0
71
58
Since a few days I got a issue with a damaged PG.
Code:
# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 1.25 is active+clean+inconsistent, acting [2,3,5]

Unfortunately this didn’t help
Code:
# ceph pg repair 1.25

What to do?

BTW:
Code:
proxmox-ve: 5.0-24 (running kernel: 4.10.17-4-pve)
pve-manager: 5.0-32 (running version: 5.0-32/2560e073)
pve-kernel-4.10.17-4-pve: 4.10.17-24
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-14
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-18
libpve-guest-common-perl: 2.0-12
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-15
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-4
pve-container: 2.0-16
pve-firewall: 3.0-3
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve18~bpo90
ceph: 12.2.1-pve1
 
Greetings!
I am having the same problem after the migration.

Code:
proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.24-1-pve: 4.4.24-72
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.59-1-pve: 4.4.59-87
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.1-pve3
 
Hello,

I have the same issue...

- 3 pve cluster with same hardware
- CEPH with 3 OSD (2 HHD and 1 SSD)
- System installed on another ssd (zfs)
- I only have 32GB of RAM in these servers and they are using around 70% of it and swapping a little bit even with swappiness at 0 (the issue seems to be related to ram and swap)
- CEPH seams to use lot of memory : the cluster is not so much loaded, (relatively light VMs) but it uses 70% of the RAM while the whole VMs RAM attribution (and they don't use it all !) is less than 30%

Software :
pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.16-1-pve)
ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416) luminous (stable)

The error occurs almost once a day everyday and completely random, different OSD, didn't see a hardware relation...

You should try not to repair but to issue a new deep scrub on the faulty pg (ceph pg deep-scrub X.XX) for me it solves the problem...
You also can try to downgrade the kernel to 4.10 (witch seem to not be affected)

More informations here : http://tracker.ceph.com/issues/22464