Hi!
I very much regret that I upgraded (via new install and restore all VMs and CTs from backup)
my cluster from 4.x to 5.2! A new storage Bluestore is driving me crazy!
I now do not sleep at night, trying to understand why the same disks worked fine in the old version CEPH,
and in the new version I regularly get constant errors. Here is an example:
Every time when there is a errors, I make a manual recovery via "ceph pg repair <num>" in CL.
Sometimes many hours of error do not appear! But they appear again and again.
Each time errors occur for different PGs. I am very tired of this struggle with the Bluestor...
Somebody help me please!
Gosha
I very much regret that I upgraded (via new install and restore all VMs and CTs from backup)
my cluster from 4.x to 5.2! A new storage Bluestore is driving me crazy!
I now do not sleep at night, trying to understand why the same disks worked fine in the old version CEPH,
and in the new version I regularly get constant errors. Here is an example:
2018-06-18 07:00:00.000178 mon.cn1 mon.0 192.168.110.1:6789/0 17405 : cluster [INF] overall HEALTH_OK
2018-06-18 08:00:00.000182 mon.cn1 mon.0 192.168.110.1:6789/0 18233 : cluster [INF] overall HEALTH_OK
2018-06-18 08:14:15.029761 osd.21 osd.21 192.168.110.4:6808/2776 120 : cluster [ERR] 4.115 shard 4: soid 4:a895bc88:::rbd_data.1ab922ae8944a.0000000000001177:head candidate had a read error
2018-06-18 08:14:15.029767 osd.21 osd.21 192.168.110.4:6808/2776 121 : cluster [ERR] 4.115 shard 4: soid 4:a8960fca:::rbd_data.b824874b0dc51.000000000000e0aa:head candidate had a read error
2018-06-18 08:15:01.879394 osd.21 osd.21 192.168.110.4:6808/2776 122 : cluster [ERR] 4.115 deep-scrub 0 missing, 2 inconsistent objects
2018-06-18 08:15:01.879400 osd.21 osd.21 192.168.110.4:6808/2776 123 : cluster [ERR] 4.115 deep-scrub 2 errors
2018-06-18 08:15:02.191305 mon.cn1 mon.0 192.168.110.1:6789/0 18451 : cluster [ERR] Health check failed: 2 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 08:15:02.191380 mon.cn1 mon.0 192.168.110.1:6789/0 18452 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 08:31:33.555440 mon.cn1 mon.0 192.168.110.1:6789/0 18652 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 08:32:43.780308 mon.cn1 mon.0 192.168.110.1:6789/0 18667 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 2 scrub errors)
2018-06-18 08:32:43.780438 mon.cn1 mon.0 192.168.110.1:6789/0 18668 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 08:32:43.780509 mon.cn1 mon.0 192.168.110.1:6789/0 18669 : cluster [INF] Cluster is now healthy
2018-06-18 09:00:00.000213 mon.cn1 mon.0 192.168.110.1:6789/0 18978 : cluster [INF] overall HEALTH_OK
2018-06-18 10:00:00.000120 mon.cn1 mon.0 192.168.110.1:6789/0 19695 : cluster [INF] overall HEALTH_OK
2018-06-18 11:00:00.009025 mon.cn1 mon.0 192.168.110.1:6789/0 20380 : cluster [INF] overall HEALTH_OK
2018-06-18 11:19:02.932367 osd.5 osd.5 192.168.110.1:6800/2817 54 : cluster [ERR] 4.269 shard 17: soid 4:96648683:::rbd_data.266182ae8944a.00000000000198cf:head candidate had a read error
2018-06-18 11:19:37.483829 mon.cn1 mon.0 192.168.110.1:6789/0 20621 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 11:19:37.483938 mon.cn1 mon.0 192.168.110.1:6789/0 20622 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 11:19:31.366600 osd.5 osd.5 192.168.110.1:6800/2817 55 : cluster [ERR] 4.269 deep-scrub 0 missing, 1 inconsistent objects
2018-06-18 11:19:31.366606 osd.5 osd.5 192.168.110.1:6800/2817 56 : cluster [ERR] 4.269 deep-scrub 1 errors
2018-06-18 12:00:00.000148 mon.cn1 mon.0 192.168.110.1:6789/0 21061 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 13:00:00.000171 mon.cn1 mon.0 192.168.110.1:6789/0 21741 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 14:00:00.000175 mon.cn1 mon.0 192.168.110.1:6789/0 22452 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 14:04:11.813396 mon.cn1 mon.0 192.168.110.1:6789/0 22503 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 14:05:32.086196 mon.cn1 mon.0 192.168.110.1:6789/0 22520 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
2018-06-18 14:05:32.086320 mon.cn1 mon.0 192.168.110.1:6789/0 22521 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 14:05:32.086400 mon.cn1 mon.0 192.168.110.1:6789/0 22522 : cluster [INF] Cluster is now healthy
2018-06-18 15:00:00.000262 mon.cn1 mon.0 192.168.110.1:6789/0 23210 : cluster [INF] overall HEALTH_OK
2018-06-18 16:00:00.000170 mon.cn1 mon.0 192.168.110.1:6789/0 23941 : cluster [INF] overall HEALTH_OK
2018-06-18 16:01:02.707696 osd.4 osd.4 192.168.110.1:6804/3022 44 : cluster [ERR] 4.31c shard 4: soid 4:38c35abc:::rbd_data.1c5032ae8944a.0000000000000583:head candidate had a read error
2018-06-18 16:02:31.311803 mon.cn1 mon.0 192.168.110.1:6789/0 23970 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 16:02:31.311901 mon.cn1 mon.0 192.168.110.1:6789/0 23971 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 16:02:28.067005 osd.4 osd.4 192.168.110.1:6804/3022 45 : cluster [ERR] 4.31c deep-scrub 0 missing, 1 inconsistent objects
2018-06-18 16:02:28.067022 osd.4 osd.4 192.168.110.1:6804/3022 46 : cluster [ERR] 4.31c deep-scrub 1 errors
2018-06-18 17:00:00.000141 mon.cn1 mon.0 192.168.110.1:6789/0 24709 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 17:22:45.323505 mon.cn1 mon.0 192.168.110.1:6789/0 24995 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 17:24:07.061763 mon.cn1 mon.0 192.168.110.1:6789/0 25016 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
2018-06-18 17:24:07.061885 mon.cn1 mon.0 192.168.110.1:6789/0 25017 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 17:24:07.061952 mon.cn1 mon.0 192.168.110.1:6789/0 25018 : cluster [INF] Cluster is now healthy
2018-06-18 18:00:00.000181 mon.cn1 mon.0 192.168.110.1:6789/0 25511 : cluster [INF] overall HEALTH_OK
2018-06-18 08:00:00.000182 mon.cn1 mon.0 192.168.110.1:6789/0 18233 : cluster [INF] overall HEALTH_OK
2018-06-18 08:14:15.029761 osd.21 osd.21 192.168.110.4:6808/2776 120 : cluster [ERR] 4.115 shard 4: soid 4:a895bc88:::rbd_data.1ab922ae8944a.0000000000001177:head candidate had a read error
2018-06-18 08:14:15.029767 osd.21 osd.21 192.168.110.4:6808/2776 121 : cluster [ERR] 4.115 shard 4: soid 4:a8960fca:::rbd_data.b824874b0dc51.000000000000e0aa:head candidate had a read error
2018-06-18 08:15:01.879394 osd.21 osd.21 192.168.110.4:6808/2776 122 : cluster [ERR] 4.115 deep-scrub 0 missing, 2 inconsistent objects
2018-06-18 08:15:01.879400 osd.21 osd.21 192.168.110.4:6808/2776 123 : cluster [ERR] 4.115 deep-scrub 2 errors
2018-06-18 08:15:02.191305 mon.cn1 mon.0 192.168.110.1:6789/0 18451 : cluster [ERR] Health check failed: 2 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 08:15:02.191380 mon.cn1 mon.0 192.168.110.1:6789/0 18452 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 08:31:33.555440 mon.cn1 mon.0 192.168.110.1:6789/0 18652 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 08:32:43.780308 mon.cn1 mon.0 192.168.110.1:6789/0 18667 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 2 scrub errors)
2018-06-18 08:32:43.780438 mon.cn1 mon.0 192.168.110.1:6789/0 18668 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 08:32:43.780509 mon.cn1 mon.0 192.168.110.1:6789/0 18669 : cluster [INF] Cluster is now healthy
2018-06-18 09:00:00.000213 mon.cn1 mon.0 192.168.110.1:6789/0 18978 : cluster [INF] overall HEALTH_OK
2018-06-18 10:00:00.000120 mon.cn1 mon.0 192.168.110.1:6789/0 19695 : cluster [INF] overall HEALTH_OK
2018-06-18 11:00:00.009025 mon.cn1 mon.0 192.168.110.1:6789/0 20380 : cluster [INF] overall HEALTH_OK
2018-06-18 11:19:02.932367 osd.5 osd.5 192.168.110.1:6800/2817 54 : cluster [ERR] 4.269 shard 17: soid 4:96648683:::rbd_data.266182ae8944a.00000000000198cf:head candidate had a read error
2018-06-18 11:19:37.483829 mon.cn1 mon.0 192.168.110.1:6789/0 20621 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 11:19:37.483938 mon.cn1 mon.0 192.168.110.1:6789/0 20622 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 11:19:31.366600 osd.5 osd.5 192.168.110.1:6800/2817 55 : cluster [ERR] 4.269 deep-scrub 0 missing, 1 inconsistent objects
2018-06-18 11:19:31.366606 osd.5 osd.5 192.168.110.1:6800/2817 56 : cluster [ERR] 4.269 deep-scrub 1 errors
2018-06-18 12:00:00.000148 mon.cn1 mon.0 192.168.110.1:6789/0 21061 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 13:00:00.000171 mon.cn1 mon.0 192.168.110.1:6789/0 21741 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 14:00:00.000175 mon.cn1 mon.0 192.168.110.1:6789/0 22452 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 14:04:11.813396 mon.cn1 mon.0 192.168.110.1:6789/0 22503 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 14:05:32.086196 mon.cn1 mon.0 192.168.110.1:6789/0 22520 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
2018-06-18 14:05:32.086320 mon.cn1 mon.0 192.168.110.1:6789/0 22521 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 14:05:32.086400 mon.cn1 mon.0 192.168.110.1:6789/0 22522 : cluster [INF] Cluster is now healthy
2018-06-18 15:00:00.000262 mon.cn1 mon.0 192.168.110.1:6789/0 23210 : cluster [INF] overall HEALTH_OK
2018-06-18 16:00:00.000170 mon.cn1 mon.0 192.168.110.1:6789/0 23941 : cluster [INF] overall HEALTH_OK
2018-06-18 16:01:02.707696 osd.4 osd.4 192.168.110.1:6804/3022 44 : cluster [ERR] 4.31c shard 4: soid 4:38c35abc:::rbd_data.1c5032ae8944a.0000000000000583:head candidate had a read error
2018-06-18 16:02:31.311803 mon.cn1 mon.0 192.168.110.1:6789/0 23970 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2018-06-18 16:02:31.311901 mon.cn1 mon.0 192.168.110.1:6789/0 23971 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)
2018-06-18 16:02:28.067005 osd.4 osd.4 192.168.110.1:6804/3022 45 : cluster [ERR] 4.31c deep-scrub 0 missing, 1 inconsistent objects
2018-06-18 16:02:28.067022 osd.4 osd.4 192.168.110.1:6804/3022 46 : cluster [ERR] 4.31c deep-scrub 1 errors
2018-06-18 17:00:00.000141 mon.cn1 mon.0 192.168.110.1:6789/0 24709 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
2018-06-18 17:22:45.323505 mon.cn1 mon.0 192.168.110.1:6789/0 24995 : cluster [ERR] Health check update: Possible data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
2018-06-18 17:24:07.061763 mon.cn1 mon.0 192.168.110.1:6789/0 25016 : cluster [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 1 scrub errors)
2018-06-18 17:24:07.061885 mon.cn1 mon.0 192.168.110.1:6789/0 25017 : cluster [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent, 1 pg repair)
2018-06-18 17:24:07.061952 mon.cn1 mon.0 192.168.110.1:6789/0 25018 : cluster [INF] Cluster is now healthy
2018-06-18 18:00:00.000181 mon.cn1 mon.0 192.168.110.1:6789/0 25511 : cluster [INF] overall HEALTH_OK
Every time when there is a errors, I make a manual recovery via "ceph pg repair <num>" in CL.
Sometimes many hours of error do not appear! But they appear again and again.
Each time errors occur for different PGs. I am very tired of this struggle with the Bluestor...
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-2 (running version: 5.2-2/b1d1c7f4)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-12
pve-kernel-4.15.17-2-pve: 4.15.17-10
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-32
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-11
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-28
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
pve-manager: 5.2-2 (running version: 5.2-2/b1d1c7f4)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-12
pve-kernel-4.15.17-2-pve: 4.15.17-10
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-32
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-11
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-28
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
Somebody help me please!
Gosha