Ceph down if one node is down

bizzarrone · Jan 24, 2019

Good evening,
after some tests, I discovered that if 1 of 4 nodes goes down, the disk IO stucks.
VM and CT are still up but no disk of them are available for I/O.

I have 3 ceph monitors.
When I reboot the node, on ceph logs:

Code:

2019-01-24 10:28:08.240463 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175770 : cluster [INF] osd.2 marked itself down
2019-01-24 10:28:08.276487 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175771 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2019-01-24 10:28:08.286445 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175773 : cluster [INF] Standby daemon mds.bluehub-prox05 assigned to filesystem cephfs as rank 0
2019-01-24 10:28:09.238925 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175776 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-01-24 10:28:09.238987 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175777 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-01-24 10:28:10.427732 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175784 : cluster [INF] daemon mds.bluehub-prox05 is now active in filesystem cephfs as rank 0
2019-01-24 10:28:11.401683 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175785 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)
2019-01-24 10:28:11.402667 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175786 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2019-01-24 10:28:11.402698 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 175787 : cluster [WRN] Health check failed: Degraded data redundancy: 3660/1823339 objects degraded (0.201%), 14 pgs degraded (PG_DEGRADED)
2019-01-24 10:28:24.577475 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8302 : cluster [INF] mon.bluehub-prox03 calling monitor election
2019-01-24 10:28:24.595828 mon.bluehub-prox05 mon.2 10.9.9.5:6789/0 1879958 : cluster [INF] mon.bluehub-prox05 calling monitor election
2019-01-24 10:28:34.598958 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8303 : cluster [INF] mon.bluehub-prox03 is new leader, mons bluehub-prox03,bluehub-prox05 in quorum (ranks 1,2)
2019-01-24 10:28:34.625022 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8308 : cluster [WRN] Health check failed: 1/3 mons down, quorum bluehub-prox03,bluehub-prox05 (MON_DOWN)
2019-01-24 10:28:34.642025 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8310 : cluster [WRN] overall HEALTH_WARN 1 osds down; 1 host (1 osds) down; 22/1823339 objects misplaced (0.001%); Reduced data availability: 1 pg peering; Degraded data redundancy: 67494/1823339 objects degraded (3.702%), 188 pgs degraded; 1/3 mons down, quorum bluehub-prox03,bluehub-prox05
2019-01-24 10:28:34.676528 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8311 : cluster [WRN] Health check update: 22/1823513 objects misplaced (0.001%) (OBJECT_MISPLACED)
2019-01-24 10:28:34.676588 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8312 : cluster [WRN] Health check update: Degraded data redundancy: 68479/1823513 objects degraded (3.755%), 199 pgs degraded (PG_DEGRADED)
2019-01-24 10:28:34.676648 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8313 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2019-01-24 10:29:09.215151 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8318 : cluster [WRN] Health check update: 22/1823515 objects misplaced (0.001%) (OBJECT_MISPLACED)
2019-01-24 10:29:09.215200 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8319 : cluster [WRN] Health check update: Degraded data redundancy: 68479/1823515 objects degraded (3.755%), 199 pgs degraded (PG_DEGRADED)
2019-01-24 10:29:11.285465 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8320 : cluster [WRN] Health check failed: Reduced data availability: 76 pgs inactive (PG_AVAILABILITY)
2019-01-24 10:29:15.441692 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8321 : cluster [WRN] Health check update: Degraded data redundancy: 68479/1823515 objects degraded (3.755%), 199 pgs degraded, 203 pgs undersized (PG_DEGRADED)
2019-01-24 10:32:24.933885 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 1 : cluster [INF] mon.bluehub-prox02 calling monitor election
2019-01-24 10:32:24.938282 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 2 : cluster [INF] mon.bluehub-prox02 calling monitor election
2019-01-24 10:32:24.979618 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 3 : cluster [INF] mon.bluehub-prox02 is new leader, mons bluehub-prox02,bluehub-prox03,bluehub-prox05 in quorum (ranks 0,1,2)
2019-01-24 10:32:25.002587 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 4 : cluster [WRN] mon.2 10.9.9.5:6789/0 clock skew 0.491436s > max 0.05s
2019-01-24 10:32:25.002706 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 5 : cluster [WRN] mon.1 10.9.9.3:6789/0 clock skew 0.491068s > max 0.05s
2019-01-24 10:32:25.009771 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 10 : cluster [WRN] Health check failed: clock skew detected on mon.bluehub-prox03, mon.bluehub-prox05 (MON_CLOCK_SKEW)
2019-01-24 10:32:25.009805 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 11 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum bluehub-prox03,bluehub-prox05)
2019-01-24 10:32:25.010647 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 12 : cluster [WRN] message from mon.2 was stamped 0.491892s in the future, clocks not synchronized
2019-01-24 10:32:25.022385 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 13 : cluster [WRN] overall HEALTH_WARN 1 osds down; 1 host (1 osds) down; 22/1823515 objects misplaced (0.001%); Reduced data availability: 76 pgs inactive; Degraded data redundancy: 68479/1823515 objects degraded (3.755%), 199 pgs degraded, 203 pgs undersized; clock skew detected on mon.bluehub-prox03, mon.bluehub-prox05
2019-01-24 10:32:25.428905 mon.bluehub-prox05 mon.2 10.9.9.5:6789/0 1880004 : cluster [INF] mon.bluehub-prox05 calling monitor election
2019-01-24 10:32:25.429032 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 8348 : cluster [INF] mon.bluehub-prox03 calling monitor election
2019-01-24 10:32:29.988287 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 16 : cluster [INF] Manager daemon bluehub-prox05 is unresponsive. No standby daemons available.
2019-01-24 10:32:29.988376 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 17 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

Here my network:

Code:

cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eno1 inet manual
#Production

auto vmbr0
iface vmbr0 inet static
    address 10.169.136.75
    netmask 255.255.255.128
    gateway 10.169.136.1
    bridge_ports eno1
    bridge_stp off
    bridge_fd 0

iface eno2 inet manual

iface enp0s29f0u2 inet manual

iface ens6f0 inet manual

iface ens6f1 inet manual

iface ens2f0 inet manual

iface ens2f1 inet manual

auto vlan1050
iface vlan1050 inet static
        vlan_raw_device ens2f0
        address  10.9.9.1
        netmask  255.255.255.0
        network  10.9.9.0
#Ceph

auto vlan1048
iface vlan1048 inet static
    vlan_raw_device ens2f0
        address  10.1.1.1
        netmask  255.255.255.0
    network  10.1.1.0
#Cluster

Here the cluster status:

Code:

Quorum information
------------------
Date:             Wed Jan 23 17:09:11 2019
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1/4580
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.1.1 (local)
0x00000002          1 10.1.1.2
0x00000003          1 10.1.1.3
0x00000004          1 10.1.1.5

My packages versions:

Code:

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

is it possible that the problem could be the clock skew?

Code:

cluster [WRN] mon.2 10.9.9.5:6789/0 clock skew 0.491436s > max 0.05s

Alwin · Jan 28, 2019

bizzarrone said:
after some tests, I discovered that if 1 of 4 nodes goes down, the disk IO stucks.
VM and CT are still up but no disk of them are available for I/O.

The general ceph.log doesn't show this, check your OSD logs to see more.

bizzarrone said:
is it possible that the problem could be the clock skew?

One possibility, all MONs need to provide the same updated maps to clients, OSDs and MDS. Use one local timeserver (in hardware) to sync the time from. This way you can make sure, that all the nodes in the cluster have the same time.

bizzarrone · Jan 28, 2019

Alwin said:
The general ceph.log doesn't show this, check your OSD logs to see more.

One possibility, all MONs need to provide the same updated maps to clients, OSDs and MDS. Use one local timeserver (in hardware) to sync the time from. This way you can make sure, that all the nodes in the cluster have the same time.

Thank you Alwin,
I am using timesyncd instead of ntpd and an internal ntp server for the datacentre..
Could it be a solution to switch to ntpd instead?

Alwin · Jan 28, 2019

bizzarrone said:
Could it be a solution to switch to ntpd instead?

If you don't mind me asking, how do you conclude that?

A node reboot usually has some clock skew and should shortly reside after finished boot (time synced).

bizzarrone · Jan 28, 2019

Alwin said:
If you don't mind me asking, how do you conclude that?

A node reboot usually has some clock skew and should shortly reside after finished boot (time synced).

I just read in some other threads.
I will try a new reboot and check the logs of OSD

Alwin · Jan 28, 2019

It greatly depends how the time is synced, while timesyncd uses one source to get its time, ntpd (if not configured otherwise) will use three time servers and calculates a median time from those three. The later can lead to time drifts and cause all sorts of unwanted behavior in a cluster.

The time should be local to the network, so that any jitter or sudden time drifts can be compensated. Further the time server should have a constant clock cycle available, as with virtual machines, the clock cycle may change (sudden drifts), depending on how much time the VM gets on the physical CPU. But in general it doesn't matter if the time is correct as long as every service in a cluster has the same time.

bizzarrone · Feb 6, 2019

Good morning,
today I performed a new test.

Code:

2019-02-06 06:26:18.797253 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33445 : cluster [WRN] Health check update: 35/1833259 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:27:20.872436 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33449 : cluster [WRN] Health check update: 35/1833261 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:28:17.427051 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33453 : cluster [WRN] Health check update: 35/1833263 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:30:18.621915 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33457 : cluster [WRN] Health check update: 35/1833265 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:31:21.253005 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33458 : cluster [WRN] Health check update: 35/1833267 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:32:21.845412 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33460 : cluster [WRN] Health check update: 35/1833269 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:34:21.322243 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33463 : cluster [WRN] Health check update: 35/1833271 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:35:23.620718 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33466 : cluster [WRN] Health check update: 35/1833273 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:37:20.785147 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33470 : cluster [WRN] Health check update: 35/1833275 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:38:21.415102 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33471 : cluster [WRN] Health check update: 35/1833277 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:39:37.036114 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33473 : cluster [WRN] Health check update: 35/1833279 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:40:25.321784 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33474 : cluster [WRN] Health check update: 35/1833281 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:41:23.161330 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33478 : cluster [WRN] Health check update: 35/1833283 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:43:27.257976 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33482 : cluster [WRN] Health check update: 35/1833285 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:44:27.198086 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33483 : cluster [WRN] Health check update: 35/1833287 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:46:26.073083 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33487 : cluster [WRN] Health check update: 35/1833289 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:47:24.752603 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33488 : cluster [WRN] Health check update: 35/1833293 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:49:30.373143 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33495 : cluster [WRN] Health check update: 35/1833295 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:50:28.433518 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33498 : cluster [WRN] Health check update: 35/1833297 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:52:28.460197 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33503 : cluster [WRN] Health check update: 35/1833299 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:54:30.882084 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33507 : cluster [WRN] Health check update: 35/1833301 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:55:29.398258 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33510 : cluster [WRN] Health check update: 35/1833303 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:57:30.592710 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33513 : cluster [WRN] Health check update: 35/1833305 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 06:58:31.217289 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33516 : cluster [WRN] Health check update: 35/1833307 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:00:00.000187 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33518 : cluster [WRN] overall HEALTH_WARN 35/1833307 objects misplaced (0.002%)
2019-02-06 07:00:34.739248 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33519 : cluster [WRN] Health check update: 35/1833309 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:02:35.913121 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33521 : cluster [WRN] Health check update: 35/1833311 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:04:36.900236 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33523 : cluster [WRN] Health check update: 35/1833313 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:05:37.378355 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33524 : cluster [WRN] Health check update: 35/1833315 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:07:37.194719 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33528 : cluster [WRN] Health check update: 35/1833317 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:09:39.764845 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33532 : cluster [WRN] Health check update: 35/1833319 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:10:40.361387 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33536 : cluster [WRN] Health check update: 35/1833321 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:12:39.741212 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33539 : cluster [WRN] Health check update: 35/1833323 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:14:43.336322 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33542 : cluster [WRN] Health check update: 35/1833325 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:16:40.317086 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33545 : cluster [WRN] Health check update: 35/1833327 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:18:43.048724 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33547 : cluster [WRN] Health check update: 35/1833329 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:20:42.209475 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33553 : cluster [WRN] Health check update: 35/1833331 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:22:43.397293 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33557 : cluster [WRN] Health check update: 35/1833333 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:24:44.582609 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33559 : cluster [WRN] Health check update: 35/1833335 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:26:49.866160 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33562 : cluster [WRN] Health check update: 35/1833337 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:28:47.078988 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33565 : cluster [WRN] Health check update: 35/1833339 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:29:07.156983 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33569 : cluster [WRN] Health check update: 35/1833341 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:29:13.226464 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33570 : cluster [WRN] Health check update: 35/1833345 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:29:18.888455 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33571 : cluster [WRN] Health check update: 35/1833353 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:30:50.749371 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33578 : cluster [WRN] Health check update: 35/1833355 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:31:50.794024 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33581 : cluster [WRN] Health check update: 35/1833357 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:34:52.500845 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33589 : cluster [WRN] Health check update: 35/1833359 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:36:50.361971 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33593 : cluster [WRN] Health check update: 35/1833361 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:38:52.828329 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33599 : cluster [WRN] Health check update: 35/1833363 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:39:17.085303 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33601 : cluster [WRN] Health check update: 35/1833365 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:40:54.865029 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33607 : cluster [WRN] Health check update: 35/1833367 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:42:57.630020 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33611 : cluster [WRN] Health check update: 35/1833369 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:45:54.971033 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33619 : cluster [WRN] Health check update: 35/1833371 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:47:56.060799 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33625 : cluster [WRN] Health check update: 35/1833373 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:49:59.289441 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33629 : cluster [WRN] Health check update: 35/1833375 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:52:59.049132 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33634 : cluster [WRN] Health check update: 35/1833377 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:55:00.243054 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33640 : cluster [WRN] Health check update: 35/1833379 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 07:57:01.377008 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33644 : cluster [WRN] Health check update: 35/1833381 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:00:00.000204 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33650 : cluster [WRN] overall HEALTH_WARN 35/1833381 objects misplaced (0.002%)
2019-02-06 08:00:01.356317 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33651 : cluster [WRN] Health check update: 35/1833383 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:02:06.369235 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33655 : cluster [WRN] Health check update: 35/1833385 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:05:06.139465 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33658 : cluster [WRN] Health check update: 35/1833387 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:07:07.448718 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33661 : cluster [WRN] Health check update: 35/1833389 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:10:09.124160 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33667 : cluster [WRN] Health check update: 35/1833391 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:12:10.261279 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33668 : cluster [WRN] Health check update: 35/1833393 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:13:53.294134 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33671 : cluster [WRN] Health check update: 35/1833395 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:15:10.238631 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33672 : cluster [WRN] Health check update: 35/1833397 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:17:15.293252 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33675 : cluster [WRN] Health check update: 35/1833399 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:20:15.014979 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33679 : cluster [WRN] Health check update: 35/1833401 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:23:14.811227 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33684 : cluster [WRN] Health check update: 35/1833403 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:25:13.965800 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33687 : cluster [WRN] Health check update: 35/1833405 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:28:19.830326 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33691 : cluster [WRN] Health check update: 35/1833407 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:30:18.973142 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33692 : cluster [WRN] Health check update: 35/1833409 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:32:16.775147 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33695 : cluster [WRN] Health check update: 35/1833411 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:35:19.984680 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33698 : cluster [WRN] Health check update: 35/1833413 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:38:21.660892 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33700 : cluster [WRN] Health check update: 35/1833415 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:40:27.156919 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33701 : cluster [WRN] Health check update: 35/1833417 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:43:25.057301 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33704 : cluster [WRN] Health check update: 35/1833419 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:45:28.826842 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33706 : cluster [WRN] Health check update: 35/1833421 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:48:29.932346 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33709 : cluster [WRN] Health check update: 35/1833423 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:51:27.369428 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33713 : cluster [WRN] Health check update: 35/1833425 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:53:32.569131 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33715 : cluster [WRN] Health check update: 35/1833427 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:56:34.831187 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33719 : cluster [WRN] Health check update: 35/1833429 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 08:59:32.060853 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33723 : cluster [WRN] Health check update: 35/1833431 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:00:00.000214 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33724 : cluster [WRN] overall HEALTH_WARN 35/1833431 objects misplaced (0.002%)
2019-02-06 09:01:40.054505 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33727 : cluster [WRN] Health check update: 35/1833433 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:02:36.209687 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33728 : cluster [WRN] Health check update: 35/1833435 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:04:37.404897 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33730 : cluster [WRN] Health check update: 35/1833437 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:06:40.211382 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33732 : cluster [WRN] Health check update: 35/1833439 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:07:34.761057 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33734 : cluster [WRN] Health check update: 35/1833441 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:10:38.480632 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33735 : cluster [WRN] Health check update: 35/1833443 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:13:38.228424 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33738 : cluster [WRN] Health check update: 35/1833445 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:16:42.029118 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33743 : cluster [WRN] Health check update: 35/1833447 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:19:41.784732 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33748 : cluster [WRN] Health check update: 35/1833449 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:22:46.536950 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33753 : cluster [WRN] Health check update: 35/1833451 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:25:45.325660 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33755 : cluster [WRN] Health check update: 35/1833453 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:28:45.121326 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33761 : cluster [WRN] Health check update: 35/1833455 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:31:50.877648 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33763 : cluster [WRN] Health check update: 35/1833457 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:34:48.704717 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33764 : cluster [WRN] Health check update: 35/1833459 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:37:50.317191 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33771 : cluster [WRN] Health check update: 35/1833461 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:40:52.065172 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33772 : cluster [WRN] Health check update: 35/1833463 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:41:32.420961 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33774 : cluster [WRN] Health check update: 35/1833465 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:44:57.178459 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33777 : cluster [WRN] Health check update: 35/1833467 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:47:54.228856 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33778 : cluster [WRN] Health check update: 35/1833469 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:48:50.805670 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33779 : cluster [WRN] Health check update: 35/1833471 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:50:56.082862 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33781 : cluster [WRN] Health check update: 35/1833473 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:53:59.897970 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33786 : cluster [WRN] Health check update: 35/1833475 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:55:02.449021 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33788 : cluster [WRN] Health check update: 35/1833477 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 09:57:03.668592 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33789 : cluster [WRN] Health check update: 35/1833479 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:00:00.000197 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33791 : cluster [WRN] overall HEALTH_WARN 35/1833479 objects misplaced (0.002%)
2019-02-06 10:01:02.052907 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33792 : cluster [WRN] Health check update: 35/1833481 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:04:01.821176 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33796 : cluster [WRN] Health check update: 35/1833483 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:07:03.624856 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33798 : cluster [WRN] Health check update: 35/1833485 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:10:08.021735 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33802 : cluster [WRN] Health check update: 35/1833487 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:14:08.713815 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33804 : cluster [WRN] Health check update: 35/1833489 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:17:11.681387 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33806 : cluster [WRN] Health check update: 35/1833491 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:20:10.316956 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33808 : cluster [WRN] Health check update: 35/1833493 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:23:11.213585 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33811 : cluster [WRN] Health check update: 35/1833495 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:26:15.076549 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33813 : cluster [WRN] Health check update: 35/1833497 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:30:18.186679 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33850 : cluster [WRN] Health check update: 35/1833499 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:33:17.594437 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33852 : cluster [WRN] Health check update: 35/1833501 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:36:19.735120 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33857 : cluster [WRN] Health check update: 35/1833503 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:39:20.829395 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33861 : cluster [WRN] Health check update: 35/1833505 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:42:20.599428 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33864 : cluster [WRN] Health check update: 35/1833507 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:46:25.017395 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33869 : cluster [WRN] Health check update: 35/1833509 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:49:22.805194 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33873 : cluster [WRN] Health check update: 35/1833511 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:52:26.627440 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33878 : cluster [WRN] Health check update: 35/1833513 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:55:19.641528 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33889 : cluster [WRN] Health check update: 35/1833515 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:55:29.626747 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33891 : cluster [WRN] Health check update: 35/1833517 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:55:52.633074 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33895 : cluster [WRN] Health check update: 35/1833519 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:56:00.012227 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33896 : cluster [WRN] Health check update: 35/1833525 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 10:58:32.241286 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33899 : cluster [WRN] Health check update: 35/1833527 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 11:00:00.000208 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33900 : cluster [WRN] overall HEALTH_WARN 35/1833527 objects misplaced (0.002%)
2019-02-06 11:01:30.036257 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33902 : cluster [WRN] Health check update: 35/1833529 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 11:04:36.020320 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33906 : cluster [WRN] Health check update: 35/1833531 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 11:07:36.016958 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33911 : cluster [WRN] Health check update: 35/1833533 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 11:09:31.552936 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33918 : cluster [WRN] Health check failed: noout flag(s) set (OSDMAP_FLAGS)
2019-02-06 11:09:46.930486 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33922 : cluster [INF] osd.10 marked itself down
2019-02-06 11:09:46.930590 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33923 : cluster [INF] osd.7 marked itself down
2019-02-06 11:09:46.930742 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33924 : cluster [INF] osd.12 marked itself down
2019-02-06 11:09:46.930826 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33925 : cluster [INF] osd.15 marked itself down
2019-02-06 11:09:46.931027 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33926 : cluster [INF] osd.14 marked itself down
2019-02-06 11:09:46.931110 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33927 : cluster [INF] osd.4 marked itself down
2019-02-06 11:09:46.931245 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33928 : cluster [INF] osd.13 marked itself down
2019-02-06 11:09:46.995264 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33929 : cluster [WRN] Health check failed: 7 osds down (OSD_DOWN)
2019-02-06 11:09:46.995311 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33930 : cluster [WRN] Health check failed: 1 host (7 osds) down (OSD_HOST_DOWN)
2019-02-06 11:09:47.473587 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33932 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2019-02-06 11:09:47.480729 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33934 : cluster [INF] Standby daemon mds.bluehub-prox01 assigned to filesystem cephfs as rank 0
2019-02-06 11:09:49.601441 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33940 : cluster [WRN] Health check failed: Reduced data availability: 15 pgs peering (PG_AVAILABILITY)
2019-02-06 11:09:49.601482 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33941 : cluster [WRN] Health check failed: Degraded data redundancy: 40846/1833533 objects degraded (2.228%), 110 pgs degraded (PG_DEGRADED)
2019-02-06 11:09:50.061219 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33944 : cluster [INF] daemon mds.bluehub-prox01 is now active in filesystem cephfs as rank 0
2019-02-06 11:09:51.035263 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33945 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)
2019-02-06 11:09:52.989286 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33948 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 15 pgs peering)
2019-02-06 11:09:55.084944 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33949 : cluster [WRN] Health check update: Degraded data redundancy: 297679/1833533 objects degraded (16.235%), 746 pgs degraded (PG_DEGRADED)
2019-02-06 11:10:50.312374 mon.bluehub-prox05 mon.2 10.9.9.5:6789/0 280937 : cluster [INF] mon.bluehub-prox05 calling monitor election
2019-02-06 11:10:50.345058 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33962 : cluster [INF] mon.bluehub-prox02 calling monitor election
2019-02-06 11:10:55.348050 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33963 : cluster [INF] mon.bluehub-prox02 is new leader, mons bluehub-prox02,bluehub-prox05 in quorum (ranks 0,2)
2019-02-06 11:10:55.392042 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33968 : cluster [WRN] Health check failed: 1/3 mons down, quorum bluehub-prox02,bluehub-prox05 (MON_DOWN)
2019-02-06 11:10:55.415578 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33970 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 7 osds down; 1 host (7 osds) down; 35/1833533 objects misplaced (0.002%); Degraded data redundancy: 297679/1833533 objects degraded (16.235%), 746 pgs degraded; 1/3 mons down, quorum bluehub-prox02,bluehub-prox05
2019-02-06 11:10:56.387126 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33971 : cluster [WRN] Health check failed: Reduced data availability: 329 pgs inactive (PG_AVAILABILITY)
2019-02-06 11:10:56.387177 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33972 : cluster [WRN] Health check update: Degraded data redundancy: 297679/1833533 objects degraded (16.235%), 746 pgs degraded, 760 pgs undersized (PG_DEGRADED)
2019-02-06 11:14:49.845004 mon.bluehub-prox05 mon.2 10.9.9.5:6789/0 281004 : cluster [INF] mon.bluehub-prox05 calling monitor election
2019-02-06 11:14:49.845775 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34023 : cluster [INF] mon.bluehub-prox02 calling monitor election
2019-02-06 11:14:49.880371 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34024 : cluster [INF] mon.bluehub-prox02 is new leader, mons bluehub-prox02,bluehub-prox03,bluehub-prox05 in quorum (ranks 0,1,2)
2019-02-06 11:14:49.893051 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34025 : cluster [WRN] mon.1 10.9.9.3:6789/0 clock skew 0.102049s > max 0.05s
2019-02-06 11:14:49.901279 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34030 : cluster [WRN] Health check failed: clock skew detected on mon.bluehub-prox03 (MON_CLOCK_SKEW)
2019-02-06 11:14:49.901311 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34031 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum bluehub-prox02,bluehub-prox05)
2019-02-06 11:14:49.902103 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34032 : cluster [WRN] message from mon.1 was stamped 0.112893s in the future, clocks not synchronized
2019-02-06 11:14:49.924378 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34033 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 7 osds down; 1 host (7 osds) down; 35/1833533 objects misplaced (0.002%); Reduced data availability: 329 pgs inactive; Degraded data redundancy: 297679/1833533 objects degraded (16.235%), 746 pgs degraded, 760 pgs undersized; clock skew detected on mon.bluehub-prox03
2019-02-06 11:14:49.958161 mon.bluehub-prox03 mon.1 10.9.9.3:6789/0 1 : cluster [INF] mon.bluehub-prox03 calling monitor election
2019-02-06 11:15:10.023422 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34043 : cluster [WRN] Health check update: 5 osds down (OSD_DOWN)
2019-02-06 11:15:10.023463 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34044 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (7 osds) down)
2019-02-06 11:15:10.071324 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34045 : cluster [INF] osd.14 10.9.9.3:6817/3334 boot
2019-02-06 11:15:10.071391 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34046 : cluster [INF] osd.7 10.9.9.3:6805/2894 boot
2019-02-06 11:15:11.069813 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34048 : cluster [WRN] Health check update: Reduced data availability: 329 pgs inactive, 37 pgs peering (PG_AVAILABILITY)
2019-02-06 11:15:11.069881 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34049 : cluster [WRN] Health check update: Degraded data redundancy: 284005/1833533 objects degraded (15.489%), 711 pgs degraded, 723 pgs undersized (PG_DEGRADED)
2019-02-06 11:15:14.081810 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34055 : cluster [INF] osd.12 10.9.9.3:6809/3013 boot
2019-02-06 11:15:14.081880 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34056 : cluster [INF] osd.15 10.9.9.3:6813/3139 boot
2019-02-06 11:15:15.085936 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34060 : cluster [WRN] Health check update: 2 osds down (OSD_DOWN)
2019-02-06 11:15:15.101614 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34061 : cluster [INF] osd.4 10.9.9.3:6801/2754 boot
2019-02-06 11:15:17.149725 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34064 : cluster [WRN] Health check update: Reduced data availability: 236 pgs inactive (PG_AVAILABILITY)
2019-02-06 11:15:17.149782 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34065 : cluster [WRN] Health check update: Degraded data redundancy: 194512/1833533 objects degraded (10.609%), 453 pgs degraded, 456 pgs undersized (PG_DEGRADED)
2019-02-06 11:15:18.206357 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34068 : cluster [INF] osd.10 10.9.9.3:6825/3925 boot
2019-02-06 11:15:19.210586 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34072 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-02-06 11:15:19.237948 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34073 : cluster [INF] osd.13 10.9.9.3:6821/3524 boot
2019-02-06 11:15:20.124716 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34075 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.bluehub-prox03)
2019-02-06 11:15:22.653532 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34077 : cluster [WRN] Health check update: Reduced data availability: 50 pgs inactive (PG_AVAILABILITY)
2019-02-06 11:15:22.653573 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34078 : cluster [WRN] Health check update: Degraded data redundancy: 31162/1833533 objects degraded (1.700%), 66 pgs degraded, 68 pgs undersized (PG_DEGRADED)
2019-02-06 11:15:24.730382 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34079 : cluster [WRN] Health check update: 35/1833535 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 11:15:24.730425 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34080 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 31162/1833533 objects degraded (1.700%), 66 pgs degraded, 68 pgs undersized)
2019-02-06 11:15:26.754555 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34081 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 6 pgs inactive)
2019-02-06 11:15:30.126285 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34082 : cluster [WRN] Health check update: 35/1833537 objects misplaced (0.002%) (OBJECT_MISPLACED)
2019-02-06 11:16:25.276829 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34084 : cluster [WRN] Health check update: 35/1833539 objects misplaced (0.002%) (OBJECT_MISPLACED)

nothing important on osd logs..

Code:

root@xxxx-prox02:~# tail /var/log/ceph/ceph-osd.*.log -f
==> /var/log/ceph/ceph-osd.2.log <==
2019-02-06 11:15:18.234277 7fdba49c9700  1 osd.2 pg_epoch: 5396 pg[1.1f9( v 5388'153738 (5387'152162,5388'153738] local-lis/les=5245/5246 n=882 ec=58/58 lis/c 5245/5245 les/c/f 5246/5246/3020 5396/5396/5396) [10,2] r=1 lpr=5396 pi=[5245,5396)/1 luod=0'0 crt=5388'153738 lcod 5388'153737 peered mbc={}] start_peering_interval up [2] -> [10,2], acting [2] -> [10,2], acting_primary 2 -> 10, up_primary 2 -> 10, role 0 -> 1, features acting 4611087853745930235 upacting 4611087853745930235
2019-02-06 11:15:18.234366 7fdba49c9700  1 osd.2 pg_epoch: 5396 pg[1.1f9( v 5388'153738 (5387'152162,5388'153738] local-lis/les=5245/5246 n=882 ec=58/58 lis/c 5245/5245 les/c/f 5246/5246/3020 5396/5396/5396) [10,2] r=1 lpr=5396 pi=[5245,5396)/1 crt=5388'153738 lcod 5388'153737 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2019-02-06 11:15:18.234919 7fdba49c9700  1 osd.2 pg_epoch: 5396 pg[1.25f( v 5387'150896 (5387'149373,5387'150896] local-lis/les=5245/5246 n=914 ec=58/58 lis/c 5245/5245 les/c/f 5246/5246/3020 5396/5396/5396) [10,2] r=1 lpr=5396 pi=[5245,5396)/1 luod=0'0 crt=5387'150896 lcod 5387'150895 peered mbc={}] start_peering_interval up [2] -> [10,2], acting [2] -> [10,2], acting_primary 2 -> 10, up_primary 2 -> 10, role 0 -> 1, features acting 4611087853745930235 upacting 4611087853745930235
2019-02-06 11:15:18.235101 7fdba49c9700  1 osd.2 pg_epoch: 5396 pg[1.25f( v 5387'150896 (5387'149373,5387'150896] local-lis/les=5245/5246 n=914 ec=58/58 lis/c 5245/5245 les/c/f 5246/5246/3020 5396/5396/5396) [10,2] r=1 lpr=5396 pi=[5245,5396)/1 crt=5387'150896 lcod 5387'150895 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2019-02-06 11:15:18.235379 7fdba51ca700  1 osd.2 pg_epoch: 5396 pg[5.62( v 5324'4299 (4450'2700,5324'4299] local-lis/les=5389/5390 n=13 ec=2493/2493 lis/c 5389/5258 les/c/f 5390/5259/3020 5396/5396/5396) [10,17,2] r=2 lpr=5396 pi=[5258,5396)/1 luod=0'0 crt=5324'4299 lcod 5324'4298 active mbc={}] start_peering_interval up [17,2] -> [10,17,2], acting [17,2] -> [10,17,2], acting_primary 17 -> 10, up_primary 17 -> 10, role 1 -> 2, features acting 4611087853745930235 upacting 4611087853745930235
2019-02-06 11:15:18.235452 7fdba49c9700  1 osd.2 pg_epoch: 5396 pg[5.193( v 5344'4326 (4603'2800,5344'4326] local-lis/les=5389/5390 n=15 ec=4632/2493 lis/c 5389/5265 les/c/f 5390/5267/3020 5396/5396/5396) [10,19,2] r=2 lpr=5396 pi=[5265,5396)/1 luod=0'0 crt=5344'4326 lcod 5344'4325 active mbc={}] start_peering_interval up [19,2] -> [10,19,2], acting [19,2] -> [10,19,2], acting_primary 19 -> 10, up_primary 19 -> 10, role 1 -> 2, features acting 4611087853745930235 upacting 4611087853745930235
2019-02-06 11:15:18.235459 7fdba51ca700  1 osd.2 pg_epoch: 5396 pg[5.62( v 5324'4299 (4450'2700,5324'4299] local-lis/les=5389/5390 n=13 ec=2493/2493 lis/c 5389/5258 les/c/f 5390/5259/3020 5396/5396/5396) [10,17,2] r=2 lpr=5396 pi=[5258,5396)/1 crt=5324'4299 lcod 5324'4298 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2019-02-06 11:15:18.235524 7fdba49c9700  1 osd.2 pg_epoch: 5396 pg[5.193( v 5344'4326 (4603'2800,5344'4326] local-lis/les=5389/5390 n=15 ec=4632/2493 lis/c 5389/5265 les/c/f 5390/5267/3020 5396/5396/5396) [10,19,2] r=2 lpr=5396 pi=[5265,5396)/1 crt=5344'4326 lcod 5344'4325 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2019-02-06 11:15:19.243898 7fdba51ca700  1 osd.2 pg_epoch: 5397 pg[5.1b6( v 5324'4202 (4427'2700,5324'4202] local-lis/les=5389/5390 n=14 ec=4632/2493 lis/c 5389/5265 les/c/f 5390/5266/3020 5397/5397/5265) [19,13,2] r=2 lpr=5397 pi=[5265,5397)/1 luod=0'0 crt=5324'4202 lcod 5324'4201 active mbc={}] start_peering_interval up [19,2] -> [19,13,2], acting [19,2] -> [19,13,2], acting_primary 19 -> 19, up_primary 19 -> 19, role 1 -> 2, features acting 4611087853745930235 upacting 4611087853745930235
2019-02-06 11:15:19.244098 7fdba51ca700  1 osd.2 pg_epoch: 5397 pg[5.1b6( v 5324'4202 (4427'2700,5324'4202] local-lis/les=5389/5390 n=14 ec=4632/2493 lis/c 5389/5265 les/c/f 5390/5266/3020 5397/5397/5265) [19,13,2] r=2 lpr=5397 pi=[5265,5397)/1 crt=5324'4202 lcod 5324'4201 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray

==> /var/log/ceph/ceph-osd.6.log <==

==> /var/log/ceph/ceph-osd.8.log <==

but again the disk I/O was blocked

Alwin · Feb 6, 2019

bizzarrone said:
2019-02-06 11:15:20.124716 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 34075 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.bluehub-prox03)

Clock skew was cleared. Was this your question?

bizzarrone · Feb 7, 2019

I masked systemd-timesyncd and installet ntpd.
No skew detected, anyway, same issue..
NO I/O, everything is blocked.
no info from OSD logs.. I think I will rollback to proxmox version 4 or I will switch from ceph to another shared disk

Jarek · Feb 7, 2019

I/O is blocked because of:

Code:

2019-02-06 11:10:56.387126 mon.bluehub-prox02 mon.0 10.9.9.2:6789/0 33971 : cluster [WRN] Health check failed: Reduced data availability: 329 pgs inactive (PG_AVAILABILITY)

Please show your crush map.

bizzarrone · Feb 7, 2019

Good morning Jarek,
thank you for your advice.
Here it is:

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 7 osd.7 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host bluehub-prox01 {
   id -3       # do not change unnecessarily
   id -4 class hdd       # do not change unnecessarily
   # weight 4.905
   alg straw2
   hash 0   # rjenkins1
   item osd.0 weight 0.817
   item osd.1 weight 0.817
   item osd.3 weight 0.817
   item osd.5 weight 0.817
   item osd.9 weight 0.817
   item osd.11 weight 0.817
}
host bluehub-prox02 {
   id -5       # do not change unnecessarily
   id -6 class hdd       # do not change unnecessarily
   # weight 0.455
   alg straw2
   hash 0   # rjenkins1
   item osd.2 weight 0.455
}
host bluehub-prox03 {
   id -7       # do not change unnecessarily
   id -8 class hdd       # do not change unnecessarily
   # weight 1.909
   alg straw2
   hash 0   # rjenkins1
   item osd.4 weight 0.273
   item osd.7 weight 0.273
   item osd.10 weight 0.273
   item osd.12 weight 0.273
   item osd.13 weight 0.273
   item osd.14 weight 0.273
   item osd.15 weight 0.273
}
host bluehub-prox05 {
   id -9       # do not change unnecessarily
   id -10 class hdd       # do not change unnecessarily
   # weight 10.915
   alg straw2
   hash 0   # rjenkins1
   item osd.16 weight 1.091
   item osd.17 weight 1.091
   item osd.18 weight 1.091
   item osd.19 weight 1.091
   item osd.20 weight 1.091
   item osd.21 weight 1.091
   item osd.22 weight 1.091
   item osd.23 weight 1.091
   item osd.24 weight 1.091
   item osd.25 weight 1.091
}
root default {
   id -1       # do not change unnecessarily
   id -2 class hdd       # do not change unnecessarily
   # weight 18.183
   alg straw2
   hash 0   # rjenkins1
   item bluehub-prox01 weight 4.905
   item bluehub-prox02 weight 0.455
   item bluehub-prox03 weight 1.909
   item bluehub-prox05 weight 10.915
}

# rules
rule replicated_rule {
   id 0
   type replicated
   min_size 1
   max_size 10
   step take default
   step chooseleaf firstn 0 type host
   step emit
}

# end crush map

Jarek · Feb 7, 2019

And pool size/min size?

alexskysilk · Feb 7, 2019

bizzarrone said:
Good morning Jarek,
...
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 18.183
alg straw2
hash 0 # rjenkins1
item bluehub-prox01 weight 4.905
item bluehub-prox02 weight 0.455
item bluehub-prox03 weight 1.909
item bluehub-prox05 weight 10.915
}
[/code]

You realize that you have no more then 1.9 (TB, presumably) of data that can be written according to your rules... you have 4 nodes but they dont have any space for PGs. your disks are effectively wasted on much of your cluster.

bizzarrone · Feb 8, 2019

alexskysilk said:
You realize that you have no more then 1.9 (TB, presumably) of data that can be written according to your rules... you have 4 nodes but they dont have any space for PGs. your disks are effectively wasted on much of your cluster.

Thank you Alex,
How could I fix the situation?

alexskysilk · Feb 8, 2019

As a general rule, you want to have your OSD nodes as similar as possible, and you want them to end up having the same weight as each other. SInce I dont what drives/how many you have per node I cant give you more specific advice.

Also, you really want to change your rules. having a minimum of 1 leaves you with potential to have pgs with no parity; this is dangerous and may result in data loss. also, there is no reason to have a maximum of 10 in a replicated pool, it should be 3.

bizzarrone · Feb 20, 2019

Good morning,
I added a new powerfull node full of disks, removed the old one with few osd.
Nothing changed.
As I stop 1 OSD, the ceph pool freezes.

The log:

Code:

2019-02-20 09:00:00.000189 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82642 : cluster [INF] overall HEALTH_OK
2019-02-20 09:12:05.253331 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82682 : cluster [INF] osd.15 marked itself down
2019-02-20 09:12:05.304373 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82683 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-02-20 09:12:08.535372 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82687 : cluster [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2019-02-20 09:12:08.535408 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82688 : cluster [WRN] Health check failed: Degraded data redundancy: 9077/1587282 objects degraded (0.572%), 17 pgs degraded (PG_DEGRADED)
2019-02-20 09:12:12.102308 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82690 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2019-02-20 09:12:14.525307 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82691 : cluster [WRN] Health check update: Degraded data redundancy: 22741/1587282 objects degraded (1.433%), 46 pgs degraded (PG_DEGRADED)
2019-02-20 09:12:48.420280 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82693 : cluster [WRN] Health check update: Degraded data redundancy: 22741/1587284 objects degraded (1.433%), 46 pgs degraded (PG_DEGRADED)
2019-02-20 09:13:06.848503 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82695 : cluster [WRN] Health check failed: Reduced data availability: 46 pgs inactive (PG_AVAILABILITY)
2019-02-20 09:13:06.848553 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82696 : cluster [WRN] Health check update: Degraded data redundancy: 22741/1587284 objects degraded (1.433%), 46 pgs degraded, 44 pgs undersized (PG_DEGRADED)
2019-02-20 09:13:14.565445 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82697 : cluster [WRN] Health check update: Degraded data redundancy: 22741/1587284 objects degraded (1.433%), 46 pgs degraded, 46 pgs undersized (PG_DEGRADED)
2019-02-20 09:19:07.306127 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82731 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-02-20 09:19:07.394314 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82732 : cluster [INF] osd.15 10.9.9.3:6800/1487797 boot
2019-02-20 09:19:09.445225 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82736 : cluster [WRN] Health check update: Reduced data availability: 46 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2019-02-20 09:19:09.445269 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82737 : cluster [WRN] Health check update: Degraded data redundancy: 19958/1587284 objects degraded (1.257%), 41 pgs degraded, 41 pgs undersized (PG_DEGRADED)
2019-02-20 09:19:13.692237 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82738 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 15420/1587284 objects degraded (0.971%), 32 pgs degraded, 32 pgs undersized)
2019-02-20 09:19:14.657433 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82739 : cluster [WRN] Health check update: Reduced data availability: 3 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
2019-02-20 09:19:15.772723 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82740 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 3 pgs inactive, 3 pgs peering)
2019-02-20 09:19:15.772785 mon.bluehub-prox01 mon.0 10.9.9.1:6789/0 82741 : cluster [INF] Cluster is now healthy

I have 5 nodes cluster.

crush map:

Code:

[global] auth client required = cephx auth cluster required = cephx auth service required = cephx cluster network = 10.9.9.0/24 fsid = 0e7f096e-de29-41c9-b862-1a9c2ec15978 keyring = /etc/pve/priv/$cluster.$name.keyring mon allow pool delete = true osd journal size = 5120 osd pool default min size = 2 osd pool default size = 3 public network = 10.9.9.0/24 [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [osd] keyring = /var/lib/ceph/osd/ceph-$id/keyring [mon.bluehub-prox05] host = bluehub-prox05 mon addr = 10.9.9.5:6789 [mon.bluehub-prox04] host = bluehub-prox04 mon addr = 10.9.9.4:6789 [mon.bluehub-prox01] host = bluehub-prox01 mon addr = 10.9.9.1:6789

============================================================
# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd device 6 osd.6 class hdd device 7 osd.7 class hdd device 8 osd.8 class hdd device 9 osd.9 class hdd device 10 osd.10 class hdd device 11 osd.11 class hdd device 12 osd.12 class hdd device 13 osd.13 class hdd device 14 osd.14 class hdd device 15 osd.15 class hdd device 16 osd.16 class hdd device 17 osd.17 class hdd device 18 osd.18 class hdd device 19 osd.19 class hdd device 20 osd.20 class hdd device 21 osd.21 class hdd device 22 osd.22 class hdd device 23 osd.23 class hdd device 24 osd.24 class hdd device 25 osd.25 class hdd device 26 osd.26 class hdd device 27 osd.27 class hdd device 28 osd.28 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host bluehub-prox01 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily # weight 4.905 alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.817 item osd.1 weight 0.817 item osd.3 weight 0.817 item osd.5 weight 0.817 item osd.9 weight 0.817 item osd.11 weight 0.817 } host bluehub-prox02 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 0.000 alg straw2 hash 0 # rjenkins1 } host bluehub-prox03 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 1.909 alg straw2 hash 0 # rjenkins1 item osd.4 weight 0.273 item osd.7 weight 0.273 item osd.10 weight 0.273 item osd.12 weight 0.273 item osd.13 weight 0.273 item osd.14 weight 0.273 item osd.15 weight 0.273 } host bluehub-prox05 { id -9 # do not change unnecessarily id -10 class hdd # do not change unnecessarily # weight 10.915 alg straw2 hash 0 # rjenkins1 item osd.16 weight 1.091 item osd.17 weight 1.091 item osd.18 weight 1.091 item osd.19 weight 1.091 item osd.20 weight 1.091 item osd.21 weight 1.091 item osd.22 weight 1.091 item osd.23 weight 1.091 item osd.24 weight 1.091 item osd.25 weight 1.091 } host bluehub-prox04 { id -11 # do not change unnecessarily id -12 class hdd # do not change unnecessarily # weight 4.905 alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.817 item osd.6 weight 0.817 item osd.8 weight 0.817 item osd.26 weight 0.817 item osd.27 weight 0.817 item osd.28 weight 0.817 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 22.634 alg straw2 hash 0 # rjenkins1 item bluehub-prox01 weight 4.905 item bluehub-prox02 weight 0.000 item bluehub-prox03 weight 1.909 item bluehub-prox05 weight 10.915 item bluehub-prox04 weight 4.905 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map


============================================================

pveceph pool ls

Name                       size   min_size     pg_num     %-used                 used
data                          2          2        512      29.27        2778343453069
data2                         2          2        512       5.31         376847421567
data3                         2          2        512       2.46         169212454177

============================================================
Quorum information
------------------
Date:             Wed Feb 20 09:24:58 2019
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1/4656
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.1.1.1 (local)
0x00000002          1 10.1.1.2
0x00000003          1 10.1.1.3
0x00000005          1 10.1.1.4
0x00000004          1 10.1.1.5

============================================================

pveversion  -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 12.2.11-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-38
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Jarek · Feb 20, 2019

The problem is you have size=min_size, so any down osd will freeze the pool.
Change size to 3 (this will cause mass data movement, so be advised).

bizzarrone · Feb 20, 2019

Jarek said:
The problem is you have size=min_size, so any down osd will freeze the pool.
Change size to 3 (this will cause mass data movement, so be advised).

Thank you Jarek for your time and your reply. I really need to read a manual on Ceph.
I changed the setting: In 2 hours I think the redundancy will be over, then I will try a new test.
I was desperate, even I thought to rollback to version 4..
Thanks again.
Luca

bizzarrone · Feb 21, 2019

I definitely solved the problem, really thank you Jarek!

garnus · Mar 12, 2021

Hi,
I have 3 proxmox instances, 2 on Hetzner and one at home.
Every cluster has 3 nodes, every node 1 osd.
When my OSD size is 3 and min_size is 2 then the ceph pool freezes when one node is down.
I change the min_size to 1 and the ceph pool working like a charm.
This is my crush map:

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host returner {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 0.455
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.455
}
host rebbel {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 0.455
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 0.455
}
host raiser {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 0.466
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.466
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 1.375
    alg straw2
    hash 0    # rjenkins1
    item returner weight 0.455
    item rebbel weight 0.455
    item raiser weight 0.466
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

Ceph down if one node is down

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Distinguished Member

Renowned Member

Attachments

Distinguished Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Active Member

We value your privacy