OSD out after create OSD on CEPH build in by Proxmox

itvietnam · May 28, 2018

Hi,

I have 2 nodes came with 6 OSD per node, i have follow this guide: https://pve.proxmox.com/wiki/Ceph_Server to create CEPH cluster. However after create OSD from web guide, it show 4/8 OSD out.

I tried create a pool on CEPH and copy data to this. The storage cluster said degrade.

Code:

2018-05-28 01:53:43.980574 mon.hl101 mon.0 10.40.10.1:6789/0 715 : cluster [WRN] Health check failed: Degraded data redundancy: 256 pgs undersized (PG_DEGRADED)
2018-05-28 01:53:58.111651 mon.hl101 mon.0 10.40.10.1:6789/0 726 : cluster [WRN] Health check update: Degraded data redundancy: 1/3 objects degraded (33.333%), 1 pg degraded, 256 pgs undersized (PG_DEGRADED)
2018-05-28 01:54:04.184970 mon.hl101 mon.0 10.40.10.1:6789/0 729 : cluster [WRN] Health check update: Degraded data redundancy: 143/429 objects degraded (33.333%), 96 pgs degraded, 256 pgs undersized (PG_DEGRADED)
2018-05-28 01:54:09.340770 mon.hl101 mon.0 10.40.10.1:6789/0 730 : cluster [WRN] Health check update: Degraded data redundancy: 246/738 objects degraded (33.333%), 146 pgs degraded, 256 pgs undersized (PG_DEGRADED)

I have delete this pool and storage health is ok again.

I have navigate to Disks on per node and see Usage column is partitions, after search i foudn this thread. I have follow instruction: zap disk and create OSD again. Then it showing 12 in and 4 out, total 16 disks (actually we have 12 disks only).

May i know how to fix this error? Is it due to 2 mon on 2 nodes (not stable cluster quorum)?

Thanks,

itvietnam · May 28, 2018

CEPH OSD showing this info:

Code:

root@hl102:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS
 1   hdd 0.45419  1.00000  465G  1275M  463G 0.27 1.21   0
10   hdd 0.45419  1.00000  465G  1039M  464G 0.22 0.98   0
11   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
12   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
13   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
14   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 3   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 4   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 5   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 6   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 7   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
15   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 0             0        0     0      0     0    0    0   0
 2             0        0     0      0     0    0    0   0
 8             0        0     0      0     0    0    0   0
 9             0        0     0      0     0    0    0   0
                    TOTAL 5581G 12667M 5569G 0.22         
MIN/MAX VAR: 0.98/1.21  STDDEV: 0.01

itvietnam · May 28, 2018

I have taken following steps and my storage looks like ok now

Code:

165  /etc/init.d/ceph stop osd.0
  166  /etc/init.d/ceph stop osd.2
  167  /etc/init.d/ceph stop osd.8
  168  /etc/init.d/ceph stop osd.9
  169  ceph osd tree
  170  ceph auth del osd.0
  171  ceph auth del osd.2
  172  ceph auth del osd.8
  173  ceph auth del osd.9
  174  ceph osd tree
  175  ceph osd rm 0
  176  ceph osd rm 2
  177  ceph osd rm 8
  178  ceph osd rm 9
  179  ceph osd tree
  180  history
root@hl102:~# ceph osd tree
ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       5.45032 root default                           
-3       2.72516     host hl101                         
 1   hdd 0.45419         osd.1      up  1.00000 1.00000
10   hdd 0.45419         osd.10     up  1.00000 1.00000
11   hdd 0.45419         osd.11     up  1.00000 1.00000
12   hdd 0.45419         osd.12     up  1.00000 1.00000
13   hdd 0.45419         osd.13     up  1.00000 1.00000
14   hdd 0.45419         osd.14     up  1.00000 1.00000
-5       2.72516     host hl102                         
 3   hdd 0.45419         osd.3      up  1.00000 1.00000
 4   hdd 0.45419         osd.4      up  1.00000 1.00000
 5   hdd 0.45419         osd.5      up  1.00000 1.00000
 6   hdd 0.45419         osd.6      up  1.00000 1.00000
 7   hdd 0.45419         osd.7      up  1.00000 1.00000
15   hdd 0.45419         osd.15     up  1.00000 1.00000
root@hl102:~#

Alwin · May 28, 2018

For the ceph installation, see our included docs (also locally available).
https://pve.proxmox.com/pve-docs/chapter-pveceph.html

To get rid of the 4 OSDs, try this from the following post.
https://forum.proxmox.com/threads/phantom-destroyed-osd.43794/#post-209814

EDIT: your ceph tree looks fine now, the GUI should reflect that too. A two node (also two MON) cluster is not a good idea, as you mentioned, because of the quorum.

itvietnam · May 28, 2018

Alwin said:
For the ceph installation, see our included docs (also locally available).
https://pve.proxmox.com/pve-docs/chapter-pveceph.html

To get rid of the 4 OSDs, try this from the following post.
https://forum.proxmox.com/threads/phantom-destroyed-osd.43794/#post-209814

EDIT: your ceph tree looks fine now, the GUI should reflect that too. A two node (also two MON) cluster is not a good idea, as you mentioned, because of the quorum.

Thanks, i have manage it already.

However i tried create a pool SSD again, then migrate disk to this SSD pool. The storage become degrade again:

Code:

Degraded data redundancy: 261/783 objects degraded (33.333%), 167 pgs degraded, 256 pgs undersized
pg 2.cd is active+undersized+degraded, acting [3,14]
pg 2.ce is stuck undersized for 241.921800, current state active+undersized, last acting [12,15]
pg 2.cf is stuck undersized for 241.922896, current state active+undersized, last acting [14,5]
pg 2.d0 is stuck undersized for 241.923909, current state active+undersized+degraded, last acting [1,5]
pg 2.d1 is stuck undersized for 241.924987, current state active+undersized, last acting [7,10]
pg 2.d2 is stuck undersized for 241.920518, current state active+undersized, last acting [15,14]
pg 2.d3 is stuck undersized for 241.922246, current state active+undersized, last acting [12,5]
pg 2.d4 is stuck undersized for 241.924584, current state active+undersized, last acting [3,11]
pg 2.d5 is stuck undersized for 241.924762, current state active+undersized+degraded, last acting [5,11]
pg 2.d6 is stuck undersized for 241.925697, current state active+undersized, last acting [3,12]
pg 2.d7 is stuck undersized for 241.920272, current state active+undersized, last acting [4,10]
pg 2.d8 is stuck undersized for 241.925261, current state active+undersized+degraded, last acting [4,1]
pg 2.d9 is stuck undersized for 241.925018, current state active+undersized, last acting [3,14]
pg 2.da is stuck undersized for 241.919320, current state active+undersized, last acting [14,3]
pg 2.db is stuck undersized for 241.922853, current state active+undersized+degraded, last acting [13,7]
pg 2.dc is stuck undersized for 241.922715, current state active+undersized+degraded, last acting [12,6]
pg 2.dd is stuck undersized for 241.924406, current state active+undersized+degraded, last acting [1,5]
pg 2.de is stuck undersized for 241.921082, current state active+undersized+degraded, last acting [10,4]
pg 2.df is stuck undersized for 241.923196, current state active+undersized+degraded, last acting [12,7]
pg 2.e0 is stuck undersized for 241.925744, current state active+undersized+degraded, last acting [3,12]
pg 2.e1 is stuck undersized for 241.923364, current state active+undersized+degraded, last acting [14,6]
pg 2.e2 is stuck undersized for 241.921661, current state active+undersized, last acting [10,3]
pg 2.e3 is stuck undersized for 241.923669, current state active+undersized, last acting [12,5]
pg 2.e4 is stuck undersized for 241.922249, current state active+undersized+degraded, last acting [10,7]
pg 2.e5 is stuck undersized for 241.922070, current state active+undersized, last acting [13,6]
pg 2.e6 is stuck undersized for 241.925335, current state active+undersized+degraded, last acting [4,10]
pg 2.e7 is stuck undersized for 241.925305, current state active+undersized+degraded, last acting [5,12]
pg 2.e8 is stuck undersized for 241.925347, current state active+undersized+degraded, last acting [11,7]
pg 2.e9 is stuck undersized for 241.924774, current state active+undersized+degraded, last acting [5,11]
pg 2.ea is stuck undersized for 241.924149, current state active+undersized+degraded, last acting [14,4]
pg 2.eb is stuck undersized for 241.925337, current state active+undersized+degraded, last acting [5,1]
pg 2.ec is stuck undersized for 241.924123, current state active+undersized+degraded, last acting [13,3]
pg 2.ed is stuck undersized for 241.924225, current state active+undersized+degraded, last acting [12,7]
pg 2.ee is stuck undersized for 241.923116, current state active+undersized+degraded, last acting [12,5]
pg 2.ef is stuck undersized for 241.922817, current state active+undersized+degraded, last acting [10,5]
pg 2.f0 is stuck undersized for 241.925746, current state active+undersized, last acting [7,11]
pg 2.f1 is stuck undersized for 241.924664, current state active+undersized+degraded, last acting [6,14]
pg 2.f2 is stuck undersized for 241.923422, current state active+undersized+degraded, last acting [6,13]
pg 2.f3 is stuck undersized for 241.924862, current state active+undersized, last acting [11,3]
pg 2.f4 is stuck undersized for 241.923919, current state active+undersized+degraded, last acting [6,13]
pg 2.f5 is stuck undersized for 241.923621, current state active+undersized, last acting [12,3]
pg 2.f6 is stuck undersized for 241.921545, current state active+undersized, last acting [4,14]
pg 2.f7 is stuck undersized for 241.924256, current state active+undersized+degraded, last acting [12,15]
pg 2.f8 is stuck undersized for 241.918920, current state active+undersized, last acting [12,7]
pg 2.f9 is stuck undersized for 241.918781, current state active+undersized, last acting [1,5]
pg 2.fa is stuck undersized for 241.925368, current state active+undersized+degraded, last acting [11,15]
pg 2.fb is stuck undersized for 241.924740, current state active+undersized, last acting [6,10]
pg 2.fc is stuck undersized for 241.917970, current state active+undersized+degraded, last acting [1,5]
pg 2.fd is stuck undersized for 241.922721, current state active+undersized+degraded, last acting [5,11]
pg 2.fe is stuck undersized for 241.918092, current state active+undersized+degraded, last acting [12,5]
pg 2.ff is stuck undersized for 241.923612, current state active+undersized

Alwin · May 28, 2018

You only have two nodes and a pool has per default a replica of 3. In small setups it is also advised against a smaller replica, as the chance of a failure rendering the data inaccessible is quit high.

To get rid of this, either create a pool with size=2; min_size=1, or better add another node. As written above, it is also advised for quorum.

itvietnam · May 28, 2018

Thanks, this is temporary setup and we will add more nodes when we done migration VM out the old system. (old server will be reuse as new node in Proxmox).

Alwin said:
size=2; min_size=1

Does this mean we can lose a half system (6 OSDs simultaneous in my current setup)?

Alwin · May 28, 2018

The replication is by default on host level, so one host (or all 6 OSDs of the host) can be down, while the cluster is still able to write to the ceph pool (min_size=1).

itvietnam · May 29, 2018

There is something stranges, i have delete old pool and created a new pool with size=2 and min_size=1. health is ok and 12 OSDs in.

This morning 1 OSD out again:

Code:

Degraded data redundancy: 2480/235442 objects degraded (1.053%), 7 pgs degraded, 7 pgs undersized
pg 3.46 is stuck undersized for 859.746038, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.7e is stuck undersized for 859.762467, current state active+undersized+degraded+remapped+backfilling, last acting [14]
pg 3.c0 is stuck undersized for 859.748271, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.c4 is stuck undersized for 859.751243, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.e2 is stuck undersized for 859.749336, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.e9 is stuck undersized for 859.752773, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.f9 is stuck undersized for 859.747215, current state active+undersized+degraded+remapped+backfilling, last acting [10]

Alwin · May 29, 2018

What are the logs saying? One disk out of 12 failed, there might be a hardware error.

Search

Search

OSD out after create OSD on CEPH build in by Proxmox

itvietnam

Renowned Member

itvietnam

Renowned Member

itvietnam

Renowned Member

Alwin

Proxmox Retired Staff

itvietnam

Renowned Member

Alwin

Proxmox Retired Staff

itvietnam

Renowned Member

Alwin

Proxmox Retired Staff

itvietnam

Renowned Member

Alwin

Proxmox Retired Staff