OSD out after create OSD on CEPH build in by Proxmox

itvietnam

Renowned Member
Aug 11, 2015
132
4
83
Hi,

I have 2 nodes came with 6 OSD per node, i have follow this guide: https://pve.proxmox.com/wiki/Ceph_Server to create CEPH cluster. However after create OSD from web guide, it show 4/8 OSD out.

I tried create a pool on CEPH and copy data to this. The storage cluster said degrade.

Code:
2018-05-28 01:53:43.980574 mon.hl101 mon.0 10.40.10.1:6789/0 715 : cluster [WRN] Health check failed: Degraded data redundancy: 256 pgs undersized (PG_DEGRADED)
2018-05-28 01:53:58.111651 mon.hl101 mon.0 10.40.10.1:6789/0 726 : cluster [WRN] Health check update: Degraded data redundancy: 1/3 objects degraded (33.333%), 1 pg degraded, 256 pgs undersized (PG_DEGRADED)
2018-05-28 01:54:04.184970 mon.hl101 mon.0 10.40.10.1:6789/0 729 : cluster [WRN] Health check update: Degraded data redundancy: 143/429 objects degraded (33.333%), 96 pgs degraded, 256 pgs undersized (PG_DEGRADED)
2018-05-28 01:54:09.340770 mon.hl101 mon.0 10.40.10.1:6789/0 730 : cluster [WRN] Health check update: Degraded data redundancy: 246/738 objects degraded (33.333%), 146 pgs degraded, 256 pgs undersized (PG_DEGRADED)

I have delete this pool and storage health is ok again.

I have navigate to Disks on per node and see Usage column is partitions, after search i foudn this thread. I have follow instruction: zap disk and create OSD again. Then it showing 12 in and 4 out, total 16 disks (actually we have 12 disks only).

2018-05-28_18-42-29.png


May i know how to fix this error? Is it due to 2 mon on 2 nodes (not stable cluster quorum)?

Thanks,
 
CEPH OSD showing this info:

Code:
root@hl102:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE VAR  PGS
 1   hdd 0.45419  1.00000  465G  1275M  463G 0.27 1.21   0
10   hdd 0.45419  1.00000  465G  1039M  464G 0.22 0.98   0
11   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
12   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
13   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
14   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 3   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 4   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 5   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 6   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 7   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
15   hdd 0.45419  1.00000  465G  1035M  464G 0.22 0.98   0
 0             0        0     0      0     0    0    0   0
 2             0        0     0      0     0    0    0   0
 8             0        0     0      0     0    0    0   0
 9             0        0     0      0     0    0    0   0
                    TOTAL 5581G 12667M 5569G 0.22         
MIN/MAX VAR: 0.98/1.21  STDDEV: 0.01
 
I have taken following steps and my storage looks like ok now

Code:
165  /etc/init.d/ceph stop osd.0
  166  /etc/init.d/ceph stop osd.2
  167  /etc/init.d/ceph stop osd.8
  168  /etc/init.d/ceph stop osd.9
  169  ceph osd tree
  170  ceph auth del osd.0
  171  ceph auth del osd.2
  172  ceph auth del osd.8
  173  ceph auth del osd.9
  174  ceph osd tree
  175  ceph osd rm 0
  176  ceph osd rm 2
  177  ceph osd rm 8
  178  ceph osd rm 9
  179  ceph osd tree
  180  history
root@hl102:~# ceph osd tree
ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       5.45032 root default                           
-3       2.72516     host hl101                         
 1   hdd 0.45419         osd.1      up  1.00000 1.00000
10   hdd 0.45419         osd.10     up  1.00000 1.00000
11   hdd 0.45419         osd.11     up  1.00000 1.00000
12   hdd 0.45419         osd.12     up  1.00000 1.00000
13   hdd 0.45419         osd.13     up  1.00000 1.00000
14   hdd 0.45419         osd.14     up  1.00000 1.00000
-5       2.72516     host hl102                         
 3   hdd 0.45419         osd.3      up  1.00000 1.00000
 4   hdd 0.45419         osd.4      up  1.00000 1.00000
 5   hdd 0.45419         osd.5      up  1.00000 1.00000
 6   hdd 0.45419         osd.6      up  1.00000 1.00000
 7   hdd 0.45419         osd.7      up  1.00000 1.00000
15   hdd 0.45419         osd.15     up  1.00000 1.00000
root@hl102:~#
 
  • Like
Reactions: itvietnam
For the ceph installation, see our included docs (also locally available).
https://pve.proxmox.com/pve-docs/chapter-pveceph.html

To get rid of the 4 OSDs, try this from the following post.
https://forum.proxmox.com/threads/phantom-destroyed-osd.43794/#post-209814

EDIT: your ceph tree looks fine now, the GUI should reflect that too. A two node (also two MON) cluster is not a good idea, as you mentioned, because of the quorum.
Thanks, i have manage it already.

However i tried create a pool SSD again, then migrate disk to this SSD pool. The storage become degrade again:

2018-05-28_19-00-39.png


Code:
Degraded data redundancy: 261/783 objects degraded (33.333%), 167 pgs degraded, 256 pgs undersized
pg 2.cd is active+undersized+degraded, acting [3,14]
pg 2.ce is stuck undersized for 241.921800, current state active+undersized, last acting [12,15]
pg 2.cf is stuck undersized for 241.922896, current state active+undersized, last acting [14,5]
pg 2.d0 is stuck undersized for 241.923909, current state active+undersized+degraded, last acting [1,5]
pg 2.d1 is stuck undersized for 241.924987, current state active+undersized, last acting [7,10]
pg 2.d2 is stuck undersized for 241.920518, current state active+undersized, last acting [15,14]
pg 2.d3 is stuck undersized for 241.922246, current state active+undersized, last acting [12,5]
pg 2.d4 is stuck undersized for 241.924584, current state active+undersized, last acting [3,11]
pg 2.d5 is stuck undersized for 241.924762, current state active+undersized+degraded, last acting [5,11]
pg 2.d6 is stuck undersized for 241.925697, current state active+undersized, last acting [3,12]
pg 2.d7 is stuck undersized for 241.920272, current state active+undersized, last acting [4,10]
pg 2.d8 is stuck undersized for 241.925261, current state active+undersized+degraded, last acting [4,1]
pg 2.d9 is stuck undersized for 241.925018, current state active+undersized, last acting [3,14]
pg 2.da is stuck undersized for 241.919320, current state active+undersized, last acting [14,3]
pg 2.db is stuck undersized for 241.922853, current state active+undersized+degraded, last acting [13,7]
pg 2.dc is stuck undersized for 241.922715, current state active+undersized+degraded, last acting [12,6]
pg 2.dd is stuck undersized for 241.924406, current state active+undersized+degraded, last acting [1,5]
pg 2.de is stuck undersized for 241.921082, current state active+undersized+degraded, last acting [10,4]
pg 2.df is stuck undersized for 241.923196, current state active+undersized+degraded, last acting [12,7]
pg 2.e0 is stuck undersized for 241.925744, current state active+undersized+degraded, last acting [3,12]
pg 2.e1 is stuck undersized for 241.923364, current state active+undersized+degraded, last acting [14,6]
pg 2.e2 is stuck undersized for 241.921661, current state active+undersized, last acting [10,3]
pg 2.e3 is stuck undersized for 241.923669, current state active+undersized, last acting [12,5]
pg 2.e4 is stuck undersized for 241.922249, current state active+undersized+degraded, last acting [10,7]
pg 2.e5 is stuck undersized for 241.922070, current state active+undersized, last acting [13,6]
pg 2.e6 is stuck undersized for 241.925335, current state active+undersized+degraded, last acting [4,10]
pg 2.e7 is stuck undersized for 241.925305, current state active+undersized+degraded, last acting [5,12]
pg 2.e8 is stuck undersized for 241.925347, current state active+undersized+degraded, last acting [11,7]
pg 2.e9 is stuck undersized for 241.924774, current state active+undersized+degraded, last acting [5,11]
pg 2.ea is stuck undersized for 241.924149, current state active+undersized+degraded, last acting [14,4]
pg 2.eb is stuck undersized for 241.925337, current state active+undersized+degraded, last acting [5,1]
pg 2.ec is stuck undersized for 241.924123, current state active+undersized+degraded, last acting [13,3]
pg 2.ed is stuck undersized for 241.924225, current state active+undersized+degraded, last acting [12,7]
pg 2.ee is stuck undersized for 241.923116, current state active+undersized+degraded, last acting [12,5]
pg 2.ef is stuck undersized for 241.922817, current state active+undersized+degraded, last acting [10,5]
pg 2.f0 is stuck undersized for 241.925746, current state active+undersized, last acting [7,11]
pg 2.f1 is stuck undersized for 241.924664, current state active+undersized+degraded, last acting [6,14]
pg 2.f2 is stuck undersized for 241.923422, current state active+undersized+degraded, last acting [6,13]
pg 2.f3 is stuck undersized for 241.924862, current state active+undersized, last acting [11,3]
pg 2.f4 is stuck undersized for 241.923919, current state active+undersized+degraded, last acting [6,13]
pg 2.f5 is stuck undersized for 241.923621, current state active+undersized, last acting [12,3]
pg 2.f6 is stuck undersized for 241.921545, current state active+undersized, last acting [4,14]
pg 2.f7 is stuck undersized for 241.924256, current state active+undersized+degraded, last acting [12,15]
pg 2.f8 is stuck undersized for 241.918920, current state active+undersized, last acting [12,7]
pg 2.f9 is stuck undersized for 241.918781, current state active+undersized, last acting [1,5]
pg 2.fa is stuck undersized for 241.925368, current state active+undersized+degraded, last acting [11,15]
pg 2.fb is stuck undersized for 241.924740, current state active+undersized, last acting [6,10]
pg 2.fc is stuck undersized for 241.917970, current state active+undersized+degraded, last acting [1,5]
pg 2.fd is stuck undersized for 241.922721, current state active+undersized+degraded, last acting [5,11]
pg 2.fe is stuck undersized for 241.918092, current state active+undersized+degraded, last acting [12,5]
pg 2.ff is stuck undersized for 241.923612, current state active+undersized
 
You only have two nodes and a pool has per default a replica of 3. In small setups it is also advised against a smaller replica, as the chance of a failure rendering the data inaccessible is quit high.

To get rid of this, either create a pool with size=2; min_size=1, or better add another node. As written above, it is also advised for quorum.
 
  • Like
Reactions: itvietnam
Thanks, this is temporary setup and we will add more nodes when we done migration VM out the old system. (old server will be reuse as new node in Proxmox).

size=2; min_size=1

Does this mean we can lose a half system (6 OSDs simultaneous in my current setup)?
 
The replication is by default on host level, so one host (or all 6 OSDs of the host) can be down, while the cluster is still able to write to the ceph pool (min_size=1).
 
There is something stranges, i have delete old pool and created a new pool with size=2 and min_size=1. health is ok and 12 OSDs in.

This morning 1 OSD out again:

2018-05-29_11-13-53.png


Code:
Degraded data redundancy: 2480/235442 objects degraded (1.053%), 7 pgs degraded, 7 pgs undersized
pg 3.46 is stuck undersized for 859.746038, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.7e is stuck undersized for 859.762467, current state active+undersized+degraded+remapped+backfilling, last acting [14]
pg 3.c0 is stuck undersized for 859.748271, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.c4 is stuck undersized for 859.751243, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.e2 is stuck undersized for 859.749336, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.e9 is stuck undersized for 859.752773, current state active+undersized+degraded+remapped+backfill_wait, last acting [14]
pg 3.f9 is stuck undersized for 859.747215, current state active+undersized+degraded+remapped+backfilling, last acting [10]
 
What are the logs saying? One disk out of 12 failed, there might be a hardware error.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!