Ceph cluster health real bad

CrowJenkins · Jul 12, 2018

Hi all. I'm (VERY) new to PVE and ceph. Let's put it this way, my knowledge of ceph is "I've heard of it once". Been having some issues for the past few months. I'm not entirely sure where to start either..

Code:

ceph osd tree
# id    weight  type name       up/down reweight
-1      113.9   root default
-3      2.7             host ceph02
1       0.54                    osd.1   down    0
2       0.54                    osd.2   down    0
3       0.54                    osd.3   down    0
4       0.54                    osd.4   down    0
5       0.54                    osd.5   down    0
-4      2.16            host ceph03
6       0.54                    osd.6   up      1
7       0.54                    osd.7   down    0
8       0.54                    osd.8   up      1
9       0.54                    osd.9   up      1
-2      27.25           host ceph01
10      5.45                    osd.10  down    0
11      5.45                    osd.11  down    0
14      5.45                    osd.14  down    0
15      5.45                    osd.15  down    0
0       5.45                    osd.0   down    0
-6      27.25           host ceph05
19      5.45                    osd.19  up      1
20      5.45                    osd.20  up      1
21      5.45                    osd.21  up      1
22      5.45                    osd.22  up      1
23      5.45                    osd.23  up      1
-7      27.25           host ceph06
24      5.45                    osd.24  down    0
26      5.45                    osd.26  up      1
27      5.45                    osd.27  up      1
29      5.45                    osd.29  up      1
25      5.45                    osd.25  down    0
-5      27.25           host ceph4
12      5.45                    osd.12  down    0
16      5.45                    osd.16  down    0
17      5.45                    osd.17  down    0
18      5.45                    osd.18  down    0
13      5.45                    osd.13  down    0

Code:

ceph health
HEALTH_WARN 1992 pgs backfill; 14 pgs backfilling; 2760 pgs degraded; 2283 pgs down; 2506 pgs peering; 687 pgs stale; 2611 pgs stuck inactive; 687 pgs stuck stale; 5612 pgs stuck unclean; recovery 2498299/10101735 objects degraded (24.731%); clock skew detected on mon.1, mon.2

Code:

ceph status
    cluster a62c1605-2026-44e8-8496-696a8d070b2f
     health HEALTH_WARN 1992 pgs backfill; 14 pgs backfilling; 2760 pgs degraded; 2283 pgs down; 2506 pgs peering; 687 pgs stale; 2611 pgs stuck inactive; 687 pgs stuck stale; 5612 pgs stuck unclean; recovery 2497013/10101735 objects degraded (24.719%); clock skew detected on mon.1, mon.2
     monmap e21: 5 mons at {0=10.254.253.100:6789/0,1=10.254.253.101:6789/0,2=10.254.253.102:6789/0,4=10.254.253.104:6789/0,5=10.254.253.105:6789/0}, election epoch 568, quorum 0,1,2,3,4 0,1,2,4,5
     mdsmap e54: 0/0/1 up
     osdmap e119016: 29 osds: 11 up, 11 in
      pgmap v115919348: 6522 pgs, 5 pools, 13004 GB data, 3264 kobjects
            25469 GB used, 20841 GB / 46311 GB avail
            2497013/10101735 objects degraded (24.719%)
                   3 down+remapped+peering
                 695 active+clean
                 139 stale+down+peering
                  13 stale+remapped+peering
                  14 active+degraded+remapped+backfilling
                2138 down+peering
                1990 active+degraded+remapped+wait_backfill
                 239 active+remapped
                   4 stale+active+degraded
                   2 active+clean+scrubbing+deep
                   2 active+remapped+wait_backfill
                 105 stale
                   3 stale+down+remapped+peering
                 752 active+degraded
                 213 stale+active+clean
                 210 stale+peering

Long and short, can't start some VM's due to this being FUBAR'd, can't back up anything (this ceph cluster was also housing our backups), and I didn't set it up and was really just thrown into this. The person who did set it up isn't available (no longer work here), so I'm floundering.

alexskysilk · Jul 12, 2018

http://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-osd/

Alwin · Jul 16, 2018

You need to get the OSDs back into the cluster, follow @alexskysilk posted link. If there are still issues, post more details including logs.

Search

Search

Ceph cluster health real bad

CrowJenkins

New Member

alexskysilk

Distinguished Member

Alwin

Proxmox Retired Staff

We value your privacy