CEPH problem after upgrade to 5.1 / slow requests + stuck request

black4 · Dec 9, 2017

Hi,

I've got problem with my CEPH cluster.

Cluster specification:
4x node
4x mon
4x mgr
37x osd

I was starting from CEPH hammer so I followed tutorials:
https://pve.proxmox.com/wiki/Ceph_Hammer_to_Jewel - without any problems
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous - without any problems

And finally:
https://pve.proxmox.com/wiki/Upgrade_from_4.x_to_5.0

After upgrade I saw that one OSD not started and there are showing a lot of slow request + stuck request. There also showing pgs inactive. This one OSD was down and out, so it shoudn't have impact, but to be sure I also destroyed OSD to eliminate potential issue and have all OSD up and running. Unfortunately that doesn't help.
I thought that I found issue - after upgrade to luminous in pve 4.4 ceph package was installed in 12.2.2 version, so when I was upgrading to 5.1 ceph packages was installed from debian repository instead proxmox. To fix it I've changed branch main to test and run dist-upgrade + restart binaries, but it doesn't help.
Dmesg is not showing anything suspicious in logs. Disks are working fine, network also.

Kernel:

Code:

Linux biurowiecD 4.13.8-3-pve #1 SMP PVE 4.13.8-30 (Tue, 5 Dec 2017 13:06:48 +0100) x86_64 GNU/Linux

PVE:

Code:

pve-manager/5.1-38/1e9bc777 (running kernel: 4.13.8-3-pve)

Packages:

Code:

ii  ceph                                 12.2.2-pve1                    amd64        distributed storage and file system
ii  ceph-base                            12.2.2-pve1                    amd64        common ceph daemon libraries and management tools
ii  ceph-common                          12.2.2-pve1                    amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-fuse                            12.2.2-pve1                    amd64        FUSE-based client for the Ceph distributed file system
ii  ceph-mds                             12.2.2-pve1                    amd64        metadata server for the ceph distributed file system
ii  ceph-mgr                             12.2.2-pve1                    amd64        manager for the ceph distributed storage system
ii  ceph-mon                             12.2.2-pve1                    amd64        monitor server for the ceph storage system
ii  ceph-osd                             12.2.2-pve1                    amd64        OSD server for the ceph storage system
ri  libcephfs1                           10.2.10-1~bpo80+1              amd64        Ceph distributed file system client library
ii  libcephfs2                           12.2.2-pve1                    amd64        Ceph distributed file system client library
ii  python-ceph                          12.2.2-pve1                    amd64        Meta-package for python libraries for the Ceph libraries
ii  python-cephfs                        12.2.2-pve1                    amd64        Python 2 libraries for the Ceph libcephfs library

CEPH versions:

Code:

{
    "mon": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 4
    },
    "mgr": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 4
    },
    "osd": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 37
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 45
    }
}

CEPH status:

Code:

root@biurowiecD:~# ceph -s
  cluster:
    id:     dcd25aa1-1618-45a9-b902-c081d3fa3479
    health: HEALTH_ERR
            noout flag(s) set
            81114/15283887 objects misplaced (0.531%)
            Reduced data availability: 142 pgs inactive
            Degraded data redundancy: 209 pgs unclean, 76 pgs degraded, 58 pgs undersized
            42 slow requests are blocked > 32 sec
            664 stuck requests are blocked > 4096 sec
            too many PGs per OSD (332 > max 200)

  services:
    mon: 4 daemons, quorum 1,0,2,3
    mgr: biurowiecH(active), standbys: biurowiecE, biurowiecD, biurowiecG
    osd: 37 osds: 37 up, 37 in; 209 remapped pgs
         flags noout

  data:
    pools:   4 pools, 4096 pgs
    objects: 4975k objects, 19753 GB
    usage:   59023 GB used, 47097 GB / 103 TB avail
    pgs:     3.467% pgs not active
             81114/15283887 objects misplaced (0.531%)
             3887 active+clean
             66   activating+remapped
             58   activating+undersized+degraded+remapped
             49   active+remapped+backfill_wait
             18   activating+degraded+remapped
             18   active+remapped+backfilling

  io:
    recovery: 87335 kB/s, 21 objects/s

CEPH health detail:

Code:

HEALTH_ERR noout flag(s) set; 80344/15283887 objects misplaced (0.526%); Reduced data availability: 142 pgs inactive; Degraded data redundancy: 209 pgs unclean, 76 pgs degraded, 58 pgs undersized; 36 slow requests are blocked > 32 sec; 670 stuck requests are blocked > 4096 sec; too many PGs
per OSD (332 > max 200)
OSDMAP_FLAGS noout flag(s) set
OBJECT_MISPLACED 80344/15283887 objects misplaced (0.526%)
PG_AVAILABILITY Reduced data availability: 142 pgs inactive
    pg 2.2f is stuck inactive for 5423.931309, current state activating+remapped, last acting [9,12,32]
    pg 2.43 is stuck inactive for 5017.370391, current state activating+undersized+degraded+remapped, last acting [34,0]
    pg 2.f0 is stuck inactive for 5022.377163, current state activating+undersized+degraded+remapped, last acting [7,33]
    pg 2.f3 is stuck inactive for 5175.904664, current state activating+degraded+remapped, last acting [15,2,26]
    pg 2.fd is stuck inactive for 5018.363393, current state activating+undersized+degraded+remapped, last acting [5,36]
    pg 2.11c is stuck inactive for 5008.767943, current state activating+degraded+remapped, last acting [8,26,14]
    pg 2.12a is stuck inactive for 5016.824719, current state activating+degraded+remapped, last acting [25,9,14]
    pg 2.12b is stuck inactive for 4992.665969, current state activating+remapped, last acting [6,29,37]
    pg 2.18b is stuck inactive for 5016.828135, current state activating+undersized+degraded+remapped, last acting [6,27]
    pg 2.197 is stuck inactive for 5412.752571, current state activating+undersized+degraded+remapped, last acting [7,32]
    pg 2.337 is stuck inactive for 5175.885682, current state activating+remapped, last acting [11,2,35]
    pg 2.33b is stuck inactive for 5008.773479, current state activating+remapped, last acting [6,33,13]
    pg 2.37e is stuck inactive for 5016.830410, current state activating+degraded+remapped, last acting [3,29,13]
    pg 2.38b is stuck inactive for 5022.367215, current state activating+undersized+degraded+remapped, last acting [28,6]
    pg 2.3ad is stuck inactive for 5017.352524, current state activating+undersized+degraded+remapped, last acting [32,0]
    pg 2.3af is stuck inactive for 4992.667587, current state activating+degraded+remapped, last acting [6,28,19]
    pg 2.3db is stuck inactive for 5016.821716, current state activating+degraded+remapped, last acting [1,32,15]
    pg 2.3f7 is stuck inactive for 5008.760318, current state activating+undersized+degraded+remapped, last acting [3,25]
    pg 2.3fc is stuck inactive for 5016.814684, current state activating+degraded+remapped, last acting [2,28,14]
    pg 3.f6 is stuck inactive for 5022.391672, current state activating+undersized+degraded+remapped, last acting [21,3]
    pg 3.104 is stuck inactive for 5436.621520, current state activating+remapped, last acting [30,11,2]
    pg 3.121 is stuck inactive for 5016.838087, current state activating+degraded+remapped, last acting [6,27,15]
    pg 3.184 is stuck inactive for 5016.730824, current state activating+remapped, last acting [14,34,7]
    pg 3.38d is stuck inactive for 5022.381109, current state activating+undersized+degraded+remapped, last acting [5,33]
    pg 3.3c4 is stuck inactive for 5016.773763, current state activating+remapped, last acting [4,24,5]
    pg 4.12f is stuck inactive for 5423.875927, current state activating+remapped, last acting [5,27,13]
    pg 4.138 is stuck inactive for 5458.774607, current state activating+remapped, last acting [6,26,19]
    pg 4.360 is stuck inactive for 5022.389524, current state activating+undersized+degraded+remapped, last acting [7,23]
    pg 4.36a is stuck inactive for 5417.817899, current state activating+undersized+degraded+remapped, last acting [6,33]
    pg 4.3a8 is stuck inactive for 5022.390175, current state activating+undersized+degraded+remapped, last acting [6,29]
    pg 4.3bd is stuck inactive for 5016.716020, current state activating+remapped, last acting [21,0,37]
    pg 4.3d9 is stuck inactive for 5022.391832, current state activating+undersized+degraded+remapped, last acting [3,24]
    pg 4.3db is stuck inactive for 33841.137520, current state activating+undersized+degraded+remapped, last acting [21,6]
    pg 4.3de is stuck inactive for 5018.350910, current state activating+undersized+degraded+remapped, last acting [30,25]
    pg 4.3fb is stuck inactive for 5018.362756, current state activating+undersized+degraded+remapped, last acting [2,31]
    pg 5.a is stuck inactive for 18082.614488, current state activating+remapped, last acting [14,3,31]
    pg 5.43 is stuck inactive for 5022.381059, current state activating+undersized+degraded+remapped, last acting [0,28]
    pg 5.4e is stuck inactive for 5017.337833, current state activating+undersized+degraded+remapped, last acting [8,28]
    pg 5.e6 is stuck inactive for 4992.641555, current state activating+remapped, last acting [34,27,3]
    pg 5.ec is stuck inactive for 5018.345309, current state activating+degraded+remapped, last acting [37,29,18]
    pg 5.f6 is stuck inactive for 5443.685769, current state activating+remapped, last acting [1,30,14]
    pg 5.128 is stuck inactive for 4992.664069, current state activating+remapped, last acting [5,21,12]
    pg 5.12f is stuck inactive for 4992.648318, current state activating+remapped, last acting [25,7,8]
    pg 5.14f is stuck inactive for 5008.752298, current state activating+degraded+remapped, last acting [21,9,15]
    pg 5.154 is stuck inactive for 5018.340345, current state activating+undersized+degraded+remapped, last acting [29,0]
    pg 5.171 is stuck inactive for 5022.383306, current state activating+undersized+degraded+remapped, last acting [5,20]
    pg 5.370 is stuck inactive for 5016.801039, current state activating+remapped, last acting [5,34,24]
    pg 5.37a is stuck inactive for 5022.395633, current state activating+undersized+degraded+remapped, last acting [21,1]
    pg 5.3e3 is stuck inactive for 4992.670207, current state activating+undersized+degraded+remapped, last acting [2,36]
    pg 5.3ee is stuck inactive for 5016.796162, current state activating+remapped, last acting [33,24,7]
    pg 5.3f1 is stuck inactive for 5016.785052, current state activating+undersized+degraded+remapped, last acting [25,2]
PG_DEGRADED Degraded data redundancy: 209 pgs unclean, 76 pgs degraded, 58 pgs undersized
    pg 2.16c is stuck unclean for 14855.765771, current state active+remapped+backfill_wait, last acting [2,30,5]
    pg 2.18b is stuck undersized for 5014.830100, current state activating+undersized+degraded+remapped, last acting [6,27]
    pg 2.197 is stuck undersized for 5175.677858, current state activating+undersized+degraded+remapped, last acting [7,32]
    pg 2.337 is stuck unclean for 5456.790524, current state activating+remapped, last acting [11,2,35]
    pg 2.33b is stuck unclean for 28192.927700, current state activating+remapped, last acting [6,33,13]
    pg 2.343 is stuck unclean for 14932.780074, current state active+remapped+backfill_wait, last acting [2,29,0]
    pg 2.350 is stuck unclean for 32138.816506, current state active+remapped+backfill_wait, last acting [8,30,15]
    pg 2.36c is stuck unclean for 5175.588667, current state active+remapped+backfill_wait, last acting [26,3,25]
    pg 2.37e is activating+degraded+remapped, acting [3,29,13]
    pg 2.38b is stuck undersized for 5020.847092, current state activating+undersized+degraded+remapped, last acting [28,6]
    pg 2.3ad is stuck undersized for 5015.822259, current state activating+undersized+degraded+remapped, last acting [32,0]
    pg 2.3af is activating+degraded+remapped, acting [6,28,19]
    pg 2.3db is activating+degraded+remapped, acting [1,32,15]
    pg 2.3e5 is stuck unclean for 5018.360316, current state active+remapped+backfill_wait, last acting [5,25,34]
    pg 2.3f7 is stuck undersized for 5006.781721, current state activating+undersized+degraded+remapped, last acting [3,25]
    pg 2.3fc is activating+degraded+remapped, acting [2,28,14]
    pg 3.149 is stuck unclean for 33750.424602, current state active+remapped+backfilling, last acting [0,34,19]
    pg 3.14d is stuck unclean for 33747.112725, current state active+remapped+backfill_wait, last acting [5,34,8]
    pg 3.170 is stuck unclean for 41911.933806, current state active+remapped+backfill_wait, last acting [2,30,13]
    pg 3.184 is stuck unclean for 50863.584785, current state activating+remapped, last acting [14,34,7]
    pg 3.347 is stuck unclean for 5014.829012, current state active+remapped+backfill_wait, last acting [25,1,37]
    pg 3.379 is stuck unclean for 5173.681939, current state active+remapped+backfilling, last acting [8,35,33]
    pg 3.38d is stuck undersized for 5020.844204, current state activating+undersized+degraded+remapped, last acting [5,33]
    pg 3.3b5 is stuck unclean for 35111.545294, current state active+remapped+backfill_wait, last acting [37,21,36]
    pg 3.3c4 is stuck unclean for 33915.855454, current state activating+remapped, last acting [4,24,5]
    pg 3.3f6 is stuck unclean for 5020.847138, current state active+remapped+backfill_wait, last acting [5,30,3]
    pg 4.138 is stuck unclean for 79356.915196, current state activating+remapped, last acting [6,26,19]
    pg 4.17a is stuck unclean for 70653.281022, current state active+remapped+backfill_wait, last acting [4,29,15]
    pg 4.360 is stuck undersized for 5020.844126, current state activating+undersized+degraded+remapped, last acting [7,23]
    pg 4.36a is stuck undersized for 5175.682594, current state activating+undersized+degraded+remapped, last acting [6,33]
    pg 4.370 is stuck unclean for 5017.348570, current state active+remapped+backfilling, last acting [3,25,37]
    pg 4.393 is stuck unclean for 75394.078516, current state active+remapped+backfill_wait, last acting [3,36,13]
    pg 4.3a8 is stuck undersized for 5020.837838, current state activating+undersized+degraded+remapped, last acting [6,29]
    pg 4.3bd is stuck unclean for 5017.331324, current state activating+remapped, last acting [21,0,37]
    pg 4.3d9 is stuck undersized for 5020.844302, current state activating+undersized+degraded+remapped, last acting [3,24]
    pg 4.3db is stuck undersized for 5021.817987, current state activating+undersized+degraded+remapped, last acting [21,6]
    pg 4.3dd is stuck unclean for 69841.408876, current state active+remapped+backfill_wait, last acting [8,37,15]
    pg 4.3de is stuck undersized for 5016.787614, current state activating+undersized+degraded+remapped, last acting [30,25]
    pg 4.3e8 is stuck unclean for 5017.370736, current state active+remapped+backfilling, last acting [0,35,24]
    pg 4.3fb is stuck undersized for 5016.819053, current state activating+undersized+degraded+remapped, last acting [2,31]
    pg 5.14f is activating+degraded+remapped, acting [21,9,15]
    pg 5.154 is stuck undersized for 5016.758706, current state activating+undersized+degraded+remapped, last acting [29,0]
    pg 5.171 is stuck undersized for 5020.845419, current state activating+undersized+degraded+remapped, last acting [5,20]
    pg 5.343 is stuck unclean for 5018.363926, current state active+remapped+backfill_wait, last acting [2,33,37]
    pg 5.346 is stuck unclean for 74802.413625, current state active+remapped+backfill_wait, last acting [2,31,8]
    pg 5.370 is stuck unclean for 5017.349368, current state activating+remapped, last acting [5,34,24]
    pg 5.37a is stuck undersized for 5020.844900, current state activating+undersized+degraded+remapped, last acting [21,1]
    pg 5.3e3 is stuck undersized for 4990.832211, current state activating+undersized+degraded+remapped, last acting [2,36]
    pg 5.3e8 is stuck unclean for 66318.787639, current state active+remapped+backfill_wait, last acting [5,22,17]
    pg 5.3ee is stuck unclean for 5017.336586, current state activating+remapped, last acting [33,24,7]
    pg 5.3f1 is stuck undersized for 5014.832273, current state activating+undersized+degraded+remapped, last acting [25,2]
REQUEST_SLOW 36 slow requests are blocked > 32 sec
    36 ops are blocked > 2097.15 sec
REQUEST_STUCK 670 stuck requests are blocked > 4096 sec
    243 ops are blocked > 8388.61 sec
    427 ops are blocked > 4194.3 sec
    osds 6,26 have stuck requests > 4194.3 sec
    osds 0,2,4,7,8,9,11,14,15,18,20,29,30,33,34,35,36 have stuck requests > 8388.61 sec
TOO_MANY_PGS too many PGs per OSD (332 > max 200)

I've checked that they hanging on peering. Always some of OSD are showing this stats when I execute ceph pgs <pgs>query.

Code:

            "stats": {
                "version": "0'0",
                "reported_seq": "0",
                "reported_epoch": "0",
                "state": "unknown",
                "last_fresh": "0.000000",
                "last_change": "0.000000",
                "last_active": "0.000000",
                "last_peered": "0.000000",
                "last_clean": "0.000000",
                "last_became_active": "0.000000",
                "last_became_peered": "0.000000",
                "last_unstale": "0.000000",
                "last_undegraded": "0.000000",
                "last_fullsized": "0.000000",
                "mapping_epoch": 0,
                "log_start": "0'0",
                "ondisk_log_start": "0'0",
                "created": 0,
                "last_epoch_clean": 0,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "0'0",
                "last_scrub_stamp": "0.000000",
                "last_deep_scrub": "0'0",
                "last_deep_scrub_stamp": "0.000000",
                "last_clean_scrub_stamp": "0.000000",
                "log_size": 0,
                "ondisk_log_size": 0,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "stat_sum": {
                    "num_bytes": 0,
                    "num_objects": 0,
                    "num_object_clones": 0,
                    "num_object_copies": 0,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 0,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 0,
                    "num_whiteouts": 0,
                    "num_read": 0,
                    "num_read_kb": 0,
                    "num_write": 0,
                    "num_write_kb": 0,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 0,
                    "num_bytes_recovered": 0,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0
                },

Anyone have idea? Maybe reinstall node to 4.4 version?

Jarek · Dec 9, 2017

Please share your ceph.conf. Why 4 mons?

black4 · Dec 9, 2017

4 mons because after upgrade I was planning adding another one. That's also why I have max_pg_per_osd warning.

ceph.conf:

Code:

[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.6.6.0/24
         filestore xattr use omap = true
         fsid = dcd25aa1-1618-45a9-b902-c081d3fa3479
         keyring = /etc/pve/priv/$cluster.$name.keyring
         osd journal size = 5120
         osd pool default min size = 1
         public network = 192.168.99.0/24
         mon allow pool delete = true

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring
         osd max backfills = 1
         osd recovery max active = 1

[mon.3]
         host = biurowiecH
         mon addr = 192.168.99.118:6789
         mon osd min down reporters = 6

[mon.2]
         host = biurowiecG
         mon addr = 192.168.99.117:6789
         mon osd min down reporters = 6

[mon.1]
         host = biurowiecD
         mon addr = 192.168.99.114:6789
         mon osd min down reporters = 7

[mon.0]
         host = biurowiecE
         mon addr = 192.168.99.115:6789
         mon osd min down reporters = 7

Jarek · Dec 9, 2017

black4 said:
io: recovery: 87335 kB/s, 21 objects/s

Your ceph network is 1Gbps?

black4 said:
4 mons because after upgrade I was planning adding another one.

You don't need to have more than 3 mons. If you need more storage space, you should add a node with OSDs only (without mon).

black4 said:
too many PGs per OSD (332 > max 200)

It is no good for recovery/backfill process.

black4 · Dec 9, 2017

Your ceph network is 1Gbps?

I've got 10Gbps network (2 switches - private cluster and public).

You don't need to have more than 3 mons. If you need more storage space, you should add a node with OSDs only (without mon).

Our target was 5 mons, thanks for the tip.

It is no good for recovery/backfill process.

With 5-th mon we are planning adding additional OSDs.

Cluster with this configuration was working few months with no problems till today upgrade.

Jarek · Dec 9, 2017

Cluster is in a normal recovery state.
So where is the bottleneck? Try to find it with the atop utility.
Disks are SSDs or spinners?

black4 said:
mon osd min down reporters = 7

Did you have a problem with flapping osds?

black4 · Dec 9, 2017

Jarek said:
Disks are SSDs or spinners?

Disks are spinners with dedicated journal partition on SSD.

Jarek said:
Did you have a problem with flapping osds?

Yes, once - long time ago. Then this worked, so we left it in configuration.

Jarek said:
Cluster is in a normal recovery state.

Maybe I didn't describe properly my issue. Stuck + slow requests hangs every VM using rados. Amount of them are only increasing, it seems as they won't ever ends.
When I'm running ceph daemon osd.3 (with slow requests) I'm getting output below. It's always stuck at "waiting for peered" for every slow/stuck request. This means request is waiting till pgs will be active+clean? Why is it not using other replicas? Every pool have set size 3.

Code:

"ops": [
        {
            "description": "osd_op(client.298842333.0:261 2.37e 2.8df15f7e (undecoded) ondisk+r
ead+known_if_redirected e62601)",
            "initiated_at": "2017-12-09 21:17:49.168413",
            "age": 795.232279,
            "duration": 795.232293,
            "type_data": {
                "flag_point": "delayed",
                "client_info": {
                    "client": "client.298842333",
                    "client_addr": "192.168.99.118:0/95629112",
                    "tid": 261
                },
                "events": [
                    {
                        "time": "2017-12-09 21:17:49.168413",
                        "event": "initiated"
                    },
                    {
                        "time": "2017-12-09 21:17:49.168456",
                        "event": "queued_for_pg"
                    },
                    {
                        "time": "2017-12-09 21:17:49.168500",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2017-12-09 21:17:49.168503",
                        "event": "waiting for peered"
                    }
                ]
            }
        },

I've poweroff all VM, restarted OSDs and waited until cluster rebalance to not generate new requests, but I can't even list pool rbds through pve-proxy (they creating new slow request and in I'm getting timeout in webui).
The status of cluster hasn't change during few hours:

Code:

  cluster:
    id:     dcd25aa1-1618-45a9-b902-c081d3fa3479
    health: HEALTH_ERR
            Reduced data availability: 83 pgs inactive, 3 pgs stale
            Degraded data redundancy: 83 pgs unclean, 46 pgs degraded, 35 pgs undersized
            2 stuck requests are blocked > 4096 sec
            too many PGs per OSD (332 > max 200)

  services:
    mon: 4 daemons, quorum 1,0,2,3
    mgr: biurowiecH(active), standbys: biurowiecE, biurowiecD, biurowiecG
    osd: 37 osds: 37 up, 37 in; 80 remapped pgs

  data:
    pools:   4 pools, 4096 pgs
    objects: 4975k objects, 19753 GB
    usage:   58952 GB used, 47168 GB / 103 TB avail
    pgs:     2.026% pgs not active
             4010 active+clean
             36   activating+remapped
             33   activating+undersized+degraded+remapped
             11   activating+degraded+remapped
             2    active+clean+scrubbing+deep
             2    stale+activating+undersized+degraded+remapped
             1    stale+activating+remapped
             1    active+clean+scrubbing

I'm wondering also about remapped pgs - they are not decreasing. For me it's look like ceph is waiting for destroyed OSD which means it will never repair itself.

Maybe it's problem with crush map? I destroyed osd.16 (disk failed as I mention in first post), but dump displays this:

Code:

        {
            "id": 15,
            "name": "osd.15",
            "class": "hdd"
        },
        {
            "id": 16,
            "name": "device16"
        },
        {
            "id": 17,
            "name": "osd.17",
            "class": "hdd"
        },

I have no idea what's wrong.
The worst thing is that I can't start any VM using ceph, because it's hanging on accessing to data.

Jarek · Dec 10, 2017

Have you tried this:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ ?

black4 · Dec 10, 2017

Jarek said:
Have you tried this:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ ?

Yes, but it doesn't help.

After night is the same situation:

Code:

root@biurowiecH:~# ceph -s
  cluster:
    id:     dcd25aa1-1618-45a9-b902-c081d3fa3479
    health: HEALTH_ERR
            Reduced data availability: 83 pgs inactive, 3 pgs stale
            Degraded data redundancy: 83 pgs unclean, 46 pgs degraded, 35 pgs undersized
            2 stuck requests are blocked > 4096 sec
            too many PGs per OSD (332 > max 200)

  services:
    mon: 4 daemons, quorum 1,0,2,3
    mgr: biurowiecH(active), standbys: biurowiecE, biurowiecD, biurowiecG
    osd: 37 osds: 37 up, 37 in; 80 remapped pgs

  data:
    pools:   4 pools, 4096 pgs
    objects: 4975k objects, 19753 GB
    usage:   58952 GB used, 47168 GB / 103 TB avail
    pgs:     2.026% pgs not active
             4013 active+clean
             36   activating+remapped
             33   activating+undersized+degraded+remapped
             11   activating+degraded+remapped
             2    stale+activating+undersized+degraded+remapped
             1    stale+activating+remapped

80 pgs are stuck, because CRUSH map is corrupted by deviceN entry. Yesterday I found few posts on other forums with my problem, but without any reply. It's look like bug in ceph. I've found post in RedHat access which is marked as resolved.(https://access.redhat.com/solutions/3106081), but I don't have account there

Jarek · Dec 10, 2017

black4 said:
80 pgs are stuck, because CRUSH map is corrupted by deviceN entry.

This happens when you use wrong command to remove osd -
ceph osd rm osd.16
instead of
ceph osd rm 16

Have you tried:
- reboot a node containing osd.16 (with noout flag)?
- set osd.16 as lost?

black4 · Dec 10, 2017

Jarek said:
This happens when you use wrong command to remove osd -
ceph osd rm osd.16
instead of
ceph osd rm 16

OSD was down and out, so I destroyed it from Proxmox PVE.

I've bought RedHat Support and there info:

Ceph CRUSH map doesn't like holes, so after the OSD is deleted, these deviceN entries get added in the devices tag in CRUSH map.
When any new OSD is added back to cluster, it will take the position of already removed osd and deviceN will be replaced.
The deviceN entry in CRUSH map is completely safe to ignore, as it doesn't impacts the bucket structure in CRUSH map. This can be verified from ceph osd tree.

So it's not issue. I've replaced OSD and add it to cluster.

Ceph status:

Code:

  cluster:
    id:     dcd25aa1-1618-45a9-b902-c081d3fa3479
    health: HEALTH_WARN
            545783/15283887 objects misplaced (3.571%)
            Reduced data availability: 101 pgs inactive
            Degraded data redundancy: 11280/15283887 objects degraded (0.074%), 514 pgs unclean, 10 pgs degraded, 9 pgs undersized
            2 slow requests are blocked > 32 sec
            too many PGs per OSD (323 > max 200)

  services:
    mon: 4 daemons, quorum 1,0,2,3
    mgr: biurowiecE(active), standbys: biurowiecG, biurowiecH, biurowiecD
    osd: 38 osds: 38 up, 38 in; 514 remapped pgs

  data:
    pools:   4 pools, 4096 pgs
    objects: 4975k objects, 19753 GB
    usage:   59094 GB used, 50749 GB / 107 TB avail
    pgs:     2.490% pgs not active
             11280/15283887 objects degraded (0.074%)
             545783/15283887 objects misplaced (3.571%)
             3581 active+clean
             374  active+remapped+backfill_wait
             101  activating+remapped
             29   active+remapped+backfilling
             6    active+undersized+degraded+remapped+backfill_wait
             3    active+undersized+degraded+remapped+backfilling
             1    active+recovery_wait+degraded+remapped
             1    peering

  io:
    recovery: 230 MB/s, 57 objects/s

Ceph health detail:

Code:

HEALTH_WARN 544198/15283887 objects misplaced (3.561%); Reduced data availability: 101 pgs inactive; Degraded data redundancy: 11137/15283887 objects degraded (0.073%), 513 pgs unclean, 10 pgs degraded, 9 pgs undersized; 2 slow requests are blocked > 32 sec; too many PGs per OSD (323 > max 200)
OBJECT_MISPLACED 544198/15283887 objects misplaced (3.561%)
PG_AVAILABILITY Reduced data availability: 101 pgs inactive
    pg 2.11 is stuck inactive for 743.019561, current state activating+remapped, last acting [0,31,20]
    pg 2.16 is stuck inactive for 743.025317, current state activating+remapped, last acting [15,7,32]
    pg 2.20 is stuck inactive for 743.013151, current state activating+remapped, last acting [36,6,23]
    pg 2.5c is stuck inactive for 743.032515, current state activating+remapped, last acting [14,34,24]
    pg 2.81 is stuck inactive for 6721.339708, current state activating+remapped, last acting [7,32,14]
    pg 2.d2 is stuck inactive for 743.026202, current state activating+remapped, last acting [5,23,33]
    pg 2.ea is stuck inactive for 743.028798, current state activating+remapped, last acting [14,4,22]
    pg 2.14b is stuck inactive for 743.014407, current state activating+remapped, last acting [21,36,3]
    pg 2.195 is stuck inactive for 742.992949, current state activating+remapped, last acting [34,5,27]
    pg 2.33b is stuck inactive for 743.029396, current state activating+remapped, last acting [6,33,12]
    pg 2.350 is stuck inactive for 743.026363, current state activating+remapped, last acting [8,30,13]
    pg 2.375 is stuck inactive for 743.023269, current state activating+remapped, last acting [2,28,34]
    pg 3.16 is stuck inactive for 743.036589, current state activating+remapped, last acting [2,13,32]
    pg 3.50 is stuck inactive for 64578.566562, current state activating+remapped, last acting [8,11,35]
    pg 3.136 is stuck inactive for 741.527275, current state activating+remapped, last acting [12,7,31]
    pg 3.169 is stuck inactive for 743.025289, current state activating+remapped, last acting [2,14,32]
    pg 3.192 is stuck inactive for 743.000100, current state activating+remapped, last acting [20,8,32]
    pg 3.349 is stuck inactive for 742.008179, current state activating+remapped, last acting [26,8,17]
    pg 3.368 is stuck inactive for 743.025858, current state activating+remapped, last acting [8,37,22]
    pg 3.379 is stuck inactive for 743.019745, current state activating+remapped, last acting [8,19,35]
    pg 3.3b9 is stuck inactive for 743.000190, current state activating+remapped, last acting [33,6,11]
    pg 3.3c4 is stuck inactive for 74339.582515, current state activating+remapped, last acting [5,24,4]
    pg 3.3d1 is stuck inactive for 743.029304, current state activating+remapped, last acting [31,29,7]
    pg 4.b is stuck inactive for 743.018079, current state activating+remapped, last acting [8,37,24]
    pg 4.48 is stuck inactive for 743.030841, current state activating+remapped, last acting [4,27,35]
    pg 4.107 is stuck inactive for 743.008786, current state activating+remapped, last acting [9,25,37]
    pg 4.144 is stuck inactive for 743.008622, current state activating+remapped, last acting [18,37,23]
    pg 4.149 is stuck inactive for 742.983703, current state activating+remapped, last acting [26,3,13]
    pg 4.182 is stuck inactive for 742.952315, current state activating+remapped, last acting [29,7,37]
    pg 4.186 is stuck inactive for 738.272744, current state activating+remapped, last acting [3,28,30]
    pg 4.365 is stuck inactive for 743.007380, current state activating+remapped, last acting [37,23,0]
    pg 4.373 is stuck inactive for 743.026568, current state activating+remapped, last acting [5,36,23]
    pg 4.37d is stuck inactive for 743.002822, current state activating+remapped, last acting [28,31,5]
    pg 4.3ac is stuck inactive for 743.021301, current state activating+remapped, last acting [8,35,27]
    pg 4.3af is stuck inactive for 743.042383, current state activating+remapped, last acting [17,23,3]
    pg 4.3df is stuck inactive for 743.029207, current state activating+remapped, last acting [4,21,11]
    pg 4.3e2 is stuck inactive for 743.004657, current state activating+remapped, last acting [36,28,6]
    pg 4.3e4 is stuck inactive for 743.021351, current state activating+remapped, last acting [9,17,21]
    pg 5.4e is stuck inactive for 743.015548, current state activating+remapped, last acting [8,28,14]
    pg 5.64 is stuck inactive for 742.960098, current state activating+remapped, last acting [21,1,15]
    pg 5.6d is stuck inactive for 743.031729, current state activating+remapped, last acting [1,23,30]
    pg 5.101 is stuck inactive for 743.039528, current state activating+remapped, last acting [5,25,30]
    pg 5.10c is stuck inactive for 743.028655, current state activating+remapped, last acting [4,37,27]
    pg 5.125 is stuck inactive for 743.041643, current state activating+remapped, last acting [5,11,20]
    pg 5.128 is stuck inactive for 743.037429, current state activating+remapped, last acting [5,21,10]
    pg 5.129 is stuck inactive for 743.009173, current state activating+remapped, last acting [4,17,20]
    pg 5.12f is stuck inactive for 742.979467, current state activating+remapped, last acting [25,7,10]
    pg 5.13e is stuck inactive for 742.022666, current state activating+remapped, last acting [26,5,17]
    pg 5.15f is stuck inactive for 742.026604, current state activating+remapped, last acting [2,13,26]
    pg 5.373 is stuck inactive for 742.979799, current state activating+remapped, last acting [28,8,35]
    pg 5.3c8 is stuck inactive for 742.977969, current state activating+remapped, last acting [29,30,6]
PG_DEGRADED Degraded data redundancy: 11137/15283887 objects degraded (0.073%), 513 pgs unclean, 10 pgs degraded, 9 pgs undersized
    pg 2.3ad is stuck unclean for 744.057530, current state active+remapped+backfill_wait, last acting [32,0,14]
    pg 2.3af is stuck unclean for 746.101394, current state active+remapped+backfill_wait, last acting [6,28,10]
    pg 2.3b9 is stuck unclean for 746.016895, current state active+remapped+backfill_wait, last acting [37,27,17]
    pg 2.3be is stuck unclean for 64573.584677, current state active+remapped+backfill_wait, last acting [2,21,11]
    pg 2.3c5 is stuck unclean for 97183.257435, current state active+remapped+backfill_wait, last acting [30,1,18]
    pg 2.3c8 is stuck unclean for 741.032041, current state active+remapped+backfill_wait, last acting [2,35,18]
    pg 2.3dd is stuck unclean for 743.034276, current state active+remapped+backfill_wait, last acting [12,2,36]
    pg 2.3e5 is stuck unclean for 744.043365, current state active+remapped+backfill_wait, last acting [5,25,14]
    pg 2.3f6 is stuck unclean for 99488.956729, current state active+remapped+backfilling, last acting [4,18,25]
    pg 2.3f7 is stuck unclean for 746.118837, current state active+remapped+backfill_wait, last acting [3,25,12]
    pg 2.3f9 is stuck unclean for 746.014037, current state active+remapped+backfill_wait, last acting [35,21,2]
    pg 2.3fa is stuck unclean for 103184.658537, current state active+remapped+backfill_wait, last acting [5,37,10]
    pg 2.3fc is active+recovery_wait+degraded+remapped, acting [2,28,14]
    pg 3.3a4 is stuck unclean for 100292.379049, current state active+remapped+backfill_wait, last acting [4,29,11]
    pg 3.3b9 is stuck unclean for 746.015824, current state activating+remapped, last acting [33,6,11]
    pg 3.3bb is stuck unclean for 113952.940946, current state active+remapped+backfill_wait, last acting [21,4,15]
    pg 3.3c4 is stuck unclean for 114118.375053, current state activating+remapped, last acting [5,24,4]
    pg 3.3d1 is stuck unclean for 744.052660, current state activating+remapped, last acting [31,29,7]
    pg 3.3d8 is stuck unclean for 147991.427362, current state active+remapped+backfill_wait, last acting [7,34,20]
    pg 3.3da is stuck unclean for 100107.110897, current state active+remapped+backfill_wait, last acting [8,29,18]
    pg 3.3f6 is stuck unclean for 744.044854, current state active+remapped+backfilling, last acting [5,30,13]
    pg 4.3a8 is stuck undersized for 729.950527, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,29]
    pg 4.3a9 is stuck unclean for 150855.809832, current state active+remapped+backfill_wait, last acting [28,9,19]
    pg 4.3ac is stuck unclean for 743.185942, current state activating+remapped, last acting [8,35,27]
    pg 4.3af is stuck unclean for 745.921204, current state activating+remapped, last acting [17,23,3]
    pg 4.3b7 is stuck unclean for 102926.743355, current state active+remapped+backfill_wait, last acting [8,24,12]
    pg 4.3bd is stuck unclean for 739.492849, current state active+remapped+backfill_wait, last acting [21,0,37]
    pg 4.3c7 is stuck unclean for 745.918280, current state active+remapped+backfill_wait, last acting [24,34,15]
    pg 4.3d9 is stuck undersized for 729.939804, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,24]
    pg 4.3db is stuck unclean for 62197.527364, current state active+remapped+backfill_wait, last acting [21,19,1]
    pg 4.3dd is stuck unclean for 743.186121, current state active+remapped+backfill_wait, last acting [8,37,14]
    pg 4.3de is stuck unclean for 744.054056, current state active+remapped+backfill_wait, last acting [30,25,17]
    pg 4.3df is stuck unclean for 745.921136, current state activating+remapped, last acting [4,21,11]
    pg 4.3e2 is stuck unclean for 746.016794, current state activating+remapped, last acting [36,28,6]
    pg 4.3e4 is stuck unclean for 745.920046, current state activating+remapped, last acting [9,17,21]
    pg 4.3e8 is stuck unclean for 744.059866, current state active+remapped+backfill_wait, last acting [0,35,18]
    pg 4.3fb is stuck unclean for 744.045176, current state active+remapped+backfilling, last acting [2,31,17]
    pg 4.3fe is stuck unclean for 743.186015, current state active+remapped+backfill_wait, last acting [8,31,19]
    pg 5.3af is stuck unclean for 104587.976268, current state active+remapped+backfill_wait, last acting [7,21,12]
    pg 5.3b9 is stuck unclean for 100227.904673, current state active+remapped+backfill_wait, last acting [4,24,18]
    pg 5.3bb is stuck unclean for 64573.577927, current state active+remapped+backfill_wait, last acting [5,25,14]
    pg 5.3c0 is stuck unclean for 743.329226, current state active+remapped+backfill_wait, last acting [21,19,35]
    pg 5.3c8 is stuck unclean for 743.445505, current state activating+remapped, last acting [29,30,6]
    pg 5.3cb is stuck unclean for 744.000279, current state active+remapped+backfill_wait, last acting [1,23,11]
    pg 5.3d0 is stuck unclean for 745.920096, current state active+remapped+backfill_wait, last acting [9,34,15]
    pg 5.3e0 is stuck unclean for 744.044648, current state active+remapped+backfill_wait, last acting [2,23,14]
    pg 5.3e3 is stuck undersized for 728.758659, current state active+undersized+degraded+remapped+backfill_wait, last acting [2,36]
    pg 5.3e8 is stuck unclean for 744.044376, current state active+remapped+backfill_wait, last acting [5,22,15]
    pg 5.3ee is stuck unclean for 740.991086, current state active+remapped+backfill_wait, last acting [33,24,7]
    pg 5.3f1 is stuck unclean for 740.799190, current state active+remapped+backfill_wait, last acting [25,2,11]
    pg 5.3f5 is stuck unclean for 741.021931, current state active+remapped+backfill_wait, last acting [9,30,15]
REQUEST_SLOW 2 slow requests are blocked > 32 sec
    2 ops are blocked > 1048.58 sec
    osds 1,12 have blocked requests > 1048.58 sec
TOO_MANY_PGS too many PGs per OSD (323 > max 200)

I'm attaching /var/log/ceph/ceph-osd.1.log and example ceph pg 3.349 query.

I don't know hot get rid of inactive pgs. They are not peering.

Jarek · Dec 10, 2017

black4 said:
io: recovery: 230 MB/s, 57 objects/s

Recovery is in progress. Wail until it complete.
I wonder what is the primary cause of this failure.
Maybe you didn't wait for HEALTH_OK between every step of upgrade? Or upgrade with noout set? When status become HEALTH_ERR, after reboot of last node?

black4 · Dec 10, 2017

Jarek said:
Recovery is in progress. Wail until it complete.

Cluster will reach 3994 active+clean and stuck.

Jarek said:
I wonder what is the primary cause of this failure.

First slow requests appears when I upgrade PVE on last node. After reboot I saw one osd.16 dead and all VMs freezes.

Jarek said:
Maybe you didn't wait for HEALTH_OK between every step of upgrade? Or upgrade with noout set? When status become HEALTH_ERR, after reboot of last node?

I've upgrade from Hammer to Jewel about month ago. From Jewel to Luminous (4.4 pve) on wednesday and it was working fine few days. After upgrade pve 4.4 to 5.1 issues started. I must admit that when I was upgrading PVE to 5.1 CEPH health status was WARN - only "too many PGs per OSD (323 > max 200)".

casshan · Dec 10, 2017

Had similar same problem, downgrade kernel to 4.13.4-1-pve fixed it for me. I believe it was something related to network stack.

black4 · Dec 10, 2017

casshan said:
Had similar same problem, downgrade kernel to 4.13.4-1-pve fixed it for me. I believe it was something related to network stack.

We've tried, but it doesn't help. Amount of pgs inactive has decrease, but still exists.

Code:

             ...
             Reduced data availability: 16 pgs inactive
             ..
             4001 active+clean
             42   active+undersized+degraded+remapped+backfill_wait
             31   active+remapped+backfill_wait
             13   activating+remapped
             4    active+undersized+degraded+remapped+backfilling
             3    activating+undersized+degraded+remapped
             2    active+remapped+backfilling

When object is on activating+remapped pgs then generates slow query.
I think that they are corrupted. I've checked all osd where this pgs is acting, but it looks fine.
Maybe is someway to reduce pgs size to 1 and rebuild it again?

casshan · Dec 10, 2017

Mine was on a fresh cluster and I was unable to create any pools. I noticed the send-q under netstat would get huge until I killed the OSD process.

black4 · Dec 11, 2017

I've got problem with two OSDs - 25, 29.

Code:

PG_STAT STATE               UP         UP_PRIMARY ACTING     ACTING_PRIMARY
2.2ac   activating+remapped  [7,14,29]          7  [7,14,28]              7
2.2a7   activating+remapped  [13,2,25]         13  [13,2,34]             13
3.136   activating+remapped  [12,7,25]         12  [12,7,31]             12
2.ea    activating+remapped  [14,4,29]         14  [14,4,23]             14
2.5c    activating+remapped [14,29,34]         14 [14,34,24]             14
5.15f   activating+remapped  [14,2,25]         14  [14,2,26]             14
4.248   activating+remapped  [17,3,29]         17  [17,3,23]             17
4.a1    activating+remapped  [1,10,29]          1  [1,10,17]              1
4.1b8   activating+remapped  [1,13,25]          1  [1,13,23]              1
4.20a   activating+remapped  [19,6,29]         19  [19,6,34]             19
3.298   activating+remapped  [7,18,25]          7  [7,18,36]              7
2.2a5   activating+remapped  [10,29,2]         10  [10,2,14]             10

Before there was bigger amount of them, but restarting OSD 25 and 29 help to reduce them to this number.
Can I force CEPH to change backfilling target for those pgs?

casshan · Dec 11, 2017

You try playing with
osd_max_backfills to back fill more at once, but I don't know a way to force backfill per pg. The more you do at once with slow down the cluster.

black4 · Dec 11, 2017

casshan said:
You try playing with
osd_max_backfills to back fill more at once, but I don't know a way to force backfill per pg. The more you do at once with slow down the cluster.

osd_max_backfills will increase speed of recovering cluster. Mine is recovered as much as it can. I will get rid of those activing+remapped pgs cluster will be fine.

I found two OSDs which are in every pgs inactive as backfilling target. For some reason they don't want to start this process, so I'm searching a way to force CEPH to override target. Restarting osds on beginning helps reducing some of them, but now it's not helping. I've tried pg repair, osd repair without success.

Code:

  cluster:
    id:     dcd25aa1-1618-45a9-b902-c081d3fa3479
    health: HEALTH_WARN
            Reduced data availability: 12 pgs inactive
            Degraded data redundancy: 12 pgs unclean
            too many PGs per OSD (323 > max 200)

  services:
    mon: 4 daemons, quorum 1,0,2,3
    mgr: biurowiecG(active), standbys: biurowiecE, biurowiecH, biurowiecD
    osd: 38 osds: 38 up, 38 in; 12 remapped pgs

  data:
    pools:   4 pools, 4096 pgs
    objects: 4975k objects, 19753 GB
    usage:   58971 GB used, 50872 GB / 107 TB avail
    pgs:     0.293% pgs not active
             4083 active+clean
             12   activating+remapped
             1    active+clean+scrubbing+deep

Instigater · Dec 12, 2017

Unfortunately I have the very same issue. It was more or less fine untill backup time, then backup failed and 2 rbd devices got stuck in iowait. In a quest to fix this, now my cluster has a lot of activating+remapped PGs. Basicaly every OSD in my cluster now has some PGs in this state.

CEPH problem after upgrade to 5.1 / slow requests + stuck request

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Attachments

Well-Known Member

Member

New Member

Member

New Member

Member

New Member

Member

Renowned Member

We value your privacy