Hi,
I've got problem with my CEPH cluster.
Cluster specification:
4x node
4x mon
4x mgr
37x osd
I was starting from CEPH hammer so I followed tutorials:
https://pve.proxmox.com/wiki/Ceph_Hammer_to_Jewel - without any problems
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous - without any problems
And finally:
https://pve.proxmox.com/wiki/Upgrade_from_4.x_to_5.0
After upgrade I saw that one OSD not started and there are showing a lot of slow request + stuck request. There also showing pgs inactive. This one OSD was down and out, so it shoudn't have impact, but to be sure I also destroyed OSD to eliminate potential issue and have all OSD up and running. Unfortunately that doesn't help.
I thought that I found issue - after upgrade to luminous in pve 4.4 ceph package was installed in 12.2.2 version, so when I was upgrading to 5.1 ceph packages was installed from debian repository instead proxmox. To fix it I've changed branch main to test and run dist-upgrade + restart binaries, but it doesn't help.
Dmesg is not showing anything suspicious in logs. Disks are working fine, network also.
Kernel:
PVE:
Packages:
CEPH versions:
CEPH status:
CEPH health detail:
I've checked that they hanging on peering. Always some of OSD are showing this stats when I execute ceph pgs <pgs>query.
Anyone have idea? Maybe reinstall node to 4.4 version?
I've got problem with my CEPH cluster.
Cluster specification:
4x node
4x mon
4x mgr
37x osd
I was starting from CEPH hammer so I followed tutorials:
https://pve.proxmox.com/wiki/Ceph_Hammer_to_Jewel - without any problems
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous - without any problems
And finally:
https://pve.proxmox.com/wiki/Upgrade_from_4.x_to_5.0
After upgrade I saw that one OSD not started and there are showing a lot of slow request + stuck request. There also showing pgs inactive. This one OSD was down and out, so it shoudn't have impact, but to be sure I also destroyed OSD to eliminate potential issue and have all OSD up and running. Unfortunately that doesn't help.
I thought that I found issue - after upgrade to luminous in pve 4.4 ceph package was installed in 12.2.2 version, so when I was upgrading to 5.1 ceph packages was installed from debian repository instead proxmox. To fix it I've changed branch main to test and run dist-upgrade + restart binaries, but it doesn't help.
Dmesg is not showing anything suspicious in logs. Disks are working fine, network also.
Kernel:
Code:
Linux biurowiecD 4.13.8-3-pve #1 SMP PVE 4.13.8-30 (Tue, 5 Dec 2017 13:06:48 +0100) x86_64 GNU/Linux
PVE:
Code:
pve-manager/5.1-38/1e9bc777 (running kernel: 4.13.8-3-pve)
Packages:
Code:
ii ceph 12.2.2-pve1 amd64 distributed storage and file system
ii ceph-base 12.2.2-pve1 amd64 common ceph daemon libraries and management tools
ii ceph-common 12.2.2-pve1 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 12.2.2-pve1 amd64 FUSE-based client for the Ceph distributed file system
ii ceph-mds 12.2.2-pve1 amd64 metadata server for the ceph distributed file system
ii ceph-mgr 12.2.2-pve1 amd64 manager for the ceph distributed storage system
ii ceph-mon 12.2.2-pve1 amd64 monitor server for the ceph storage system
ii ceph-osd 12.2.2-pve1 amd64 OSD server for the ceph storage system
ri libcephfs1 10.2.10-1~bpo80+1 amd64 Ceph distributed file system client library
ii libcephfs2 12.2.2-pve1 amd64 Ceph distributed file system client library
ii python-ceph 12.2.2-pve1 amd64 Meta-package for python libraries for the Ceph libraries
ii python-cephfs 12.2.2-pve1 amd64 Python 2 libraries for the Ceph libcephfs library
CEPH versions:
Code:
{
"mon": {
"ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 4
},
"mgr": {
"ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 4
},
"osd": {
"ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 37
},
"mds": {},
"overall": {
"ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 45
}
}
CEPH status:
Code:
root@biurowiecD:~# ceph -s
cluster:
id: dcd25aa1-1618-45a9-b902-c081d3fa3479
health: HEALTH_ERR
noout flag(s) set
81114/15283887 objects misplaced (0.531%)
Reduced data availability: 142 pgs inactive
Degraded data redundancy: 209 pgs unclean, 76 pgs degraded, 58 pgs undersized
42 slow requests are blocked > 32 sec
664 stuck requests are blocked > 4096 sec
too many PGs per OSD (332 > max 200)
services:
mon: 4 daemons, quorum 1,0,2,3
mgr: biurowiecH(active), standbys: biurowiecE, biurowiecD, biurowiecG
osd: 37 osds: 37 up, 37 in; 209 remapped pgs
flags noout
data:
pools: 4 pools, 4096 pgs
objects: 4975k objects, 19753 GB
usage: 59023 GB used, 47097 GB / 103 TB avail
pgs: 3.467% pgs not active
81114/15283887 objects misplaced (0.531%)
3887 active+clean
66 activating+remapped
58 activating+undersized+degraded+remapped
49 active+remapped+backfill_wait
18 activating+degraded+remapped
18 active+remapped+backfilling
io:
recovery: 87335 kB/s, 21 objects/s
CEPH health detail:
Code:
HEALTH_ERR noout flag(s) set; 80344/15283887 objects misplaced (0.526%); Reduced data availability: 142 pgs inactive; Degraded data redundancy: 209 pgs unclean, 76 pgs degraded, 58 pgs undersized; 36 slow requests are blocked > 32 sec; 670 stuck requests are blocked > 4096 sec; too many PGs
per OSD (332 > max 200)
OSDMAP_FLAGS noout flag(s) set
OBJECT_MISPLACED 80344/15283887 objects misplaced (0.526%)
PG_AVAILABILITY Reduced data availability: 142 pgs inactive
pg 2.2f is stuck inactive for 5423.931309, current state activating+remapped, last acting [9,12,32]
pg 2.43 is stuck inactive for 5017.370391, current state activating+undersized+degraded+remapped, last acting [34,0]
pg 2.f0 is stuck inactive for 5022.377163, current state activating+undersized+degraded+remapped, last acting [7,33]
pg 2.f3 is stuck inactive for 5175.904664, current state activating+degraded+remapped, last acting [15,2,26]
pg 2.fd is stuck inactive for 5018.363393, current state activating+undersized+degraded+remapped, last acting [5,36]
pg 2.11c is stuck inactive for 5008.767943, current state activating+degraded+remapped, last acting [8,26,14]
pg 2.12a is stuck inactive for 5016.824719, current state activating+degraded+remapped, last acting [25,9,14]
pg 2.12b is stuck inactive for 4992.665969, current state activating+remapped, last acting [6,29,37]
pg 2.18b is stuck inactive for 5016.828135, current state activating+undersized+degraded+remapped, last acting [6,27]
pg 2.197 is stuck inactive for 5412.752571, current state activating+undersized+degraded+remapped, last acting [7,32]
pg 2.337 is stuck inactive for 5175.885682, current state activating+remapped, last acting [11,2,35]
pg 2.33b is stuck inactive for 5008.773479, current state activating+remapped, last acting [6,33,13]
pg 2.37e is stuck inactive for 5016.830410, current state activating+degraded+remapped, last acting [3,29,13]
pg 2.38b is stuck inactive for 5022.367215, current state activating+undersized+degraded+remapped, last acting [28,6]
pg 2.3ad is stuck inactive for 5017.352524, current state activating+undersized+degraded+remapped, last acting [32,0]
pg 2.3af is stuck inactive for 4992.667587, current state activating+degraded+remapped, last acting [6,28,19]
pg 2.3db is stuck inactive for 5016.821716, current state activating+degraded+remapped, last acting [1,32,15]
pg 2.3f7 is stuck inactive for 5008.760318, current state activating+undersized+degraded+remapped, last acting [3,25]
pg 2.3fc is stuck inactive for 5016.814684, current state activating+degraded+remapped, last acting [2,28,14]
pg 3.f6 is stuck inactive for 5022.391672, current state activating+undersized+degraded+remapped, last acting [21,3]
pg 3.104 is stuck inactive for 5436.621520, current state activating+remapped, last acting [30,11,2]
pg 3.121 is stuck inactive for 5016.838087, current state activating+degraded+remapped, last acting [6,27,15]
pg 3.184 is stuck inactive for 5016.730824, current state activating+remapped, last acting [14,34,7]
pg 3.38d is stuck inactive for 5022.381109, current state activating+undersized+degraded+remapped, last acting [5,33]
pg 3.3c4 is stuck inactive for 5016.773763, current state activating+remapped, last acting [4,24,5]
pg 4.12f is stuck inactive for 5423.875927, current state activating+remapped, last acting [5,27,13]
pg 4.138 is stuck inactive for 5458.774607, current state activating+remapped, last acting [6,26,19]
pg 4.360 is stuck inactive for 5022.389524, current state activating+undersized+degraded+remapped, last acting [7,23]
pg 4.36a is stuck inactive for 5417.817899, current state activating+undersized+degraded+remapped, last acting [6,33]
pg 4.3a8 is stuck inactive for 5022.390175, current state activating+undersized+degraded+remapped, last acting [6,29]
pg 4.3bd is stuck inactive for 5016.716020, current state activating+remapped, last acting [21,0,37]
pg 4.3d9 is stuck inactive for 5022.391832, current state activating+undersized+degraded+remapped, last acting [3,24]
pg 4.3db is stuck inactive for 33841.137520, current state activating+undersized+degraded+remapped, last acting [21,6]
pg 4.3de is stuck inactive for 5018.350910, current state activating+undersized+degraded+remapped, last acting [30,25]
pg 4.3fb is stuck inactive for 5018.362756, current state activating+undersized+degraded+remapped, last acting [2,31]
pg 5.a is stuck inactive for 18082.614488, current state activating+remapped, last acting [14,3,31]
pg 5.43 is stuck inactive for 5022.381059, current state activating+undersized+degraded+remapped, last acting [0,28]
pg 5.4e is stuck inactive for 5017.337833, current state activating+undersized+degraded+remapped, last acting [8,28]
pg 5.e6 is stuck inactive for 4992.641555, current state activating+remapped, last acting [34,27,3]
pg 5.ec is stuck inactive for 5018.345309, current state activating+degraded+remapped, last acting [37,29,18]
pg 5.f6 is stuck inactive for 5443.685769, current state activating+remapped, last acting [1,30,14]
pg 5.128 is stuck inactive for 4992.664069, current state activating+remapped, last acting [5,21,12]
pg 5.12f is stuck inactive for 4992.648318, current state activating+remapped, last acting [25,7,8]
pg 5.14f is stuck inactive for 5008.752298, current state activating+degraded+remapped, last acting [21,9,15]
pg 5.154 is stuck inactive for 5018.340345, current state activating+undersized+degraded+remapped, last acting [29,0]
pg 5.171 is stuck inactive for 5022.383306, current state activating+undersized+degraded+remapped, last acting [5,20]
pg 5.370 is stuck inactive for 5016.801039, current state activating+remapped, last acting [5,34,24]
pg 5.37a is stuck inactive for 5022.395633, current state activating+undersized+degraded+remapped, last acting [21,1]
pg 5.3e3 is stuck inactive for 4992.670207, current state activating+undersized+degraded+remapped, last acting [2,36]
pg 5.3ee is stuck inactive for 5016.796162, current state activating+remapped, last acting [33,24,7]
pg 5.3f1 is stuck inactive for 5016.785052, current state activating+undersized+degraded+remapped, last acting [25,2]
PG_DEGRADED Degraded data redundancy: 209 pgs unclean, 76 pgs degraded, 58 pgs undersized
pg 2.16c is stuck unclean for 14855.765771, current state active+remapped+backfill_wait, last acting [2,30,5]
pg 2.18b is stuck undersized for 5014.830100, current state activating+undersized+degraded+remapped, last acting [6,27]
pg 2.197 is stuck undersized for 5175.677858, current state activating+undersized+degraded+remapped, last acting [7,32]
pg 2.337 is stuck unclean for 5456.790524, current state activating+remapped, last acting [11,2,35]
pg 2.33b is stuck unclean for 28192.927700, current state activating+remapped, last acting [6,33,13]
pg 2.343 is stuck unclean for 14932.780074, current state active+remapped+backfill_wait, last acting [2,29,0]
pg 2.350 is stuck unclean for 32138.816506, current state active+remapped+backfill_wait, last acting [8,30,15]
pg 2.36c is stuck unclean for 5175.588667, current state active+remapped+backfill_wait, last acting [26,3,25]
pg 2.37e is activating+degraded+remapped, acting [3,29,13]
pg 2.38b is stuck undersized for 5020.847092, current state activating+undersized+degraded+remapped, last acting [28,6]
pg 2.3ad is stuck undersized for 5015.822259, current state activating+undersized+degraded+remapped, last acting [32,0]
pg 2.3af is activating+degraded+remapped, acting [6,28,19]
pg 2.3db is activating+degraded+remapped, acting [1,32,15]
pg 2.3e5 is stuck unclean for 5018.360316, current state active+remapped+backfill_wait, last acting [5,25,34]
pg 2.3f7 is stuck undersized for 5006.781721, current state activating+undersized+degraded+remapped, last acting [3,25]
pg 2.3fc is activating+degraded+remapped, acting [2,28,14]
pg 3.149 is stuck unclean for 33750.424602, current state active+remapped+backfilling, last acting [0,34,19]
pg 3.14d is stuck unclean for 33747.112725, current state active+remapped+backfill_wait, last acting [5,34,8]
pg 3.170 is stuck unclean for 41911.933806, current state active+remapped+backfill_wait, last acting [2,30,13]
pg 3.184 is stuck unclean for 50863.584785, current state activating+remapped, last acting [14,34,7]
pg 3.347 is stuck unclean for 5014.829012, current state active+remapped+backfill_wait, last acting [25,1,37]
pg 3.379 is stuck unclean for 5173.681939, current state active+remapped+backfilling, last acting [8,35,33]
pg 3.38d is stuck undersized for 5020.844204, current state activating+undersized+degraded+remapped, last acting [5,33]
pg 3.3b5 is stuck unclean for 35111.545294, current state active+remapped+backfill_wait, last acting [37,21,36]
pg 3.3c4 is stuck unclean for 33915.855454, current state activating+remapped, last acting [4,24,5]
pg 3.3f6 is stuck unclean for 5020.847138, current state active+remapped+backfill_wait, last acting [5,30,3]
pg 4.138 is stuck unclean for 79356.915196, current state activating+remapped, last acting [6,26,19]
pg 4.17a is stuck unclean for 70653.281022, current state active+remapped+backfill_wait, last acting [4,29,15]
pg 4.360 is stuck undersized for 5020.844126, current state activating+undersized+degraded+remapped, last acting [7,23]
pg 4.36a is stuck undersized for 5175.682594, current state activating+undersized+degraded+remapped, last acting [6,33]
pg 4.370 is stuck unclean for 5017.348570, current state active+remapped+backfilling, last acting [3,25,37]
pg 4.393 is stuck unclean for 75394.078516, current state active+remapped+backfill_wait, last acting [3,36,13]
pg 4.3a8 is stuck undersized for 5020.837838, current state activating+undersized+degraded+remapped, last acting [6,29]
pg 4.3bd is stuck unclean for 5017.331324, current state activating+remapped, last acting [21,0,37]
pg 4.3d9 is stuck undersized for 5020.844302, current state activating+undersized+degraded+remapped, last acting [3,24]
pg 4.3db is stuck undersized for 5021.817987, current state activating+undersized+degraded+remapped, last acting [21,6]
pg 4.3dd is stuck unclean for 69841.408876, current state active+remapped+backfill_wait, last acting [8,37,15]
pg 4.3de is stuck undersized for 5016.787614, current state activating+undersized+degraded+remapped, last acting [30,25]
pg 4.3e8 is stuck unclean for 5017.370736, current state active+remapped+backfilling, last acting [0,35,24]
pg 4.3fb is stuck undersized for 5016.819053, current state activating+undersized+degraded+remapped, last acting [2,31]
pg 5.14f is activating+degraded+remapped, acting [21,9,15]
pg 5.154 is stuck undersized for 5016.758706, current state activating+undersized+degraded+remapped, last acting [29,0]
pg 5.171 is stuck undersized for 5020.845419, current state activating+undersized+degraded+remapped, last acting [5,20]
pg 5.343 is stuck unclean for 5018.363926, current state active+remapped+backfill_wait, last acting [2,33,37]
pg 5.346 is stuck unclean for 74802.413625, current state active+remapped+backfill_wait, last acting [2,31,8]
pg 5.370 is stuck unclean for 5017.349368, current state activating+remapped, last acting [5,34,24]
pg 5.37a is stuck undersized for 5020.844900, current state activating+undersized+degraded+remapped, last acting [21,1]
pg 5.3e3 is stuck undersized for 4990.832211, current state activating+undersized+degraded+remapped, last acting [2,36]
pg 5.3e8 is stuck unclean for 66318.787639, current state active+remapped+backfill_wait, last acting [5,22,17]
pg 5.3ee is stuck unclean for 5017.336586, current state activating+remapped, last acting [33,24,7]
pg 5.3f1 is stuck undersized for 5014.832273, current state activating+undersized+degraded+remapped, last acting [25,2]
REQUEST_SLOW 36 slow requests are blocked > 32 sec
36 ops are blocked > 2097.15 sec
REQUEST_STUCK 670 stuck requests are blocked > 4096 sec
243 ops are blocked > 8388.61 sec
427 ops are blocked > 4194.3 sec
osds 6,26 have stuck requests > 4194.3 sec
osds 0,2,4,7,8,9,11,14,15,18,20,29,30,33,34,35,36 have stuck requests > 8388.61 sec
TOO_MANY_PGS too many PGs per OSD (332 > max 200)
I've checked that they hanging on peering. Always some of OSD are showing this stats when I execute ceph pgs <pgs>query.
Code:
"stats": {
"version": "0'0",
"reported_seq": "0",
"reported_epoch": "0",
"state": "unknown",
"last_fresh": "0.000000",
"last_change": "0.000000",
"last_active": "0.000000",
"last_peered": "0.000000",
"last_clean": "0.000000",
"last_became_active": "0.000000",
"last_became_peered": "0.000000",
"last_unstale": "0.000000",
"last_undegraded": "0.000000",
"last_fullsized": "0.000000",
"mapping_epoch": 0,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 0,
"last_epoch_clean": 0,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "0.000000",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "0.000000",
"last_clean_scrub_stamp": "0.000000",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
Anyone have idea? Maybe reinstall node to 4.4 version?