Hi,
I have noticed in Ceph log (ceph -w) an increase of "slow requests are blocked" when I create CephFS, e.g.
2019-10-14 16:41:32.083294 mon.ld5505 [INF] daemon mds.ld4465 assigned to filesystem cephfs as rank 0
2019-10-14 16:41:32.121895 mon.ld5505 [INF] daemon mds.ld4465 is now active in filesystem cephfs as rank 0
2019-10-14 16:41:33.090050 mon.ld5505 [WRN] Health check failed: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:39.208907 mon.ld5505 [WRN] Health check update: 13 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:45.242038 mon.ld5505 [WRN] Health check update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:50.892561 mon.ld5505 [WRN] Health check update: 16 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:56.978641 mon.ld5505 [WRN] Health check update: 27 slow requests are blocked > 32 sec (REQUEST_SLOW)
Then I noticed this warning 1 pools have pg_num > pgp_num and corrected it. Immediatelly the PG_AVAILABILITY shows pgs inactive + "slow requests are blocked":
2019-10-15 07:00:25.679400 mon.ld5505 [WRN] overall HEALTH_WARN 1 pools have too many placement groups; 1 pools have pg_num > pgp_num; mons ld5505,ld5506,ld5508 are low on available space
n2019-10-15 07:48:43.053437 mon.ld5505 [INF] Health check cleared: SMALLER_PGP_NUM (was: 1 pools have pg_num > pgp_num)
2019-10-15 07:48:45.062434 mon.ld5505 [WRN] Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)
2019-10-15 07:48:52.233902 mon.ld5505 [WRN] Health check update: Reduced data availability: 3 pgs inactive, 2 pgs peering (PG_AVAILABILITY)
2019-10-15 07:49:00.020152 mon.ld5505 [WRN] Health check failed: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:06.104454 mon.ld5505 [WRN] Health check update: 11 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:12.240239 mon.ld5505 [WRN] Health check update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:17.241812 mon.ld5505 [WRN] Health check update: 16 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:22.243300 mon.ld5505 [WRN] Health check update: 18 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:27.244978 mon.ld5505 [WRN] Health check update: 16 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:30.318849 mon.ld5505 [WRN] Health check update: Reduced data availability: 2 pgs inactive, 1 pg peering (PG_AVAILABILITY)
2019-10-15 07:49:32.246457 mon.ld5505 [WRN] Health check update: 13 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:32.379147 mon.ld5505 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive, 1 pg peering)
2019-10-15 07:50:06.195773 mon.ld5505 [WRN] Health check update: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:12.233853 mon.ld5505 [WRN] Health check update: 8 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:17.266094 mon.ld5505 [WRN] Health check update: 7 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:22.267746 mon.ld5505 [WRN] Health check update: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:35.340271 mon.ld5505 [WRN] Health check update: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:51:40.276440 mon.ld5505 [INF] Health check cleared: REQUEST_SLOW (was: 1 slow requests are blocked > 32 sec)
2019-10-15 08:00:00.000128 mon.ld5505 [WRN] overall HEALTH_WARN 1 pools have too many placement groups; mons ld5505,ld5506,ld5508 are low on available space
I don't believe in a coincident here because I was facing this situation many times in past causing my VMs using the same pool for images to enter a non-operable state because the number of slow requests was drastically increasing.
Here are the current pool stats:
root@ld3955:~# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
backup 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
cephfs_data 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
cephfs_metadata 2.2 KiB 22 0 66 0 0 0 0 0 B 45 13 KiB 0 B 0 B
hdb_backup 195 TiB 51077703 0 153233109 0 0 0 12576632 4.3 TiB 280999680 523 TiB 0 B 0 B
hdd 2.2 TiB 585051 0 1755153 0 0 0 4424081 69 GiB 8219453 1.2 TiB 0 B 0 B
total_objects 51662776
total_used 594 TiB
total_avail 887 TiB
total_space 1.4 PiB
Can you please share some info what is causing this finding?
THX
I have noticed in Ceph log (ceph -w) an increase of "slow requests are blocked" when I create CephFS, e.g.
2019-10-14 16:41:32.083294 mon.ld5505 [INF] daemon mds.ld4465 assigned to filesystem cephfs as rank 0
2019-10-14 16:41:32.121895 mon.ld5505 [INF] daemon mds.ld4465 is now active in filesystem cephfs as rank 0
2019-10-14 16:41:33.090050 mon.ld5505 [WRN] Health check failed: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:39.208907 mon.ld5505 [WRN] Health check update: 13 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:45.242038 mon.ld5505 [WRN] Health check update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:50.892561 mon.ld5505 [WRN] Health check update: 16 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-14 16:41:56.978641 mon.ld5505 [WRN] Health check update: 27 slow requests are blocked > 32 sec (REQUEST_SLOW)
Then I noticed this warning 1 pools have pg_num > pgp_num and corrected it. Immediatelly the PG_AVAILABILITY shows pgs inactive + "slow requests are blocked":
2019-10-15 07:00:25.679400 mon.ld5505 [WRN] overall HEALTH_WARN 1 pools have too many placement groups; 1 pools have pg_num > pgp_num; mons ld5505,ld5506,ld5508 are low on available space
n2019-10-15 07:48:43.053437 mon.ld5505 [INF] Health check cleared: SMALLER_PGP_NUM (was: 1 pools have pg_num > pgp_num)
2019-10-15 07:48:45.062434 mon.ld5505 [WRN] Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)
2019-10-15 07:48:52.233902 mon.ld5505 [WRN] Health check update: Reduced data availability: 3 pgs inactive, 2 pgs peering (PG_AVAILABILITY)
2019-10-15 07:49:00.020152 mon.ld5505 [WRN] Health check failed: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:06.104454 mon.ld5505 [WRN] Health check update: 11 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:12.240239 mon.ld5505 [WRN] Health check update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:17.241812 mon.ld5505 [WRN] Health check update: 16 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:22.243300 mon.ld5505 [WRN] Health check update: 18 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:27.244978 mon.ld5505 [WRN] Health check update: 16 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:30.318849 mon.ld5505 [WRN] Health check update: Reduced data availability: 2 pgs inactive, 1 pg peering (PG_AVAILABILITY)
2019-10-15 07:49:32.246457 mon.ld5505 [WRN] Health check update: 13 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:49:32.379147 mon.ld5505 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive, 1 pg peering)
2019-10-15 07:50:06.195773 mon.ld5505 [WRN] Health check update: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:12.233853 mon.ld5505 [WRN] Health check update: 8 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:17.266094 mon.ld5505 [WRN] Health check update: 7 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:22.267746 mon.ld5505 [WRN] Health check update: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:50:35.340271 mon.ld5505 [WRN] Health check update: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-10-15 07:51:40.276440 mon.ld5505 [INF] Health check cleared: REQUEST_SLOW (was: 1 slow requests are blocked > 32 sec)
2019-10-15 08:00:00.000128 mon.ld5505 [WRN] overall HEALTH_WARN 1 pools have too many placement groups; mons ld5505,ld5506,ld5508 are low on available space
I don't believe in a coincident here because I was facing this situation many times in past causing my VMs using the same pool for images to enter a non-operable state because the number of slow requests was drastically increasing.
Here are the current pool stats:
root@ld3955:~# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
backup 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
cephfs_data 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
cephfs_metadata 2.2 KiB 22 0 66 0 0 0 0 0 B 45 13 KiB 0 B 0 B
hdb_backup 195 TiB 51077703 0 153233109 0 0 0 12576632 4.3 TiB 280999680 523 TiB 0 B 0 B
hdd 2.2 TiB 585051 0 1755153 0 0 0 4424081 69 GiB 8219453 1.2 TiB 0 B 0 B
total_objects 51662776
total_used 594 TiB
total_avail 887 TiB
total_space 1.4 PiB
Can you please share some info what is causing this finding?
THX