Hi,
we recently had a minor issue with cephs PG autoscaling feature and wanted to let everybody know which might have a similar configuration.
The autoscaler was set to "on" by us, after upgrading to the release of proxmox 6.0, but did not make problems until the latest update, so this should not affect all users.
What we forgot/oversaw, was to set a ratio or target size for the ceph pool.
After upgrading from ceph version 14.2.4 to ceph version 14.2.6 this led to a "HEALTH_WARN too few PGs per OSD (29 < min 30)" warning, because, apparently, ceph choose to set a target PG size of 20 - which it reached later the day.
To resolve the warning, we first tried to manually set num_pg to 256 through
# ceph osd pool set cephstor pg_num 256
# ceph osd pool set cephstor pgp_num 256
and after some hours of bit moving the warning was gone. Until next day ... where we again got the warning about too few (20 PGs) instead of 256.
After some searching we found that there is a pg autoscaler target size to be specified as a total expected size (target_size_bytes) or factor of the total ceph disk size (target_size_ratio).
After we set it to 0.9 (90%) with
# ceph osd pool set MY_POOL_NAME target_size_ratio .9
the autoscaler choose 256 PGs and now everything is running fine again.
The good news is that there was never an outage or data corruption - everything was resilent as expected. Only the massive moving of data between the nodes was impacting performance a little bit
I hope this helps other people.
Useful commands:
query pg autoscaler status
# ceph osd pool autoscale-status
set fixed expected pool size
# ceph osd pool set MY_POOL_NAME target_size_bytes 100T
or relative pool size (of full space)
# ceph osd pool set MY_POOL_NAME target_size_ratio .9
https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-pool-target-size
we recently had a minor issue with cephs PG autoscaling feature and wanted to let everybody know which might have a similar configuration.
The autoscaler was set to "on" by us, after upgrading to the release of proxmox 6.0, but did not make problems until the latest update, so this should not affect all users.
What we forgot/oversaw, was to set a ratio or target size for the ceph pool.
After upgrading from ceph version 14.2.4 to ceph version 14.2.6 this led to a "HEALTH_WARN too few PGs per OSD (29 < min 30)" warning, because, apparently, ceph choose to set a target PG size of 20 - which it reached later the day.
To resolve the warning, we first tried to manually set num_pg to 256 through
# ceph osd pool set cephstor pg_num 256
# ceph osd pool set cephstor pgp_num 256
and after some hours of bit moving the warning was gone. Until next day ... where we again got the warning about too few (20 PGs) instead of 256.
After some searching we found that there is a pg autoscaler target size to be specified as a total expected size (target_size_bytes) or factor of the total ceph disk size (target_size_ratio).
After we set it to 0.9 (90%) with
# ceph osd pool set MY_POOL_NAME target_size_ratio .9
the autoscaler choose 256 PGs and now everything is running fine again.
The good news is that there was never an outage or data corruption - everything was resilent as expected. Only the massive moving of data between the nodes was impacting performance a little bit
I hope this helps other people.
Useful commands:
query pg autoscaler status
# ceph osd pool autoscale-status
set fixed expected pool size
# ceph osd pool set MY_POOL_NAME target_size_bytes 100T
or relative pool size (of full space)
# ceph osd pool set MY_POOL_NAME target_size_ratio .9
https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-pool-target-size