Hi! Our little hyperconverged cluster was born 2 years ago with 3 nodes and 12 OSDs in total. Nodes have triplicated since then and now the cluster has 10 nodes and 40 OSDs in total: 512 PGs are not enough anymore (we have to rebalance often, for example) and we'd like to increase the PG count from 512 to 2048, as recommended by the PG calc tool.
The cluster is in HA and we can't afford any downtime, so we need to carefully plan the operation. Everywhere online I read that PG increase is the most impactful event in a Ceph cluster and "shuld be avoided for production clusters if possible". Then I read this guide which says that increasing in slices of 128 PGs they were able to upgrade a production cluster without any limitation on client traffic: https://www.netways.de/blog/2017/10/25/ceph-increasing-placement-groups-in-production/
Does anyone has similar experiences and wants to share some tips or opinions?
To give you all the elements, I also specify that all 40 OSDs are NVMe SSDs and that the 10 nodes are connected with a 10 Gb/s LAN.
The cluster is in HA and we can't afford any downtime, so we need to carefully plan the operation. Everywhere online I read that PG increase is the most impactful event in a Ceph cluster and "shuld be avoided for production clusters if possible". Then I read this guide which says that increasing in slices of 128 PGs they were able to upgrade a production cluster without any limitation on client traffic: https://www.netways.de/blog/2017/10/25/ceph-increasing-placement-groups-in-production/
Does anyone has similar experiences and wants to share some tips or opinions?
To give you all the elements, I also specify that all 40 OSDs are NVMe SSDs and that the 10 nodes are connected with a 10 Gb/s LAN.