[PVE+Ceph] Increasing PG count on a production cluster

lucaferr

Renowned Member
Jun 21, 2011
71
9
73
Hi! Our little hyperconverged cluster was born 2 years ago with 3 nodes and 12 OSDs in total. Nodes have triplicated since then and now the cluster has 10 nodes and 40 OSDs in total: 512 PGs are not enough anymore (we have to rebalance often, for example) and we'd like to increase the PG count from 512 to 2048, as recommended by the PG calc tool.
The cluster is in HA and we can't afford any downtime, so we need to carefully plan the operation. Everywhere online I read that PG increase is the most impactful event in a Ceph cluster and "shuld be avoided for production clusters if possible". Then I read this guide which says that increasing in slices of 128 PGs they were able to upgrade a production cluster without any limitation on client traffic: https://www.netways.de/blog/2017/10/25/ceph-increasing-placement-groups-in-production/
Does anyone has similar experiences and wants to share some tips or opinions?
To give you all the elements, I also specify that all 40 OSDs are NVMe SSDs and that the 10 nodes are connected with a 10 Gb/s LAN.
 
  • Like
Reactions: Tmanok
Sure. I assume you're running Ceph Luminous (with PVE 5.x) -- Ceph Nautilus (with PVE 6.x) has the autoscaler feature which may or may not make the change "easier".

In any case, this was my procedure with Luminous on PVE 5.x using increments of 32 -- you can substitute 128 if you like:
Code:
===========================================================================
# Increase PG in production

    # stop and wait for scrub and deep-scrub operations

ceph osd set noscrub
ceph osd set nodeep-scrub

    # set cluster in maintenance mode with :

ceph osd set noout


    # change pg_num AND pgp_num in small increments (512,544,576,608...)
   
# ceph osd pool set {pool-name} pg_num {pg_num}

ceph osd pool set ceph-critical pg_num 512

    #ceph osd pool set {pool-name} pgp_num {pgp_num}

ceph osd pool set ceph-critical pgp_num 512

     # wait for rebalance and do next N until desired PG count reached.

    # restore

ceph osd unset noout


    # when all the PGs are active+clean, re-enable the scrub and deep-scrub operations

ceph osd unset noscrub
ceph osd unset nodeep-scrub


    # done


=================================================================================

I didn't have any problem with servers/services being available during the process. However, in your case, the 40 NVME backed OSDs will completely saturate the 10G network -- that may cause you issues. If so, try smaller increments. Good luck!
 
Last edited:
Thank you very much for sharing your procedure: the main steps are the same I had identified but I hadn't thought about disabling the scrub. I will then proceed with a first step of +32 PGs and, based on I/O and network saturation (luckily I have a very well-tuned Zabbix monitoring them in real time), I will decide whether to continue this way or to switch to +64 PGs and so on.
 
I have another question: in addition to the PG increase, I have to add 10 new OSDs to the cluster (each one is a 2 TB NVMe drive, so it's 20 TB of storage to add).
Should I add the new OSDs first, wait for the rebalancing and then add the new PGs, or the opposite?
At the moment the occupation of raw space is 75%, while the pool is at 82%, if it can be useful for considerations.
Thanks!
 
In your case, I recommend the PG increase before adding your new OSDs -- you are already below 30 PG per OSD (default mon_pg_warn_min_per_osd). I also hope you intend a network bandwidth increase for 50 x NVMe drives in the cluster.
 
  • Like
Reactions: Tmanok
Thanks for the advice regarding the PG increment. Regarding the 10 Gbe, at the moment I only have 200 Mb/s going through each ethernet port, with peaks not superior than 1.5 Gb/s when I add OSDs or rebalance, so 10 Gb/s seems adeguate to me (my workload is mainly made by webhosting VMs). Since I keep having 4 OSDs per node (I'm adding new nodes to the cluster, I'm not adding OSDs to the existing nodes), I don't expect the traffic per port to increase, it will only increase the total throughput across the switch.
 
Hi at @RokaKen,
Thanks for your 'write up' on the steps to Ceph.

I would also like some instructions:
  • I'm Running Proxmox 8.1 with Ceph 17.2.7 (Quincy) on a 3 Node Cluster.
  • Each cluster has 6 x 1TB Samsung 870 EVOs
  • Each Proxmox Node has a 1TB Samsung 980 Pro 1TB NVME with 6 x 40GB WAl/DB lvm partitions
I ran through a simple setup guide which also instructed on how to create this setup, however, I don't remember where it mentioned configuring placement groups, and as a result I ended up with 128 PGs.

I'm filling up this server and its now giving me WARNINGS that I 128PGs is too low.


Since I've got 18 x 1TB SSDs and I plan on adding 4 more 1 TB SSDs in ach node in the future, bringing the total number of OSD's to 30, i was thinking that 512 PGs would be appropriate.

Would these instructions work for my exact use-case?

Your help and guidance would be greatly appreciated!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!