PVE7 - Ceph 16.2.5 - Pools and number of PG

zeuxprox · Sep 16, 2021

Hi,

I have a cluster of 5 PVE7 nodes with Ceph 16.2.5. The Hardware configuration of 4 of 5 nodes is:

CPU: n.2 EPYC ROME 7402
RAM: 1 TB ECC
2 x SSD 960 GB ZFS Raid 1 for Proxmox
4 x Micron 9300 MAX 3.2 TB NVMe for Pool 1 named Pool-NVMe
2 x Micron 5300 PRO 3.8 TB SSD for Pool 2 named Pool-SSD
NICs: 6 x 100Gb Mellanox Connect-X 5

The configuration of the fifth node is:

CPU: n.1 EPYC ROME 7302
RAM: 256 GB
2 x SSD 240 GB ZFS Raid 1 for Proxmox
2 x Micron 5300 PRO 3.8 TB SSD for Pool 2 named Pool-SSD
NICs: 6 x 100Gb Mellanox Connect-X 5

I have created 2 Pools:

Pool-NVMe: composed of NVMe disks (16 x 3.2 TB)
Pool-SSD: composed of SSD disks (10 x 3.8 TB)

PVE7, by default, have created both with 32 PGs. Now I have to migrate about 20 VMs on this cluster. Currently this 20 VMs are on a PVE 6.4 server with 6 NVMe disks configured in ZFS RAIDZ-2 and the total space they occupy is about 7 TB.

In the cluster is actived the "Autoscale mode", but I would like to have an optimal number of PGs per pool before to migrate the 20 VMs, so he question is: do you think I have to increase the number of minimum PGs for my 2 Ceph pools? If yes, considering that 4 TB will be stored in Pools-NVME and 3 TB in Pool-SSD, which number of PG do you advice per pool?

Thank you

Comune di Levico Terme · Sep 16, 2021

Have you just read this https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#Data-durability ?
To calculate the right PG you must specify the size of the pool (the replica number)

Felix. · Sep 16, 2021

In general, you better have a bit too many PGs than too few. 32 is definitely too few if you are going to fill the pool to that extent.
The autoscaler does only trigger when the PG count is wrong by a factor of 3 or more, so there should not be relied on when the pool is not actively used yet.

That documentation linked above seems to suggest 1024 PGs for your NVMe pool and 512 or 1024 PGs for your SSD pool.

RokaKen · Sep 16, 2021

See my previous post regarding the "autoscaler" here: https://forum.proxmox.com/threads/c...-before-enabling-auto-scale.80105/post-354624

I strongly recommend disabling the autoscaler until you have 50+ OSDs in your cluster

Code:

ceph config set global osd_pool_default_pg_autoscale_mode {warn|off}
-or-
ceph osd pool set <pool> pg_autoscale_mode {off|warn}

and manually selecting pg_num && pgp_num per above recommendations.

EDIT: add command for existing pools

dan.ger · Sep 16, 2021

Which driver do you use for the Mellanox Connect-X 5 Cards to get them running under debian bullseye (11). I try to update from 6.4.x to 7.x but the Connect-X 6 cards won't work.

Did you run the cards in eth mode?

zeuxprox · Sep 17, 2021

dan.ger said:
Which driver do you use for the Mellanox Connect-X 5 Cards to get them running under debian bullseye (11). I try to update from 6.4.x to 7.x but the Connect-X 6 cards won't work.

Did you run the cards in eth mode?

Hi,

I'm using the driver of PVE7, I only upgraded the firmware that I found on mellanox site. Then I downloaded the mellanox tools at the following link:
https://www.mellanox.com/products/adapter-software/firmware-tools

you have also to download the firmware for your card...

Follow this mini how do:


Install pve-headers gcc make dkms with:
   apt install pve-headers gcc make dkms

Extract mellanox tools with:
   tar zxvf fileNameHere-deb.tgz

Install mellanox tools with:
   ./install.sh

Start mellanox tools with:
   mst start

Show device name with:
   mst status
   It should retur something like /dev/mst/XXXXXXXX

Update the firmware (previously downloaded) with:
    flint -d /dev/mst/XXXXXXXX -i firmware_name.bin burn

Here some usefull commands:

Code:

Show info about cards:
   mlxfwmanager

Show detailed info:
   mlxconfig -d /dev/mst/XXXXXXXX query

Modify cards from IB to ETH (cards with 2 ports):
   mlxconfig -d /dev/mst/XXXXXXXX set LINK_TYPE_P1=2  (for port 1)

   mlxconfig -d /dev/mst/XXXXXXXX set LINK_TYPE_P2=2   (for port 2)

Regards

zeuxprox · Sep 17, 2021

RokaKen said:
See my previous post regarding the "autoscaler" here: https://forum.proxmox.com/threads/c...-before-enabling-auto-scale.80105/post-354624

I strongly recommend disabling the autoscaler until you have 50+ OSDs in your cluster

Code:

ceph config set global osd_pool_default_pg_autoscale_mode {warn|off} -or- ceph osd pool set <pool> pg_autoscale_mode {off|warn}

and manually selecting pg_num && pgp_num per above recommendations.

EDIT: add command for existing pools

So for my cluster you advice to run the following commands:

Code:

ceph config set global osd_pool_default_pg_autoscale_mode off

But how can I set pg_num and pgp_num at 1024 ? Is it safe to do it in production environment?

Can I use this guide:

https://forum.proxmox.com/threads/pve-ceph-increasing-pg-count-on-a-production-cluster.74145/

and change the pg_num and pgp_num in small increment of 128 until to reach the number of 1024?

Thank you

zeuxprox · Sep 17, 2021

draguz said:
Have you just read this https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#Data-durability ?
To calculate the right PG you must specify the size of the pool (the replica number)

My replica number is 3

RokaKen · Sep 17, 2021

zeuxprox said:
So for my cluster you advice to run the following commands:

Code:

ceph config set global osd_pool_default_pg_autoscale_mode off

But how can I set pg_num and pgp_num at 1024 ? Is it safe to do it in production environment?

Can I use this guide:

https://forum.proxmox.com/threads/pve-ceph-increasing-pg-count-on-a-production-cluster.74145/

and change the pg_num and pgp_num in small increment of 128 until to reach the number of 1024?

Thank you

Yes

Search

Search

PVE7 - Ceph 16.2.5 - Pools and number of PG

zeuxprox

Renowned Member

Comune di Levico Terme

Renowned Member

Felix.

Renowned Member

RokaKen

Active Member

dan.ger

Well-Known Member

zeuxprox

Renowned Member

zeuxprox

Renowned Member

zeuxprox

Renowned Member

RokaKen

Active Member

We value your privacy