Hi everyone,
I’m configuring a new Ceph cluster managed by Proxmox VE 9.0 (ceph version 19.2.3) with the following setup:
- Nodes: 3 hosts, each with 5x ThinkSystem 3.5" U.3 7500 PRO 1.92TB NVMe drives (15 OSDs total, ~28.8 TB raw capacity).
- Network: ~64 Gbit/s, MTU 9000 (public and cluster network).
- Pools: Two pools:
- `.mgr` (1 PG, for Ceph management).
- `vms0_ceph` (RBD pool for VMs, replication size 3, min_size 2).
- Data : ~373 GiB (~1.1 TiB with replication).
- Benchmarks:
- Read IOPS: ~63,000 (4K random read, 16 threads, 10G total).
- Write IOPS: ~75,000 (4K random write, 16 threads, 10G total).
- Ceph 19.2.3).
- Proxmox 9.03
Issue:
When creating the `vms0_ceph` pool, I set the PG number to 256, as I calculated the optimal PG count to be ~500 (`(15 OSDs × 100) ÷ 3 ≈ 500`) and chose 256 as a compromise. However, the Proxmox GUI and `ceph osd pool ls detail` show 32 (`pg_num 32`, `pgp_num 32`, `autoscale_mode on`), and the latest `ceph -s` output (as of 09:25 AM CEST, Oct 10, 2025) shows **33 PGs total** (1 for `.mgr`, 32 for `vms0_ceph`), all in `active+clean` state.
It seems the autoscale mode is overriding my manual setting of 256 PGs and reducing it to 32.
Performance Concerns:
With only 32 PGs, I suspect the cluster is not fully utilizing the NVMe drives’ parallelism, as evidenced by the lower read IOPS (63,000) compared to write IOPS (75,000). I believe increasing the PG count to 256 or 512 would improve performance, especially for read-heavy VM workloads.Questions:
I’m configuring a new Ceph cluster managed by Proxmox VE 9.0 (ceph version 19.2.3) with the following setup:
- Nodes: 3 hosts, each with 5x ThinkSystem 3.5" U.3 7500 PRO 1.92TB NVMe drives (15 OSDs total, ~28.8 TB raw capacity).
- Network: ~64 Gbit/s, MTU 9000 (public and cluster network).
- Pools: Two pools:
- `.mgr` (1 PG, for Ceph management).
- `vms0_ceph` (RBD pool for VMs, replication size 3, min_size 2).
- Data : ~373 GiB (~1.1 TiB with replication).
- Benchmarks:
- Read IOPS: ~63,000 (4K random read, 16 threads, 10G total).
- Write IOPS: ~75,000 (4K random write, 16 threads, 10G total).
- Ceph 19.2.3).
- Proxmox 9.03
Issue:
When creating the `vms0_ceph` pool, I set the PG number to 256, as I calculated the optimal PG count to be ~500 (`(15 OSDs × 100) ÷ 3 ≈ 500`) and chose 256 as a compromise. However, the Proxmox GUI and `ceph osd pool ls detail` show 32 (`pg_num 32`, `pgp_num 32`, `autoscale_mode on`), and the latest `ceph -s` output (as of 09:25 AM CEST, Oct 10, 2025) shows **33 PGs total** (1 for `.mgr`, 32 for `vms0_ceph`), all in `active+clean` state.
It seems the autoscale mode is overriding my manual setting of 256 PGs and reducing it to 32.
Performance Concerns:
With only 32 PGs, I suspect the cluster is not fully utilizing the NVMe drives’ parallelism, as evidenced by the lower read IOPS (63,000) compared to write IOPS (75,000). I believe increasing the PG count to 256 or 512 would improve performance, especially for read-heavy VM workloads.Questions:
- Is it normal for Proxmox’s autoscale mode to override my manual PG setting (256) and reduce it to 32?
- Is 32 PGs too low for a cluster with 15 NVMe OSDs (ThinkSystem 3.5" U.3 7500 PRO 1.92TB) and replication factor 3?
- Have I made a mistake in my configuration, or is this expected behavior?
- Should I disable autoscale (ceph osd pool set vms0_ceph autoscale_mode off) and set the PG count to 256 or 512? What are the risks or considerations?
- How can I ensure the PG count stays at my desired value (256 or 512)?
- Could the lower read IOPS (~63,000 vs. ~75,000 write IOPS) be related to the low PG count, or are there other factors I should investigate?
Bash:
~# rbd bench --io-type write --io-size 4K --io-threads 16 --io-total 1G vms0_ceph/test-image
^[[3~bench type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 78816 78910.1 308 MiB/s
2 136816 68449.5 267 MiB/s
3 187632 62569.5 244 MiB/s
4 250048 62531 244 MiB/s
elapsed: 4 ops: 262144 ops/sec: 61957.3 bytes/sec: 242 MiB/s
root@vm1001:~# ceph version
ceph version 19.2.3 (2f03f1cd83e5d40cdf1393cb64a662a8e8bb07c6) squid (stable)
root@vm1001:~# rbd bench --io-type read --io-size 4K --io-threads 16 --io-total 1G vms0_ceph/test-image
bench type read io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 71344 71430.7 279 MiB/s
2 150704 75396.9 295 MiB/s
3 226608 75565.7 295 MiB/s
elapsed: 3 ops: 262144 ops/sec: 75371.3 bytes/sec: 294 MiB/s
Bash:
~# pveceph pool ls --noborder
Name Size Min Size PG Num min. PG Num Optimal PG Num PG Autoscale Mode PG Autoscale Target Size PG Autoscale Target Ratio Crush Rule Name %
.mgr 3 2 1 1 1 on replicated_rule 2.8115297823205
vms0_ceph 3 2 32 32 on replicated_rule 0.04410051554
root@vm1001:~# pveceph pool ls --noborder
ceph osd df tree
Name Size Min Size PG Num min. PG Num Optimal PG Num PG Autoscale Mode PG Autoscale Target Size PG Autoscale Target Ratio Crush Rule Name %
.mgr 3 2 1 1 1 on replicated_rule 2.8115297823205
vms0_ceph 3 2 32 32 on replicated_rule 0.04410051554
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 26.19896 - 26 TiB 1.1 TiB 1.1 TiB 38 KiB 9.7 GiB 25 TiB 4.10 1.00 - root default
-3 8.73299 - 8.7 TiB 368 GiB 363 GiB 9 KiB 4.8 GiB 8.4 TiB 4.11 1.00 - host vm1001
0 ssd 1.74660 1.00000 1.7 TiB 58 GiB 57 GiB 1 KiB 1.1 GiB 1.7 TiB 3.25 0.79 5 up osd.0
1 ssd 1.74660 1.00000 1.7 TiB 92 GiB 91 GiB 5 KiB 624 MiB 1.7 TiB 5.12 1.25 8 up osd.1
2 ssd 1.74660 1.00000 1.7 TiB 69 GiB 68 GiB 1 KiB 1.0 GiB 1.7 TiB 3.88 0.95 6 up osd.2
3 ssd 1.74660 1.00000 1.7 TiB 80 GiB 79 GiB 1 KiB 1.1 GiB 1.7 TiB 4.49 1.10 8 up osd.3
4 ssd 1.74660 1.00000 1.7 TiB 69 GiB 68 GiB 1 KiB 952 MiB 1.7 TiB 3.83 0.94 6 up osd.4
-5 8.73299 - 8.7 TiB 367 GiB 363 GiB 8 KiB 4.2 GiB 8.4 TiB 4.11 1.00 - host vm1002
5 ssd 1.74660 1.00000 1.7 TiB 58 GiB 57 GiB 1 KiB 949 MiB 1.7 TiB 3.23 0.79 5 up osd.5
6 ssd 1.74660 1.00000 1.7 TiB 80 GiB 79 GiB 1 KiB 1.0 GiB 1.7 TiB 4.48 1.09 7 up osd.6
7 ssd 1.74660 1.00000 1.7 TiB 57 GiB 56 GiB 1 KiB 942 MiB 1.7 TiB 3.21 0.78 5 up osd.7
8 ssd 1.74660 1.00000 1.7 TiB 115 GiB 114 GiB 4 KiB 317 MiB 1.6 TiB 6.40 1.56 11 up osd.8
9 ssd 1.74660 1.00000 1.7 TiB 58 GiB 57 GiB 1 KiB 985 MiB 1.7 TiB 3.22 0.79 5 up osd.9
-7 8.73299 - 8.7 TiB 364 GiB 363 GiB 21 KiB 754 MiB 8.4 TiB 4.07 0.99 - host vm1003
10 ssd 1.74660 1.00000 1.7 TiB 80 GiB 80 GiB 5 KiB 172 MiB 1.7 TiB 4.48 1.09 7 up osd.10
11 ssd 1.74660 1.00000 1.7 TiB 34 GiB 34 GiB 4 KiB 62 MiB 1.7 TiB 1.88 0.46 3 up osd.11
12 ssd 1.74660 1.00000 1.7 TiB 80 GiB 80 GiB 4 KiB 136 MiB 1.7 TiB 4.47 1.09 8 up osd.12
13 ssd 1.74660 1.00000 1.7 TiB 125 GiB 125 GiB 4 KiB 282 MiB 1.6 TiB 7.00 1.71 11 up osd.13
14 ssd 1.74660 1.00000 1.7 TiB 45 GiB 45 GiB 4 KiB 102 MiB 1.7 TiB 2.52 0.62 4 up osd.14
TOTAL 26 TiB 1.1 TiB 1.1 TiB 46 KiB 9.7 GiB 25 TiB 4.10
MIN/MAX VAR: 0.46/1.71 STDDEV: 1.31
Bash:
~# fastfetch
.://:` `://:. root@vm1001
`hMMMMMMd/ /dMMMMMMh` ---------------
`sMMMMMMMd: :mMMMMMMMs` OS: Proxmox VE 9.0.10 x86_64
`-/+oo+/:`.yMMMMMMMh- -hMMMMMMMy.`:/+oo+/-` Host: ThinkSystem SR650 V3 (07)
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:` Kernel: Linux 6.14.11-3-pve
`/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/` Uptime: 2 days, 1 hour, 33 mins
./ooooooo+- +NMMMMMMMMN+ -+ooooooo/. Packages: 773 (dpkg)
.+ooooooo+-`oNMMMMNo`-+ooooooo+. Shell: bash 5.2.37
-+ooooooo/.`sMMs`./ooooooo+- Display (Acer B223W): 1024x768 @ 75 Hz in 22"
:oooooooo/`..`/oooooooo: Terminal: /dev/pts/2
:oooooooo/`..`/oooooooo: CPU: 2 x Intel(R) Xeon(R) Silver 4410T (40) @ 4.00 GHz
-+ooooooo/.`sMMs`./ooooooo+- GPU: ASPEED Technology, Inc. ASPEED Graphics Family
.+ooooooo+-`oNMMMMNo`-+ooooooo+. Memory: 11.01 GiB / 125.61 GiB (9%)
./ooooooo+- +NMMMMMMMMN+ -+ooooooo/. Swap: 0 B / 8.00 GiB (0%)
`/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/` Disk (/): 4.39 GiB / 93.93 GiB (5%) - ext4
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:` Local IP (vmbr10): 10.91.10.11/24
`-/+oo+/:`.yMMMMMMMh- -hMMMMMMMy.`:/+oo+/-` Locale: C
`sMMMMMMMm: :dMMMMMMMs`
`hMMMMMMd/ /dMMMMMMh`
`://:` `://:`
Last edited: