Ceph : number of placement groups for 5+ pools on 3hosts x 1osd

galeksandrp

Member
Mar 15, 2024
4
0
6
Hi.

MY CONFIG : 3 hosts with PVE 8.4.1 and ceph reef, 10gb ethernet dedicated ceph network.

Each host have single osd which is 8tb hdd cmr drive.

WHAT I DID : Created 5 pools with defaul settings.

WHAT I NEED TO DO : Create 15 more pools.

PROBLEM : Ceph started screaming "too many pgs per osd".

WHY PROBLEM SURFACED : As far as I understood placement group is a thread which calculates destination of ceph object.

This calculation is done independently for each pool.

That means that 128 pg threads is adequate for single pool on 3osd.

But for 20 pools will be having 2560 placement groups per osd. Ceph will not be happy.

QUESTION: Can i supress this warning ?
At any time only single pool will have writes.
Does that mean that out of 2560 potential pg threads only 128 will be started ?
 
Last edited:
ADDITIONAL HARDWARE INFO : Each host have OSD with 1tb enterprise u.2 nvme as well.

On that ssd class OSDs reside single vm 300 gb database disk. This vm runs 24x7.

ADDITIONAL SOFTWARE INFO : Database is configured ring buffer-like, so only stores 1.5 months or so. This is hard limitation.

Each month script copies database files to destination backed by "month" pool.

TASK WHICH I TRY TO SOLVE : From time to time I need to provide fast random read access for x month ago database.

So if i have 20 pools this is simple - just change class of pool from HDD to Hdd+ssd and have rock solid rebalance.
 
Last edited:
Okay, this will be like :

STEP 1 : Migrate 202412.raw from HDD POOL to NVME POOL, don't delete source.
STEP 2 : Do database work.
STEP 3 : Delete source from HDD POOL.
STEP 4 : Migrate back 202412.raw from NVME POOL to HDD POOL, delete source.

Sound plausible, will try next week
 
Last edited:
ANSWER : I am decommissioned CEPH reef 18.2.7 entirely due to following issues

ISSUE 01 : Windows 11 VM have IO noticeable slowdown [ linear IO drops to 1MB per sec , IOPS so low so unable to capture ]
- in first ~5-10 minutes on CEPH with hyperconverged access [ same host VM+OSD ] , rather than with local checksumming filesystem
- After ~5-10 minutes of VM working linear IO and IOPS return to expected.
ASSUMPTION : Not much to fix there - either screw around with OSD processes resources isolation or buy disk array server.
- Ceph is actually RADOS object storage.
- Object storage can be hyperconverged all the way , because for object storage client software is expected to deal extreme latencies and instability.
- When they slapped RBD on top of RADOS, they intentionally forgot to remove word hyperconverged.
- Because non-hyperconverged block storage software already exist and no one cares about it. It's called SAN software.

ISSUE 02 : There is a well known bug in Proxmox VE web ui - when some storage are in infinite wait IO mode , web ui API server or something fail miserably and you will have all VM an CT with question mark instead of name and running status.

So I had temp nonimportant pool in 1/1 mode on one machine of a cluster . This machine was having hardware issues and hanged.
I lost all ability to control the cluster until I kick back that machine with methods available to me.

ISSUE 03 : You need to perform upgrade of CEPH to next major version before even considering migrating PVE 8 to 9.

WORKAROUND : I am for now using scheme proposed by gurubert - live storage migration between pools of local checksumming filesystem.

POSTMORTEM : We know who to blame - CEPH community advertisers marketing object storage as hyperconverged block storage. But why they was able to tell me lies in the first place ?
01 : Well , I wanted checksumming file system
- with ability to live migrate between NVME tier and HDD tier
- transparently [ on storage level ].
02 : I was unable to use checksumming filesystem on VM level [ Windows VM ]
I am thinking that I've been better to forget tales about bit rot and setup LVM because LVM have PVMOVE.

CLOSED.
 
Last edited: