Low disc performance with CEPH pool storage

SB3rt_PX

Member
Apr 7, 2022
26
1
8
Hi,
I have a disk performance issues with a Windows 2019 virtual machine.

The storage pool of my proxmox cluster is a CEPH POOL (with SSD disks).
The virtual machine has software that writes using block sizes in 4k and the performance verified with benchmark tools is very low on 4k writing.

These are the benchmark results:
- sequential: 2869 MB/s(Read) 1498 MB7s (write)
- 4k: 2,61 MB/s(Read) 8,39 MB/s

4K performance is very low, is there a way to improve/optimize it?
do you have any advice?

Ceph configurations were done following the baselines and performances are goods(except on 4k block size).
I would like to improve the current performances on 4k block size or understand what solution I can adopt.

Thanks
 
4K performance is very low, is there a way to improve/optimize it?
Are you SURE? when benchmarking 4k performance, note that
- MB/S is irrelevant. what are the IOPs?
- data patterns (sequential/random) will have a large impact on this perceived performance.

Sequential large read/write performance numbers get the warm and fuzzy but are largely inconsequential since such IO patterns are rare in actual use. If you can describe your use case and your fio variables we can drill down further; also, your crush rule and output of ceph osd tree would help, as well as ceph.conf and the speed/mtu of your public and private interfaces.
 
- 4k: 2,61 MB/s(Read) 8,39 MB/s

What is the second rate? Guessing: writing data.

That would be 2000 IOPS. Over the wire. Very often twice, for a default "size=3, min_size=2" pool - depending on the topology. Plus time for the protocol overhead, e.g. the final "okay, it is written now. Continue..."

With a 1 GBit/s network (you did not tell us anything about it) that's more than I would expect to get.

((
Actually that's approximately what I got with my 2.5 GBit/s dedicated network for a normal usecase. With 12 good OSDs on 6 nodes and iodepth=32 fio got me 9900 IOPS, shy below 10k. My own setup is History, it does not exist anymore. Some hints: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/
))
 
Last edited:
4k: 2610/4=650 iops read, 8390/4=2097 iops write.
10 Gbit is theo. 1250MB/s which isn't reachable due to overhead but for 2869MB/s you minimum need 25 Gbit network connection otherwise you measured just local cache.
 
Are you SURE? when benchmarking 4k performance, note that
- MB/S is irrelevant. what are the IOPs?
- data patterns (sequential/random) will have a large impact on this perceived performance.

Sequential large read/write performance numbers get the warm and fuzzy but are largely inconsequential since such IO patterns are rare in actual use. If you can describe your use case and your fio variables we can drill down further; also, your crush rule and output of ceph osd tree would help, as well as ceph.conf and the speed/mtu of your public and private interfaces.
Hi @alexskysilk,

the two ceph interfaces are configured in LACP bond (802.3ad) each interface is at 25Gbits.
CEPH PRIVATE and PUBLIC are configured on the same LACP bond (so 50Gbits)

I'll answer all your questions:
1) I execute the fio test and this the result:
Jobs: 4 (f=4): [w(4)][100.0%][w=1223MiB/s][w=313k IOPS][eta 00m:00s]
test: (groupid=0, jobs=4): err= 0: pid=619700: Wed Jun 4 15:10:16 2025
write: IOPS=321k, BW=1256MiB/s (1317MB/s)(73.6GiB/60001msec); 0 zone resets
clat (usec): min=2, max=12156, avg=12.12, stdev=31.57
lat (usec): min=2, max=12156, avg=12.16, stdev=31.57
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8],
| 30.00th=[ 8], 40.00th=[ 9], 50.00th=[ 10], 60.00th=[ 10],
| 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 17], 95.00th=[ 32],
| 99.00th=[ 43], 99.50th=[ 46], 99.90th=[ 231], 99.95th=[ 490],
| 99.99th=[ 1352]
bw ( MiB/s): min= 630, max= 1421, per=100.00%, avg=1256.07, stdev=23.56, samples=476
iops : min=161482, max=364006, avg=321554.87, stdev=6031.27, samples=476
lat (usec) : 4=0.17%, 10=63.96%, 20=26.80%, 50=8.77%, 100=0.15%
lat (usec) : 250=0.05%, 500=0.05%, 750=0.02%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
cpu : usr=3.59%, sys=90.85%, ctx=66512, majf=4, minf=328
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,19285461,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=1256MiB/s (1317MB/s), 1256MiB/s-1256MiB/s (1317MB/s-1317MB/s), io=73.6GiB (79.0GB), run=60001-60001msec

2)CRUSH RULE

{
"rule_id": 1,
"rule_name": "ceph-pool-ssd",
"type": 1,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}

3) CEPH OSD TREE

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 41.91705 root default
-3 13.97235 host PX01
0 ssd 3.49309 osd.0 up 1.00000 1.00000
1 ssd 3.49309 osd.1 up 0.95001 1.00000
2 ssd 3.49309 osd.2 up 1.00000 1.00000
3 ssd 3.49309 osd.3 up 1.00000 1.00000
-5 13.97235 host PX02
4 ssd 3.49309 osd.4 up 0.95001 1.00000
5 ssd 3.49309 osd.5 up 1.00000 1.00000
6 ssd 3.49309 osd.6 up 1.00000 1.00000
7 ssd 3.49309 osd.7 up 1.00000 1.00000
-7 13.97235 host PX03
8 ssd 3.49309 osd.8 up 1.00000 1.00000
9 ssd 3.49309 osd.9 up 1.00000 1.00000
10 ssd 3.49309 osd.10 up 1.00000 1.00000
11 ssd 3.49309 osd.11 up 1.00000 1.00000

4) CEPH CONF
cat /etc/ceph/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.255.254.101/24
fsid = d59d29e6-5b2d-4ab3-8eec-5145eaaa4850
mon_allow_pool_delete = true
mon_host = 10.255.255.101 10.255.255.102 10.255.255.103
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.255.255.101/24

rbd cache = true
rbd cache writethrough until flush = false

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.px01]
public_addr = 10.255.255.101

[mon.px02]
public_addr = 10.255.255.102

[mon.px03]
public_addr = 10.255.255.103
 
What is the second rate? Guessing: writing data.

That would be 2000 IOPS. Over the wire. Very often twice, for a default "size=3, min_size=2" pool - depending on the topology. Plus time for the protocol overhead, e.g. the final "okay, it is written now. Continue..."

With a 1 GBit/s network (you did not tell us anything about it) that's more than I would expect to get.

((
Actually that's approximately what I got with my 2.5 GBit/s dedicated network for a normal usecase. With 12 good OSDs on 6 nodes and iodepth=32 fio got me 9900 IOPS, shy below 10k. My own setup is History, it does not exist anymore. Some hints: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/
))
HIi @UdoB
yes the second rate is "write".

the two ceph interfaces are configured in LACP bond (802.3ad) each interface is at 25Gbits.
CEPH PRIVATE and PUBLIC are configured on the same LACP bond (so 50Gbits).
 
4k: 2610/4=650 iops read, 8390/4=2097 iops write.
10 Gbit is theo. 1250MB/s which isn't reachable due to overhead but for 2869MB/s you minimum need 25 Gbit network connection otherwise you measured just local cache.
Yes, I confirm the two ceph interfaces are configured in LACP bond (802.3ad) each interface is at 25Gbits.
CEPH PRIVATE and PUBLIC are configured on the same LACP bond (so 50Gbits)
 
CEPH PRIVATE and PUBLIC are configured on the same LACP bond (so 50Gbits).
not 50gbit, 2x25. a single io request cannot exceed 25gbit a single channel on a lagg, and ceph transactions are still single threaded. the good news is that it wouldnt really make a difference anyway, since each of your OSD nodes need two transactions per IO anyway (one on the public interface, one on private.)
 
Last edited:
This is not slow at all.
despite this the benchmark inside the Windows VM is slow. (only for 4k)

I notice this because users complain that their ERP is slow and this is because their ERP (badly developed) performs continuous 4k reads/writes on the disk.

The Windows VM is configure following this best practice document: https://pve.proxmox.com/wiki/Windows_2019_guest_best_practices

Is there anything I can do to improve the performance of Ceph? or optimize Windows VM configuration for ceph and 4k reading/writing?

Thanks
 
What is the type (NTFS/ReFS) and block size of the guest file system? Also, you mentioned you set up the VM using best practices but it might be useful to validate it. post its vmid.conf.
I attached the vmid.conf:

agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=scsi0;scsi1
cores: 8
cpu: x86-64-v2-AES,flags=+pdpe1gb;+aes
efidisk0: CEPH-POOL-SSD:vm-592-disk-0,size=128K
machine: pc-i440fx-9.2
memory: 98304
meta: creation-qemu=9.2.0,ctime=1747412332
name: SRV-TS
net0: virtio=00:50:56:84:c0:c6,bridge=vmbr20
numa: 0
onboot: 1
ostype: win11
scsi0: CEPH-POOL-SSD:vm-592-disk-1,cache=writeback,discard=on,iothread=1,size=900G,ssd=1
scsi1: CEPH-POOL-SSD:vm-592-disk-0,cache=writeback,discard=on,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=********************
sockets: 1
vmgenid: *********************

The type of the guest file system is NTFS and the block size is 4k (default).