Ceph Performance Understanding

RRR · May 6, 2021

I setup a Proxmox cluster with 3 servers (Intel Xeon E5-2673 and 192 GB RAM each).
There are 2 Ceph Pools configured on them and separated into a NVMe- and a SSD-Pool through crush rules.
The public_network is using a dedicated 10 GBit network while the cluster_network is using a dedicated 40 GBit network.
Everything seems to be working fine, I get the following Rados benchmark results, which indicates much better performance with NVMes:

	Proxmox1 SSD-Pool	Proxmox2 SSD-Pool	Proxmox3 SSD-Pool	Proxmox1 NVMe-Pool	Proxmox2 NVMe-Pool	Proxmox3 NVMe-Pool
Writing
BW MB/s	193.514	191.476	192.317	930.343	813.821	842.248
Avg IOPS	48	47	48	232	203	210
Reading
BW MB/s	837.793	809.026	828.607	1402.85	1153.1	1622.98
Avg IOPS	207	202	207	350	288	405
Rand Read
BW MB/s	828.15	808.978	825.837	1393.78	1232.1	1401.24
Avg IOPS	207	202	206	348	308	350

Interestingly, when i do performance benchmarks within VMs, I don't see much of a difference between the SSD- and the NVMe-pool:

	VM @ SSD-Pool @ Proxmox1	VM @ SSD-Pool @ Proxmox2	VM @ SSD-Pool @ Proxmox3	VM @ NVMe-Pool @ Proxmox1	VM @ NVMe-Pool @ Proxmox2	VM @ NVMe-Pool @ Proxmox3
hdparm cached r MB/s	5434.11	4757.98	4877.86	4766.4	4816.38	3780.34
Hdparm buffered MB/s	270.33	253.57	255.7	304.63	302.49	289.11
DD Write	183	261	263	255	251	272
DD Read	928	1.6 GB/s	1.7 GB/s	1.5 GB/s	1.5 GB/s	1.5 GB/s
DD Read wo cache	1.0 GB/s	1.7 GB/s	1.7 GB/s	1.6 GB/s	1.6 GB/s	1.6 GB/s

The benchmark inside the VMs shows almost no difference between the SSD-pool and the NVMe-pool, especially the NVMe-pool seems to be much slower when compared to the Rados-benchmark on the Proxmox hosts. I also tried different cache settings for guests, like the write through cache, which didn't change anything. As SCSI-Controller I am using the recommended VirtIO SCSI Controller.
Do you have any suggestions to improve the NVMe-speed or is it normal that the write speed inside the VM is just about a third of the rados benchmark?

aaron · May 6, 2021

Can you please post the rados bench command you used?
Can you post the VM configs?
How do you benchmark inside the VMs?

Did you see our Ceph Benchmark paper? https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

It might clear up some things for you. (Check the numbers in the graphs as the y-scale is logarithmic in many places)

RRR · May 7, 2021

Can you please post the rados bench command you used?

Rados Write Benchmark:

Bash:

rados bench -p <SSDPool|NVMePool> 600 write -b 4M -t 16 --run-name `hostname` --no-cleanup

Rados Read Benchmark:

Bash:

rados bench -p <SSDPool|NVMePool> 600 seq -t 16 --run-name `hostname`

Rados Random Read Benchmark:

Bash:

rados bench -p <SSDPool|NVMePool> 600 rand -t 16 --run-name `hostname`

Can you post the VM configs?

VM Config:
agent: enabled=1
boot: c
bootdisk: scsi0
cores: 2
ide2: SSDPool:vm-112-cloudinit,media=cdrom
ipconfig0: xxx
memory: 2048
name: SSD-TEST01
net0: virtio=3E:4C

0:5D:A1:26,bridge=vmbr0,tag=1003
onboot: 1
scsi0: SSDPool:vm-112-disk-0,size=51404M (or NVMePool with NVMe Hosts)
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=3b022e12-e572-4e37-b01e-a7f7babc704b
sockets: 2
sshkeys: xxx
vga: std
vmgenid: xxx

How do you benchmark inside the VMs?

Bash:

sync
echo 3 > /proc/sys/vm/drop_caches
dd if=/dev/zero of=/root/temp oflag=direct bs=128k count=256K  # Our actual performance test
echo "DD Read Test"
dd if=/root/temp of=/dev/null bs=1M count=32K
echo "DD Read without Cache"
/sbin/sysctl -w vm.drop_caches=3
dd if=/root/temp of=/dev/null bs=1M count=32K

I also checked your benchmark paper and tried the fio test inside both VMs:

Bash:

fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1

Results:
SSD Write: WRITE: bw=139MiB/s (145MB/s), 139MiB/s-139MiB/s (145MB/s-145MB/s), io=81.3GiB (87.3GB), run=600021-600021msec
NVMe Write: WRITE: bw=166MiB/s (174MB/s), 166MiB/s-166MiB/s (174MB/s-174MB/s), io=97.1GiB (104GB), run=600016-600016msec

Bash:

fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=read --bs=4M --numjobs=1 --iodepth=1

Results:
SSD Read: READ: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=186GiB (199GB), run=600006-600006msec
NVMe Read: READ: bw=393MiB/s (412MB/s), 393MiB/s-393MiB/s (412MB/s-412MB/s), io=230GiB (247GB), run=600003-600003msec

spirit · May 7, 2021

with iodetph=1 (or a single dd), you'll be limited by latency of the network + cputime of the client (librbd) + cpu time of ceph.

your rados bench use 16 threads (-t 16), so try with iodepth=16 for your fio.

RRR · May 18, 2021

I am running the fio tests inside the virtual machines, they should not know anything about the ceph filesystem nor should they be concerned about performance or cpu time used for ceph operations. I don't understand why i see 4.5 times higher writing speeds for the NVMe pool in the rados benchmark (done on the Proxmox hosts), but inside the VM's i just get roughly 1.2 times higher writing and reading speeds with the NVMe pool compared to the SSD pool.

Anyways, I now delegated 4 sockets with 4 cores each to the VMs and tried the fio benchmarks again with iodepth 16, nothing changed:

Code:

fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=16

Results:
SSD Write: WRITE: bw=139MiB/s (145MB/s), 139MiB/s-139MiB/s (145MB/s-145MB/s), io=81.2GiB (87.2GB), run=600011-600011mse
NVMe Write: WRITE: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s), io=98.4GiB (106GB), run=600014-600014msec

Code:

fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=read --bs=4M --numjobs=1 --iodepth=16

Results:
SSD Read: READ: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=186GiB (199GB), run=600003-600003msec
NVMe Read: bw=394MiB/s (413MB/s), 394MiB/s-394MiB/s (413MB/s-413MB/s), io=231GiB (248GB), run=600004-600004msec

velocity08 · May 29, 2021

Hey Everyone

this be sound like a stupid question on this topic but would love some clarity on the numbers we are seeing.

let me provide some background to the results for clarity.

as an example below:

3 x Hosts
256 GB Ram each
Intel v3/4 12 CPU dual socket
Mirror ZFS Boot
1 x OSD per host

these are all SSD Intel S4610 Enterprise SATA drives.

Ceph is using;

LACP Bonded 2 x 10 GB Nic for cluster. (total 20 GB 1/2 Duplex 40 GB Full duplex)
LACP Bonded 2 x 10 GB Nic for public.
MTU 9000

Results:

Bash:

rados bench -p device_health_metrics 600 rand -t 16 --no-cleanup

Total time run:       600.133
Total reads made:     130825
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   871.973
[COLOR=rgb(65, 168, 95)]Average IOPS:         217
Stddev IOPS:          14.5926
Max IOPS:             269
Min IOPS:             176[/COLOR]
Average Latency(s):   0.0729499
Max latency(s):       0.285462
Min latency(s):       0.00259099

Bash:

rados bench -p device_health_metrics 600 write -b 4M -t 16 --no-cleanup

Total time run:         600.2
Total writes made:      63954
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     426.218
Stddev Bandwidth:       31.365
Max bandwidth (MB/sec): 464
Min bandwidth (MB/sec): 132
[COLOR=rgb(65, 168, 95)]Average IOPS:           106
Stddev IOPS:            7.84125
Max IOPS:               116
Min IOPS:               33[/COLOR]
Average Latency(s):     0.150138
Stddev Latency(s):      0.0355427
Max latency(s):         1.25835
Min latency(s):         0.0224209

My main question comes down to IOP's why are they so low?

the base performance of a single disk is in excess of the IOP's we are seeing.

We can see the network speed is about 85% saturation so shouldn't we be seeing more in respect to IOP's here?

What am i missing?

Hungry to learn more

""Cheers
G

velocity08 · May 29, 2021

Ok i think i have worked out the issue.

the Rados benchmark isn't pushing the drives hard enough to get the max IOP's available for Reads/ Writes.

Have performed some VM drive tests and can see much higher available IOP's in the VM benchmark using CrystalMark and then observing the IOP's consumed in PVE > DataCentre > Ceph performance monitoring.

quiet impressed from some basic tests so far on what can be achieved with Ceph.

Will try to get some concurrent VM FIO tests going to really give it a push and i can already see that the network will be the bottleneck here 100% even with 10 GB LACP bonds and separate networks the network will be saturated before the drives reach their full performance potential.

very cool! (and quietly impressed so far)

Will report back once we have set up a number of VM's across each host and run the tests in parallel if anyone is interested.

""Cheers
G

aaron · May 31, 2021

velocity08 said:
the Rados benchmark isn't pushing the drives hard enough to get the max IOP's available for Reads/ Writes.

You can start a rados benchmark on each node in the cluster, just make sure to pass the --run-name <label> parameter, so they don't interfere with each other.

Search

Search

Ceph Performance Understanding

RRR

Member

aaron

Proxmox Staff Member

RRR

Member

spirit

Distinguished Member

RRR

Member

velocity08

Well-Known Member

velocity08

Well-Known Member

aaron

Proxmox Staff Member