Ceph Performance Understanding

RRR

Member
Dec 6, 2020
9
0
6
47
I setup a Proxmox cluster with 3 servers (Intel Xeon E5-2673 and 192 GB RAM each).
There are 2 Ceph Pools configured on them and separated into a NVMe- and a SSD-Pool through crush rules.
The public_network is using a dedicated 10 GBit network while the cluster_network is using a dedicated 40 GBit network.
Everything seems to be working fine, I get the following Rados benchmark results, which indicates much better performance with NVMes:
Proxmox1 SSD-PoolProxmox2 SSD-PoolProxmox3 SSD-PoolProxmox1 NVMe-PoolProxmox2 NVMe-PoolProxmox3 NVMe-Pool
Writing
BW MB/s 193.514 191.476 192.317 930.343 813.821 842.248
Avg IOPS 48 47 48 232 203 210
Reading
BW MB/s 837.793 809.026 828.607 1402.85 1153.1 1622.98
Avg IOPS 207 202 207 350 288 405
Rand Read
BW MB/s 828.15 808.978 825.837 1393.78 1232.1 1401.24
Avg IOPS 207 202 206 348 308 350

Interestingly, when i do performance benchmarks within VMs, I don't see much of a difference between the SSD- and the NVMe-pool:
VM @ SSD-Pool @ Proxmox1VM @ SSD-Pool @ Proxmox2VM @ SSD-Pool @ Proxmox3VM @ NVMe-Pool @ Proxmox1VM @ NVMe-Pool @ Proxmox2VM @ NVMe-Pool @ Proxmox3
hdparm cached r MB/s 5434.11 4757.98 4877.86 4766.4 4816.38 3780.34
Hdparm buffered MB/s 270.33 253.57 255.7 304.63 302.49 289.11
DD Write 183 261 263 255 251 272
DD Read 928 1.6 GB/s 1.7 GB/s 1.5 GB/s 1.5 GB/s 1.5 GB/s
DD Read wo cache 1.0 GB/s 1.7 GB/s 1.7 GB/s 1.6 GB/s 1.6 GB/s 1.6 GB/s

The benchmark inside the VMs shows almost no difference between the SSD-pool and the NVMe-pool, especially the NVMe-pool seems to be much slower when compared to the Rados-benchmark on the Proxmox hosts. I also tried different cache settings for guests, like the write through cache, which didn't change anything. As SCSI-Controller I am using the recommended VirtIO SCSI Controller.
Do you have any suggestions to improve the NVMe-speed or is it normal that the write speed inside the VM is just about a third of the rados benchmark?
 
Can you please post the rados bench command you used?
Rados Write Benchmark:
Bash:
rados bench -p <SSDPool|NVMePool> 600 write -b 4M -t 16 --run-name `hostname` --no-cleanup
Rados Read Benchmark:
Bash:
rados bench -p <SSDPool|NVMePool> 600 seq -t 16 --run-name `hostname`
Rados Random Read Benchmark:
Bash:
rados bench -p <SSDPool|NVMePool> 600 rand -t 16 --run-name `hostname`
Can you post the VM configs?
VM Config:
agent: enabled=1
boot: c
bootdisk: scsi0
cores: 2
ide2: SSDPool:vm-112-cloudinit,media=cdrom
ipconfig0: xxx
memory: 2048
name: SSD-TEST01
net0: virtio=3E:4C:D0:5D:A1:26,bridge=vmbr0,tag=1003
onboot: 1
scsi0: SSDPool:vm-112-disk-0,size=51404M (or NVMePool with NVMe Hosts)
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=3b022e12-e572-4e37-b01e-a7f7babc704b
sockets: 2
sshkeys: xxx
vga: std
vmgenid: xxx
How do you benchmark inside the VMs?
Bash:
sync
echo 3 > /proc/sys/vm/drop_caches
dd if=/dev/zero of=/root/temp oflag=direct bs=128k count=256K  # Our actual performance test
echo "DD Read Test"
dd if=/root/temp of=/dev/null bs=1M count=32K
echo "DD Read without Cache"
/sbin/sysctl -w vm.drop_caches=3
dd if=/root/temp of=/dev/null bs=1M count=32K
I also checked your benchmark paper and tried the fio test inside both VMs:
Bash:
fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1
Results:
SSD Write: WRITE: bw=139MiB/s (145MB/s), 139MiB/s-139MiB/s (145MB/s-145MB/s), io=81.3GiB (87.3GB), run=600021-600021msec
NVMe Write: WRITE: bw=166MiB/s (174MB/s), 166MiB/s-166MiB/s (174MB/s-174MB/s), io=97.1GiB (104GB), run=600016-600016msec
Bash:
fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=read --bs=4M --numjobs=1 --iodepth=1
Results:
SSD Read: READ: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=186GiB (199GB), run=600006-600006msec
NVMe Read: READ: bw=393MiB/s (412MB/s), 393MiB/s-393MiB/s (412MB/s-412MB/s), io=230GiB (247GB), run=600003-600003msec
 
with iodetph=1 (or a single dd), you'll be limited by latency of the network + cputime of the client (librbd) + cpu time of ceph.

your rados bench use 16 threads (-t 16), so try with iodepth=16 for your fio.
 
I am running the fio tests inside the virtual machines, they should not know anything about the ceph filesystem nor should they be concerned about performance or cpu time used for ceph operations. I don't understand why i see 4.5 times higher writing speeds for the NVMe pool in the rados benchmark (done on the Proxmox hosts), but inside the VM's i just get roughly 1.2 times higher writing and reading speeds with the NVMe pool compared to the SSD pool.

Anyways, I now delegated 4 sockets with 4 cores each to the VMs and tried the fio benchmarks again with iodepth 16, nothing changed:

Code:
fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=16
Results:
SSD Write: WRITE: bw=139MiB/s (145MB/s), 139MiB/s-139MiB/s (145MB/s-145MB/s), io=81.2GiB (87.2GB), run=600011-600011mse
NVMe Write: WRITE: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s), io=98.4GiB (106GB), run=600014-600014msec

Code:
fio --ioengine=psync --filename=/tmp/test_fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --rw=read --bs=4M --numjobs=1 --iodepth=16
Results:
SSD Read: READ: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=186GiB (199GB), run=600003-600003msec
NVMe Read: bw=394MiB/s (413MB/s), 394MiB/s-394MiB/s (413MB/s-413MB/s), io=231GiB (248GB), run=600004-600004msec
 
Hey Everyone

this be sound like a stupid question on this topic but would love some clarity on the numbers we are seeing.

let me provide some background to the results for clarity.

as an example below:


3 x Hosts
256 GB Ram each
Intel v3/4 12 CPU dual socket
Mirror ZFS Boot
1 x OSD per host

these are all SSD Intel S4610 Enterprise SATA drives.

Ceph is using;

LACP Bonded 2 x 10 GB Nic for cluster. (total 20 GB 1/2 Duplex 40 GB Full duplex)
LACP Bonded 2 x 10 GB Nic for public.
MTU 9000

Results:

Bash:
rados bench -p device_health_metrics 600 rand -t 16 --no-cleanup

Total time run:       600.133
Total reads made:     130825
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   871.973
[COLOR=rgb(65, 168, 95)]Average IOPS:         217
Stddev IOPS:          14.5926
Max IOPS:             269
Min IOPS:             176[/COLOR]
Average Latency(s):   0.0729499
Max latency(s):       0.285462
Min latency(s):       0.00259099

Bash:
rados bench -p device_health_metrics 600 write -b 4M -t 16 --no-cleanup

Total time run:         600.2
Total writes made:      63954
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     426.218
Stddev Bandwidth:       31.365
Max bandwidth (MB/sec): 464
Min bandwidth (MB/sec): 132
[COLOR=rgb(65, 168, 95)]Average IOPS:           106
Stddev IOPS:            7.84125
Max IOPS:               116
Min IOPS:               33[/COLOR]
Average Latency(s):     0.150138
Stddev Latency(s):      0.0355427
Max latency(s):         1.25835
Min latency(s):         0.0224209

My main question comes down to IOP's why are they so low?

the base performance of a single disk is in excess of the IOP's we are seeing.

We can see the network speed is about 85% saturation so shouldn't we be seeing more in respect to IOP's here?

What am i missing?

Hungry to learn more :)

""Cheers
G
 
Ok i think i have worked out the issue.

the Rados benchmark isn't pushing the drives hard enough to get the max IOP's available for Reads/ Writes.

Have performed some VM drive tests and can see much higher available IOP's in the VM benchmark using CrystalMark and then observing the IOP's consumed in PVE > DataCentre > Ceph performance monitoring.

quiet impressed from some basic tests so far on what can be achieved with Ceph.

Will try to get some concurrent VM FIO tests going to really give it a push and i can already see that the network will be the bottleneck here 100% even with 10 GB LACP bonds and separate networks the network will be saturated before the drives reach their full performance potential.

very cool! (and quietly impressed so far)

Will report back once we have set up a number of VM's across each host and run the tests in parallel if anyone is interested.

""Cheers
G
 
  • Like
Reactions: aaron
the Rados benchmark isn't pushing the drives hard enough to get the max IOP's available for Reads/ Writes.
You can start a rados benchmark on each node in the cluster, just make sure to pass the --run-name <label> parameter, so they don't interfere with each other.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!