CEPH 4k random read/write QD1 performance problem

jenst88

New Member
Feb 17, 2020
2
0
1
35
Hi

After days of searching and trying different things I am looking for some advice on how to solve this problem :).

The performance issue persists in the VM's on Proxmox. They have some old software running on them that requires good random 4K read/write performance and this therefore has a big impact. With or without the Optane 900P SSD's for the bluestore db's does not make much of a difference.

Newly created 3 node ProxmoxVE cluster with CEPH.
Server specs:

Dell R630
2x Intel E5-2660 v3
128 GB RAM
2x 250 GB SSD (ProxmoxVE)
6x 960GB Intel D3-S4510 SSD (OSD)
1x Intel Optane 900P (Bluestore DB)

SSD direct benchmark

- fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdd --bs=4k --iodepth=1 --size=4G --readwrite=randread

-> read: IOPS=34.5k, BW=135MiB/s (141MB/s)(4096MiB/30388msec)

- fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdd --bs=4k --iodepth=1 --size=4G --readwrite=randwrite

-> write: IOPS=28.0k, BW=113MiB/s (119MB/s)(4096MiB/36198msec)

CEPH Pool benchmark

rbd_iodepth32: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

-> [r=6288KiB/s][r=1572 IOPS]


rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

-> [w=3359KiB/s][w=839 IOPS]


CEPH Config

Tried with default config and modified config (below) but no noticable difference


[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.19.40.15/24
fsid = 884a21ce-7386-4dcd-a930-2318d472fb15
mon_allow_pool_delete = true
mon_host = 10.19.40.15 10.19.40.16 10.19.40.17
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.19.40.15/24
debug bluestore = 0/0
debug bluefs = 0/0
debug bdev = 0/0
debug rocksdb = 0/0

[osd]
bluestore_block_wal_create = false
bluestore_block_db_create = true
bluestore_fsck_on_mkfs = false
bdev_aio_max_queue_depth = 1024
bluefs_min_flush_size = 65536
bluestore_min_alloc_size = 4096
bluestore_max_blob_size = 65536
bluestore_max_contexts_per_kv_batch = 64
bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,disableWAL=false,compaction_readahead_size=2097152"

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring


Network latency

Network is 10Gbit SFP+ over Force 10 s4810 switches.

21 packets transmitted, 21 received, 0% packet loss, time 507ms
rtt min/avg/max/mdev = 0.044/0.086/0.113/0.021 ms


Any ideas? :).
 
Ceph uses by default 4 MB object size. This is somewhere a compromise between bandwidth and speed. Add cache=writeback to the VMs disk config, it translates to ceph caching (librbd). This should increase the performance, if the OS (filesystem) in the VM does know how to flush.
 
Ceph uses by default 4 MB object size. This is somewhere a compromise between bandwidth and speed. Add cache=writeback to the VMs disk config, it translates to ceph caching (librbd). This should increase the performance, if the OS (filesystem) in the VM does know how to flush.

The write speed (inside a Windows Server 2019 VM with Virtio SCSI single and SCSI disk) is indeed better with the write back cache but the read speed @ 4K Q1T1 is still very slow. Shouldn't read speeds be close to the actual SSD read speed?

server2019-benchmark.PNG
 
Interested to know if you figured this one out, I am looking at a ssd ceph cluster for small random read/writes as well. Seems like you should be getting way better numbers.
 
[osd]
bluestore_block_wal_create = false
bluestore_block_db_create = true
bluestore_fsck_on_mkfs = false
bdev_aio_max_queue_depth = 1024
bluefs_min_flush_size = 65536
bluestore_min_alloc_size = 4096
bluestore_max_blob_size = 65536
bluestore_max_contexts_per_kv_batch = 64
bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,disableWAL=false,compaction_readahead_size=2097152"
You made lots of changes. Some of those may not be good for the intended workload.

The write speed (inside a Windows Server 2019 VM with Virtio SCSI single and SCSI disk) is indeed better with the write back cache but the read speed @ 4K Q1T1 is still very slow. Shouldn't read speeds be close to the actual SSD read speed?
9.47 MB/s = 2424 IO/s, ~60% better than the fio benchmark on the Ceph pool. After all, you need to remember any 4K read will travel over the network and request a 4MB object.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!