CEPH 4k random read/write QD1 performance problem

jenst88 · Feb 17, 2020

Hi

After days of searching and trying different things I am looking for some advice on how to solve this problem

.

The performance issue persists in the VM's on Proxmox. They have some old software running on them that requires good random 4K read/write performance and this therefore has a big impact. With or without the Optane 900P SSD's for the bluestore db's does not make much of a difference.

Newly created 3 node ProxmoxVE cluster with CEPH.
Server specs:

Dell R630
2x Intel E5-2660 v3
128 GB RAM
2x 250 GB SSD (ProxmoxVE)
6x 960GB Intel D3-S4510 SSD (OSD)
1x Intel Optane 900P (Bluestore DB)

SSD direct benchmark

- fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdd --bs=4k --iodepth=1 --size=4G --readwrite=randread

-> read: IOPS=34.5k, BW=135MiB/s (141MB/s)(4096MiB/30388msec)

- fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdd --bs=4k --iodepth=1 --size=4G --readwrite=randwrite

-> write: IOPS=28.0k, BW=113MiB/s (119MB/s)(4096MiB/36198msec)

CEPH Pool benchmark

rbd_iodepth32: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

-> [r=6288KiB/s][r=1572 IOPS]

rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

-> [w=3359KiB/s][w=839 IOPS]

CEPH Config

Tried with default config and modified config (below) but no noticable difference

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.19.40.15/24
fsid = 884a21ce-7386-4dcd-a930-2318d472fb15
mon_allow_pool_delete = true
mon_host = 10.19.40.15 10.19.40.16 10.19.40.17
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.19.40.15/24
debug bluestore = 0/0
debug bluefs = 0/0
debug bdev = 0/0
debug rocksdb = 0/0

[osd]
bluestore_block_wal_create = false
bluestore_block_db_create = true
bluestore_fsck_on_mkfs = false
bdev_aio_max_queue_depth = 1024
bluefs_min_flush_size = 65536
bluestore_min_alloc_size = 4096
bluestore_max_blob_size = 65536
bluestore_max_contexts_per_kv_batch = 64
bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,disableWAL=false,compaction_readahead_size=2097152"

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

Network latency

Network is 10Gbit SFP+ over Force 10 s4810 switches.

21 packets transmitted, 21 received, 0% packet loss, time 507ms
rtt min/avg/max/mdev = 0.044/0.086/0.113/0.021 ms

Any ideas?

.

Alwin · Feb 17, 2020

Ceph uses by default 4 MB object size. This is somewhere a compromise between bandwidth and speed. Add cache=writeback to the VMs disk config, it translates to ceph caching (librbd). This should increase the performance, if the OS (filesystem) in the VM does know how to flush.

jenst88 · Feb 17, 2020

Alwin said:
Ceph uses by default 4 MB object size. This is somewhere a compromise between bandwidth and speed. Add cache=writeback to the VMs disk config, it translates to ceph caching (librbd). This should increase the performance, if the OS (filesystem) in the VM does know how to flush.

The write speed (inside a Windows Server 2019 VM with Virtio SCSI single and SCSI disk) is indeed better with the write back cache but the read speed @ 4K Q1T1 is still very slow. Shouldn't read speeds be close to the actual SSD read speed?

adamb · Feb 17, 2020

Interested to know if you figured this one out, I am looking at a ssd ceph cluster for small random read/writes as well. Seems like you should be getting way better numbers.

Alwin · Feb 18, 2020

jenst88 said:
[osd]
bluestore_block_wal_create = false
bluestore_block_db_create = true
bluestore_fsck_on_mkfs = false
bdev_aio_max_queue_depth = 1024
bluefs_min_flush_size = 65536
bluestore_min_alloc_size = 4096
bluestore_max_blob_size = 65536
bluestore_max_contexts_per_kv_batch = 64
bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,disableWAL=false,compaction_readahead_size=2097152"

You made lots of changes. Some of those may not be good for the intended workload.

jenst88 said:
The write speed (inside a Windows Server 2019 VM with Virtio SCSI single and SCSI disk) is indeed better with the write back cache but the read speed @ 4K Q1T1 is still very slow. Shouldn't read speeds be close to the actual SSD read speed?

9.47 MB/s = 2424 IO/s, ~60% better than the fio benchmark on the Ceph pool. After all, you need to remember any 4K read will travel over the network and request a 4MB object.

Search

Search

CEPH 4k random read/write QD1 performance problem

jenst88

New Member

Alwin

Proxmox Retired Staff

jenst88

New Member

adamb

Famous Member

Alwin

Proxmox Retired Staff

We value your privacy