Disabling Write Cache on SSDs with Ceph

sahostking · Jul 26, 2022

Hi guys

We have Microns 5210 drives in ceph.

I read this today:

https://yourcmc.ru/wiki/index.php?t..._view_desktop#Drive_cache_is_slowing_you_down

It states we must disable write cache?

Should I do this on all our drives.
Can we do it with a live Ceph System?

Thanks

hepo · Mar 29, 2023

sadly no response... did you disabled the write cache?

spirit · Mar 29, 2023

see https://tracker.ceph.com/issues/53161

It's mostly because of bad firmware. (but you need to bench it to be sure).

Personnaly, I never had this problem with intel && samsung datacenter ssd .

pille99 · Mar 29, 2023

from that link / very interesting.

Quick guide for optimizing Ceph for random reads/writes:

Only use SSDs and NVMe with supercaps. A hint: 99 % of desktop SSDs/NVMe don’t have supercaps.
Disable their cache with hdparm -W 0.
Disable powersave: governor=performance, cpupower idle-set -D 0
Disable signatures: cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false (and use -o nocephx_require_signatures,nocephx_sign_messages for rbd map and cepnhfs kernel mounts)
For good SSDs and NVMes: set min_alloc_size=4096, prefer_deferred_size_ssd=0 (BEFORE deploying OSDs)
At least until Nautilus: [global] debug objecter = 0/0 (there is a big client-side slowdown)
Try to disable rbd cache in the userspace driver (QEMU options cache=none)

add the lines to ceph.conf

cephx_require_signatures = false
cephx_cluster_require_signatures = false
cephx_sign_messages = false
set min_alloc_size=4096
prefer_deferred_size_ssd=0
nocephx_require_signatures
nocephx_sign_messages
rbd_cache = false

[global]
debug objecter = 0/0
QEMU options cache=none

///////////////////////////////////////

set scaling governor
echo "schedutil" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor (or performance)
want to set permanently add "cpufreq.default_governor=schedutil" to kernel command

run command to disable cache on ssd
hdparm -W 0

chec your osd config
ceph daemon osd.0 config show | grep min_alloc_size

set with
ceph set min_alloc_size=4096

spirit · Mar 30, 2023

pille99 said:
from that link / very interesting.

Quick guide for optimizing Ceph for random reads/writes:

Only use SSDs and NVMe with supercaps. A hint: 99 % of desktop SSDs/NVMe don’t have supercaps.

yes sure. don't use consumer ssd for ceph (or zfs). you need fast sync for the bluestore journal/metadata

pille99 said:
Disable their cache with hdparm -W 0.

depend of model, you need to test it.

pille99 said:
Disable powersave: governor=performance, cpupower idle-set -D 0

definitively.
I'm using:
idle=poll intel_idle.max_cstate=0 processor.max_cstate=1 in my grub (works for intel/amd)

pille99 said:
Disable signatures: cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false (and use -o nocephx_require_signatures,nocephx_sign_messages for rbd map and cepnhfs kernel mounts)

It's better than in past, but you can still have small percent of improvement. (be careful, because you don't have authentification anymore, if you want to share cephfs for example)

pille99 said:
For good SSDs and NVMes: set min_alloc_size=4096, prefer_deferred_size_ssd=0 (BEFORE deploying OSDs)

it's the default since 2-3 releases

pille99 said:
At least until Nautilus: [global] debug objecter = 0/0 (there is a big client-side slowdown)

I'm still doing it, disabling all debug to be ure

pille99 said:
Try to disable rbd cache in the userspace driver (QEMU options cache=none)

I disagree. In past (nautilus?), it was slower for read, because of global lock in the layer.
Since Octopus, a new implementation is done, (writearound), as now read are as fast than with cache=none.
(and writeback really help for buffered write

if you want a good boost (default for next ceph reef release), some bluestore tuning

Code:

[osd]
bluestore rocksdb options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
bluestore_allocator = bitmap

and

Code:

 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0

I'm also currently to add a patch in proxmox, to reenable tcmalloc memory allocator for qemu, it give me 30% performance gain (60k iops -> 90kiops ) randread 4k with a single virtio disk.
I have sent patch to pve-devel mailing list, I'm waiting for merge.

pille99 · Mar 30, 2023

hey spirit, how is your spirit doing today

thx for your help on the last issue on my proxmox - did you read what the issue was ??

thx for the bluestore hacks.
20% performance decrease because of logging - its pretty sick. it shoud standard disabled and in case of debugging or maintainance to be enabled.
i will try your 2 suggestions.

i am working on a autoscaler script (not PG) based on CPU, RAM and Storage consumption - it drives me nuts, if you create VMs that it doesnt move it automatically to the best Node (from the point of resources).

spirit · Mar 30, 2023

pille99 said:
hey spirit, how is your spirit doing today
thx for your help on the last issue on my proxmox - did you read what the issue was ??

thx for the bluestore hacks.
20% performance decrease because of logging - its pretty sick. it shoud standard disabled and in case of debugging or maintainance to be enabled.
i will try your 2 suggestions.

if you want to try it, I have a patched qemu version with tcmalloc enabled

https://mutulin1.odiso.net:/pve-qemu-kvm_7.2.0-8_amd64.deb

pille99 said:
i am working on a autoscaler script (not PG) based on CPU, RAM and Storage consumption - it drives me nuts, if you create VMs that it doesnt move it automatically to the best Node (from the point of resources).

maybe it could help, but a new workload balancer is coming for next version of ceph. (mostly for read, instead of simply always read on the primary pg, it'll try to balancer read on less used node (not sure about the algorithm)

https://www.youtube.com/watch?v=0_wmRa5Lcc4

spirit · Mar 30, 2023

some benchs:

1virtio-scsi disk (cache=none, iothread) (ceph is a replicat x3, 3 nodes with 6 nvme by node)

randread 4k, iodepth=128 : 100k iops

Capture d’écran du 2023-03-30 09-02-58.png

randwrite 4k, iodetph=128 (buffered, no fsync) : 56k iops

Capture d’écran du 2023-03-30 09-08-36.png

randwrite 4k,iodepth=128, sync write (20k iops)

Capture d’écran du 2023-03-30 09-11-02.png

pille99 · Mar 30, 2023

very good job. so, from the initial thread, disabling cache is not the ultimate solution to get the last peace of performance out of the SSD.
do you think, the tweaks and disable cache in combination, results in more performance ?

spirit · Mar 30, 2023

pille99 said:
very good job. so, from the initial thread, disabling cache is not the ultimate solution to get the last peace of performance out of the SSD.
do you think, the tweaks and disable cache in combination, results in more performance ?

disable cache on physical ssd, it's really depend of the model .(as I said, I never had problem with datacenter intel && samsung. can't tell for other mode)

about vm cache=writeback, I will send benchmark today, It's really improve performance too (without read penality). The only thing, is that if you have a power failure on the proxmox node, you'll lost some last writes still in the buffer. (but no vm corruption, as it's correctly do flush/fsync).

The last write optimisation, is to use an optane drive on proxmox, and use ceph persistent writeback cache (keeping fsync) . But I don't have hardware yet to test it

The main problem currently is really low iodepth with sync write. (it should improve in the future with new ceph crimson osd, but it'll not be ready before 1 or 2 years).

Also, try to have the fastest cpu frequency possible to improve iops. (both on proxmox nodes && ceph nodes). Better to have less cores with big frequency , than more cores with lower frequency.

timansky · Mar 7, 2024

Hello @spirit, whats the status of patch? Can you share it?

spirit · Mar 8, 2024

Hi, still not included in official proxmox repo

here the last version for pve8:

pve-qemu-kvm_8.1.5-3+tcmalloc_amd64.deb
https://pixeldrain.com/u/ejvina9q

timansky · Mar 8, 2024

Can you share patch?

spirit · Mar 8, 2024

timansky said:
Can you share patch?

This is simply --enable-tcmalloc on debian/rules files in pve-qemu package:

Code:

commit a8aa39d35087076d298744cb83a6163a7063cb75 (HEAD -> master)
Author: Alexandre Derumier <aderumier@odiso.com>
Date:   Wed Feb 28 08:48:31 2024 +0100

    enable tcmalloc

diff --git a/debian/rules b/debian/rules
index 51f56c5..d73cb56 100755
--- a/debian/rules
+++ b/debian/rules
@@ -76,7 +76,9 @@ endif
            --enable-usb-redir \
            --enable-virglrenderer \
            --enable-virtfs \
-           --enable-zstd
+           --enable-zstd \
+           --enable-tcmalloc
+
 
 build: build-arch build-indep
 build-arch: build-stamp

timansky · Mar 11, 2024

thanks

Search

Search

Disabling Write Cache on SSDs with Ceph

sahostking

Renowned Member

hepo

Active Member

spirit

Distinguished Member

pille99

Active Member

spirit

Distinguished Member

pille99

Active Member

spirit

Distinguished Member

spirit

Distinguished Member

Attachments

pille99

Active Member

spirit

Distinguished Member

timansky

Renowned Member

spirit

Distinguished Member

timansky

Renowned Member

spirit

Distinguished Member

timansky

Renowned Member

We value your privacy