Disabling Write Cache on SSDs with Ceph

sahostking

Renowned Member
from that link / very interesting.

Quick guide for optimizing Ceph for random reads/writes:


  • Only use SSDs and NVMe with supercaps. A hint: 99 % of desktop SSDs/NVMe don’t have supercaps.
  • Disable their cache with hdparm -W 0.
  • Disable powersave: governor=performance, cpupower idle-set -D 0
  • Disable signatures: cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false (and use -o nocephx_require_signatures,nocephx_sign_messages for rbd map and cepnhfs kernel mounts)
  • For good SSDs and NVMes: set min_alloc_size=4096, prefer_deferred_size_ssd=0 (BEFORE deploying OSDs)
  • At least until Nautilus: [global] debug objecter = 0/0 (there is a big client-side slowdown)
  • Try to disable rbd cache in the userspace driver (QEMU options cache=none)



add the lines to ceph.conf

cephx_require_signatures = false
cephx_cluster_require_signatures = false
cephx_sign_messages = false
set min_alloc_size=4096
prefer_deferred_size_ssd=0
nocephx_require_signatures
nocephx_sign_messages
rbd_cache = false

[global]
debug objecter = 0/0
QEMU options cache=none

///////////////////////////////////////

set scaling governor
echo "schedutil" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor (or performance)
want to set permanently add "cpufreq.default_governor=schedutil" to kernel command

run command to disable cache on ssd
hdparm -W 0

chec your osd config
ceph daemon osd.0 config show | grep min_alloc_size

set with
ceph set min_alloc_size=4096
 
  • Like
Reactions: dignus
from that link / very interesting.

Quick guide for optimizing Ceph for random reads/writes:


  • Only use SSDs and NVMe with supercaps. A hint: 99 % of desktop SSDs/NVMe don’t have supercaps.
yes sure. don't use consumer ssd for ceph (or zfs). you need fast sync for the bluestore journal/metadata
  • Disable their cache with hdparm -W 0.
depend of model, you need to test it.
  • Disable powersave: governor=performance, cpupower idle-set -D 0
definitively.
I'm using:
idle=poll intel_idle.max_cstate=0 processor.max_cstate=1 in my grub (works for intel/amd)
  • Disable signatures: cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false (and use -o nocephx_require_signatures,nocephx_sign_messages for rbd map and cepnhfs kernel mounts)
It's better than in past, but you can still have small percent of improvement. (be careful, because you don't have authentification anymore, if you want to share cephfs for example)
  • For good SSDs and NVMes: set min_alloc_size=4096, prefer_deferred_size_ssd=0 (BEFORE deploying OSDs)
it's the default since 2-3 releases
  • At least until Nautilus: [global] debug objecter = 0/0 (there is a big client-side slowdown)
I'm still doing it, disabling all debug to be ure

  • Try to disable rbd cache in the userspace driver (QEMU options cache=none)
I disagree. In past (nautilus?), it was slower for read, because of global lock in the layer.
Since Octopus, a new implementation is done, (writearound), as now read are as fast than with cache=none.
(and writeback really help for buffered write


if you want a good boost (default for next ceph reef release), some bluestore tuning

Code:
[osd]
bluestore rocksdb options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
bluestore_allocator = bitmap

and
Code:
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0



I'm also currently to add a patch in proxmox, to reenable tcmalloc memory allocator for qemu, it give me 30% performance gain (60k iops -> 90kiops ) randread 4k with a single virtio disk.
I have sent patch to pve-devel mailing list, I'm waiting for merge.
 
hey spirit, how is your spirit doing today :)
thx for your help on the last issue on my proxmox - did you read what the issue was ??

thx for the bluestore hacks.
20% performance decrease because of logging - its pretty sick. it shoud standard disabled and in case of debugging or maintainance to be enabled.
i will try your 2 suggestions.

i am working on a autoscaler script (not PG) based on CPU, RAM and Storage consumption - it drives me nuts, if you create VMs that it doesnt move it automatically to the best Node (from the point of resources).
 
hey spirit, how is your spirit doing today :)
thx for your help on the last issue on my proxmox - did you read what the issue was ??

thx for the bluestore hacks.
20% performance decrease because of logging - its pretty sick. it shoud standard disabled and in case of debugging or maintainance to be enabled.
i will try your 2 suggestions.
if you want to try it, I have a patched qemu version with tcmalloc enabled

https://mutulin1.odiso.net:/pve-qemu-kvm_7.2.0-8_amd64.deb

i am working on a autoscaler script (not PG) based on CPU, RAM and Storage consumption - it drives me nuts, if you create VMs that it doesnt move it automatically to the best Node (from the point of resources).
maybe it could help, but a new workload balancer is coming for next version of ceph. (mostly for read, instead of simply always read on the primary pg, it'll try to balancer read on less used node (not sure about the algorithm)

https://www.youtube.com/watch?v=0_wmRa5Lcc4
 
some benchs:

1virtio-scsi disk (cache=none, iothread) (ceph is a replicat x3, 3 nodes with 6 nvme by node)

randread 4k, iodepth=128 : 100k iops

Capture d’écran du 2023-03-30 09-02-58.png


randwrite 4k, iodetph=128 (buffered, no fsync) : 56k iops

Capture d’écran du 2023-03-30 09-08-36.png

randwrite 4k,iodepth=128, sync write (20k iops)

Capture d’écran du 2023-03-30 09-11-02.png
 

Attachments

  • Capture d’écran du 2023-03-30 09-11-02.png
    Capture d’écran du 2023-03-30 09-11-02.png
    23.7 KB · Views: 9
Last edited:
very good job. so, from the initial thread, disabling cache is not the ultimate solution to get the last peace of performance out of the SSD.
do you think, the tweaks and disable cache in combination, results in more performance ?
 
very good job. so, from the initial thread, disabling cache is not the ultimate solution to get the last peace of performance out of the SSD.
do you think, the tweaks and disable cache in combination, results in more performance ?
disable cache on physical ssd, it's really depend of the model .(as I said, I never had problem with datacenter intel && samsung. can't tell for other mode)

about vm cache=writeback, I will send benchmark today, It's really improve performance too (without read penality). The only thing, is that if you have a power failure on the proxmox node, you'll lost some last writes still in the buffer. (but no vm corruption, as it's correctly do flush/fsync).

The last write optimisation, is to use an optane drive on proxmox, and use ceph persistent writeback cache (keeping fsync) . But I don't have hardware yet to test it :)

The main problem currently is really low iodepth with sync write. (it should improve in the future with new ceph crimson osd, but it'll not be ready before 1 or 2 years).

Also, try to have the fastest cpu frequency possible to improve iops. (both on proxmox nodes && ceph nodes). Better to have less cores with big frequency , than more cores with lower frequency.
 
Can you share patch?
This is simply --enable-tcmalloc on debian/rules files in pve-qemu package:


Code:
commit a8aa39d35087076d298744cb83a6163a7063cb75 (HEAD -> master)
Author: Alexandre Derumier <aderumier@odiso.com>
Date:   Wed Feb 28 08:48:31 2024 +0100

    enable tcmalloc

diff --git a/debian/rules b/debian/rules
index 51f56c5..d73cb56 100755
--- a/debian/rules
+++ b/debian/rules
@@ -76,7 +76,9 @@ endif
            --enable-usb-redir \
            --enable-virglrenderer \
            --enable-virtfs \
-           --enable-zstd
+           --enable-zstd \
+           --enable-tcmalloc
+
 
 build: build-arch build-indep
 build-arch: build-stamp
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!