Ceph Slow Performance On All Flash NVME

Teapot

Active Member
Apr 9, 2019
23
0
41
27
Hello,
We have 6 servers and 36 x 3.84T NVME OSD. (All NVME's Enterprise Gen4 PCIE NVME)
Sometimes when I tried to VM disk import its importing with 60MB/sn but sometimes its importing with 1GB/sn.

VM Speedtest: https://prnt.sc/miiSD_s_F7xC
I checked all nodes CPU, RAM status its ok.
NVME disks are in write back mode.
All node CPU power policy is performance.
MTU 9216 , 2 x 100G LACP 3+4
Total PG : 1024
Ceph version 17.2.


what can cause this ?

Thanks.
 
Last edited:
Ceph.conf

[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.254.254.10/24 debug_asok = 0/0 debug_auth = 0/0 debug_bdev = 0/0 debug_bluefs = 0/0 debug_bluestore = 0/0 debug_buffer = 0/0 debug_civetweb = 0/0 debug_client = 0/0 debug_compressor = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_crypto = 0/0 debug_dpdk = 0/0 debug_eventtrace = 0/0 debug_filer = 0/0 debug_filestore = 0/0 debug_finisher = 0/0 debug_fuse = 0/0 debug_heartbeatmap = 0/0 debug_javaclient = 0/0 debug_journal = 0/0 debug_journaler = 0/0 debug_kinetic = 0/0 debug_kstore = 0/0 debug_leveldb = 0/0 debug_lockdep = 0/0 debug_mds = 0/0 debug_mds_balancer = 0/0 debug_mds_locker = 0/0 debug_mds_log = 0/0 debug_mds_log_expire = 0/0 debug_mds_migrator = 0/0 debug_memdb = 0/0 debug_mgr = 0/0 debug_mgrc = 0/0 debug_mon = 0/0 debug_monc = 0/00 debug_ms = 0/0 debug_none = 0/0 debug_objclass = 0/0 debug_objectcacher = 0/0 debug_objecter = 0/0 debug_optracker = 0/0 debug_osd = 0/0 debug_paxos = 0/0 debug_perfcounter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_rbd_mirror = 0/0 debug_rbd_replay = 0/0 debug_refs = 0/0 debug_reserver = 0/0 debug_rgw = 0/0 debug_rocksdb = 0/0 debug_striper = 0/0 debug_throttle = 0/0 debug_timer = 0/0 debug_tp = 0/0 debug_xio = 0/0 fsid = d53c769a-dfa6-42ba-9176-9da308cad967 mon_allow_pool_delete = true mon_host = 10.254.254.11 10.254.254.13 10.254.254.10 10.254.254.14 10.254.254.12 mon_max_pg_per_osd = 800 ms_bind_ipv4 = true ms_bind_ipv6 = false ms_type = async osd_pool_default_min_size = 2 osd_pool_default_size = 3 perf = True public_network = 10.254.254.10/24 rocksdb_perf = True [client] keyring = /etc/pve/priv/$cluster.$name.keyring rbd_cache = True rbd_cache_max_dirty = 134217728 rbd_cache_max_dirty_age = 30 rbd_cache_max_dirty_object = 2 rbd_cache_size = 335544320 rbd_cache_target_dirty = 235544320 rbd_cache_writethrough_until_flush = False [osd] bluestore_extent_map_shard_max_size = 200 bluestore_extent_map_shard_min_size = 50 bluestore_extent_map_shard_target_size = 100 bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel, level0_stop_writes_trigger = 256,max_bytes_for_level_base=6GB,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB osd_disk_threads = 8 osd_enable_op_tracker = false osd_max_pg_log_entries = 10 osd_memory_target = 14150664191 osd_min_pg_log_entries = 10 osd_op_threads = 16 osd_pg_log_dups_tracked = 10 osd_pg_log_trim_min = 10 write_buffer_size = 4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128, [mon.node0] public_addr = 10.254.254.10 [mon.node1] public_addr = 10.254.254.11 [mon.node2] public_addr = 10.254.254.12 [mon.node3] public_addr = 10.254.254.13 [mon.node4] public_addr = 10.254.254.14
 
Hi @Teapot ,
the configuration looks good.
What kind of switch configuration are you using for the LACP?
I have seen problems several times with e.g. DELL, Cisco and Arista with MLAG.
Have you ever run iPerf through the LACP?
 
Hi @Teapot ,
the configuration looks good.
What kind of switch configuration are you using for the LACP?
I have seen problems several times with e.g. DELL, Cisco and Arista with MLAG.
Have you ever run iPerf through the LACP?
Yes , I tested with IPERF. With multiple thread I can get 100-110Gbps.
When I was the restart OSD from Proxmox CEPH GUI, osd won't restart after stopping. I have to completely delete disk delete and add.
 
small improvements
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx

this you can change tp "none" / but i dont know if if can be changed in on the running system / may somebody can confirm it

i will adapt your settings in your initial post / can i modify with your entries on a running system ?
 
small improvements
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx

this you can change tp "none" / but i dont know if if can be changed in on the running system / may somebody can confirm it

i will adapt your settings in your initial post / can i modify with your entries on a running system ?
you need to restart all osd/mon and vms
 
if you want to big boost, you can try this qemu version, with tcmalloc compilated

https://mutulin1.odiso.net/pve-qemu-kvm_7.2.0-8_amd64.deb

(I'm currently look to add an option in proxmox to be able to dynamicaly choose it, but this build enabled tcmalloc statically for now).

I'm going from 60k iops-> 90kiops with 4k randread with 1 virtio-scsi disk.
 
  • Like
Reactions: Teapot
pve-qemu-kvm: 7.2.0-8 is also automaticly shipped with 7.4 on Enterprise Repository
 
  • Like
Reactions: Teapot
pve-qemu-kvm: 7.2.0-8 is also automaticly shipped with 7.4 on Enterprise Repository
as I said, this is a custom build with different compilation option (--tcmalloc).
(Just for testing of course, but you should see big difference in performance).

I'm trying to push it officially on pve-devel mailing list with an new option on the vm, to choose this version.
 
Solved.
In server all NVME disks same but some of them slower. (it wasn't like that on the first day, they slowed down afterwards)
I changed these disk and problem solved.

But another problem,
When I try to restart OSD in Proxmox CEPH GUI OSD not starting.
What could this be due to?
 
if you want to big boost, you can try this qemu version, with tcmalloc compilated

https://mutulin1.odiso.net/pve-qemu-kvm_7.2.0-8_amd64.deb

(I'm currently look to add an option in proxmox to be able to dynamicaly choose it, but this build enabled tcmalloc statically for now).

I'm going from 60k iops-> 90kiops with 4k randread with 1 virtio-scsi disk.
This is great ! I will try on my test cluster.
Thanks.
 
as I said, this is a custom build with different compilation option (--tcmalloc).
(Just for testing of course, but you should see big difference in performance).

I'm trying to push it officially on pve-devel mailing list with an new option on the vm, to choose this version.
Has this ever found its way into official PVE Repositories?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!