Proxmox 5 & Ceph Luminous/Bluestore super slow!?

Ekkas

Active Member
Nov 7, 2015
11
1
43
Hi, I have 2 Proxmox/Ceph clusters.
One with 4 OSD (5 disks each) db+wal on NVMe
Another with 4 OSD (10 disks each) db+wal on NVMe

First cluster upgraded and performed slow until all disks were converted to Bluestore, it's still not up to Jewel level of performance but throughput on storage improved. VMs run/feel a bit slower though... But second cluster is just very very slow, although it has a much higher load. Both clusters ran on Prox4.4/CephJewel just great.

The Proxmox nodes performance degrade over a few hours to the point where even SSH to it can take minutes. VMs on it is very slow. I suspect it's a kernel tuning issue, but if I knew for sure I wouldn't be posting here :)

If I run:

free && sync && echo 3 > /proc/sys/vm/drop_caches && free

It can easily take 30minutes or even more to finish. Unaffected nodes finish in a few seconds...
We have 40Gb infiniband running on the cluster so I have kernel tuning parameters as shown in the Proxmox/infiniband Wiki. The same parameters worked just fine in Prox4/Jewel.

I tried with/without some of these sysctl.conf tunings as well:
vm.swappiness=0
vm.vfs_cache_pressure=50
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.min_free_kbytes=2097152
vm.zone_reclaim_mode=1
vm.nr_hugepages = 400

But nodes seem to die a slow death over time. After a reboot they are fine for a few hours again.
We get 40%-80% IOWait on slow nodes with is highly unusual. Before IOWait barely registered on the summary graphs.

I also tried to change ceph parameters, a.o.
Disabling debugging
osd_op_threads = 5
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2

rbd cache = true
rbd cache writethrough until flush = true
rbd_op_threads = 2

But if anyone can suggest recommended Kernel and/or ceph tunings for 4 OSD, 40 disks 120TB Luminous.
Or what other info can I provide as I'm not sure if it's Luminous of prox5's kernel causing it.

Thanks for reading
Ekkas
 
hi, do you use same hardware before and after bluestore convertion ? (nvme was used for filestore journal ?)
 
My first thought is: why do you have 10x as much drives as OSD's? Isn't the whole idea behind Ceph to have 1 OSD per drive?
 
Potato, Tomato... :) sorry for misleading.
4 OSD servers (CentOS7 kernel 4.13.1) with 10 OSD disks in each server. But I do not think the issue is Ceph, nor for that matter Proxmox, but kernel/memory/IRQ/(?) issues as even normal SSH between nodes slow down, node become slower and slower over time without any obvious parameters that I could see so far, so if I can pull stats off a slow node to help diagnose, but since there's no errors or warning in the logs that I can see, it's non-obvious where to look...
 
you can try to boot on kernel 4.4 from old proxmox, it should works. (just to compare).

Which network card do you use ?
 
Using
Intel Corporation 82576 Gigabit Network Connection (rev 01)
Subsystem: Super Micro Computer Inc 82576 Gigabit Network Connection

We have Supermicro blade system with 12 Xeon servers ATM.

Changed all blades to kernel 4.4.76-1-pve or earlier and speed is much better already. Time will tell about stability.
 
  • Like
Reactions: aderumier
Using
Intel Corporation 82576 Gigabit Network Connection (rev 01)
Subsystem: Super Micro Computer Inc 82576 Gigabit Network Connection

We have Supermicro blade system with 12 Xeon servers ATM.

Changed all blades to kernel 4.4.76-1-pve or earlier and speed is much better already. Time will tell about stability.

good to known ! is the difference really big ?
 
Still struggling with stability. VMs would just hang for a while, then speed up again. We disabled all VM disk caches from cache=writeback to cache=none and stability is a bit better but we still have VMs slow down drastically, only clue is high io-wait on the node during that time. Could take 40 minutes and everything speeds up again. We have mostly Windows server VMs and have disabled memory ballooning on all, keep at least 6GB unallocated RAM (by VMs) on each node but the problem seem to occur more often on lesser-RAM nodes although I cannot say that for sure.
 
Warning ahead, the current Ceph Luminous 12.2.x is the release candidate, for production ready Ceph Cluster packages please wait for Proxmox VE 5.1

We disabled all VM disk caches from cache=writeback to cache=none and stability is a bit better
It is a common mistake to use qemu cache and ceph cache at the same time, while writes benefit, every read has bigger latency, as it has to check through two caches to get to the data.

we still have VMs slow down drastically, only clue is high io-wait on the node during that time. Could take 40 minutes and everything speeds up again.
Are you running any Ceph services on the PVE cluster? And how does your 'ceph osd perf' (best throw it in a watch -d) look like during that slow down? Are you using HBA or RAID controllers (eg. RAID0 with one disk)?

osd_op_threads = 5
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2

rbd cache = true
rbd cache writethrough until flush = true
rbd_op_threads = 2
Also you need to check your settings (top of my head), I think your 'osd_op_threads' and 'osd_disk_threads' are counter productive, as you put more work on a single OSD daemon for its disk. Now a OSD has 'osd_op_threads = 5' five threads to work on a HDD, by default it is 2, those are writing in 30sec interval to the disk. The 'osd_disk_threads' is usually 1, as OSD intensive tasks, like scrubbing or snap trimming shouldn't interfere with operation.
 
It is a common mistake to use qemu cache and ceph cache at the same time, while writes benefit, every read has bigger latency, as it has to check through two caches to get to the data.
What's the recommendation? rbd cache or qemu cache? And what settings?
 
Last edited:
My recommendation, use qemu disk cache=none and use ceph cache (sizing on your needs), as it can write full 4MB objects to the OSDs.
 
also add on your ceph.conf clients:

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0




debug_ms = "0/0" is the more important for luminous,
It'll be disable by default on the next ceph release

https://github.com/ceph/ceph/pull/18529
 
Thank you for your recommendations.
We decreased data replication from 3 to 2 copies to get a more stable workable cluster. Debug and other recommendations tried but issues still arise from time to time. (Much better with just 2 copies) Problem is when it 'goes slow' all VMs suffer. For us it is counter-productive to not have a single point of failure but a single point of performance impact for all VMs. If CEPH goes bad/slow, all production slows to a crawl.
We are going to ditch CEPH and rather have higher maintenance, but much higher performance ZFS nodes that will replicate between themselves.
It seems with increasing CEPH versions they work towards larger clusters, so our little 4-node, 40 OSDs clusters just cannot handle what CEPH requires of it, certainly not what one would expect to run off that many disks/ram/cpu. Come to think of it, CEPH is VERY inefficient i.t.o. resources, RAM, CPU and power consumption. A.o. CEPH luminous now recommend a minimum of 64GB RAM on OSD nodes.
Unless you need Petabytes in a single storage solution and can afford 5 or more OSD nodes, I would not recommend or deploy CEPH easily again.
 
Sad to hear, that it was not a nice ride. I would still be interested in some numbers, eg, from a rados bench and fio. Do you have some and would you be able to share them? Thanks in advance.
 
Hi, I'm running jewel filestore and luminous bluestore, with 3 nodes - each : 6 osd ssd - 2xintel 3ghz 12cores - 64g ram - 2x10gb ethernet . I don't have see any regression. I'm around 600000 iops randread 4k, 150000 iops randwrite 4k.
 
Hi, I'm running jewel filestore and luminous bluestore, with 3 nodes - each : 6 osd ssd - 2xintel 3ghz 12cores - 64g ram - 2x10gb ethernet . I don't have see any regression. I'm around 600000 iops randread 4k, 150000 iops randwrite 4k.
Hi, what is you WAL and DB config? Type of hardware, sizing and other stuff?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!