Proxmox 5 & Ceph Luminous/Bluestore super slow!?

Ekkas · Sep 15, 2017

Hi, I have 2 Proxmox/Ceph clusters.
One with 4 OSD (5 disks each) db+wal on NVMe
Another with 4 OSD (10 disks each) db+wal on NVMe

First cluster upgraded and performed slow until all disks were converted to Bluestore, it's still not up to Jewel level of performance but throughput on storage improved. VMs run/feel a bit slower though... But second cluster is just very very slow, although it has a much higher load. Both clusters ran on Prox4.4/CephJewel just great.

The Proxmox nodes performance degrade over a few hours to the point where even SSH to it can take minutes. VMs on it is very slow. I suspect it's a kernel tuning issue, but if I knew for sure I wouldn't be posting here

If I run:

free && sync && echo 3 > /proc/sys/vm/drop_caches && free

It can easily take 30minutes or even more to finish. Unaffected nodes finish in a few seconds...
We have 40Gb infiniband running on the cluster so I have kernel tuning parameters as shown in the Proxmox/infiniband Wiki. The same parameters worked just fine in Prox4/Jewel.

I tried with/without some of these sysctl.conf tunings as well:
vm.swappiness=0
vm.vfs_cache_pressure=50
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.min_free_kbytes=2097152
vm.zone_reclaim_mode=1
vm.nr_hugepages = 400

But nodes seem to die a slow death over time. After a reboot they are fine for a few hours again.
We get 40%-80% IOWait on slow nodes with is highly unusual. Before IOWait barely registered on the summary graphs.

I also tried to change ceph parameters, a.o.
Disabling debugging
osd_op_threads = 5
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2

rbd cache = true
rbd cache writethrough until flush = true
rbd_op_threads = 2

But if anyone can suggest recommended Kernel and/or ceph tunings for 4 OSD, 40 disks 120TB Luminous.
Or what other info can I provide as I'm not sure if it's Luminous of prox5's kernel causing it.

Thanks for reading
Ekkas

aderumier · Sep 15, 2017

hi, do you use same hardware before and after bluestore convertion ? (nvme was used for filestore journal ?)

Ekkas · Sep 16, 2017

Hi, yes, same hardware. Just upgraded in-place as described in the Wiki. Then removed the disks in batches and synced it back in as Bluestore. DB and WAL on NVMe and so was filestore journals.
I see similar issues
https://forum.proxmox.com/threads/3000-msec-ping-and-packet-drops-with-virtio-under-load.36687/
It seems the 'lesser of two evils' solution is to make the disks IDE instead of virtio.

MewBie · Sep 16, 2017

My first thought is: why do you have 10x as much drives as OSD's? Isn't the whole idea behind Ceph to have 1 OSD per drive?

Ekkas · Sep 16, 2017

Potato, Tomato...

sorry for misleading.
4 OSD servers (CentOS7 kernel 4.13.1) with 10 OSD disks in each server. But I do not think the issue is Ceph, nor for that matter Proxmox, but kernel/memory/IRQ/(?) issues as even normal SSH between nodes slow down, node become slower and slower over time without any obvious parameters that I could see so far, so if I can pull stats off a slow node to help diagnose, but since there's no errors or warning in the logs that I can see, it's non-obvious where to look...

aderumier · Sep 17, 2017

you can try to boot on kernel 4.4 from old proxmox, it should works. (just to compare).

Which network card do you use ?

Ekkas · Sep 18, 2017

Using
Intel Corporation 82576 Gigabit Network Connection (rev 01)
Subsystem: Super Micro Computer Inc 82576 Gigabit Network Connection

We have Supermicro blade system with 12 Xeon servers ATM.

Changed all blades to kernel 4.4.76-1-pve or earlier and speed is much better already. Time will tell about stability.

aderumier · Sep 19, 2017

Ekkas said:
Using
Intel Corporation 82576 Gigabit Network Connection (rev 01)
Subsystem: Super Micro Computer Inc 82576 Gigabit Network Connection

We have Supermicro blade system with 12 Xeon servers ATM.

Changed all blades to kernel 4.4.76-1-pve or earlier and speed is much better already. Time will tell about stability.

good to known ! is the difference really big ?

Ekkas · Sep 26, 2017

Still struggling with stability. VMs would just hang for a while, then speed up again. We disabled all VM disk caches from cache=writeback to cache=none and stability is a bit better but we still have VMs slow down drastically, only clue is high io-wait on the node during that time. Could take 40 minutes and everything speeds up again. We have mostly Windows server VMs and have disabled memory ballooning on all, keep at least 6GB unallocated RAM (by VMs) on each node but the problem seem to occur more often on lesser-RAM nodes although I cannot say that for sure.

Alwin · Sep 26, 2017

Warning ahead, the current Ceph Luminous 12.2.x is the release candidate, for production ready Ceph Cluster packages please wait for Proxmox VE 5.1

Ekkas said:
We disabled all VM disk caches from cache=writeback to cache=none and stability is a bit better

It is a common mistake to use qemu cache and ceph cache at the same time, while writes benefit, every read has bigger latency, as it has to check through two caches to get to the data.

Ekkas said:
we still have VMs slow down drastically, only clue is high io-wait on the node during that time. Could take 40 minutes and everything speeds up again.

Are you running any Ceph services on the PVE cluster? And how does your 'ceph osd perf' (best throw it in a watch -d) look like during that slow down? Are you using HBA or RAID controllers (eg. RAID0 with one disk)?

Ekkas said:
osd_op_threads = 5
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2
osd-max-backfills = 3
osd_disk_threads = 8
osd_op_num_threads_per_shard = 1
osd_op_num_threads_per_shard_hdd = 2
osd_op_num_threads_per_shard_ssd = 2

rbd cache = true
rbd cache writethrough until flush = true
rbd_op_threads = 2

Also you need to check your settings (top of my head), I think your 'osd_op_threads' and 'osd_disk_threads' are counter productive, as you put more work on a single OSD daemon for its disk. Now a OSD has 'osd_op_threads = 5' five threads to work on a HDD, by default it is 2, those are writing in 30sec interval to the disk. The 'osd_disk_threads' is usually 1, as OSD intensive tasks, like scrubbing or snap trimming shouldn't interfere with operation.

hansm · Sep 26, 2017

Alwin said:
It is a common mistake to use qemu cache and ceph cache at the same time, while writes benefit, every read has bigger latency, as it has to check through two caches to get to the data.

What's the recommendation? rbd cache or qemu cache? And what settings?

Alwin · Sep 26, 2017

My recommendation, use qemu disk cache=none and use ceph cache (sizing on your needs), as it can write full 4MB objects to the OSDs.

Proxmox India · Nov 9, 2017

did the situation improve ?

I plan on putting up a Infiniband 40G QDR based Ceph Cluster too !

aderumier · Nov 9, 2017

also add on your ceph.conf clients:

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

debug_ms = "0/0" is the more important for luminous,
It'll be disable by default on the next ceph release

https://github.com/ceph/ceph/pull/18529

Ekkas · Nov 13, 2017

Thank you for your recommendations.
We decreased data replication from 3 to 2 copies to get a more stable workable cluster. Debug and other recommendations tried but issues still arise from time to time. (Much better with just 2 copies) Problem is when it 'goes slow' all VMs suffer. For us it is counter-productive to not have a single point of failure but a single point of performance impact for all VMs. If CEPH goes bad/slow, all production slows to a crawl.
We are going to ditch CEPH and rather have higher maintenance, but much higher performance ZFS nodes that will replicate between themselves.
It seems with increasing CEPH versions they work towards larger clusters, so our little 4-node, 40 OSDs clusters just cannot handle what CEPH requires of it, certainly not what one would expect to run off that many disks/ram/cpu. Come to think of it, CEPH is VERY inefficient i.t.o. resources, RAM, CPU and power consumption. A.o. CEPH luminous now recommend a minimum of 64GB RAM on OSD nodes.
Unless you need Petabytes in a single storage solution and can afford 5 or more OSD nodes, I would not recommend or deploy CEPH easily again.

Alwin · Nov 13, 2017

Sad to hear, that it was not a nice ride. I would still be interested in some numbers, eg, from a rados bench and fio. Do you have some and would you be able to share them? Thanks in advance.

aderumier · Nov 13, 2017

Hi, I'm running jewel filestore and luminous bluestore, with 3 nodes - each : 6 osd ssd - 2xintel 3ghz 12cores - 64g ram - 2x10gb ethernet . I don't have see any regression. I'm around 600000 iops randread 4k, 150000 iops randwrite 4k.

gcakici · Nov 21, 2017

aderumier said:
Hi, I'm running jewel filestore and luminous bluestore, with 3 nodes - each : 6 osd ssd - 2xintel 3ghz 12cores - 64g ram - 2x10gb ethernet . I don't have see any regression. I'm around 600000 iops randread 4k, 150000 iops randwrite 4k.

Hi, what is you WAL and DB config? Type of hardware, sizing and other stuff?

Search

Search

Proxmox 5 & Ceph Luminous/Bluestore super slow!?

Ekkas

Renowned Member

aderumier

Renowned Member

Ekkas

Renowned Member

MewBie

New Member

Ekkas

Renowned Member

aderumier

Renowned Member

Ekkas

Renowned Member

aderumier

Renowned Member

Ekkas

Renowned Member

Alwin

Proxmox Retired Staff

hansm

Well-Known Member

Alwin

Proxmox Retired Staff

Proxmox India

Member

aderumier

Renowned Member

Ekkas

Renowned Member

Alwin

Proxmox Retired Staff

aderumier

Renowned Member

gcakici

Renowned Member

We value your privacy