Increase performance with sched_autogroup_enabled=0

e100 · Feb 26, 2018

Changing sched_autogroup_enabled from 1 to 0 makes a HUGE difference in performance on busy Proxmox hosts
Also helps to modify sched_migration_cost_ns

I've tested this on Proxmox 4.x and 5.x:

Code:

echo 5000000 > /proc/sys/kernel/sched_migration_cost_ns 
echo 0 > /proc/sys/kernel/sched_autogroup_enabled

Some old servers where we ran CEPH performed so poorly they were barely useful.
Applied the above settings to Proxmox Hosts and CEPH Servers, instantly IO inside vms backed by CEPH were 4x faster!

The improvements are not limited to CEPH, ZFS runs better too!
Seems like any Proxmox host were I set the above settings performance improves significantly.

Anyone else seeing similar results?

marsian · Feb 26, 2018

Haven't tried it yet, but according to a post on the PostgreSQL boards they came across a similar observation: https://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com

e100 · Feb 26, 2018

Yup, that thread is what gave me the idea to try this. Found it this morning.

I suspect these settings will help anyone facing IO related performance issues.

Brian Read · Feb 27, 2018

Is there a downside to using these options?

t.lamprecht · Feb 27, 2018

Brian Read said:
Is there a downside to using these options?

The first setting (sched_migration_cost_ns) allows processes to stay longer on a CPU (core) since it's last run, this is a heuristic for estimating cache misses. I.e., you have a lot tasks to run, for those who still have a lot of their data in cache it's cheaper to wait minimally and then run on the CPU (core) they last run - as cache misses cost quite much cpu-cycle wise.
For those which have not much, or even no, data in the local CPU caches (L1, and also L2) it maybe faster/better to just run the task on another CPU with less work (i.e., migrate it), as it must (re)cache its data anyway.
The heuristic uses, among other things, the time duration since the tasks last run to estimate how many task data is probably still cached, as the longer the task did not run the more likely it is that it's data was evicted from cache to make room for another task. Now, setting this to high can have downsides, cache penalty may add up, but having a to low one is also not ideal, task migration is not exactly free - as often inter-CPU locks must be acquired to move a task to another's CPU cores run queue. But it seems that this values default may be a bit to low for modern systems and a hypervisor work load, so you could try to set it to 5ms instead of 0.5 ms, and observe how your system is affected, note that here a higher CPU load could be wanted, as basically it just gets used more efficiently - less time wasted in task migrations and you can achieve more throughput.

The second setting (sched_autogroup_enabled) automatically groups task by their session, which is really really great for desktop systems, but seems to be not always ideal for background processes sharing the same session.
As systemd executes a setsid() for creating a new session after forking for a command execution this may not have such a great effect anymore... I just skimmed kernel and systemd source code a bit, it may be that I missed a detail, so if anyone else has more accurate information it is welcome to correct me

Best is probably to try a bit around with the settings, they can be quickly enabled or disabled with sysctl.

Brian Read · Feb 27, 2018

t.lamprecht said:
The first setting (sched_migration_cost_ns) allows processes to stay longer on a CPU (core) since it's last run, this is a heuristic for estimating cache misses.
.. snip...
The second setting (sched_autogroup_enabled) automatically groups task by their session,
..snip...

Thanks, this is hugely useful information, I am trying the change as we speak!

t.lamprecht · Feb 27, 2018

Brian Read said:
Thanks, this is hugely useful information, I am trying the change as we speak!

Another thing I just found in the scheduler documentation (man 7 sched):

The use of the cgroups(7) CPU controller to place processes in cgroups other than the root CPU cgroup overrides the effect of autogrouping.

Meaning, for virtual machines and container the second setting should not make a real difference, as we place those processes already in it's own control groups.
But ceph daemons are in the root cgroup, AFAIS, so for a hyper-converged setup it could make a difference there.

Edit:

t.lamprecht said:
As systemd executes a setsid() for creating a new session after forking for a command execution this may not have such a great effect anymore...

Oh and here I was wrong, as task groups compete each other this should state the contrary, thats the reason why disabling could improve ceph performance at all.

Alwin · Feb 27, 2018

e100 said:
Applied the above settings to Proxmox Hosts and CEPH Servers, instantly IO inside vms backed by CEPH were 4x faster!

@e100, May you please share with us, how you measured and maybe some numbers?

e100 · Feb 27, 2018

@Alwin

I used dd and 'time cp' on a VM where I could only copy anything at about 40MB/sec and now I can get 140MB/sec.

Alwin · Feb 28, 2018

What does a fio test show? As dd without options will use caching and usually not be a direct write to disk.

Example:

Code:

fio --ioengine=libaio --filename=/dev/sdX --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --write_lat_log=1 --name=fio --output=fio.log

I ask, because I could not see any difference on our ceph cluster in the testlab, with or without load (stress-ng, rados bench, fio).

Has anybody else tested those settings and what results did you get?

e100 · Feb 28, 2018

@Alwin

I've tried setting the values back to default so I can test before and after but performance stays the same.

Its possible that there is some other explanation to my initial results.

Alwin · Feb 28, 2018

Sorry for asking the obvious. Did you reboot the hosts and/or VM? Maybe there was some swap usage (host/VM)?

Search

Search

Increase performance with sched_autogroup_enabled=0

e100

Renowned Member

marsian

Well-Known Member

e100

Renowned Member

Brian Read

Well-Known Member

t.lamprecht

Proxmox Staff Member

Brian Read

Well-Known Member

t.lamprecht

Proxmox Staff Member

Alwin

Proxmox Retired Staff

e100

Renowned Member

Alwin

Proxmox Retired Staff

e100

Renowned Member

Alwin

Proxmox Retired Staff