Increase performance with sched_autogroup_enabled=0

e100

Renowned Member
Nov 6, 2010
1,268
46
88
Columbus, Ohio
ulbuilder.wordpress.com
Changing sched_autogroup_enabled from 1 to 0 makes a HUGE difference in performance on busy Proxmox hosts
Also helps to modify sched_migration_cost_ns

I've tested this on Proxmox 4.x and 5.x:
Code:
echo 5000000 > /proc/sys/kernel/sched_migration_cost_ns 
echo 0 > /proc/sys/kernel/sched_autogroup_enabled

Some old servers where we ran CEPH performed so poorly they were barely useful.
Applied the above settings to Proxmox Hosts and CEPH Servers, instantly IO inside vms backed by CEPH were 4x faster!

The improvements are not limited to CEPH, ZFS runs better too!
Seems like any Proxmox host were I set the above settings performance improves significantly.

Anyone else seeing similar results?
 
  • Like
Reactions: ebiss
Is there a downside to using these options?

The first setting (sched_migration_cost_ns) allows processes to stay longer on a CPU (core) since it's last run, this is a heuristic for estimating cache misses. I.e., you have a lot tasks to run, for those who still have a lot of their data in cache it's cheaper to wait minimally and then run on the CPU (core) they last run - as cache misses cost quite much cpu-cycle wise.
For those which have not much, or even no, data in the local CPU caches (L1, and also L2) it maybe faster/better to just run the task on another CPU with less work (i.e., migrate it), as it must (re)cache its data anyway.
The heuristic uses, among other things, the time duration since the tasks last run to estimate how many task data is probably still cached, as the longer the task did not run the more likely it is that it's data was evicted from cache to make room for another task. Now, setting this to high can have downsides, cache penalty may add up, but having a to low one is also not ideal, task migration is not exactly free - as often inter-CPU locks must be acquired to move a task to another's CPU cores run queue. But it seems that this values default may be a bit to low for modern systems and a hypervisor work load, so you could try to set it to 5ms instead of 0.5 ms, and observe how your system is affected, note that here a higher CPU load could be wanted, as basically it just gets used more efficiently - less time wasted in task migrations and you can achieve more throughput.

The second setting (sched_autogroup_enabled) automatically groups task by their session, which is really really great for desktop systems, but seems to be not always ideal for background processes sharing the same session.
As systemd executes a setsid() for creating a new session after forking for a command execution this may not have such a great effect anymore... I just skimmed kernel and systemd source code a bit, it may be that I missed a detail, so if anyone else has more accurate information it is welcome to correct me :)

Best is probably to try a bit around with the settings, they can be quickly enabled or disabled with sysctl.
 
The first setting (sched_migration_cost_ns) allows processes to stay longer on a CPU (core) since it's last run, this is a heuristic for estimating cache misses.
.. snip...
The second setting (sched_autogroup_enabled) automatically groups task by their session,
..snip...

Thanks, this is hugely useful information, I am trying the change as we speak!
 
Thanks, this is hugely useful information, I am trying the change as we speak!

Another thing I just found in the scheduler documentation (man 7 sched):

The use of the cgroups(7) CPU controller to place processes in cgroups other than the root CPU cgroup overrides the effect of autogrouping.

Meaning, for virtual machines and container the second setting should not make a real difference, as we place those processes already in it's own control groups.
But ceph daemons are in the root cgroup, AFAIS, so for a hyper-converged setup it could make a difference there.

Edit:
As systemd executes a setsid() for creating a new session after forking for a command execution this may not have such a great effect anymore...

Oh and here I was wrong, as task groups compete each other this should state the contrary, thats the reason why disabling could improve ceph performance at all.
 
Applied the above settings to Proxmox Hosts and CEPH Servers, instantly IO inside vms backed by CEPH were 4x faster!
@e100, May you please share with us, how you measured and maybe some numbers?
 
What does a fio test show? As dd without options will use caching and usually not be a direct write to disk.

Example:
Code:
fio --ioengine=libaio --filename=/dev/sdX --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --write_lat_log=1 --name=fio --output=fio.log

I ask, because I could not see any difference on our ceph cluster in the testlab, with or without load (stress-ng, rados bench, fio).

Has anybody else tested those settings and what results did you get?
 
Sorry for asking the obvious. Did you reboot the hosts and/or VM? Maybe there was some swap usage (host/VM)?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!