M7350v1_en_gpl

This commit is contained in:
T
2024-09-09 08:52:07 +00:00
commit f9cc65cfda
65988 changed files with 26357421 additions and 0 deletions

View File

@@ -0,0 +1,14 @@
00-INDEX
- this file.
sched-arch.txt
- CPU Scheduler implementation hints for architecture specific code.
sched-design-CFS.txt
- goals, design and implementation of the Completely Fair Scheduler.
sched-domains.txt
- information on scheduling domains.
sched-nice-design.txt
- How and why the scheduler's nice levels are implemented.
sched-rt-group.txt
- real-time group scheduling.
sched-stats.txt
- information on schedstats (Linux Scheduler Statistics).

View File

@@ -0,0 +1,89 @@
CPU Scheduler implementation hints for architecture specific code
Nick Piggin, 2005
Context switch
==============
1. Runqueue locking
By default, the switch_to arch function is called with the runqueue
locked. This is usually not a problem unless switch_to may need to
take the runqueue lock. This is usually due to a wake up operation in
the context switch. See arch/ia64/include/asm/system.h for an example.
To request the scheduler call switch_to with the runqueue unlocked,
you must `#define __ARCH_WANT_UNLOCKED_CTXSW` in a header file
(typically the one where switch_to is defined).
Unlocked context switches introduce only a very minor performance
penalty to the core scheduler implementation in the CONFIG_SMP case.
2. Interrupt status
By default, the switch_to arch function is called with interrupts
disabled. Interrupts may be enabled over the call if it is likely to
introduce a significant interrupt latency by adding the line
`#define __ARCH_WANT_INTERRUPTS_ON_CTXSW` in the same place as for
unlocked context switches. This define also implies
`__ARCH_WANT_UNLOCKED_CTXSW`. See arch/arm/include/asm/system.h for an
example.
CPU idle
========
Your cpu_idle routines need to obey the following rules:
1. Preempt should now disabled over idle routines. Should only
be enabled to call schedule() then disabled again.
2. need_resched/TIF_NEED_RESCHED is only ever set, and will never
be cleared until the running task has called schedule(). Idle
threads need only ever query need_resched, and may never set or
clear it.
3. When cpu_idle finds (need_resched() == 'true'), it should call
schedule(). It should not call schedule() otherwise.
4. The only time interrupts need to be disabled when checking
need_resched is if we are about to sleep the processor until
the next interrupt (this doesn't provide any protection of
need_resched, it prevents losing an interrupt).
4a. Common problem with this type of sleep appears to be:
local_irq_disable();
if (!need_resched()) {
local_irq_enable();
*** resched interrupt arrives here ***
__asm__("sleep until next interrupt");
}
5. TIF_POLLING_NRFLAG can be set by idle routines that do not
need an interrupt to wake them up when need_resched goes high.
In other words, they must be periodically polling need_resched,
although it may be reasonable to do some background work or enter
a low CPU priority.
5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
an interrupt sleep, it needs to be cleared then a memory
barrier issued (followed by a test of need_resched with
interrupts disabled, as explained in 3).
arch/x86/kernel/process.c has examples of both polling and
sleeping idle functions.
Possible arch/ problems
=======================
Possible arch problems I found (and either tried to fix or didn't):
h8300 - Is such sleeping racy vs interrupts? (See #4a).
The H8/300 manual I found indicates yes, however disabling IRQs
over the sleep mean only NMIs can wake it up, so can't fix easily
without doing spin waiting.
ia64 - is safe_halt call racy vs interrupts? (does it sleep?) (See #4a)
sh64 - Is sleeping racy vs interrupts? (See #4a)
sparc - IRQs on at this point(?), change local_irq_save to _disable.
- TODO: needs secondary CPUs to disable preempt (See #1)

View File

@@ -0,0 +1,122 @@
CFS Bandwidth Control
=====================
[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.
The bandwidth allowed for a group is specified using a quota and period. Within
each given "period" (microseconds), a group is allowed to consume only up to
"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
group exceeds this limit (for that period), the tasks belonging to its
hierarchy will be throttled and are not allowed to run again until the next
period.
A group's unused runtime is globally tracked, being refreshed with quota units
above at each period boundary. As threads consume this bandwidth it is
transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice".
Management
----------
Quota and period are managed within the cpu subsystem via cgroupfs.
cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
cpu.cfs_period_us: the length of a period (in microseconds)
cpu.stat: exports throttling statistics [explained further below]
The default values are:
cpu.cfs_period_us=100ms
cpu.cfs_quota=-1
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained
bandwidth group. This represents the traditional work-conserving behavior for
CFS.
Writing any (valid) positive value(s) will enact the specified bandwidth limit.
The minimum quota allowed for the quota or period is 1ms. There is also an
upper bound on the period length of 1s. Additional restrictions exist when
bandwidth limits are used in a hierarchical fashion, these are explained in
more detail below.
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
and return the group to an unconstrained state once more.
Any updates to a group's bandwidth specification will result in it becoming
unthrottled if it is in a constrained state.
System wide settings
--------------------
For efficiency run-time is transferred between the global pool and CPU local
"silos" in a batch fashion. This greatly reduces global accounting pressure
on large systems. The amount transferred each time such an update is required
is described as the "slice".
This is tunable via procfs:
/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
Larger slice values will reduce transfer overheads, while smaller values allow
for more fine-grained consumption.
Statistics
----------
A group's bandwidth statistics are exported via 3 fields in cpu.stat.
cpu.stat:
- nr_periods: Number of enforcement intervals that have elapsed.
- nr_throttled: Number of times the group has been throttled/limited.
- throttled_time: The total time duration (in nanoseconds) for which entities
of the group have been throttled.
This interface is read-only.
Hierarchical considerations
---------------------------
The interface enforces that an individual entity's bandwidth is always
attainable, that is: max(c_i) <= C. However, over-subscription in the
aggregate case is explicitly allowed to enable work-conserving semantics
within a hierarchy.
e.g. \Sum (c_i) may exceed C
[ Where C is the parent's bandwidth, and c_i its children ]
There are two ways in which a group may become throttled:
a. it fully consumes its own quota within a period
b. a parent's quota is fully consumed within its period
In case b) above, even though the child may have runtime remaining it will not
be allowed to until the parent's runtime is refreshed.
Examples
--------
1. Limit a group to 1 CPU worth of runtime.
If period is 250ms and quota is also 250ms, the group will get
1 CPU worth of runtime every 250ms.
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
runtime every 500ms.
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 500000 > cpu.cfs_period_us /* period = 500ms */
The larger period here allows for increased burst capacity.
3. Limit a group to 20% of 1 CPU.
With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
# echo 50000 > cpu.cfs_period_us /* period = 50ms */
By using a small period here we are ensuring a consistent latency
response at the expense of burst capacity.

View File

@@ -0,0 +1,244 @@
=============
CFS Scheduler
=============
1. OVERVIEW
CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
replacement for the previous vanilla scheduler's SCHED_OTHER interactivity
code.
80% of CFS's design can be summed up in a single sentence: CFS basically models
an "ideal, precise multi-tasking CPU" on real hardware.
"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical
power and which can run each task at precise equal speed, in parallel, each at
1/nr_running speed. For example: if there are 2 tasks running, then it runs
each at 50% physical power --- i.e., actually in parallel.
On real hardware, we can run only a single task at once, so we have to
introduce the concept of "virtual runtime." The virtual runtime of a task
specifies when its next timeslice would start execution on the ideal
multi-tasking CPU described above. In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.
2. FEW IMPLEMENTATION DETAILS
In CFS the virtual runtime is expressed and tracked via the per-task
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
timestamp and measure the "expected CPU time" a task should have gotten.
[ small detail: on "ideal" hardware, at any time all tasks would have the same
p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
would ever get "out of balance" from the "ideal" share of CPU time. ]
CFS's task picking logic is based on this p->se.vruntime value and it is thus
very simple: it always tries to run the task with the smallest p->se.vruntime
value (i.e., the task which executed least so far). CFS always tries to split
up CPU time between runnable tasks as close to "ideal multitasking hardware" as
possible.
Most of the rest of CFS's design just falls out of this really simple concept,
with a few add-on embellishments like nice levels, multiprocessing and various
algorithm variants to recognize sleepers.
3. THE RBTREE
CFS's design is quite radical: it does not use the old data structures for the
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
task execution, and thus has no "array switch" artifacts (by which both the
previous vanilla scheduler and RSDL/SD are affected).
CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
increasing value tracking the smallest vruntime among all tasks in the
runqueue. The total amount of work done by the system is tracked using
min_vruntime; that value is used to place newly activated entities on the left
side of the tree as much as possible.
The total number of running tasks in the runqueue is accounted through the
rq->cfs.load value, which is the sum of the weights of the tasks queued on the
runqueue.
CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
p->se.vruntime key (there is a subtraction using rq->cfs.min_vruntime to
account for possible wraparounds). CFS picks the "leftmost" task from this
tree and sticks to it.
As the system progresses forwards, the executed tasks are put into the tree
more and more to the right --- slowly but surely giving a chance for every task
to become the "leftmost task" and thus get on the CPU within a deterministic
amount of time.
Summing up, CFS works like this: it runs a task a bit, and when the task
schedules (or a scheduler tick happens) the task's CPU usage is "accounted
for": the (small) time it just spent using the physical CPU is added to
p->se.vruntime. Once p->se.vruntime gets high enough so that another task
becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
small amount of "granularity" distance relative to the leftmost task so that we
do not over-schedule tasks and trash the cache), then the new leftmost task is
picked and the current task is preempted.
4. SOME FEATURES OF CFS
CFS uses nanosecond granularity accounting and does not rely on any jiffies or
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
way the previous scheduler had, and has no heuristics whatsoever. There is
only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
/proc/sys/kernel/sched_min_granularity_ns
which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
"server" (i.e., good batching) workloads. It defaults to a setting suitable
for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.
Due to its design, the CFS scheduler is not prone to any of the "attacks" that
exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
chew.c, ring-test.c, massive_intr.c all work fine and do not impact
interactivity and produce the expected behavior.
The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
than the previous vanilla scheduler: both types of workloads are isolated much
more aggressively.
SMP load-balancing has been reworked/sanitized: the runqueue-walking
assumptions are gone from the load-balancing code now, and iterators of the
scheduling modules are used. The balancing code got quite a bit simpler as a
result.
5. Scheduling policies
CFS implements three scheduling policies:
- SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
policy that is used for regular tasks.
- SCHED_BATCH: Does not preempt nearly as often as regular tasks
would, thereby allowing tasks to run longer and make better use of
caches but at the cost of interactivity. This is well suited for
batch jobs.
- SCHED_IDLE: This is even weaker than nice 19, but its not a true
idle timer scheduler in order to avoid to get into priority
inversion problems which would deadlock the machine.
SCHED_FIFO/_RR are implemented in sched_rt.c and are as specified by
POSIX.
The command chrt from util-linux-ng 2.13.1.1 can set all of these except
SCHED_IDLE.
6. SCHEDULING CLASSES
The new CFS scheduler has been designed in such a way to introduce "Scheduling
Classes," an extensible hierarchy of scheduler modules. These modules
encapsulate scheduling policy details and are handled by the scheduler core
without the core code assuming too much about them.
sched_fair.c implements the CFS scheduler described above.
sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT
priority levels, instead of 140 in the previous scheduler) and it needs no
expired array.
Scheduling classes are implemented through the sched_class structure, which
contains hooks to functions that must be called whenever an interesting event
occurs.
This is the (partial) list of the hooks:
- enqueue_task(...)
Called when a task enters a runnable state.
It puts the scheduling entity (task) into the red-black tree and
increments the nr_running variable.
- dequeue_task(...)
When a task is no longer runnable, this function is called to keep the
corresponding scheduling entity out of the red-black tree. It decrements
the nr_running variable.
- yield_task(...)
This function is basically just a dequeue followed by an enqueue, unless the
compat_yield sysctl is turned on; in that case, it places the scheduling
entity at the right-most end of the red-black tree.
- check_preempt_curr(...)
This function checks if a task that entered the runnable state should
preempt the currently running task.
- pick_next_task(...)
This function chooses the most appropriate task eligible to run next.
- set_curr_task(...)
This function is called when a task changes its scheduling class or changes
its task group.
- task_tick(...)
This function is mostly called from time tick functions; it might lead to
process switch. This drives the running preemption.
7. GROUP SCHEDULER EXTENSIONS TO CFS
Normally, the scheduler operates on individual tasks and strives to provide
fair CPU time to each task. Sometimes, it may be desirable to group tasks and
provide fair CPU time to each such task group. For example, it may be
desirable to first provide fair CPU time to each user on the system and then to
each task belonging to a user.
CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be
grouped and divides CPU time fairly among such groups.
CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
SCHED_RR) tasks.
CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
SCHED_BATCH) tasks.
These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroups/cgroups.txt for more information about this filesystem.
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem.
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
# cd /sys/fs/cgroup/cpu
# mkdir multimedia # create "multimedia" group of tasks
# mkdir browser # create "browser" group of tasks
# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group
# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares
# firefox & # Launch firefox and move it to "browser" group
# echo <firefox_pid> > browser/tasks
# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks

View File

@@ -0,0 +1,81 @@
Each CPU has a "base" scheduling domain (struct sched_domain). The domain
hierarchy is built from these base domains via the ->parent pointer. ->parent
MUST be NULL terminated, and domain structures should be per-CPU as they are
locklessly updated.
Each scheduling domain spans a number of CPUs (stored in the ->span field).
A domain's span MUST be a superset of it child's span (this restriction could
be relaxed if the need arises), and a base domain for CPU i MUST span at least
i. The top domain for each CPU will generally span all CPUs in the system
although strictly it doesn't have to, but this could lead to a case where some
CPUs will never be given tasks to run unless the CPUs allowed mask is
explicitly set. A sched domain's span means "balance process load among these
CPUs".
Each scheduling domain must have one or more CPU groups (struct sched_group)
which are organised as a circular one way linked list from the ->groups
pointer. The union of cpumasks of these groups MUST be the same as the
domain's span. The intersection of cpumasks from any two of these groups
MUST be the empty set. The group pointed to by the ->groups pointer MUST
contain the CPU to which the domain belongs. Groups may be shared among
CPUs as they contain read only data after they have been set up.
Balancing within a sched domain occurs between groups. That is, each group
is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.
In kernel/sched.c, trigger_load_balance() is run periodically on each CPU
through scheduler_tick(). It raises a softirq after the next regularly scheduled
rebalancing event for the current runqueue has arrived. The actual load
balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
in softirq context (SCHED_SOFTIRQ).
The latter function takes two arguments: the current CPU and whether it was idle
at the time the scheduler_tick() happened and iterates over all sched domains
our CPU is on, starting from its base domain and going up the ->parent chain.
While doing that, it checks to see if the current domain has exhausted its
rebalance interval. If so, it runs load_balance() on that domain. It then checks
the parent sched_domain (if it exists), and the parent of the parent and so
forth.
Initially, load_balance() finds the busiest group in the current sched domain.
If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
that group. If it manages to find such a runqueue, it locks both our initial
CPU's runqueue and the newly found busiest one and starts moving tasks from it
to our runqueue. The exact number of tasks amounts to an imbalance previously
computed while iterating over this sched domain's groups.
*** Implementing sched domains ***
The "base" domain will "span" the first level of the hierarchy. In the case
of SMT, you'll span all siblings of the physical CPU, with each group being
a single virtual CPU.
In SMP, the parent of the base domain will span all physical CPUs in the
node. Each group being a single physical CPU. Then with NUMA, the parent
of the SMP domain will span the entire machine, with each group having the
cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
might have just one domain covering its one NUMA level.
The implementor should read comments in include/linux/sched.h:
struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
the specifics and what to tune.
For SMT, the architecture must define CONFIG_SCHED_SMT and provide a
cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of
all "i"'s siblings as well as "i" itself.
Architectures may retain the regular override the default SD_*_INIT flags
while using the generic domain builder in kernel/sched.c if they wish to
retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
can be done by #define'ing ARCH_HASH_SCHED_TUNE.
Alternatively, the architecture may completely override the generic domain
builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
arch_init_sched_domains function. This function will attach domains to all
CPUs using cpu_attach_domain.
The sched-domains debugging infrastructure can be enabled by enabling
CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains
which should catch most possible errors (described above). It also prints out
the domain structure in a visual format.

View File

@@ -0,0 +1,108 @@
This document explains the thinking about the revamped and streamlined
nice-levels implementation in the new Linux scheduler.
Nice levels were always pretty weak under Linux and people continuously
pestered us to make nice +19 tasks use up much less CPU time.
Unfortunately that was not that easy to implement under the old
scheduler, (otherwise we'd have done it long ago) because nice level
support was historically coupled to timeslice length, and timeslice
units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
In the O(1) scheduler (in 2003) we changed negative nice levels to be
much stronger than they were before in 2.4 (and people were happy about
that change), and we also intentionally calibrated the linear timeslice
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
understand it, the timeslice graph went like this (cheesy ASCII art
alert!):
A
\ | [timeslice length]
\ |
\ |
\ |
\ |
\|___100msecs
|^ . _
| ^ . _
| ^ . _
-*----------------------------------*-----> [nice level]
-20 | +19
|
|
So that if someone wanted to really renice tasks, +19 would give a much
bigger hit than the normal linear rule would do. (The solution of
changing the ABI to extend priorities was discarded early on.)
This approach worked to some degree for some time, but later on with
HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
we felt to be a bit excessive. Excessive _not_ because it's too small of
a CPU utilization, but because it causes too frequent (once per
millisec) rescheduling. (and would thus trash the cache, etc. Remember,
this was long ago when hardware was weaker and caches were smaller, and
people were running number crunching apps at nice +19.)
So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
right minimal granularity - and this translates to 5% CPU utilization.
But the fundamental HZ-sensitive property for nice+19 still remained,
and we never got a single complaint about nice +19 being too _weak_ in
terms of CPU utilization, we only got complaints about it (still) being
too _strong_ :-)
To sum it up: we always wanted to make nice levels more consistent, but
within the constraints of HZ and jiffies and their nasty design level
coupling to timeslices and granularity it was not really viable.
The second (less frequent but still periodically occurring) complaint
about Linux's nice level support was its assymetry around the origo
(which you can see demonstrated in the picture above), or more
accurately: the fact that nice level behavior depended on the _absolute_
nice level as well, while the nice API itself is fundamentally
"relative":
int nice(int inc);
asmlinkage long sys_nice(int increment)
(the first one is the glibc API, the second one is the syscall API.)
Note that the 'inc' is relative to the current nice level. Tools like
bash's "nice" command mirror this relative API.
With the old scheduler, if you for example started a niced task with +1
and another task with +2, the CPU split between the two tasks would
depend on the nice level of the parent shell - if it was at nice -10 the
CPU split was different than if it was at +5 or +10.
A third complaint against Linux's nice level support was that negative
nice levels were not 'punchy enough', so lots of people had to resort to
run audio (and other multimedia) apps under RT priorities such as
SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
proof, and a buggy SCHED_FIFO app can also lock up the system for good.
The new scheduler in v2.6.23 addresses all three types of complaints:
To address the first complaint (of nice levels being not "punchy"
enough), the scheduler was decoupled from 'time slice' and HZ concepts
(and granularity was made a separate concept from nice levels) and thus
it was possible to implement better and more consistent nice +19
support: with the new scheduler nice +19 tasks get a HZ-independent
1.5%, instead of the variable 3%-5%-9% range they got in the old
scheduler.
To address the second complaint (of nice levels not being consistent),
the new scheduler makes nice(1) have the same CPU utilization effect on
tasks, regardless of their absolute nice levels. So on the new
scheduler, running a nice +10 and a nice 11 task has the same CPU
utilization "split" between them as running a nice -5 and a nice -4
task. (one will get 55% of the CPU, the other 45%.) That is why nice
levels were changed to be "multiplicative" (or exponential) - that way
it does not matter which nice level you start out from, the 'relative
result' will always be the same.
The third complaint (of negative nice levels not being "punchy" enough
and forcing audio apps to run under the more dangerous SCHED_FIFO
scheduling policy) is addressed by the new scheduler almost
automatically: stronger negative nice levels are an automatic
side-effect of the recalibrated dynamic range of nice levels.

View File

@@ -0,0 +1,183 @@
Real-Time group scheduling
--------------------------
CONTENTS
========
0. WARNING
1. Overview
1.1 The problem
1.2 The solution
2. The interface
2.1 System-wide settings
2.2 Default behaviour
2.3 Basis for grouping tasks
3. Future plans
0. WARNING
==========
Fiddling with these settings can result in an unstable system, the knobs are
root only and assumes root knows what he is doing.
Most notable:
* very small values in sched_rt_period_us can result in an unstable
system when the period is smaller than either the available hrtimer
resolution, or the time it takes to handle the budget refresh itself.
* very small values in sched_rt_runtime_us can result in an unstable
system when the runtime is so small the system has difficulty making
forward progress (NOTE: the migration thread and kstopmachine both
are real-time processes).
1. Overview
===========
1.1 The problem
---------------
Realtime scheduling is all about determinism, a group has to be able to rely on
the amount of bandwidth (eg. CPU time) being constant. In order to schedule
multiple groups of realtime tasks, each group must be assigned a fixed portion
of the CPU time available. Without a minimum guarantee a realtime group can
obviously fall short. A fuzzy upper limit is of no use since it cannot be
relied upon. Which leaves us with just the single fixed portion.
1.2 The solution
----------------
CPU time is divided by means of specifying how much time can be spent running
in a given period. We allocate this "run time" for each realtime group which
the other realtime groups will not be permitted to use.
Any time not allocated to a realtime group will be used to run normal priority
tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
SCHED_OTHER.
Let's consider an example: a frame fixed realtime renderer must deliver 25
frames a second, which yields a period of 0.04s per frame. Now say it will also
have to play some music and respond to input, leaving it with around 80% CPU
time dedicated for the graphics. We can then give this group a run time of 0.8
* 0.04s = 0.032s.
This way the graphics group will have a 0.04s period with a 0.032s run time
limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
0.00015s. So this group can be scheduled with a period of 0.005s and a run time
of 0.00015s.
The remaining CPU time will be used for user input and other tasks. Because
realtime tasks have explicitly allocated the CPU time they need to perform
their tasks, buffer underruns in the graphics or audio can be eliminated.
NOTE: the above example is not fully implemented yet. We still
lack an EDF scheduler to make non-uniform periods usable.
2. The Interface
================
2.1 System wide settings
------------------------
The system wide settings are configured under the /proc virtual file system:
/proc/sys/kernel/sched_rt_period_us:
The scheduling period that is equivalent to 100% CPU bandwidth
/proc/sys/kernel/sched_rt_runtime_us:
A global limit on how much time realtime scheduling may use. Even without
CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
available to all realtime groups.
* Time is specified in us because the interface is s32. This gives an
operating range from 1us to about 35 minutes.
* sched_rt_period_us takes values from 1 to INT_MAX.
* sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
* A run time of -1 specifies runtime == period, ie. no limit.
2.2 Default behaviour
---------------------
The default values for sched_rt_period_us (1000000 or 1s) and
sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
realtime tasks will not lock up the machine but leave a little time to recover
it. By setting runtime to -1 you'd get the old behaviour back.
By default all bandwidth is assigned to the root group and new groups get the
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
want to assign bandwidth to another group, reduce the root group's bandwidth
and assign some or all of the difference to another group.
Realtime group scheduling means you have to assign a portion of total CPU
bandwidth to the group before it will accept realtime tasks. Therefore you will
not be able to run realtime tasks as any user other than root until you have
done that, even if the user has the rights to run processes with realtime
priority!
2.3 Basis for grouping tasks
----------------------------
Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
CPU bandwidth to task groups.
This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
to control the CPU time reserved for each control group.
For more information on working with control groups, you should read
Documentation/cgroups/cgroups.txt as well.
Group settings are checked against the following limits in order to keep the
configuration schedulable:
\Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
For now, this can be simplified to just the following (but see Future plans):
\Sum_{i} runtime_{i} <= global_runtime
3. Future plans
===============
There is work in progress to make the scheduling period for each group
("<cgroup>/cpu.rt_period_us") configurable as well.
The constraint on the period is that a subgroup must have a smaller or
equal period to its parent. But realistically its not very useful _yet_
as its prone to starvation without deadline scheduling.
Consider two sibling groups A and B; both have 50% bandwidth, but A's
period is twice the length of B's.
* group A: period=100000us, runtime=10000us
- this runs for 0.01s once every 0.1s
* group B: period= 50000us, runtime=10000us
- this runs for 0.01s twice every 0.1s (or once every 0.05 sec).
This means that currently a while (1) loop in A will run for the full period of
B and can starve B's tasks (assuming they are of lower priority) for a whole
period.
The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
full deadline scheduling to the linux kernel. Deadline scheduling the above
groups and treating end of the period as a deadline will ensure that they both
get their allocated time.
Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
the biggest challenge as the current linux PI infrastructure is geared towards
the limited static priority levels 0-99. With deadline scheduling you need to
do deadline inheritance (since priority is inversely proportional to the
deadline delta (deadline - now)).
This means the whole PI machinery will have to be reworked - and that is one of
the most complex pieces of code we have.

View File

@@ -0,0 +1,154 @@
Version 15 of schedstats dropped counters for some sched_yield:
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
identical to version 14.
Version 14 of schedstats includes support for sched_domains, which hit the
mainline kernel in 2.6.20 although it is identical to the stats from version
12 which was in the kernel from 2.6.13-2.6.19 (version 13 never saw a kernel
release). Some counters make more sense to be per-runqueue; other to be
per-domain. Note that domains (and their associated information) will only
be pertinent and available on machines utilizing CONFIG_SMP.
In version 14 of schedstat, there is at least one level of domain
statistics for each cpu listed, and there may well be more than one
domain. Domains have no particular names in this implementation, but
the highest numbered one typically arbitrates balancing across all the
cpus on the machine, while domain0 is the most tightly focused domain,
sometimes balancing only between pairs of cpus. At this time, there
are no architectures which need more than three domain levels. The first
field in the domain stats is a bit map indicating which cpus are affected
by that domain.
These fields are counters, and only increment. Programs which make use
of these will need to start with a baseline observation and then calculate
the change in the counters at each subsequent observation. A perl script
which does this for many of the fields is available at
http://eaglet.rain.com/rick/linux/schedstat/
Note that any such script will necessarily be version-specific, as the main
reason to change versions is changes in the output format. For those wishing
to write their own scripts, the fields are described here.
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
Domain statistics
-----------------
One of these is produced per domain for each cpu described. (Note that if
CONFIG_SMP is not defined, *no* domains are utilized and these lines
will not appear in the output.)
domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
The first field is a bit mask indicating what cpus this domain operates over.
The next 24 are a variety of load_balance() statistics in grouped into types
of idleness (idle, busy, and newly idle):
1) # of times in this domain load_balance() was called when the
cpu was idle
2) # of times in this domain load_balance() checked but found
the load did not require balancing when the cpu was idle
3) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was idle
4) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
5) # of times in this domain pull_task() was called when the cpu
was idle
6) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle
7) # of times in this domain load_balance() was called but did
not find a busier queue while the cpu was idle
8) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found
9) # of times in this domain load_balance() was called when the
cpu was busy
10) # of times in this domain load_balance() checked but found the
load did not require balancing when busy
11) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was busy
12) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was busy
13) # of times in this domain pull_task() was called when busy
14) # of times in this domain pull_task() was called even though the
target task was cache-hot when busy
15) # of times in this domain load_balance() was called but did not
find a busier queue while the cpu was busy
16) # of times in this domain a busier queue was found while the cpu
was busy but no busier group was found
17) # of times in this domain load_balance() was called when the
cpu was just becoming idle
18) # of times in this domain load_balance() checked but found the
load did not require balancing when the cpu was just becoming idle
19) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was just becoming idle
20) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was just becoming idle
21) # of times in this domain pull_task() was called when newly idle
22) # of times in this domain pull_task() was called even though the
target task was cache-hot when just becoming idle
23) # of times in this domain load_balance() was called but did not
find a busier queue while the cpu was just becoming idle
24) # of times in this domain a busier queue was found while the cpu
was just becoming idle but no busier group was found
Next three are active_load_balance() statistics:
25) # of times active_load_balance() was called
26) # of times active_load_balance() tried to move a task and failed
27) # of times active_load_balance() successfully moved a task
Next three are sched_balance_exec() statistics:
28) sbe_cnt is not used
29) sbe_balanced is not used
30) sbe_pushed is not used
Next three are sched_balance_fork() statistics:
31) sbf_cnt is not used
32) sbf_balanced is not used
33) sbf_pushed is not used
Next three are try_to_wake_up() statistics:
34) # of times in this domain try_to_wake_up() awoke a task that
last ran on a different cpu in this domain
35) # of times in this domain try_to_wake_up() moved a task to the
waking cpu because it was cache-cold on its own cpu anyway
36) # of times in this domain try_to_wake_up() started passive balancing
/proc/<pid>/schedstat
----------------
schedstats also adds a new /proc/<pid>/schedstat file to include some of
the same information on a per-process level. There are three fields in
this file correlating for that process to:
1) time spent on the cpu
2) time spent waiting on a runqueue
3) # of timeslices run on this cpu
A program could be easily written to make use of these extra fields to
report on how well a particular process or set of processes is faring
under the scheduler's policies. A simple version of such a program is
available at
http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c