Completely Fair Scheduler

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Completely Fair Scheduler
Original author(s) Ingo Molnár
Developer(s) Linux kernel developers
Written in C
Operating system Linux kernel
Type process scheduler
License GPL-2.0
Website kernel.org

The Completely Fair Scheduler (CFS) is a process scheduler which was merged into the 2.6.23 release of the Linux kernel and is the default scheduler. It handles CPU resource allocation for executing processes, and aims to maximize overall CPU utilization while also maximizing interactive performance.

Con Kolivas's work with CPU scheduling, most significantly his implementation of "fair scheduling" named Rotating Staircase Deadline, inspired Ingo Molnár to develop his CFS, as a replacement for the earlier O(1) scheduler, crediting Kolivas in his announcement.[1]

In contrast to the previous O(1) scheduler used in older Linux 2.6 kernels, the CFS scheduler implementation is not based on run queues. Instead, a red-black tree implements a "timeline" of future task execution. Additionally, the scheduler uses nanosecond granularity accounting, the atomic units by which an individual process' share of the CPU was allocated (thus making redundant the previous notion of timeslices). This precise knowledge also means that no specific heuristics are required to determine the interactivity of a process, for example.[2]

Like the old O(1) scheduler, CFS uses a concept called "sleeper fairness", which considers sleeping or waiting tasks equivalent to those on the runqueue. This means that interactive tasks which spend most of their time waiting for user input or other events get a comparable share of CPU time when they need it.

Algorithm[edit]

The data structure used for the scheduling algorithm is a red-black tree in which the nodes are scheduler specific structures, entitled "sched_entity". These are derived from the general task_struct process descriptor, with added scheduler elements. These nodes are indexed by processor execution time in nanoseconds.[3] A maximum execution time is also calculated for each process. This time is based upon the idea that an "ideal processor" would equally share processing power amongst all processes. Thus, the maximum execution time is the time the process has been waiting to run, divided by the total number of processes, or in other words, the maximum execution time is the time the process would have expected to run on an "ideal processor".

When the scheduler is invoked to run a new processes, the operation of the scheduler is as follows:

  1. The left most node of the scheduling tree is chosen (as it will have the lowest spent execution time), and sent for execution.
  2. If the process simply completes execution, it is removed from the system and scheduling tree.
  3. If the process reaches it's maximum execution time or is otherwise stopped (voluntarily or via interrupt) it is reinserted into the scheduling tree based on its new spent execution time.
  4. The new left-most node will then be selected from the tree, repeating the iteration.

If the process spends a lot of its time sleeping, then its spent time value is low and it automatically gets the priority boost when it finally needs it. Hence such tasks do not get less processor time than the tasks that are constantly running.

OS background[edit]

CFS is an implementation of a well-studied, classic scheduling algorithm called weighted fair queuing.[4]

Originally invented for packet networks, fair queuing had been previously applied to CPU scheduling under the name stride scheduling. However, CFS uses different terminology than normally applied to fair queuing. "Service error" (the amount in which a process's obtained CPU share differs from its expected CPU share) is called "wait_runtime" in Linux's implementation, and the term "queue virtual time" (QVT) was given the name "fair_clock".

The fair queuing CFS scheduler has a scheduling complexity of O(log N), where N is the number of tasks in the runqueue. Choosing a task can be done in constant time, but reinserting a task after it has run requires O(log N) operations, because the runqueue is implemented as a red-black tree.

CFS is the first implementation of a fair queuing process scheduler widely used in a general-purpose operating system.[4]

Fairer algorithms[edit]

Technically, the name "Completely Fair Scheduler" is not entirely correct, since the algorithm only guarantees the "unfair" level to be less than O(n), where n is the number of processes. There are more complicated algorithms[clarification needed] which can give better bounds over the "unfair" levels (e.g. O(log n)). [clarification needed]

The Linux kernel received a patch for CFS in November 2010 for the 2.6.38 kernel that has made the scheduler fairer for use on desktops and workstations. Developed by Mike Galbraith using ideas suggested by Linus Torvalds the patch is expected to significantly boost multi-tasking performance on most systems in that class.[5] The explanation of the basic algorithm implementation was included by Mike Galbraith in a LKML post[6] by him about the patch:

Each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase its refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly.

The primary issues solved by this are for multi-core as well as multi-cpu (SMP) systems experiencing increased interactive response times while performing other tasks that use many threads in those tasks. A simple explanation is that one will be able to still watch a video, read email and perform other typical desktop activities without glitches or choppiness while compiling the Linux kernel or a similar process such as encoding video. However there are objections on this statement.

This patch implements the tty task group creation only for fair class tasks and, as such, leaves the way open for enhancement. Even at this basic implementation this patch can make Linux on the desktop a reality for all those who have found desktop performance to be less than desired.[7] As Linus put it:[8]

So I think this is firmly one of those "real improvement" patches.

Good job. Group scheduling goes from "useful for some specific server loads" to "that's a killer feature".

Controversy[edit]

In the same LKML post,[9] Lennart Poettering from Red Hat pointed out that this change is a policy, which is against the Unix philosophy of implementing "mechanism, not policy." He also gave an equivalent implementation[10] in shell script that was supposed to achieve the same result in userspace to prove although this patch demonstrates an optimization, there is no point to implement it in the kernel. The script was tested[11] by Markus Trippelsdorf and appeared to work better than the kernel patch.

Lennart Poettering also doubted the usefulness of the patch, as he explained to the reader,[12]

So, this patch only has an effect on people who build kernels from an xterm with make -j all day, and at the same time want to watch a movie, from a player they also start from a terminal, but from another one.

Linus Torvalds believed that the patch should be implemented and enabled in the kernel, but also recognized the value in implementing scheduler grouping as desktop policy,[13]

In fact, I don't think it would be at all wrong to have the desktop launcher have an option to "launch in a group" (although I think it would need to be named better than that) ... So I do _not_ believe that the autogroup feature should necessarily mean that you cannot do other grouping decisions too.

See also[edit]

References[edit]

External links[edit]