Completely Fair Scheduler
|Original author(s)||Ingo Molnár|
|Developer(s)||Linux kernel developers|
|Operating system||Linux kernel|
The Completely Fair Scheduler (CFS) is the name of a process scheduler which was merged into the 2.6.23 release of the Linux kernel and is the default scheduler. It handles CPU resource allocation for executing processes, and aims to maximize overall CPU utilization while also maximizing interactive performance.
Con Kolivas's work with CPU scheduling, most significantly his implementation of "fair scheduling" named Rotating Staircase Deadline, inspired Ingo Molnár to develop his CFS, as a replacement for the earlier O(1) scheduler, crediting Kolivas in his announcement.
In contrast to the previous O(1) scheduler used in older Linux 2.6 kernels, the CFS scheduler implementation is not based on run queues. Instead, a red-black tree implements a "timeline" of future task execution. Additionally, the scheduler uses nanosecond granularity accounting, the atomic units by which an individual process' share of the CPU was allocated (thus making redundant the previous notion of timeslices). This precise knowledge also means that no specific heuristics are required to determine the interactivity of a process, for example.
Like the old O(1) scheduler, CFS uses a concept called "sleeper fairness", which considers sleeping or waiting tasks equivalent to those on the runqueue. This means that interactive tasks which spend most of their time waiting for user input or other events get a comparable share of CPU time when they need it.
The scheduler stores the records about the planned tasks in a red-black tree, using the spent processor time as a key. This allows it to pick efficiently the process that has used the least amount of time (it is stored in the leftmost node of the tree). The entry of the picked process is then removed from the tree, the spent execution time is updated and the entry is then returned to the tree where it normally takes some other location. The new leftmost node is then picked from the tree, repeating the iteration.
If the task spends a lot of its time sleeping, then its spent time value is low and it automatically gets the priority boost when it finally needs it. Hence such tasks do not get less processor time than the tasks that are constantly running.
|This section may be confusing or unclear to readers. (October 2009)|
Originally invented for packet networks, fair queuing had been previously applied to CPU scheduling under the name stride scheduling. However, CFS uses different terminology than normally applied to fair queuing. "Service error" (the amount in which a process's obtained CPU share differs from its expected CPU share) is called "wait_runtime" in Linux's implementation, and the term "queue virtual time" (QVT) was given the name "fair_clock".
The fair queuing CFS scheduler has a scheduling complexity of O(log N), where N is the number of tasks in the runqueue. Choosing a task can be done in constant time, but reinserting a task after it has run requires O(log N) operations, because the runqueue is implemented as a red-black tree.
Technically, the name "Completely Fair Scheduler" is not entirely correct, since the algorithm only guarantees the "unfair" level to be less than O(n), where n is the number of processes. There are more complicated algorithms[clarification needed] which can give better bounds over the "unfair" levels (e.g. O(log n)).[clarification needed]
The Linux kernel received a patch for CFS in November 2010 for the 2.6.38 kernel that has made the scheduler fairer for use on desktops and workstations. Developed by Mike Galbraith using ideas suggested by Linus Torvalds the patch is expected to significantly boost multi-tasking performance on most systems in that class. The explanation of the basic algorithm implementation was included by Mike Galbraith in a LKML post by him about the patch:
Each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase its refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used.
The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly.
The primary issues solved by this are for multi-core as well as multi-cpu (SMP) systems experiencing increased interactive response times while performing other tasks that use many threads in those tasks. A simple explanation is that one will be able to still watch a video, read email and perform other typical desktop activities without glitches or choppiness while compiling the Linux kernel or a similar process such as encoding video. However there are objections on this statement.
This patch implements the tty task group creation only for fair class tasks and, as such, leaves the way open for enhancement. Even at this basic implementation this patch can make Linux on the desktop a reality for all those who have found desktop performance to be less than desired. As Linus put it:
So I think this is firmly one of those "real improvement" patches.
Good job. Group scheduling goes from "useful for some specific server loads" to "that's a killer feature".
In the same LKML post, Lennart Poettering from Red Hat pointed out that this change is a policy, which is against the Unix philosophy of implementing "mechanism, not policy." He also gave an equivalent implementation in shell script that was supposed to achieve the same result in userspace to prove although this patch demonstrates an optimization, there is no point to implement it in the kernel. The script was tested by Markus Trippelsdorf and appeared to work better than the kernel patch.
Lennart Poettering also doubted the usefulness of the patch, as he explained to the reader,
So, this patch only has an effect on people who build kernels from an xterm with make -j all day, and at the same time want to watch a movie, from a player they also start from a terminal, but from another one.
Linus Torvalds believed that the patch should be implemented and enabled in the kernel, but also recognized the value in implementing scheduler grouping as desktop policy,
In fact, I don't think it would be at all wrong to have the desktop launcher have an option to "launch in a group" (although I think it would need to be named better than that) ... So I do _not_ believe that the autogroup feature should necessarily mean that you cannot do other grouping decisions too.
- Molnár, Ingo (2007-04-13). "[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]". linux-kernel mailing list. http://lwn.net/Articles/230501/.
- Andrews, Jeremy (2007-04-18). "Linux: The Completely Fair Scheduler". KernelTrap.
- CFS description at ibm.com
- Li, T.; Baumberger, D.; Hahn, S. (2009). "Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin". ACM SIGPLAN Notices 44 (4): 65. doi:10.1145/1594835.1504188.
- The ~200 Line Linux Kernel Patch That Does Wonders
- The Linux desktop may soon be a lot faster
- Lennart's first implementation
- Markus' test against Lennart's script
- Lennart's interpretation of the patch
- Corbet, Jonathan (2007-04-17). "Schedulers: The Plot Thickens". LWN.net.
- Corbet, J. (2007-07-02). "CFS Group Scheduling". LWN.net.