Slurm Workload Manager
![]() | |
Developer(s) | SchedMD |
---|---|
Stable release | |
Repository | |
Written in | C |
Operating system | Linux, BSDs |
Type | Job Scheduler for Clusters and Supercomputers |
License | GNU General Public License |
Website | slurm |
The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.
It provides three key functions:
- allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
- providing a framework for starting, executing, and monitoring work, typically a parallel job such as Message Passing Interface (MPI) on a set of allocated nodes, and
- arbitrating contention for resources by managing a queue of pending jobs.
Slurm is the workload manager on about 60% of the TOP500 supercomputers.[1]
Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.[2]
History[edit]
Slurm began development as a collaborative effort primarily by Lawrence Livermore National Laboratory, SchedMD,[3] Linux NetworX, Hewlett-Packard, and Groupe Bull as a Free Software resource manager. It was inspired by the closed source Quadrics RMS and shares a similar syntax. The name is a reference to the soda in Futurama.[4] Over 100 people around the world have contributed to the project. It has since evolved into a sophisticated batch scheduler capable of satisfying the requirements of many large computer centers.
As of November 2021[update], TOP500 list of most powerful computers in the world indicates that Slurm is the workload manager on more than half of the top ten systems.
Structure[edit]
Slurm's design is very modular with about 100 optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes. More sophisticated configurations provide database integration for accounting, management of resource limits and workload prioritization.
Features[edit]
Slurm features include:[citation needed]
- No single point of failure, backup daemons, fault-tolerant job options
- Highly scalable (schedules up to 100,000 independent jobs on the 100,000 sockets of IBM Sequoia)
- High performance (up to 1000 job submissions per second and 600 job executions per second)
- Free and open-source software (GNU General Public License)
- Highly configurable with about 100 plugins
- Fair-share scheduling with hierarchical bank accounts
- Preemptive and gang scheduling (time-slicing of parallel jobs)
- Integrated with database for accounting and configuration
- Resource allocations optimized for network topology and on-node topology (sockets, cores and hyperthreads)
- Advanced reservation
- Idle nodes can be powered down
- Different operating systems can be booted for each job
- Scheduling for generic resources (e.g. Graphics processing unit)
- Real-time accounting down to the task level (identify specific tasks with high CPU or memory usage)
- Resource limits by user or bank account
- Accounting for power consumption by job
- Support of IBM Parallel Environment (PE/POE)
- Support for job arrays
- Job profiling (periodic sampling of each task's CPU use, memory use, power consumption, network and file system use)
- Sophisticated multifactor job prioritization algorithms
- Support for MapReduce+
- Support for burst buffer that accelerates scientific data movement
The following features are announced for version 14.11 of Slurm, was released in November 2014:[5]
- Improved job array data structure and scalability
- Support for heterogeneous generic resources
- Add user options to set the CPU governor
- Automatic job requeue policy based on exit value
- Report API use by user, type, count and time consumed
- Communication gateway nodes improve scalability
Supported platforms[edit]
Slurm is primarily developed to work alongside Linux distributions, although there is also support for a few other POSIX-based operating systems, including BSDs (FreeBSD, NetBSD and OpenBSD).[6] Slurm also supports several unique computer architectures, including:
- IBM BlueGene/Q models, including the 20 petaflop IBM Sequoia
- Cray XT, XE and Cascade
- Tianhe-2 a 33.9 petaflop system with 32,000 Intel Ivy Bridge chips and 48,000 Intel Xeon Phi chips with a total of 3.1 million cores
- IBM Parallel Environment
- Anton
License[edit]
Slurm is available under the GNU General Public License v2.
Commercial support[edit]
In 2010, the developers of Slurm founded SchedMD, which maintains the canonical source, provides development, level 3 commercial support and training services. Commercial support is also available from Bull, Cray, and Science + Computing.
See also[edit]
- Job Scheduler and Batch Queuing for Clusters
- Beowulf cluster
- Maui Cluster Scheduler
- Open Source Cluster Application Resources (OSCAR)
- TORQUE
- Univa Grid Engine
- Platform LSF
References[edit]
- ^ "Running a Job on HPC using Slurm | HPC | USC". hpcc.usc.edu. Archived from the original on 2019-03-06. Retrieved 2019-03-05.
- ^ Pascual, Jose Antonio; Navaridas, Javier; Miguel-Alonso, Jose (2009). Effects of Topology-Aware Allocation Policies on Scheduling Performance. Job Scheduling Strategies for Parallel Processing. Lecture Notes in Computer Science. Vol. 5798. pp. 138–144. doi:10.1007/978-3-642-04633-9_8. ISBN 978-3-642-04632-2.
- ^ "Slurm Commercial Support, Development, and Installation". SchedMD. Retrieved 2014-02-23.
- ^ "SLURM: Simple Linux Utility for Resource Management" (PDF). 23 June 2003. Retrieved 11 January 2016.
- ^ "Slurm - What's New". SchedMD. Retrieved 2014-08-29.
- ^ Slurm Platforms
Further reading[edit]
- Balle, Susanne M.; Palermo, Daniel J. (2008). Enhancing an Open Source Resource Manager with Multi-core/Multi-threaded Support. Job Scheduling Strategies for Parallel Processing. Lecture Notes in Computer Science. Vol. 4942. p. 37. doi:10.1007/978-3-540-78699-3_3. ISBN 978-3-540-78698-6.
- Jette, M.; Grondona, M. (June 2003). "SLURM: Simple Linux Utility for Resource Management" (PDF). Proceedings of ClusterWorld Conference and Expo. San Jose, California.
- Layton, Jeffrey B. (5 February 2009). "Caos NSA and Perceus: All-in-one Cluster Software Stack". Linux Magazine. Archived from the original on February 11, 2009.
{{cite journal}}
: CS1 maint: unfit URL (link) - Yoo, Andy B.; Jette, Morris A.; Grondona, Mark (2003). SLURM: Simple Linux Utility for Resource Management. Job Scheduling Strategies for Parallel Processing. Lecture Notes in Computer Science. Vol. 2862. p. 44. CiteSeerX 10.1.1.10.6834. doi:10.1007/10968987_3. ISBN 978-3-540-20405-3.