Slurm Workload Manager - Overview (2024)

Slurm is an open source,fault-tolerant, and highly scalable cluster management and job scheduling systemfor large and small Linux clusters. Slurm requires no kernel modifications forits operation and is relatively self-contained. As a cluster workload manager,Slurm has three key functions. First, it allocates exclusive and/or non-exclusiveaccess to resources (compute nodes) to users for some duration of time so theycan perform work. Second, it provides a framework for starting, executing, andmonitoring work (normally a parallel job) on the set of allocated nodes.Finally, it arbitrates contention for resources by managing a queue ofpending work.Optional plugins can be used foraccounting,advanced reservation,gang scheduling (time sharing forparallel jobs), backfill scheduling,topology optimized resource selection,resource limits by user or bank account,and sophisticated multifactor jobprioritization algorithms.

Architecture

Slurm has a centralized manager, slurmctld, to monitor resources andwork. There may also be a backup manager to assume those responsibilities in theevent of failure. Each compute server (node) has a slurmd daemon, whichcan be compared to a remote shell: it waits for work, executes that work, returnsstatus, and waits for more work.The slurmd daemons provide fault-tolerant hierarchical communications.There is an optional slurmdbd (Slurm DataBase Daemon) which can be usedto record accounting information for multiple Slurm-managed clusters in asingle database.There is an optionalslurmrestd (Slurm REST API Daemon)which can be used to interact with Slurm through itsREST API.User tools include srun to initiate jobs,scancel to terminate queued or running jobs,sinfo to report system status,squeue to report the status of jobs, andsacct to get information about jobs and job steps that are running or have completed.The sview commands graphically reports system andjob status including network topology.There is an administrative tool scontrol available to monitorand/or modify configuration and state information on the cluster.The administrative tool used to manage the database is sacctmgr.It can be used to identify the clusters, valid users, valid bank accounts, etc.APIs are available for all functions.

Slurm Workload Manager - Overview (1)
Figure 1. Slurm components

Slurm has a general-purpose plugin mechanism available to easily support variousinfrastructures. This permits a wide variety of Slurm configurations using abuilding block approach. These plugins presently include:

  • Accounting Storage: Primarily Used to store historical data about jobs. When used with SlurmDBD (Slurm Database Daemon), it can also supply a limits based system along with historical system status.
  • Account Gather Energy: Gather energy consumption data per job or nodes in the system. This plugin is integrated with the Accounting Storage and Job Account Gather plugins.
  • Authentication of communications: Provides authentication mechanism between various components of Slurm.
  • Containers: HPC workload container support and implementations.
  • Credential (Digital Signature Generation): Mechanism used to generate a digital signature, which is used to validate that job step is authorized to execute on specific nodes. This is distinct from the plugin used for Authentication since the job step request is sent from the user's srun command rather than directly from the slurmctld daemon, which generates the job step credential and its digital signature.
  • Generic Resources: Provide interface to control generic resources, including Graphical Processing Units (GPUs).
  • Job Submit: Custom plugin to allow site specific control over job requirements at submission and update.
  • Job Accounting Gather: Gather job step resource utilization data.
  • Job Completion Logging: Log a job's termination data. This is typically a subset of data stored by an Accounting Storage Plugin.
  • Launchers: Controls the mechanism used by the 'srun' command to launch the tasks.
  • MPI: Provides different hooks for the various MPI implementations. For example, this can set MPI specific environment variables.
  • Preempt: Determines which jobs can preempt other jobs and the preemption mechanism to be used.
  • Priority: Assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
  • Process tracking (for signaling): Provides a mechanism for identifying the processes associated with each job. Used for job accounting and signaling.
  • Scheduler: Plugin determines how and when Slurm schedules jobs.
  • Node selection: Plugin used to determine the resources used for a job allocation.
  • Site Factor (Priority): Assigns a specific site_factor component of a job's multifactor priority to jobs upon submission and on an ongoing basis (e.g. as they age).
  • Switch or interconnect: Plugin to interface with a switch or interconnect. For most systems (Ethernet or InfiniBand) this is not needed.
  • Task Affinity: Provides mechanism to bind a job and its individual tasks to specific processors.
  • Network Topology: Optimizes resource selection based upon the network topology. Used for both job allocations and advanced reservation.

The entities managed by these Slurm daemons, shown in Figure 2, include nodes,the compute resource in Slurm, partitions, which group nodes into logicalsets, jobs, or allocations of resources assigned to a user fora specified amount of time, and job steps, which are sets of (possiblyparallel) tasks within a job.The partitions can be considered job queues, each of which has an assortment ofconstraints such as job size limit, job time limit, users permitted to use it, etc.Priority-ordered jobs are allocated nodes within a partition until the resources(nodes, processors, memory, etc.) within that partition are exhausted. Oncea job is assigned a set of nodes, the user is able to initiate parallel work inthe form of job steps in any configuration within the allocation. For instance,a single job step may be started that utilizes all nodes allocated to the job,or several job steps may independently use a portion of the allocation.Slurm provides resource management for the processors allocated to a job,so that multiple job steps can be simultaneously submitted and queued untilthere are available resources within the job's allocation.

Slurm Workload Manager - Overview (2)
Figure 2. Slurm entities

Configurability

Node state monitored include: count of processors, size of real memory, sizeof temporary disk space, and state (UP, DOWN, etc.). Additional node informationincludes weight (preference in being allocated work) and features (arbitrary informationsuch as processor speed or type).Nodes are grouped into partitions, which may contain overlapping nodes so they arebest thought of as job queues.Partition information includes: name, list of associated nodes, state (UP or DOWN),maximum job time limit, maximum node count per job, group access list,priority (important if nodes are in multiple partitions) and shared node access policywith optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2).Bit maps are used to represent nodes and schedulingdecisions can be made by performing a small number of comparisons and a seriesof fast bit map manipulations. A sample (partial. Slurm configuration file follows.

## Sample /etc/slurm.conf#SlurmctldHost=linux0001 # Primary serverSlurmctldHost=linux0002 # Backup server#AuthType=auth/mungeEpilog=/usr/local/slurm/sbin/epilogPluginDir=/usr/local/slurm/libProlog=/usr/local/slurm/sbin/prologSlurmctldPort=7002SlurmctldTimeout=120SlurmdPort=7003SlurmdSpoolDir=/var/tmp/slurmd.spoolSlurmdTimeout=120StateSaveLocation=/usr/local/slurm/slurm.stateTmpFS=/tmp## Node Configurations#NodeName=DEFAULT CPUs=4 TmpDisk=16384 State=IDLENodeName=lx[0001-0002] State=DRAINEDNodeName=lx[0003-8000] RealMemory=2048 Weight=2NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video## Partition Configurations#PartitionName=DEFAULT MaxTime=30 MaxNodes=2PartitionName=login Nodes=lx[0001-0002] State=DOWNPartitionName=debug Nodes=lx[0003-0030] State=UP Default=YESPartitionName=class Nodes=lx[0031-0040] AllowGroups=studentsPartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096PartitionName=batch Nodes=lx[0041-9999]

Last modified 6 August 2021

Slurm Workload Manager - Overview (2024)

References

Top Articles
Latest Posts
Article information

Author: Kieth Sipes

Last Updated:

Views: 6359

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Kieth Sipes

Birthday: 2001-04-14

Address: Suite 492 62479 Champlin Loop, South Catrice, MS 57271

Phone: +9663362133320

Job: District Sales Analyst

Hobby: Digital arts, Dance, Ghost hunting, Worldbuilding, Kayaking, Table tennis, 3D printing

Introduction: My name is Kieth Sipes, I am a zany, rich, courageous, powerful, faithful, jolly, excited person who loves writing and wants to share my knowledge and understanding with you.