Useful Slurm commands — Research Computing University of Colorado Boulder documentation (2024)

Slurm provides a variety of tools that allow a user to manage andunderstand their jobs. This tutorial will introduce these tools, aswell as provide details on how to use them.

Finding queuing information with squeue

The squeue command is a tool we use to pull up information about thejobs in queue. By default, the squeue command will print out thejob ID, partition, username, job status,number of nodes, and name of nodes for all jobs queued orrunning within Slurm. Usually you wouldn’t need information for alljobs that were queued in the system, so we can specify jobs that onlyyou are running with the --user flag:

$ squeue --user=your_rc-username

We can output non-abbreviated information with the --long flag. Thisflag will print out the non-abbreviated default information with theaddition of a timelimit field:

$ squeue --user=your_rc-username --long

The squeue command also provides users with a means to calculate ajob’s estimated start time by adding the --start flag to ourcommand. This will append Slurm’s estimated start time for each job inour output information.

Note: The start time provided by this commandcan be inaccurate. This is because the time calculated is based onjobs queued or running in the system. If a job with a higher priorityis queued after the command is run, your job may be delayed.

$ squeue --user=your_rc-username --start

When checking the status of a job, you may want to repeatedly call thesqueue command to check for updates. We can accomplish this by addingthe --iterate flag to our squeue command. This will run squeue everyn seconds, allowing for a frequent, continuous update of queueinformation without needing to repeatedly call squeue:

$ squeue --user=your_rc-username --start --iterate=n_seconds

Press ctrl-c to stop the command from looping and bring you backto the terminal.

For more information on squeue, visit the Slurm page onsqueue

Stopping jobs with scancel

Sometimes you may need to stop a job entirely while it’s running. Thebest way to accomplish this is with the scancel command. The scancelcommand allows you to cancel jobs you are running on ResearchComputing resources using the job’s ID. The command looks like this:

$ scancel your_job-id

To cancel multiple jobs, you can use a comma-separated list of job IDs:

$ scancel your_job-id1, your_job-id2, your_jobiid3

For more information, visit the Slurm manual on scancel

Analyzing currently running jobs with sstat

The sstat command allows users to easily pull up status informationabout their currently running jobs. This includes information about CPU usage,task information, node information, resident set size(RSS), and virtual memory (VM). We can invoke the sstatcommand as such:

$ sstat --jobs=your_job-id

By default, sstat will pull up significantly more information thanwhat would be needed in the commands default output. To remedy this,we can use the --format flag to choose what we want in ouroutput. The format flag takes a list of comma separated variableswhich specify output data:

$ sstat --jobs=your_job-id --format=var_1,var_2, ... , var_N

A chart of some these variables are listed in the table below:

VariableDescription
avecpuAverage CPU time of all tasks in job.
averssAverage resident set size of all tasks.
avevmsizeAverage virtual memory of all tasks in a job.
jobidThe id of the Job.
maxrssMaximum number of bytes read by all tasks in the job.
maxvsizeMaximum number of bytes written by all tasks in the job.
ntasksNumber of tasks in a job.

For an example, let’s print out a job’s average job id, cpu time, maxrss, and number of tasks. We can do this by typing out the command:

sstat --jobs=your_job-id --format=jobid,cputime,maxrss,ntasks

A full list of variables that specify data handled by sstat can befound with the --helpformat flag or by visiting the slurm page onsstat.

Analyzing past jobs with sacct

The sacct command allows users to pull up status information aboutpast jobs. This command is very similar to sstat, but is used on jobsthat have been previously run on the system instead of currentlyrunning jobs. We can use a job’s id…

$ sacct --jobs=your_job-id

…or your Research Computing username…

$ sacct --user=your_rc-username

…to pull up accounting information on jobs run at an earlier time.

By default, sacct will only pull up jobs that were run on the currentday. We can use the --starttime flag to tell the command to lookbeyond its short-term cache of jobs.

$ sacct –-jobs=your_job-id –-starttime=YYYY-MM-DD

To see a non-abbreviated version of sacct output, use the --longflag:

$ sacct –-jobs=your_job-id –-starttime=YYYY-MM-DD --long

Formatting sacct output

Like sstat, the standard output of sacct may not provide theinformation we want. To remedy this, we can use the --format flag tochoose what we want in our output. Similarly, the format flag ishandled by a list of comma separated variables which specify outputdata:

$ sacct --user=your_rc-username --format=var_1,var_2, ... ,var_N

A chart of some variables is provided below:

VariableDescription
accountAccount the job ran under.
avecpuAverage CPU time of all tasks in job.
averssAverage resident set size of all tasks in the job.
cputimeFormatted (Elapsed time * CPU) count used by a job or step.
elapsedJobs elapsed time formated as DD-HH:MM:SS.
exitcodeThe exit code returned by the job script or salloc.
jobidThe id of the Job.
jobnameThe name of the Job.
maxdiskreadMaximum number of bytes read by all tasks in the job.
maxdiskwriteMaximum number of bytes written by all tasks in the job.
maxrssMaximum resident set size of all tasks in the job.
ncpusAmount of allocated CPUs.
nnodesThe number of nodes used in a job.
ntasksNumber of tasks in a job.
prioritySlurm priority.
qosQuality of service.
reqcpuRequired number of CPUs
reqmemRequired amount of memory for a job.
userUsername of the person who ran the job.

As an example, suppose you want to find information about jobs thatwere run on March 12, 2018. You want to show information regarding thejob name, the number of nodes used in the job, the number of cpus, themaxrss, and the elapsed time. Your command would look like this:

$ sacct --jobs=your_job-id --starttime=2018-03-12 --format=jobname,nnodes,ncpus,maxrss,elapsed

As another example, suppose you would like to pull up information onjobs that were run on February 21, 2018. You would like information onjob ID, job name, QoS, Number of Nodes used, Number of CPUs used,Maximum RSS, CPU time, Average CPU time, and elapsed time. Yourcommand would look like this:

$ sacct –-jobs=your_job-id –-starttime=2018-02-21 --format=jobid,jobname,qos,nnodes,ncpu,maxrss,cputime,avecpu,elapsed

A full list of variables that specify data handled by sacct can befound with the --helpformat flag or by visiting the slurm page onsacct.

Controlling queued and running jobs using scontrol

The scontrol command provides users extended control of their jobsrun through Slurm. This includes actions like suspending a job,holding a job from running, or pulling extensive status information onjobs.

To suspend a job that is currently running on the system, we can usescontrol with the suspend command. This will stop a running job onits current step that can be resumed at a later time. We can suspend ajob by typing the command:

$ scontrol suspend job_id

To resume a paused job, we use scontrol with the resume command:

$ scontrol resume job_id

Slurm also provides a utility to hold jobs that are queued in thesystem. Holding a job will place the job in the lowest priority,effectively “holding” the job from being run. A job can only be heldif it’s waiting on the system to be run. We use the hold command toplace a job into a held state:

$ scontrol hold job_id

We can then release a held job using the release command:

$ scontrol release job_id

scontrol can also provide information on jobs using the show jobcommand. The information provided from this command is quite extensiveand detailed, so be sure to either clear your terminal window, grepcertain information from the command, or pipe the output to a separatetext file:

# Output to console$ scontrol show job job_id# Streaming output to a textfile$ scontrol show job job_id > outputfile.txt# Piping output to Grep and find lines containing the word "Time"$ scontrol show job job_id | grep Time

For a full primer on grep and regular expressions, visit GNU’s pageon Grep

For more information on scontrol, visit the Slurm page onscontrol

Useful Slurm commands — Research Computing
University of Colorado Boulder  documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Kareem Mueller DO

Last Updated:

Views: 6357

Rating: 4.6 / 5 (46 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Kareem Mueller DO

Birthday: 1997-01-04

Address: Apt. 156 12935 Runolfsdottir Mission, Greenfort, MN 74384-6749

Phone: +16704982844747

Job: Corporate Administration Planner

Hobby: Mountain biking, Jewelry making, Stone skipping, Lacemaking, Knife making, Scrapbooking, Letterboxing

Introduction: My name is Kareem Mueller DO, I am a vivacious, super, thoughtful, excited, handsome, beautiful, combative person who loves writing and wants to share my knowledge and understanding with you.