Getting started with SLURM at SVI

Requirements

Access

SSH access is only available internally to the SLURM login node. To use remotely you can access it by following https://svimrit.atlassian.net/wiki/spaces/SI/pages/728170527

Getting started with job submission scripts

A submission script is a shell script that consists of a list of processing tasks that need to be carried out, such as the command, runtime libraries, and input and/or output files for the tasks. If you know the resources that your tasks need to consume, you may also modify the SBATCH script with some of the common directives, e.g.:

Short Format Long Format Default Description ------------ ----------- ------- ----------- -J jobname --job-name=job_name N/A Up to 15 printable, non-whitespace characters -p partition --partition=partition general Always specify your partition (i.e. general, gpu, all) -n count --ntasks One Controls the number of tasks to be created for the job -c count --cpus-per-task One Controls the number of CPUs allocated per task N/A --mem-per-cpu N/A Memory size per CPU N/A --mem=size 1000MB Total memory size N/A --gres=gpu:1 N/A Generic consumable resources e.g. GPU -t HH:MM:SS --time=HH:MM:SS N/A Specify the maximum wallclock time for your job

These are just some of the options. For a complete set see https://slurm.schedmd.com/sbatch.html or download this cheatsheet https://slurm.schedmd.com/pdfs/summary.pdf

Running Simple Batch Jobs

Submitting a job to SLURM is performed by running the sbatch command and specifying a job script.

sbatch job.script

You can supply options (e.g. --ntasks=xx) to the sbatch command. If an option is already defined in the job.script file, it will be overridden by the commandline argument.

sbatch [options] job.script

An example Slurm job script

This script describes the job: it is a serial job with only one process (--ntasks=1). It only needs one CPU core to run the command touch ~/helloworld.text and has been allocated 100M RAM per CPU so in this case is 100M total. You can see that this has worked by running ls ~. You should see a file named helloworkd.text (feel free to delete it now).

Cancelling jobs

To cancel one job

To cancel all of your jobs

Interactive jobs

Starting an interactive job

You can run an interactive job like this:

Here we ask for a single core on one interactive node for one hour with the default amount of memory. The command prompt will appear as soon as the job starts.

This is how it looks once the interactive job starts:

Exit the bash shell to end the job. If you exceed the time or memory limits the job will also abort.

Interactive jobs have the same policies as normal batch jobs, there are no extra restrictions. You should be aware that you might be sharing the node with other users, so play nice.

Keeping interactive jobs alive

Interactive jobs die when you disconnect from the login node either by choice or by internet connection problems. To keep a job alive you can use a terminal multiplexer like tmux.

tmux allows you to run processes as usual in your standard bash shell

You start tmux on the login node before you get a interactive slurm session with srun and then do all the work in it. In case of a disconnect you simply reconnect to the login node and attach to the tmux session again by typing:

or in case you have multiple sessions running:

As long as the tmux session is not closed or terminated (e.g. by a server restart) your session should continue.

To log out a tmux session without closing it you have to press CTRL-B (that the Ctrl key and simultaneously “b”, which is the standard tmux prefix) and then “d” (without the quotation marks). To close a session just close the bash session with either CTRL-D or type exit. You can get a list of all tmux commands by CTRL-B and the ? (question mark). See also this page for a short tutorial of tmux. Otherwise working inside of a tmux session is almost the same as a normal bash session.

Monitoring

  • You can see an overview of cluster usage at https://grafana.svi.edu.au/ You need to select Sign in with Microsoft. This first time you sign in you won’t have access to anything. Please log a ticket requesting to be added to a group after your first login.

Other Documentation

There is a lot of documentation online so remember Google is your friend. It may be worth getting your group to collaborate on an internal document that has useful commands that you actually use. This will be great for onboarding new team members and will narrow in on commands that your team actually uses.

  • https://slurm.schedmd.com/quickstart.html - This is the official documentation. This isn’t great for beginners but has everything.

  • - This is the OzSTAR user guide at Swinburne. Not everything will apply at SVI but is a useful resource to get started with SLURM

  • - Also some good examples. As with above there will be a lot of site specific info that won’t apply to SVI

Notes

Try not to allocate more resources than you will actually use so that there are more resources available for everyone else.