Getting started with SLURM at SVI
Requirements
Access must be grated by IT. This can be done by getting your supervisor to log an IT support request
You must have followed https://svimrit.atlassian.net/wiki/spaces/SI/pages/344883201 for RStudio Server Pro
Access
Open OnDemand access: https://ood.svi.edu.au
SSH access: slurm-login.svi.edu.au
SSH access is only available internally to the SLURM login node. To use remotely you can access it by following https://svimrit.atlassian.net/wiki/spaces/SI/pages/728170527
Getting started with job submission scripts
A submission script is a shell script that consists of a list of processing tasks that need to be carried out, such as the command, runtime libraries, and input and/or output files for the tasks. If you know the resources that your tasks need to consume, you may also modify the SBATCH script with some of the common directives, e.g.:
Short Format Long Format Default Description
------------ ----------- ------- -----------
-J jobname --job-name=job_name N/A Up to 15 printable, non-whitespace characters
-p partition --partition=partition general Always specify your partition (i.e. general, gpu, all)
-n count --ntasks One Controls the number of tasks to be created for the job
-c count --cpus-per-task One Controls the number of CPUs allocated per task
N/A --mem-per-cpu N/A Memory size per CPU
N/A --mem=size 1000MB Total memory size
N/A --gres=gpu:1 N/A Generic consumable resources e.g. GPU
-t HH:MM:SS --time=HH:MM:SS N/A Specify the maximum wallclock time for your job
These are just some of the options. For a complete set see https://slurm.schedmd.com/sbatch.html or download this cheatsheet https://slurm.schedmd.com/pdfs/summary.pdf
Running Simple Batch Jobs
Submitting a job to SLURM is performed by running the sbatch
command and specifying a job script.
sbatch job.script
You can supply options (e.g. --ntasks=xx
) to the sbatch
command. If an option is already defined in the job.script
file, it will be overridden by the commandline argument.
sbatch [options] job.script
An example Slurm job script
#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --partition=general
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=100
#SBATCH --cpus-per-task=1
touch ~/helloworld.text
This script describes the job: it is a serial job with only one process (--ntasks=1
). It only needs one CPU core to run the command touch ~/helloworld.text
and has been allocated 100M RAM per CPU so in this case is 100M total. You can see that this has worked by running ls ~
. You should see a file named helloworkd.text (feel free to delete it now).
Cancelling jobs
To cancel one job
scancel [JOBID]
To cancel all of your jobs
scancel -u [USERID]
Interactive jobs
Starting an interactive job
You can run an interactive job like this:
$ srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i
Here we ask for a single core on one interactive node for one hour with the default amount of memory. The command prompt will appear as soon as the job starts.
This is how it looks once the interactive job starts:
srun: job 12345 queued and waiting for resources
srun: job 12345 has been allocated resources
Exit the bash shell to end the job. If you exceed the time or memory limits the job will also abort.
Interactive jobs have the same policies as normal batch jobs, there are no extra restrictions. You should be aware that you might be sharing the node with other users, so play nice.
Keeping interactive jobs alive
Interactive jobs die when you disconnect from the login node either by choice or by internet connection problems. To keep a job alive you can use a terminal multiplexer like tmux
.
tmux allows you to run processes as usual in your standard bash shell
You start tmux on the login node before you get a interactive slurm session with srun
and then do all the work in it. In case of a disconnect you simply reconnect to the login node and attach to the tmux session again by typing:
tmux attach
or in case you have multiple sessions running:
tmux list-session
tmux attach -t SESSION_NUMBER
As long as the tmux session is not closed or terminated (e.g. by a server restart) your session should continue.
To log out a tmux session without closing it you have to press CTRL-B (that the Ctrl key and simultaneously “b”, which is the standard tmux prefix) and then “d” (without the quotation marks). To close a session just close the bash session with either CTRL-D or type exit. You can get a list of all tmux commands by CTRL-B and the ? (question mark). See also this page for a short tutorial of tmux. Otherwise working inside of a tmux session is almost the same as a normal bash session.
Monitoring
You can see an overview of cluster usage at https://grafana.svi.edu.au/ You need to select Sign in with Microsoft. This first time you sign in you won’t have access to anything. Please log a ticket requesting to be added to a group after your first login.
Other Documentation
There is a lot of documentation online so remember Google is your friend. It may be worth getting your group to collaborate on an internal document that has useful commands that you actually use. This will be great for onboarding new team members and will narrow in on commands that your team actually uses.
https://slurm.schedmd.com/quickstart.html - This is the official documentation. This isn’t great for beginners but has everything.
https://supercomputing.swin.edu.au/docs/2-ozstar/oz-slurm-basics.html - This is the OzSTAR user guide at Swinburne. Not everything will apply at SVI but is a useful resource to get started with SLURM
https://docs.massive.org.au/M3/slurm/slurm-overview.html - Also some good examples. As with above there will be a lot of site specific info that won’t apply to SVI
Notes
Try not to allocate more resources than you will actually use so that there are more resources available for everyone else.