You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Introduction

This user's guide for the CMS supercomputer is intended to provide the minimum amount of information needed by a new user of this system. As such, it assumes that the user is familiar with basic notions on scientific computing, in particular the basic commands of the Unix operating system, and also with basic techniques for the execution of applications in a supercomputer.

The information in this guide includes basic technical documentation about the CMS supercomputer, the software environment, and also on available applications.

Please read it carefully and if any doubt arises don't hesitate to contact our support mail (see below in Getting help).


Connecting to CMS

Once you have a username and its associated password you can get into CMS system, connecting to one of the the 2 login nodes at altamira1.ifca.es in the case of the first login or altamira2.ifca.es for the second one (see Login node to know more information about the login nodes). The password provided is temporal, you must change this initial password after connecting to IPA. Also use a strong password (do not use a word or phrase from a dictionary and do not use a word that can be obviously tied to your person). 

You must use Secure Shell (ssh) tools to login into or transfer file into CMS. We do not accept incoming connections from protocols as telnet, ftp, rlogin, rcp, or rsh commands. The connection to CMS must be done in the following way:

$ ssh username@altamiraX.ifca.es (where X is 1 or 2)

Once you are logged into CMS you cannot make outgoing connections for security reasons (contact us in an exceptional case).

If you cannot get access to the system after following this procedure you can contact us, (see Getting Help to know how to contact with us).


Login Node

Once inside the machine you will be presented with a UNIX shell prompt and you'll normally be in your home ($HOME) directory. If you are new to UNIX, you'll have to learn the basics before you could do anything useful.

The machine in which you will be logged in will be the login node of Altamira (login1). This machine act as front end, and it is used typically for editing, compiling, preparation/submition of batch executions and as a gateway for copying data inside or outside Altamira.

It is not permitted the execution of cpu-bound programs on this node, if some compilation needs much more cputime than the permitted, this needs to be done through the batch queue system. It is not possible to connect directly to the compute nodes from the login node, all resource allocation is done by the batch queue system.

Compute Node

As you already know Altamira includes 158 main compute nodes where all the executions must be done. For security reasons, it is not possible the connection directly to the worker nodes and all the executions must be allocated in the nodes through the batch system queue (see below how to submit jobs).

Running Jobs

As is defined above SLURM is the utility used at Altamira for batch processing support, so all jobs must be run through it. This part provides information for getting started with job execution at Altamira, (see the official  slurm documentation to get more information on how to create a job) .


NOTE: In order to keep the login nodes in a proper load, a 10 minutes limitation in the CPU time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system.

Manage Jobs

A job is the execution unit for the SLURM. A job is defined by a script containing a set of directives describing the job, and the commands to execute.

These are the basic directives to submit jobs:

  • sbatch <job script>: submits a job script to the queue system (see below for job script directives).
  • squeue  [-u user ]: shows all the jobs submitted by all users or by the user if you specify the -u option.
  • scancel <job id>: removes his/her job from the queue system, canceling the execution of the job if it was already running.

On the other way, you can also launch your jobs in an interactive way to run your session in one of the compute node you have requested (used eventually for graphical applications, etc) :

srun -N 2 --ntasks-per-node=8 --pty bash

this requests 2 nodes (-N 2) and we are saying we are going to launch a maximum of 8 tasks per node (--ntasks-per-node=8). We are saying that you want to run a login shell (bash) on the compute nodes. The option --pty is important. This gives a login prompt and a session that looks very much like a normal interactive session but it is on one of the compute nodes.

Job directives

A job must contain a series of directives to inform the batch system about the characteristics of the job and you can configure them to fit your needs. These directives appear as comments in the job script and  usually at the top just after the shebang line, with the following syntax:

#SBATCH option=value

Note that these directives:

  • start with the #SBATCH prefix
  • are always lowercase
  • have no spaces in between.
  • don't expand shell variables (they are just shell comments)

This table describes the common directives you can define in your job (see below an example):

--job-name=value
The name of the job that appears in the batch queuescript_name
--partition=...
The name of the queue in slurmcompute
--output=...
The name of the file to collect the standard output of the job.  The %j part in the job directives will be sustitute by the job ID.file-%j.out
--error=...
The name of the file to collect the standard error of the job. The %j part in the job directives will be sustitute by the job ID.file-%j.err
--chdir=...
The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.submitting directory
--qos=...
Quality of Service (or queue) where the job is allocated. By default, a queue is assigned for the user so this variable is not mandatory.main
--time=...

The limit of wall clock time. This is a mandatory field and you must set it to a value greater than the real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period.
The format can be: m, m:s, h:m:s, d-h, d-h:m or d-h:m:s

qos default time limit
--ntasks=...

The number of processes to allocate as parallel tasks.

1
--cpus-per-task=...
Number of processors for each task. Without this option, the controller will just try to allocate one processor per task.  The number of cpus per task must be between 1 and 16, since each node has 16 cores (one for each thread).1
--ntasks-per-node
The number of tasks allocated in each node. When an application uses more than 3.8 GB of memory per process, it is not possible to have 16 processes in the same node and its 64GB of memory. It can be combined with the cpus_per_task to allocate the nodes exclusively, i.e. to allocate 2, processes per node, set both directives to 2. The number of tasks per node must be between 1 and 16.1
--mem-per-cpu
Minimum memory required per allocated CPU. Default units are megabytes unless the SchedulerParameters configuration parameter includes the "default_gbytes" option for gigabytes.DefMemPerCPU

Job environment variables

There are also a few SLURM environment variables you can use in your scripts:

SLURM_JOBID

Specifies the job ID of the executing job

SLURM_NPROCS

Specifies the total number of processes in the job. Same as -n, --ntasks

SLURM_NNODES

Is the actual number of nodes assigned to run your job

SLURM_PROCID

Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)

SLURM_NODEID

Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)

SLURM_LOCALID

Specifies the node-local task ID for the process within a job

SLURM_NODELIST

Specifies the list of nodes on which the job is actually running

SLURM_SUBMIT_DIRThe directory from which sbatch was invoked.

SLURM_MEM_PER_CPU

Memory available per CPU used


Job examples

Example for a sequential job:

#!/bin/bash
#
#SBATCH --job-name=hello
#SBATCH --output=hello.out
#SBATCH --ntasks=1
#SBATCH --time=10:00
# From here the job starts
echo hostname
sleep 60

Example for a parallel job:

#!/bin/bash
#
#SBATCH --job-name=hello
#SBATCH --output=hello.out
#SBATCH --ntasks=4 # The job spawns in 4 cores
#SBATCH --time=10:00
# From here the job starts
srun parallel.sh
srun sleep 60

Example of gpu use:

Now in Altamira are available 3 gpus with 12 cpus (limit the number of tasks running on the gpu). The user can access them by adding a new parameter name --partition  in the script and allocates the job in one of the gpus. 

#!/bin/bash
#
#SBATCH --job-name=hello
#SBATCH --partition=gpus
#SBATCH --output=hello.out
#SBATCH --ntasks=12 # The job spawns in 12 cores
#SBATCH --time=10:00
# From here the job starts
module load CUDA


Software

Modules Enviroment

The Environment Modules package provides for the dynamic modification of a user's environment via modulefiles. Each modulefile contains the information needed to configure the shell for an application or a compilation. Modules can be loaded and unloaded dynamically and atomically, in a clean fashion. All popular shells are supported, including bash, ksh, zsh, sh, csh, tcsh, as well as some scripting languages such as perl.

The most important commands of module tool are: list, avail, load, unload, switch and purge

  • module list shows all the modules you have loaded

  • module avail shows all the modules that user is able to load

  • module load let user load the necessary environment variables for the selected modulefile (PATH, MANPATH, LD_LIBRARY_PATH...etc)

  • module unload removes all environment changes made by module load command

  • module switch acts as module unload and module load command at same time

Job submitting with Modules

We need to do the loading of the needed applications inside the job script to load them in the worker nodes once the jobs require them.

Getting Help

IFCA provides to users consulting assistance. User support consultants are available during normal business hours, Monday to Friday, 09 a.m. to 17 p.m. (CEST time).

User questions and support are handled at:

If you need assistance, please supply us with the nature of the problem, the date and time that the problem occurred, and the location of any other relevant information, such as output or log files.



  • No labels