You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »


Introduction

This user's guide for the Altamira supercomputer is intended to provide the minimum amount of information needed by a new user of this system. As such, it assumes that the user is familiar with basic notions on scientific computing, in particular the basic commands of the Unix operating system, and also with basic techniques for the execution of applications in a supercomputer, like MPI or OpenMP.

The information in this guide includes basic technical documentation about the Altamira supercomputer, the software environment, and also on available applications.

Please read it carefully and if any doubt arises don't hesitate to contact our support group.


Requesting an account

Altamira users include researchers at the University of Cantabria, researchers who get execution time at the Spanish Supercomputing Network (RES), and other researchers. The assignment of an account and execution time requires a request form, contact us in case of doubt or for urgent requests.


System Overview

Altamira includes 158 main compute nodes, 4 additional GPU compute nodes and 2 login server.

Main compute nodes have two Intel Sandybridge E5-2670 processors, each one with 8 cores operating at 2.6 GHz and a cache of 20MB, 64 GB of RAM memory (i.e. 4 GB/core) and 500 GB local disk.

Operating System (OS)

Main compute and login nodes run Centos 7.4

Network

The internal network in Altamira includes:

  • Infiniband Network (FDR): High bandwidth network used by parallel applications communications and data transfer.
  • Gigabit Network: Ethernet network used by the management services.

Storage

All the nodes are connected to a global storage system based on GPFS (Global Parallel File System) providing a total storage of 1 PB.

File Systems

Each nodes has several areas of disk space for installing system or programs and storing files. These areas may have size or time limits, please read carefully all this section to know about the policy of usage of each of these filesystems. There are 2 different types of storage available inside a node:

  • GPFS Filesystems: GPFS is a distributed networked filesystem which can be accessed from all the nodes
  • Local Hard Drive: Every compute node has an internal hard drive
GPFS filesystem

The IBM General Parallel File System (GPFS) is a high-performance shared-disk file system that can provide fast, reliable data access from all blades of the cluster to a global filesystem. GPFS allows parallel applications simultaneous access to a set of files (even a single file) from any node that has the GPFS file system mounted while providing a high level of control over all file system operations. These filesystems are the recommended to use with most jobs, because GPFS provides high-performance I/O by "striping" blocks of data from individual files across multiple disks on multiple storage devices and reading/writing these blocks in parallel. In addition, GPFS can read or write large blocks of data in a single I/O operation, thereby minimizing overhead.

These are the GPFS filesystems available in Altamira from all nodes:

  • /gpfs/users/res → /home: users home
  • /gpfs/res_projects:
  • /gpfs/res_apps:

Bach system

The job scheduling system running at Altamira is SLURM. Current version of Slurm in Altamira is 17.11.

Connecting to Altamira

Once you have a login and its associated password you can get into Altamira system, connecting to one of the the 2 login nodes at altamira1.ifca.es in the case of the first login or altamira2.ifca.es in the second case.

You must use Secure Shell (ssh) tools to login into or transfer file into Altamira. We do not accept incoming connections from protocols as telnet, ftp, rlogin, rcp, or rsh commands. Once you are logged into Altamira you cannot make outgoing connections for security reasons (contact us in an exceptional case).

Running Jobs

As is defined above SLURM is the utility used at Altamira for batch processing support, so all jobs must be run through it. This part provides information for getting started with job execution at Altamira.

NOTE: In order to keep the login nodes in a proper load, a 10 minutes limitation in the CPU time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system.

Manage Jobs

A job is the execution unit for the SLURM. A job is defined by a script containing a set of directives describing the job, and the commands to execute.

These are the basic directives to submit jobs:

  • sbatch <job script>: submits a job script to the queue system (see below for job script directives).
  • squeue  [-u user ]: shows all the jobs submitted by all users or by the user if you specify the -u option.
  • scancel <job id>: removes his/her job from the queue system, canceling the execution of the job if it was already running.

Getting Help

IFCA provides to users consulting assistance. User support consultants are available during normal business hours, Monday to Friday, 09 a.m. to 17 p.m. (CEST time).

User questions and support are handled at:

If you need assistance, please supply us with the nature of the problem, the date and time that the problem occurred, and the location of any other relevant information, such as output or log files.





  • No labels