UVic Computing  |  UVic Home
  

Using LLaima

Accessing the Cluster

The cluster can be accessed via ssh at llaima.uvic.ca. An email request to sysadmin@uvic.ca should be made in order to have an account created on Llaima. The Llaima account will use your UVic NetLink-ID credentials, this being said a valid NetLink-ID will be required to gain access to Llaima.
Compiling Programs

 

Currently software can be compiled on the head node. As Llaima's utilization increases statistics and system response monitoring may determine that the addition of development hosts could be necessary in the future.

Currently the following packages are available on Llaima:

  1. lam-7.0.6-5 (primary MPI libraries), PATH=/usr/include
  2. gsl-1.5-2.rhel, PATH=/usr/include/gsl
  3. gcc-3.4.4-2
  4. gcc-c++-3.4.4-2
  5. gcc4-gfortran-4.0.1-4.EL4.2
  6. gcc-g77-3.4.4-2
  7. Intel Fortran Compiler, /opt/bin/ifort32 and /opt/bin/ifort64
  8. g95-x86_64-64, PATH=/usr/bin/f90
  9. hdf5-1.6.5, PATH=/opt/hdf5
  10. fftw-3.0.1, PATH=/opt/fftw
  11. IDL 6.2, PATH=/opt/rsi
  12. IRAF, PATH=/opt/iraf

Example Makefile

# 
# "mpicc" adds the directories for the include and lib files. Hence,
# -I and -L for MPI stuff is not necessary
#

CC = mpicc

#
# Modify TOPDIR if you use your own include files and library files

#

PROGRAM = lammpi # name of the binary

SRCS = lammpi.c # source file

OBJS = $(SRCS:.c=.o) # object file


#
# Targets
#

default: all

all: $(PROGRAM)

$(PROGRAM): $(OBJS)
$(CC) $(OBJS) -o $(PROGRAM) $(LDFLAGS)

clean:
/bin/rm -f $(OBJS) $(PROGRAM)

Submitting Jobs

Parallel Jobs

All jobs will be submitted into the cluster using llaima.uvic.ca. Currently a single queue exists, called LLAIMA, into which jobs will be submitted onto Llaima.

In order to submit jobs onto Llaima use the qsub command, an example of running a job:

$ qsub lam_test.pbs

The contents of lam_test.pbs command file are shown below. The PBS options are listed at the top of the command file are declared with the #PBS directive. This command file will run an MPI job on 16 compute nodes using 2 CPUs per nodes thus totaling 32 processors.

#!/bin/bash
#PBS -l nodes=16:ppn=2
#PBS -N lam_mpi_test
#PBS -q LLAIMA

#mpirun it
/opt/bin/pbslam -v -W /home1l/ekolb/src /opt/bin/mpitest

pbslam

Please use this script to submit MPI jobs as it synchronizes the PBS and MPI environments

Synopsis: exec pbslam [-dfghOtTv] [-c <#>] [-D | -W dir] [] 

Description: Run a LAM MPI application under PBS. For proper cleanup, must be exec'ed from top-level shell.

Options:
-c Run # copies of the program on the allocated nodes.
-d Use indirect communication via LAM daemons.
-D Use location of as working directory.
-f Do not configure stdio descriptors.
-h Print this help message.
-O System is heterogeneous; enable data conversion.
-t Enable tracing with generation initially off.
-T Enable tracing with generation initially on.
-v Verbose mode.
-W Use "dir" as working directory. Executable MPI application.

Arguments for application program.

Defaults: Configure stdio; heartbeat off; don't check processor load;
data conversion off; GER off; direct communication (daemons off);
tracing disabled; one process on each PBS virtual processor in VP order.

Serial Jobs

Serial jobs can be submitted on to Llaima. Below is an example pbs file titled serial.pbs that shows how to submit a serial job onto Llaima. This job will run a cop of the UNIX command sleep for ten seconds on a single compute node


#!/bin/bash
#PBS -l nodes=1:ppn=1,mem=200mb
#PBS -N test_job
#PBS -q LLAIMA

#run it
/bin/sleep 10

Interactive Jobs

Interactive jobs can be submitted onto Llaima. Below is an example submission that will allocate one node for one minute. The walltime can be adjusted per requirements.

$ qsub -I -l nodes=1:ppn=1,mem=200mb,walltime=00:01:00 -N interactive_job -q LLAIMA

Monitoring Jobs

  • Using qstat
$ qstat -u ekolb

llaima1.uvic.ca:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
601.llaima1.uvic.ca ekolb LLAIMA MPI_test -- 35 -- 4096mb 12:00 R --
-bash-3.00$ qstat -q

server: llaima1

Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
TEST -- -- -- -- 0 0 -- E R
dque -- -- -- -- 0 0 -- E R
LLAIMA 4096mb -- -- -- 1 0 -- E R
--- ---
1 0

$ qstat -q

server: llaima1

Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
TEST -- -- -- -- 0 0 -- E R
dque -- -- -- -- 0 0 -- E R
LLAIMA 4096mb -- -- -- 1 0 -- E R
--- ---
1 0
  • Using showq
$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME

601 ekolb Running 70 12:00:00 Wed Apr 12 08:34:16

1 Active Job 70 of 70 Processors Active (100.00%)
35 of 35 Nodes Active (100.00%)

IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME


Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs:
  • Job output via PBS results files, they will exist in the directory from which you started your job, where job_name.o?? for STDOUT and job_name.e?? for STDERR.
$ ls -ltr lam_mpi_test.*66
-rw------- 1 ekolb ekolb 707 Jan 20 13:38 lam_mpi_test.o601
-rw------- 1 ekolb ekolb 0 Jan 20 13:38 lam_mpi_test.e601
  • Using checkjob
$ checkjob -v 601


checking job 601 (RM job '601.llaima1.uvic.ca')

State: Running
Creds: user:ekolb group:ekolb class:LLAIMA qos:DEFAULT
WallTime: 00:00:11 of 12:00:00
SubmitTime: Wed Apr 12 08:44:19
(Time Queued Total: 00:00:02 Eligible: 00:00:02)

StartTime: Wed Apr 12 08:44:21
Total Tasks: 70

Req[0] TaskCount: 70 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1 MEM: 58M
Utilized Resources Per Task: [NONE]
Avg Util Resources Per Task: PROCS: 0.01
Max Util Resources Per Task: [NONE]
Average Utilized Memory: 4060.00 MB
Average Utilized Procs: 70.00
NodeAccess: SINGLEJOB
TasksPerNode: 2 NodeCount: 35
Allocated Nodes:
[r02u32:2][r02u30:2][r02u29:2][r02u28:2]
[r02u27:2][r02u26:2][r02u25:2][r02u24:2]
[r02u23:2][r02u17:2][r02u16:2][r02u15:2]
[r02u14:2][r02u13:2][r02u12:2][r02u10:2]
[r02u09:2][r01u32:2][r01u31:2][r01u30:2]
[r01u29:2][r01u27:2][r01u26:2][r01u25:2]
[r01u24:2][r01u23:2][r01u17:2][r01u16:2]
[r01u15:2][r01u14:2][r01u13:2][r01u12:2]
[r01u11:2][r01u10:2][r01u09:2]
Task Distribution: r02u32,r02u32,r02u30,r02u30,r02u29,r02u29,r02u28,r02u28,r02u27,r02u27,r02u26,...


IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE

Reservation '601' (-00:00:11 -> 11:59:49 Duration: 12:00:00)
PE: 70.00 StartPriority: 1

Job control

  • Cancelling a job
$ qdel 601