UVic Computing  |  UVic Home

Using the WestGrid Clusters at UVic


Access to WestGrid resources requires a WestGrid account, which requires registration with Compute Canada.  Please see Registering with the CCDB and Getting an Account on the WestGrid site.

Once you have a WestGrid ID, the cluster's interactive nodes can be accessed by ssh to hermes.westgrid.ca or nestor.westgrid.ca. Currently there are two interactive nodes and round-robin DNS is used for load-balancing access.

Preparing and Submitting Jobs

Before a job is submitted to the cluster, it must be defined, via a small script, that sets up its environment and runs the program that does the actual processing. This script is then submitted to Torque. The script may be written in one of a number of languages, including bash, csh and Perl, but will contain a number of directives that tell Torque which resources are required.

For an example, consider the following bash script:

#PBS -l walltime=1:00:00
#PBS -l mem=100mb

#PBS -e $HOME/output.err
#PBS -o $HOME/output.out

This job definition script starts off with a number of PBS directives. These directives request a number of resources and specifies where the results should be written. The first  directive is very important: the wall time, which specifies how long the job is expected to run. In your own jobs, be generous here--if the job takes longer than specified with the wall time, it will be terminated.  Many users take their best estimates and multiply it by three for the wall time.

Following the directives is, simply, the rest of the script, doing whatever processing the user requires. In this example the job simply returns the hostname and the current time.

Once the script is written and the job is thus prepared, it is submitted to PBS using the qsub command:

litai05$ qsub test.sh

Torque returns a job ID that may then be used to query the status of the job, or to delete it (generally, only the first numerical portion is necessary). When the job is finished processing, its output and standard error will be written to files. If output paths were not specified, as they were in the example, the files will be written to the current working directory.

For more information on qsub and its directives, please consult the qsub manual page, via man qsub at the command prompt.

For parallel jobs on the Nestor cluster, your program should be compiled using an MPI-capable compiler and run using mpiexec, which can be called from your job script.  For example:


#PBS -l walltime=1:00:00
#PBS -l mem=100mb
#PBS -l procs=16
#PBS -j oe
module load mpi/openmpi
mpirun hostname

This job can be submitted using qsub, with syntax specifying the number of nodes required:

litai05$ qsub mpi.pbs

This job will be routed to the Nestor partition so it executes on nodes with a high-speed interconnect.

Checking Job Status

While a job is queued or running its status may be viewed by issuing the qstat command to view the state of the queues. By specifying a user ID, jobs owned by that user are listed:

litai05$ qstat -u renge
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
102.moab01.westg     renge    hermes   test.sh             --    --    1  100mb 01:00 R   --

Other syntax will cause the command to display the number of jobs assigned to a particular queue:

litai05$ qstat -q hermes
server: moab01.westgrid.uvic.ca
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
hermes             --      --       --      --    0   1 --   E R
                                               ----- -----
                                                   0     1

For more information on this command, issue man qstat from one of the interactive nodes.

Another useful command is the showq command. This is a Moab command that shows the state (running, idle, or blocked) of all jobs in the system in the current order they are running or will be dispatched. For running jobs, it lists when they started running and how much time remains in their wall time. For blocked or idle jobs, it lists their wall time and the amount of time they have been queued.

ACTIVE JOBS--------------------
5208 babarpro Running 1 17:26:38 Tue Mar 1 08:19:26
5210 babarpro Running 1 17:31:50 Tue Mar 1 08:24:38
5211 babarpro Running 1 17:32:22 Tue Mar 1 08:25:10
...more jobs...
63 Active Jobs 63 of 72 Processors Active (87.50%)
32 of 36 Nodes Active (88.89%)

IDLE JOBS----------------------

0 Idle Jobs

BLOCKED JOBS----------------
5283 babarpro Idle 1 23:59:00 Mon Feb 28 23:18:19
5284 babarpro Idle 1 23:59:00 Mon Feb 28 23:18:20
...more jobs...

Total Jobs: 229 Active Jobs: 63 Idle Jobs: 0 Blocked Jobs: 166

Generally, when resources are fully utilised and jobs are queued, the queued jobs will be considered "blocked", waiting for resources to become available. "Idle" jobs are those waiting for specific resources, such as a particular node.