UVic Computing  |  UVic Home
  

Job Scheduling

Job scheduling in a batch system can be very complex, with many variables.  Our scheduling software allows a great deal of flexibility to meet the needs of computing centres in many industries, and while a small cluster serving one research group may use a very simple scheduler--possibly even a simple first-in, first-served policy--a system the size of Hermes/Nestor serving various research groups at many different institutions is going to be more complicated.

The information on this page should help to inform users about how jobs on the Nestor and Hermes clusters are prioritised and run.  For further information please contact support@westgrid.ca.

Wall-clock time

Wall-clock time, sometimes called simply "wall time", is the time used or expected to be used by a job in real-world terms, regardless of actual CPU cycles consumed.  A job that starts at four in the afternoon and completes at ten that night uses six hours of wall-clock time; a job that started at three p.m. with a wall-clock time of two hours should be done by five.

Wall-clock time is an important metric in batch computing because although it does not necessarily reflect the actual amount of work done--time spent transferring data to be analysed also contributes to wall-clock time--it serves as an upper bound on the amount of time a job will consume the resources it has been assigned, and as such, serves as a predictor for when the next job may be run.  Most clusters will terminate a job whose actual wall-clock time exceeds the expected wall time declared when the job was submitted.

Different clusters may prioritise jobs differently based on the declared wall-clock time, or route jobs with different wall-clock times to different execution queues or nodes.  Nestor and Hermes do not.  It is still important however to choose a reasonable value since it will help others predict when their jobs will run and assist the scheduler in making effective use of available resources.

It is not necessary to pick an exact value, however, there are three factors that must be considered when choosing a wall-clock time for your job:

  1. Jobs that exceed their declared wall-clock time will be terminated.
  2. Unexpected slowdowns in I/O to disk or the network may adversely affect job performance to the extent that your job may not always complete in the allotted time.  Please provide some margin for error here.
  3. The maximum walltime on Hermes and Nestor is limited to 72 hours.

Fairshare

Moab's Fairshare mechanism provides a job prioritisation factor based on some credential's historical usage of resources and that credential's share of resources.  On Hermes and Nestor, as on other WestGrid systems, the credential employed is the accounting group determined by the research group or groups to which each user belongs, and the share is determined by RAC allocation (or lack thereof).

The majority of WestGrid resources are designated on a yearly basis to be used by specific research groups according to expected need and other factors.  A smaller portion remains for general use.  Every user with WestGrid account is accorded some share of resources based on this; this share constitutes a target amount of resources to be consumed by that user or group.

Priority is given to users or groups who have not yet met their targets within a sliding time window defined by the current time and the configured size of this window.  On Nestor and Hermes, this window is from today to two weeks ago, so somebody who has not made much use of the cluster in the past two weeks will have higher priority than somebody who has consistently been using the system, so long as their shares support it.

Further complicating this mechanism is the concept of decay, in which the consumption of resources from last week is considered less relevant than the consumption from this week.  This allows past consumption to factor in without drowning out present need, so that heavy usage from ten days ago does not effectively block a user from running jobs even in the presence of other jobs submitted by users with unfulfilled targets.

Maximum processor seconds

On a serial cluster where all jobs use one CPU, it is fairly simple to limit cluster resource consumption by the number of jobs.  In this way a policy on an 100-node cluster can be set in which any user may have 25 jobs maximum running at any one time.  However, on a parallel system, this may mean that one user could consume the entire cluster by submitting 25 jobs each using four nodes.

Further, the length of time that computing resources are consumed is also important.  Perhaps it is acceptable to have one user consume 25 nodes for 5 days, while also being acceptable for a large parallel job to consume the entire cluster for half a day. 

To meet these two criteria the metric of processor seconds is introduced.  One processor second is the equivalent of using one processor for one second.  Using this metric, the following limits are in effect:

  • Hermes: 12960000 processor seconds, equivalent to 100 processors for 1.5 days, or 30 processors for five days.
  • Nestor: 144115200 processor seconds, equivalent to 2184 processors for 18 hours.

These limits will be adjusted as needed.

Maximum idle jobs

There are three states for a job:

  1. Running, in which the job has been assigned resources and has been dispatched to use those resources;
  2. Idle, in which the job is under consideration for running and as such is evaluated for start priority and resources; and
  3. Blocked, in which a job is registered but not queued and not currently being evaluated for priority or considered for running.

By limiting the maximum number of jobs that are in the idle state for a particular user or credential, a user who submits 5,000 jobs and then leaves on vacation will have no advantage over a user who submits a smaller number, more often, so long as other factors are equivalent for the two users.  Without this, a user may potentially flood the system with jobs and effectively block out other users with equal priority, even in the absence of queuing time as a prioritisation factor.

The maximum number of idle jobs per user is determined by the number of processors required by those jobs.  The maximum number of processors in jobs in the idle state per user on Hermes is 100, and for Nestor is 2256.  Jobs beyond this limit are in the blocked state and will be queued (and assigned priority) once the queued jobs for that user are under the above limit.

Backfill

Backfill is a mechanism by which resources reserved for one job in the future may be used in the meantime by lower-priority jobs with lower resource requirements.

For example, consider a four-processor cluster in which one job "A" is running on one core for the next four hours.  Jobs B, C, and D are queued and waiting.  Job B is the highest priority job and, requiring four cores, is predicted to wait for four hours.  Job C is next and requires one core for five hours.  Job D is the lowest priority and requires two cores for two hours.

Job B is essentially waiting for all resources available to be freed when job A finishes, releasing the processor it is using.  Job C is outweighed by job B which has higher priority.  But job D, although the lowest priority, could use two of the three idle cores currently reserved for job B and complete before job B is projected to run.  This is backfill.

Backfill is implemented on Nestor and Hermes.

Static node allocations

Completely independent of these job prioritisation and scheduling policies are two sets of nodes on the Hermes cluster dedicated to particular research groups or functions.  Scheduling for these sets are done independently of general user jobs by other mechanisms.

The first set of nodes is dedicated to the Atlas experiment.  This set will in the future be merged with the rest of the Hermes cluster and the partition boundaries removed, and Atlas jobs will be managed using the same mechanisms as for other research groups.

The second set of nodes is used for a research computing cloud.  Nodes in this partition have special requirements and cannot currently be managed in the same way as the rest of the cluster.