Overview
The clusters are designed to run both serial and parallel jobs. The following sections will give details on writing a job and submitting the job. Examples will also be provided to help understand the process.
Submitting Serial Jobs
Steps involved in submitting a serial job:
- Write a batch script/program containing the commands that need to be executed.
- Connect to the submit host for the cluster using the account that was provided to you.
- Submit the program using the qsub command.
- Check the status of the program using the qstat command.
- Check the output files. The names of the files are composed of the job script name, an appended dot sign followed by an “o” for the sdtout file and an “e” for the stderr file and finally the unique jobid.
Example 1: Simple serial job
1) Connect to the submit host for the cluster.
2) Create the following script called simple.sh
#!/bin/sh
#This is a simple example of a Sun Grid Engine batch script
# Print date and time
date
# Sleep for 20 seconds
sleep 20
# Print date and time again
date
# Print the hostname
hostname
3) Submit the script using the following command:
ens-hpc% qsub -cwd simple.sh
The cwd option is used so that the output files are saved in the directory in which the program resides. If the cwd command is not used , then the output is stored in the parent directory.
Once the job is submitted, you will get the following response:
ens-hpc% Your job # (simple.sh) has been submitted
4) To check the status of your job, type:
ens-hpc% qstat -f
5) A sample output of qstat command is:
---------------------------------------------------------------------------------
ens.q@node7 BIP 0/5/12 0.00 lx-amd64
2094917 0.55500 simple.sh shaila t 01/22/2015 15:40:07 1
---------------------------------------------------------------------------------
The stdout and stderr files in this case are simple.sh.ojob# and the simple.sh.ejob#. The contents of these files are shown below:
[shaila@ens-hpc ~]$ more simple.sh.o2094917
Thu Jan 22 15:40:32 MST 2015
node7
Since there are no error messages, simple.sh.e2094917 is empty.
Example 2: Serial MATLAB job
1) We first need to write a matlab batch job (i.e a matlab .m file) . A sample matlab batch file is given below:
ens-hpc%more sample.m
a = [1 2 3; 4 5 6];
magic(a);
a
quit
2) You need to write a script to execute the above problem. The script called mymatlabjob.sh is given below.
ens-hpc%more mymatlabjob.sh
#!/bin/csh
# Defining various SGE parameters
#$ -cwd
#$ -N testmatlab
#$ -e myjob2.err
/usr/local/bin/matlab -nodisplay -nosplash < sample.m > & out.txt
In the script, matlab is invoked with the -nodisplay and -nosplash options so that matlab runs in the command line mode. The input to the matlab command is sample.m and the output is stored in out.txt.
3) Use qsub to submit the job to the grid:
ens-hpc% qsub mymatlabjob.sh
4) The part of the listing from qstat is given below:
---------------------------------------------------------------------------------
ens.q@node10 BIP 0/5/12 0.00 lx-amd64
2094918 0.55500 testmatlab shaila r 01/22/2015 15:47:52 1
---------------------------------------------------------------------------------
The output file is out.txt. See the contents of this file by entering this command:
< M A T L A B (R) >
Copyright 1984-2014 The MathWorks, Inc.
R2014a (8.3.0.532) 64-bit (glnxa64)
February 11, 2014
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>> >> >>
a =
1 2 3
4 5 6
>>
Submitting Serial Jobs
We have a number of parallel applications that can be used via the cluster: various versions of mpich, openmp, parallel matlab and parallel fluent. Steps to follow when submitting a parallel job:
- Connect to the submit host for the cluster using the account that was provided to you.
- Write a batch script/program containing the commands that need to be executed.
For example, an mpich program would be written in mpich code. For the parallel MATLAB program, you would write your .m file containing the MATLAB code. - Write a script file that will be submitted to the cluster. This should contain execution instructions that are needed to run your code.
- Submit the program using the qsub command.
- Check the status of the program using the qstat command.
- Once the job is completed, check the output of the program.
Notes on the qsub command:
In the qsub command, you need to include the number of processors and the parallel environment you would like to use. The parallel environments available are MPI, OpenMP, matlab, and fluent_pe.
For example, if you wish to submit a mpich1 job using 4 processors, you can use the following command:
qsub -cwd -pe MPI 4 myscript
Example 3: Simple parallel job
Here is an example of writing and submitting an mpich2 job. You can view a sample mpich job here. Next we need to write a batch file to compile this program using mpicc . The batch file called mpi2batch is given below:
ens-hpc% more mpi2batch
/usr/local/mpich2/1.5/x86_64/gnu/bin/mpicc -o mpi2 -lm mpi.c
/usr/local/mpich2/1.5/x86_64/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./mpi2
In the above script, the $NSLOTS is got from the number of processors requested at the qsub command. Set the environment variables by running the following command:
shippo% qsub-cwd –pe MPI 4 mpi2batch
The above qsub command submits the job to the mpich1 parallel environment and requests that the job be run on 6 processors. To check the status of your job, type
ens-hpc% qstat –f
Sample output
---------------------------------------------------------------------------------
ens.q@gpu8 BIP 0/12/12 0.01 lx-amd64
---------------------------------------------------------------------------------
ens.q@node1 BIP 0/4/12 1.54 lx-amd64
2168245 0.60500 mpi2batch shaila r 01/27/2015 15:15:07 4
---------------------------------------------------------------------------------
ens.q@node10 BIP 0/0/12 1.32 lx-amd64
---------------------------------------------------------------------------------
ens.q@node11 BIP 0/0/12 1.41 lx-amd64
---------------------------------------------------------------------------------
From the above output, you can see 4 processes are running on . In this case all the processes are running on different machines. The output of the program is given below:
shippo % more mpi2batch.o343
myid 5 , lnbr 4 , rnbr 0
Success: I am 5 - left and right neighbors 4 and 0.
myid 3 , lnbr 2 , rnbr 4
Success: I am 3 - left and right neighbors 2 and 4.
myid 2 , lnbr 1 , rnbr 3
Success: I am 2 - left and right neighbors 1 and 3.
myid 1 , lnbr 0 , rnbr 2
Success: I am 1 - left and right neighbors 0 and 2.
myid 4 , lnbr 3 , rnbr 5
Success: I am 4 - left and right neighbors 3 and 5.
myid 0 , lnbr 5 , rnbr 1
Success: I am 0 - left and right neighbors 5 and 1.
Submitting other parallel jobs
The same steps need to be followed to submit matlab or fluent parallel jobs. The only difference is in the parallel environment that is used when submitting using the qsub command. For matlab, the parallel environment is matlab. For fluent, the parallel environment is fluent_pe.
Additional Tips
A number of options can be included in your submissions script. Rather than typing those options on the command line while issuing the qsub command, it is preferred that you include it in your scripts. This reduces the risk of giving incorrect inputs to the qsub command. It also reduces the tedious task of typing up the long commands everytime a change is made.
These options need to be added to the top of your script. Each of the options is preceded by #$ .
Following are some neat options that you can include in your script.
- Rather than typing -cwd everytime the job is submitted, this can be included in the script by adding the following:#$ -cwd
- You can give a name to your job . This is different from the script name. To do this, you need to type: #$ -N jobname
- You can include the parallel environment to which you want to submit the job: #$ -pe parallelenvironment For example: $# -pe MPI
- If you want to submit the job to a specific host, you can even specify a hostname as follows: #$ -l hostname=node1
- The shell can be specified as: #$ -S /bin/csh
- The output and error files can be specified by the following: #$ -o outputfilename and #$ -e errorfilename
- You can request specific resources using the –l option. For example, if you want your job to run on a host that has 20M of free virtual memory, you would include the following in your script: #$ -l h_vmem=20m
There are a lot more options that can be supplied to the the qsub options. These are listed in the man page for qsub. This can be viewed by typing the following: man qsub
Troubleshooting
The file <jobname>.e<jobid> generally contains the error messages that are encountered when the job is running on the cluster. Make sure you check this file carefully for any error messages. Quite often , there will be error messages about file not being found. By default, the . (current working directory) is not included in the path in your .bashrc file . So, either enter the full path name in the script or include the . in your path.
A common error that you will find in this file is “Command not found” . This occurs when the command that you are running is not in your path. This can be corrected by either including it in the path in the .cshrc/.bashrc file or by invoking the program using the complete pathname.
Sometimes there are unexplained errors like “No such file of directory” even when the file exists. Or sometimes, the job just does not run. It stays in the “Pending” state. This sometimes occurs if you have created the script file on a windows machine. It then puts extra “Carriage Return” characters at the end of each line. These are not recognized by unix. To fix this type the following command: dos2unix filename filename This will remove the extra “^M” at the end of each line. Then open the file using your favorite editor (vi, emacs, pico) and make sure it does not have the “^M”. Then save the file. Do NOT quit without saving. Once you fix the file, submit it again.