Engineering Compute Clusters
These compute clusters are part of the College of Engineering's high performance computing (HPC) resources for scientific research and teaching. They consist of state of the art computing nodes, high compute power and memory, scheduling software, and general purpose engineering applications.
The clusters are designed to run both serial and parallel jobs. The following sections will give details on writing a job and submitting the job. Examples will also be provided to help understand the process.
Detailed Usage Guide
Steps involved in submitting a serial job:
Here is a very simple program to print the date and time and the hostname of the host where the job is running. The script is called simple.sh.
1) Connect to the submit host for the cluster.
ens-hpc% qsub -cwd simple.sh
The cwd option is used so that the output files are saved in the directory in which the program resides. If the cwd command is not used , then the output is stored in the parent directory.
Once the job is submitted, you will get the following response:
ens-hpc% Your job job# (simple.sh) has been submitted
Where job# is the id number assigned to the job by the grid software.
4) To check the status of your job, type:
ens-hpc% qstat -f
5) A sample output of qstat command is:
The stdout and stderr files in this case are simple.sh.ojob# and the simple.sh.ejob#. The contents of these files are shown below:
[shaila@ens-hpc ~]$ more simple.sh.o2094917
Thu Jan 22 15:40:12 MST 2015
Since there are no error messages, simple.sh.e2094917 is empty.
1) We first need to write a matlab batch job (i.e a matlab .m file) . A sample matlab batch file is given below:
a = [1 2 3; 4 5 6];
2) You need to write a script to execute the above problem. The script called mymatlabjob.sh is given below.
In the script, matlab is invoked with the -nodisplay and -nosplash options so that matlab runs in the command line mode. The input to the matlab command is sample.m and the output is stored in out.txt.
3) Use qsub to submit the job to the grid:
ens-hpc% qsub mymatlabjob.sh
4) The part of the listing from qstat is given below:
5) The output file is out.txt. See the contents of this file by entering this command:
ens-hpc% more out.txt
We have a number of parallel applications that can be used via the cluster: various versions of mpich, openmp, parallel matlab and parallel fluent. Steps to follow when submitting a parallel job:
Notes on the qsub command
In the qsub command, you need to include the number of processors and the parallel environment you would like to use. The parallel environments available are MPI, OpenMP, matlab, and fluent_pe.
For example, if you wish to submit a mpich1 job using 4 processors, you can use the following command:
qsub -cwd -pe MPI 4 myscript
Here is an example of writing and submitting an mpich2 job. You can view a sample mpich job here. Next we need to write a batch file to compile this program using mpicc . The batch file called mpi2batch is given below:
ens-hpc% more mpi2batch
/usr/local/mpich2/1.5/x86_64/gnu/bin/mpicc -o mpi2 -lm mpi.c
/usr/local/mpich2/1.5/x86_64/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./mpi2
In the above script, the $NSLOTS is got from the number of processors requested at the qsub command. Set the environment variables by running the following command:
shippo% qsub-cwd –pe MPI 4 mpi2batch
The above qsub command submits the job to the mpich1 parallel environment and requests that the job be run on 6 processors. To check the status of your job, type
shippo% qstat –f
From the above output, you can see 4 processes are running on . In this case all the processes are running on different machines. The output of the program is given below:
shippo % more mpi2batch.o343
The same steps need to be followed to submit matlab or fluent parallel jobs. The only difference is in the parallel environment that is used when submitting using the qsub command. For matlab, the parallel environment is matlab. For fluent, the parallel environment is fluent_pe.
A number of options can be included in your submissions script. Rather than typing those options on the command line while issuing the qsub command, it is preferred that you include it in your scripts. This reduces the risk of giving incorrect inputs to the qsub command. It also reduces the tedious task of typing up the long commands everytime a change is made.
These options need to be added to the top of your script. Each of the options is preceded by #$ .
Following are some neat options that you can include in your script.
There are a lot more options that can be supplied to the the qsub options. These are listed in the man page for qsub. This can be viewed by typing the following: man qsub
The file <jobname>.e<jobid> generally contains the error messages that are encountered when the job is running on the cluster. Make sure you check this file carefully for any error messages. Quite often , there will be error messages about file not being found. By default, the . (current working directory) is not included in the path in your .bashrc file . So, either enter the full path name in the script or include the . in your path.
A common error that you will find in this file is "Command not found" . This occurs when the command that you are running is not in your path. This can be corrected by either including it in the path in the .cshrc/.bashrc file or by invoking the program using the complete pathname.
Sometimes there are unexplained errors like "No such file of directory" even when the file exists. Or sometimes, the job just does not run. It stays in the "Pending" state. This sometimes occurs if you have created the script file on a windows machine. It then puts extra "Carriage Return" characters at the end of each line. These are not recognized by unix. To fix this type the following command: dos2unix filename filename This will remove the extra "^M" at the end of each line. Then open the file using your favorite editor (vi, emacs, pico) and make sure it does not have the "^M". Then save the file. Do NOT quit without saving. Once you fix the file, submit it again.
This document last modified Tuesday October 03, 2017