Supported
A
I
Discovery
Infrastructure for
Engineering
Current Reservations
| First Research group |
| Second research group |
Cluster Machine List and Queue overview
Click here for a list of names and specs for each of the nodes in the cluster along with an overview of the queues that you can submit jobs to.
OnDemand Cluster Portal
Open Ondemand is a web-based portal to access HPC resources. VPN required if not on campus.
Getting Access and Eligibility.
The SAIDIE cluster is restricted to AI faculty and their groups and a select few other AI researchers. Fill out a freshservice request here or and we will review your eligibility to be added to the cluster.
General Guidelines
- Use the job scheduler.
- Do not run jobs on the login node. Processes that impact the performance of the login node will be killed without notice. If you need help running your job, please contact us and we can get started on SAIDIE.
- Be good to each other.
- This cluster is shared by several research groups. Please do your best to be fair and kind to others.
Partitions
Partitions are separate queues for submittied jobs and can contain overlapping groups of nodes. When you submit your job, resources are allocated from the partition’s nodes and your job runs on one or more of the nodes in that group. You can use the command sinfo or overview to see the list of partitions you can submit to. Below are a list of basic partitions configured on the Cashew cluster.
general- Use this partition for multiple short term jobs, Try and put in checkpointing for your program in case you need to restart it.
- This partition is capped at a max of 1 day of run time.
- This is the default partition
long-runs- This uses the same DGX/resources as the general partition.
- This partition is capped at a max of 7 days of run time.
Storage on the Cluster
Each user gets 2 TB of space in the /home directory on our all-flash storage server.
The storage on SAIDIE is a VAST solid state system and totals around 300tb. We can reserver a small amount of space out for when groups reserve the cluster usage. Contact us if this is a need.
Research Groups may add separate storage servers to the cluster and will be available under the root (/) directory. Contact us for more information.
You can see your current storage usage with the get-quota command.
Please contact us to purchase more individual space or find out about more storage options.
Running a Job On the Cluster
When working with a cluster you won’t be running programs as you would on a personal computer or server. Instead, you interact with the cluster by issuing commands to the job scheduler. This is done via the command line and with scripts submitted to the scheduler to run your job. A typical workflow would involve the steps below.
- Move your data/code to the cluster folder.
- For Linux and OSX users, we recommend scp, sftp.
- For Windows we recommend WinSCP
- Write or edit your submission script to add your required scheduler options and what commands are needed to launch your program.
- Load any modules, activate virtual environments, or set any other environment variables that are required, these are not added by default.
- Submit your job to the cluster with the “sbatch” command.
- e.g.
sbatch sample_submission.sh - For info on how to submit, watch, cancel, or find info on your job please see the basic usage guide.
- e.g.
- Instead of outputting to your terminal window, by default standard output and errors from your job will be put in a text file in the same directory where you submitted the job from. You can then check any output or move it to another location for further processing. The location can be modified by SBATCH options.
Example Workflow
From your client machine, upload your data and connect to the cluster.
Load any needed modules.
Write your submission script. You can find this example here.
Submit your job. Make sure to take note of your job number. You can check the status of your job with the overview command.
Submit your job. When it is done, check your data and output file for errors or standard output.