Qualifier Presentation

The Papers

Steady-State Scheduling on Heterogeneous Clusters: Why and How?
O. Beaumont and A. Legrand and L. Marchal and Y. Robert
Parallel and Distributed Processing Symposium, 18th International 8 171a (2004) pdf
Linear Programming-Based Affinity Scheduling of Independent Tasks on Heterogeneous Computing Systems
I. Al-Azzoni and D. G. Down
IEEE Transactions on Parallel and Distributed Systems 19 1671-1682 (2008) pdf

Task Scheduling

Heterogeneous Computing (HC) Systems

Task execution time depends on the machine type
Special purpose hardware might execute specific task much faster
Objective: minimize makespan
Bin-packing problem is NP-Hard
Heuristics are used to find near-optimal solutions

Steady-State Scheduling

Motivation

Vastly more tasks than machines
Makespan is much larger then task execution time
Percent of degradation due to neglecting the packing (discrete variables) vanishes asymptotically
Task execution time estimates are inaccurate
Benefit from Divisible Load Theory (DLT) efficiency in non-divisible problems

Steady-State Scheduling

Modeling

Focus on only the dominate schedule impacting parts of the problem

Find a model with the least DOF (i.e. only model what you need to model)

Assign portion of task types to machine types

Neglect transient effects

Convert integer valued decision variables to continuous

Temporarily ignore fine grain allocation (e.g. time ordering)

Steady-State Scheduling

Approach

Model the system in the steady-state
Make just enough approximations to create a tractable problem

In stark contrast to "pose a computationally intractable problem, then find suboptimal solutions"
Find a representative low fidelity model to solve optimally

Analyze the system behavior using the simpler model
Determine scheduling policies

Paper #1

Steady-State Scheduling on Heterogeneous Clusters: Why and How?

Overview

Platform model
Master-slave tasking
Pipeline scatter
Constructing a schedule
Extensions
Limitations

Platform Model

Known processor performance (1 task type)
Known network topology
$w_i$ -- time steps per computational unit (time to compute) for node $P_i$
$c_{ij}$ -- time steps per data unit (time to send) from node $P_i$ to $P_j$

Platform Model

Full overlap, single port model

A node can simultaneously:

Receive data
Send data
Perform a computation on a different data unit

All parameters are assumed to be rational

Master-Slave Tasking

Problem Description

Master node, $P_m$, does work and sends work to slaves
$P_m$ receives no work
Slaves send work amongst themselves
Slaves either process the work themselves or forward it
$\alpha_i$ -- fraction of time $P_i$ spends computing
$s_{ij}$ -- fraction of time $P_i$ sends to $P_j$

Master-Slave Tasking

Example System

Master-Slave Tasking

Linear Program

$$\underset{\alpha_i,\,s_{ij}\,\vert\, e_{ij} \in E}{\text{maximize}} \quad n_{task} = \sum_i \frac{\alpha_i}{w_i}$$ subject to:

$\forall i$	$0 \leq \alpha_i \leq 1$	computing time is a fraction
$\forall i,j \,\vert\, e_{ij} \in E$	$0 \leq s_{ij} \leq 1$	communication time is a fraction
$\forall i$	$\sum_{j \,\vert\, e_{ij} \in E} s_{ij} \leq 1$	total sending fraction
$\forall i$	$\sum_{j \,\vert\, e_{ji} \in E} s_{ji} \leq 1$	total receiving fraction

Master-Slave Tasking

Linear Program

$\forall i,j \,\vert\, e_{jm} \in E$	$s_{jm} = 0$	master receives nothing
$\forall i \,\vert\, i \neq m$	$\sum_{j \,\vert\, e_{ji} \in E} \frac{s_{ji}}{c_{ji}}$ $= \frac{\alpha_i}{w_i} + \sum_{j \,\vert\, e_{ij} \in E} \frac{s_{ij}}{c_{ij}}$	tasks received = processed + forwarded

Pipelined Scatter

Problem Description

Multi-hop uniform data distribution
One source sends messages (possibly indirectly) to a set of targets $\mathcal{P}_\text{target}$
Each message is delivered to each $P_k \in \mathcal{P}_\text{target}$
Objective: maximize the number of messages delivered
$TP$ -- throughput: the fractional number of messages distributed
$send(i,j,k)$ -- fractional number of messages sent from $i$ to $j$ with destination $k$

Pipelined Scatter

Example System

Pipelined Scatter

Linear Program

$$\underset{TP,\,s_{ij},\,send(i,j,k)}{\text{maximize}} \quad TP$$ subject to:

$\forall i,j \,\vert\, e_{ij} \in E$	$0 \leq s_{ij} \leq 1$	communication time is a fraction
$\forall i$	$\sum_{j \,\vert\, e_{ij} \in E} s_{ij} \leq 1$	total sending fraction
$\forall i$	$\sum_{j \,\vert\, e_{ji} \in E} s_{ji} \leq 1$	total receiving fraction

Pipelined Scatter

Linear Program

$\forall i,j \,\vert\, e_{ij} \in E$	$s_{ij} = \sum_k send(i,j,k) c_{ij}$	sent time is sum of all sending times
$\forall i,k, \,\vert\, k \neq i$	$\sum_{j \,\vert\, e_{ji} \in E} send(i,j,k)$ $= \sum_{j \,\vert\, e_{ij} \in E} send(i,j,k)$	received = sent if not target
$\forall P_k \in \mathcal{P}_\text{target}$	$TP = \sum_{j \,\vert\, e_{jk} \in E} send(j,k,k)$	$TP$ is the messages arriving at the target

Determine Scheduling Period

Solve the rational linear program (in polynomial time)
Need to find integer values for the number of:

tasks computed: $\frac{\alpha^*_i}{w_i} T$
messages sent: $\frac{s^*_{ij}}{c_{ij}} T$
messages sent to $P_k$: $send^*(i,j,k) T$
communication steps: $s^*_{ij} T$

Period $T$ is the LCM of the denominators of the above fractions (not necessarily polynomial in problem size)

Constructing a Schedule

Split each node into a sending and receiving nodes and form a bipartite graph
Compute the weighted edge coloring (in polynomial time)
Each coloring contains the communication links that may operate simultaneously

Example

Reference

Found a related example from another paper
Scheduling strategies for master-slave tasking on heterogeneous processor platforms
C. Banino and O. Beaumont and L. Carter and J. Ferrante and A. Legrand and Y. Robert
Parallel and Distributed Systems, IEEE Transactions on 15 319 - 330 (2004) pdf
Master-slave tasking

Example

Problem and LP Solution

Example

Finding the Bipartite Graph

Extensions

Proof of asymptotic optimality
Multiple communication ports (NICs)
Scheduling startup and tear-down efficiently
Handling fixed scheduling periods
Dynamic computing and communication environments can be handled efficiently
Other paper: multiple masters

Limitations of the Approach

Pipelined multicast is still NP-Hard
Require nodes to be able to send and receive data
Determining the topology for non-trivial systems is difficult
Shared communications links are not taken into account

Critique

Mostly a good paper written by an expert in the field
Compelling argument for use of steady-state approaches
Scattered

Schedule reconstruction is in 3 sections
Pipelined multicast is in 2 sections
Limitations sections also contains benefits

Missing details that would have made it easier to understand

Decision variables
Bipartite graph matching example

Paper #2

Linear Programming-Based Affinity Scheduling of Independent Tasks on Heterogeneous Computing Systems

Overview

Problem description
Baseline mapping heuristics
Linear Programming-Based Affinity Scheduling (LPAS)
Simulation-based comparison

Problem Description

Given a HC system description:

Task types (classes) with known mean execution times for each machine type
Task arrival rates

When a task arrives, assign it to a machine immediately
No tasks are dropped
Objectives:

System stability
Minimize mean task waiting time
Minimize the required state information

Baseline Mapping Heuristics

	Minimum Execution Time (MET)	K-Percent Best (KPB)	Minimum Completion Time (MCT)
Assign task to machine	which is fastest for that task type	from $S_i^k$ with the earliest completion time, where $S_i^k$ is the set of $k\%$ fastest machines for task type $i$	which has the earliest completion time
Required state	none	between MET and MCT	full
Issues	Ignores waiting time at the machine	Selecting $k$ is difficult. $k$ is the same for each task type.	No consideration of future task arrivals

LPAS

Nomenclature

Let $\alpha_i$ be the arrival rate for task type $i$
Let $\lambda$ be the normalized system capacity
For task type $i$ on machine type $j$, let $\mu_{ij} = \frac{1}{ETC_{ij}}$ be tasks per second and $\delta_{ij}$ be the proportional allocation

LPAS

Allocation LP

$$\underset{\lambda,\,\delta_{ij}}{\text{maximize}} \quad \lambda$$ subject to:

$\forall i$	$\sum_j \delta_{ij} \mu_{ij} \geq \lambda \alpha_i$	task arrival rate constraint
$\forall j$	$\sum_i \delta_{ij} \leq 1$	machine allocation constraint
$\forall i,j$	$\delta_{ij} \geq 0$	proportional allocation is positive

LPAS

Allocation LP Solution

Always has a solution
$\lambda^* \leq 1$ ⇒ unstable system (overloaded/oversubscribed)
$\lambda^* > 1$ ⇒ stablizable system (queue size converges to a steady-state distribution as $t \rightarrow \infty$)
$\lambda^* \sum_i \alpha_i$ is the number of tasks per second the system can execute

LPAS

Algorithm

Pick the solution which has the smallest number of zeros (least restrictive)
$S_i = \{j \,:\, \delta^*_{ij} \neq 0\}$ be the set of assignable machines for task type $i$
Assigns a task to a machine in $S_i$ with the earliest completion time

Values of $\delta^*_{ij}$ are not actually used
Allocation LP is simply used to intelligently reduce the set of machines we use in MCT
Re-solve the allocation LP only when the system parameters ($\alpha_i$ and $\mu$) change

Example 1

System A

Given a HC system: $\alpha = \begin{pmatrix} 2.45 & 2.45 \end{pmatrix}$ and $\mu = \begin{pmatrix}9 & 5 \\ 2 & 1 \end{pmatrix}$

Allocation LP gives: $\delta^* = \begin{pmatrix}0 & 0.5 \\ 1 & 0.5 \end{pmatrix}$ with task rates of $\delta^* \mu = \begin{pmatrix}0 & 2.5 \\ 2 & 0.5 \end{pmatrix}$ and $\lambda^* = \frac{2.5}{2.45} = 1.0204 > 1$

Machine 1 executes task type 1 best but is only assigned task type 2
KPB(1) ≡ MET and both result in unstable systems (task type 2 → machine 1)
KPB(2) ≡ MCT and both are stable but perform worse then LPAS

Example 2

Given a HC system: $\alpha = \begin{pmatrix} 2.45 & 2.45 \end{pmatrix}$ and $\mu = \begin{pmatrix}2.5 & 2.5 \\ 2.5 & 2.5 \end{pmatrix}$

Allocation LP gives: $\delta^* = \begin{pmatrix}0 & 1 \\ 1 & 0 \end{pmatrix}$ or $\begin{pmatrix}1 & 0 \\ 0 & 1 \end{pmatrix}$ and $\lambda^* = \frac{2.5}{2.45} = 1.0204 > 1$

Perfectly homogeneous system
Infinite number of optimal solutions
LPAS arbitrarily restricts itself

Simulation

Simple queuing model: inter-arrivals times and execution times are exponentially distributed
MET often results in an unstable system so it is excluded
First-Come-First-Serve (FCFS) local schedulers
Performance metric: long-run average of number of tasks in the system (queue + executing)
30 monte carlo runs, reporting 95% confidence intervals
Finite set of carefully crafted systems are compared

Results

System A

Previous 2x2 HC system
Smaller is better
Unstable heuristics are omitted

System B

$\alpha = \begin{pmatrix} 5 & 8 \end{pmatrix}$ and $\mu = \begin{pmatrix}8 & 3 \\ 4 & 10 \end{pmatrix}$

Allocation LP gives: $\delta^* = \begin{pmatrix}0.8333 & 0 \\ 0.1667 & 1 \end{pmatrix}$

Allocating all of machine 2 toward task type 2 helps LPAS compared to MCT
Not restricting the allocation on machine 1 helps when compared to KPB(1)

System C1

3x30 system with high task to machine affinity
Four groups of identical machines so the LP has many equal cost optimal solutions
Solve the allocation LP with equivalent machines for efficiency (and for solution uniqueness)
KPB is unstable for $k < 14$

System D

High task and high machine heterogeneity

System E

Low task and high machine heterogeneity

MCT outperforms LPAS because you effectively have one task type and LPAS restricts which machine it can run on (when there is virtually no benefit to do so)

System F1

High task and low machine heterogeneity with $\alpha = \begin{pmatrix}4 & 8 & 10 & 10\end{pmatrix}$

System F2

High task and low machine heterogeneity with $\alpha = \begin{pmatrix}7 & 7 & 7 & 7\end{pmatrix}$

System G

Super low task and machine heterogeneity

MCT performs slightly better then LPAS possibly because the smaller machine set is more restrictive without a large benefit

System H

Super high task and machine heterogeneity, high task to machine affinity

System I1

Same as system C1 but with hyper-exponential execution time distribution (twice the variance)

Note that the LPAS upper bound is a typo in the paper

System I2

Same as system C1 but with constant execution time

LPAS-2/k Heuristic

Tries to reduce the number of machines queried for completion time when task of type $i$ arrives
Randomly choose at most 2 machines from $S_i$ based on $\delta^*_{ij}$

$j_1$ selected with PMF $p_{j_1} = \frac{\delta^*_{i{j_1}} \mu_{i{j_1}}}{\lambda^* \alpha_i}$
$j_2$ selected with PMF $p_{j_2} = \frac{\delta^*_{i{j_2}} \mu_{i{j_2}}}{\lambda^* \alpha_i - \delta^*_{i{j_1}} \mu_{i{j_1}}}$

Now we need the actual values of $\delta^*_{ij}$ not just indices of the nonzero elements
Select the machine from $\{j_1,j_2\}$ with the earliest completion time

LP-Static Heuristic

Motivation: requires no state information (similar to MET but hopefully performs better)
Machine $j$ is selected with PMF $p_{j} = \frac{\delta^*_{ij} \mu_{ij}}{\lambda^* \alpha_i}$
Proven to always stabilize a stabilizable system
Performs poorly in practice

System C2

Large highly heterogeneous system (similar to system C1)
Used to compare LP-Static and LPAS-2/k which uses less state information than LPAS and MCT

Guided-LPAS Heuristics

No stabilizable system has been found that LPAS can not stabilize
However this has not been proven thus they introduce a new heuristic
Motivation: provable stability with similar performance to LPAS
Achieves a steady state distribution of $\pi_{ij} = \frac{\delta^*_{ij} \mu_{ij}}{\lambda^* \alpha_i}$
Restricts the set of assignable machines to force this distribution
Also can define Guided-LPAS-2/k

Critique

Order of the sections is not as coherent as possible

Simulation results are presented then more heuristics are developed
Should have compared the provably stabilizable heuristics to all systems along side LPAS

Could have made a quantitative argument for less state in LPAS and LPAS-2/k
Selection of ETC matrices is questionable
Add random ETC matrices to compare heuristics
Add real workload traces
Add random and round-robin scheduling for completeness

Future Directions (My Thoughts)

MCT outperforms LPAS for homogeneous systems
LPAS does not model all the noise sources in the system
Use a stochastic optimization approach

Perturb $\alpha_i$ and $\mu_{ij}$ based on the distributions to see the effect on $\delta_{ij}$
With low heterogeneity there are many solutions that have a cost near the optimal
With noise in the parameters to the allocation LP any of these solutions may be optimal
Include some near-optimal solutions in $S_i$
With this approach LPAS will be equivalent to MCT for homogeneous systems

Conclusion

Steady-state and reduced fidelity models allow some useful insight into the system behavior
Noise in these systems seems to be simulated but not modelled in the solutions
My interests lie in better characterizing and developing more practical scheduling algorithms for these HC systems