The Papers

SteadyState Scheduling on Heterogeneous Clusters: Why and How?
O. Beaumont and A. Legrand and L. Marchal and Y. Robert
Parallel and Distributed Processing Symposium, 18th International
8
171a
(2004)
pdf

Linear ProgrammingBased Affinity Scheduling of Independent Tasks on Heterogeneous Computing Systems
I. AlAzzoni and D. G. Down
IEEE Transactions on Parallel and Distributed Systems
19
16711682
(2008)
pdf
Task Scheduling
Heterogeneous Computing (HC) Systems
 Task execution time depends on the machine type
 Special purpose hardware might execute specific task much faster
 Objective: minimize makespan
 Binpacking problem is NPHard
 Heuristics are used to find nearoptimal solutions
SteadyState Scheduling
Motivation
 Vastly more tasks than machines
 Makespan is much larger then task execution time
 Percent of degradation due to neglecting the packing (discrete variables) vanishes asymptotically
 Task execution time estimates are inaccurate
 Benefit from Divisible Load Theory (DLT) efficiency in nondivisible problems
SteadyState Scheduling
Modeling
Focus on only the dominate schedule impacting parts of the problem
Find a model with the least DOF (i.e. only model what you need to model)
Assign portion of task types to machine types
Neglect transient effects
Convert integer valued decision variables to continuous
Temporarily ignore fine grain allocation (e.g. time ordering)
SteadyState Scheduling
Approach
 Model the system in the steadystate
 Make just enough approximations to create a tractable problem
 In stark contrast to "pose a computationally intractable problem, then find suboptimal solutions"
 Find a representative low fidelity model to solve optimally
 Analyze the system behavior using the simpler model
 Determine scheduling policies
Paper #1
SteadyState Scheduling on Heterogeneous Clusters: Why and How?
Overview
 Platform model
 Masterslave tasking
 Pipeline scatter
 Constructing a schedule
 Extensions
 Limitations
Platform Model
 Known processor performance (1 task type)
 Known network topology
 $w_i$  time steps per computational unit (time to compute) for node $P_i$
 $c_{ij}$  time steps per data unit (time to send) from node $P_i$ to $P_j$
Platform Model
Full overlap, single port model
A node can simultaneously:
 Receive data
 Send data
 Perform a computation on a different data unit
All parameters are assumed to be rational
MasterSlave Tasking
Problem Description
 Master node, $P_m$, does work and sends work to slaves
 $P_m$ receives no work
 Slaves send work amongst themselves
 Slaves either process the work themselves or forward it
 $\alpha_i$  fraction of time $P_i$ spends computing
 $s_{ij}$  fraction of time $P_i$ sends to $P_j$
MasterSlave Tasking
Example System
MasterSlave Tasking
Linear Program
$$\underset{\alpha_i,\,s_{ij}\,\vert\, e_{ij} \in E}{\text{maximize}} \quad n_{task} = \sum_i \frac{\alpha_i}{w_i}$$
subject to:
$\forall i$  $0 \leq \alpha_i \leq 1$  computing time is a fraction 
$\forall i,j \,\vert\, e_{ij} \in E$  $0 \leq s_{ij} \leq 1$  communication time is a fraction 
$\forall i$  $\sum_{j \,\vert\, e_{ij} \in E} s_{ij} \leq 1$  total sending fraction 
$\forall i$  $\sum_{j \,\vert\, e_{ji} \in E} s_{ji} \leq 1$  total receiving fraction 
MasterSlave Tasking
Linear Program
$\forall i,j \,\vert\, e_{jm} \in E$  $s_{jm} = 0$  master receives nothing 
$\forall i \,\vert\, i \neq m$  $\sum_{j \,\vert\, e_{ji} \in E} \frac{s_{ji}}{c_{ji}}$ $= \frac{\alpha_i}{w_i} + \sum_{j \,\vert\, e_{ij} \in E} \frac{s_{ij}}{c_{ij}}$ 
tasks received = processed + forwarded 
Pipelined Scatter
Problem Description
 Multihop uniform data distribution
 One source sends messages (possibly indirectly) to a set of targets $\mathcal{P}_\text{target}$
 Each message is delivered to each $P_k \in \mathcal{P}_\text{target}$
 Objective: maximize the number of messages delivered
 $TP$  throughput: the fractional number of messages distributed
 $send(i,j,k)$  fractional number of messages sent from $i$ to $j$ with destination $k$
Pipelined Scatter
Example System
Pipelined Scatter
Linear Program
$$\underset{TP,\,s_{ij},\,send(i,j,k)}{\text{maximize}} \quad TP$$
subject to:
$\forall i,j \,\vert\, e_{ij} \in E$  $0 \leq s_{ij} \leq 1$  communication time is a fraction 
$\forall i$  $\sum_{j \,\vert\, e_{ij} \in E} s_{ij} \leq 1$  total sending fraction 
$\forall i$  $\sum_{j \,\vert\, e_{ji} \in E} s_{ji} \leq 1$  total receiving fraction 
Pipelined Scatter
Linear Program
$\forall i,j \,\vert\, e_{ij} \in E$  $s_{ij} = \sum_k send(i,j,k) c_{ij}$ 
sent time is sum of all sending times 
$\forall i,k, \,\vert\, k \neq i$  $\sum_{j \,\vert\, e_{ji} \in E} send(i,j,k)$ $= \sum_{j \,\vert\, e_{ij} \in E} send(i,j,k)$ 
received = sent if not target 
$\forall P_k \in \mathcal{P}_\text{target}$  $TP = \sum_{j \,\vert\, e_{jk} \in E} send(j,k,k)$ 
$TP$ is the messages arriving at the target 
Determine Scheduling Period
 Solve the rational linear program (in polynomial time)
 Need to find integer values for the number of:
 tasks computed: $\frac{\alpha^*_i}{w_i} T$
 messages sent: $\frac{s^*_{ij}}{c_{ij}} T$
 messages sent to $P_k$: $send^*(i,j,k) T$
 communication steps: $s^*_{ij} T$
 Period $T$ is the LCM of the denominators of the above fractions (not necessarily polynomial in problem size)
Constructing a Schedule
 Split each node into a sending and receiving nodes and form a bipartite graph
 Compute the weighted edge coloring (in polynomial time)
 Each coloring contains the communication links that may operate simultaneously
Example
Reference
 Found a related example from another paper

Scheduling strategies for masterslave tasking on heterogeneous processor platforms
C. Banino and O. Beaumont and L. Carter and J. Ferrante and A. Legrand and Y. Robert
Parallel and Distributed Systems, IEEE Transactions on
15
319  330
(2004)
pdf
 Masterslave tasking
Example
Problem and LP Solution
Example
Finding the Bipartite Graph
Example
Schedule
Extensions
 Proof of asymptotic optimality
 Multiple communication ports (NICs)
 Scheduling startup and teardown efficiently
 Handling fixed scheduling periods
 Dynamic computing and communication environments can be handled efficiently
 Other paper: multiple masters
Limitations of the Approach
 Pipelined multicast is still NPHard
 Require nodes to be able to send and receive data
 Determining the topology for nontrivial systems is difficult
 Shared communications links are not taken into account
Critique
 Mostly a good paper written by an expert in the field
 Compelling argument for use of steadystate approaches
 Scattered
 Schedule reconstruction is in 3 sections
 Pipelined multicast is in 2 sections
 Limitations sections also contains benefits
 Missing details that would have made it easier to understand
 Decision variables
 Bipartite graph matching example
Paper #2
Linear ProgrammingBased Affinity Scheduling of Independent Tasks on Heterogeneous Computing Systems
Overview
 Problem description
 Baseline mapping heuristics
 Linear ProgrammingBased Affinity Scheduling (LPAS)
 Simulationbased comparison
Problem Description
 Given a HC system description:
 Task types (classes) with known mean execution times for each machine type
 Task arrival rates
 When a task arrives, assign it to a machine immediately
 No tasks are dropped
 Objectives:
 System stability
 Minimize mean task waiting time
 Minimize the required state information
Baseline Mapping Heuristics

Minimum Execution Time (MET) 
KPercent Best (KPB) 
Minimum Completion Time (MCT) 
Assign task to machine 
which is fastest for that task type 
from $S_i^k$ with the earliest completion time, where $S_i^k$ is the set of $k\%$ fastest machines for task type $i$ 
which has the earliest completion time 
Required state 
none 
between MET and MCT 
full 
Issues 
Ignores waiting time at the machine 
Selecting $k$ is difficult. $k$ is the same for each task type. 
No consideration of future task arrivals 
LPAS
Nomenclature
 Let $\alpha_i$ be the arrival rate for task type $i$
 Let $\lambda$ be the normalized system capacity
 For task type $i$ on machine type $j$, let $\mu_{ij} = \frac{1}{ETC_{ij}}$ be tasks per second and $\delta_{ij}$ be the proportional allocation
LPAS
Allocation LP
$$\underset{\lambda,\,\delta_{ij}}{\text{maximize}} \quad \lambda$$
subject to:
$\forall i$  $\sum_j \delta_{ij} \mu_{ij} \geq \lambda \alpha_i$  task arrival rate constraint 
$\forall j$  $\sum_i \delta_{ij} \leq 1$  machine allocation constraint 
$\forall i,j$  $\delta_{ij} \geq 0$  proportional allocation is positive 
LPAS
Allocation LP Solution
 Always has a solution
 $\lambda^* \leq 1$ ⇒ unstable system (overloaded/oversubscribed)
 $\lambda^* > 1$ ⇒ stablizable system (queue size converges to a steadystate distribution as $t \rightarrow \infty$)
 $\lambda^* \sum_i \alpha_i$ is the number of tasks per second the system can execute
LPAS
Algorithm
 Pick the solution which has the smallest number of zeros (least restrictive)
 $S_i = \{j \,:\, \delta^*_{ij} \neq 0\}$ be the set of assignable machines for task type $i$
 Assigns a task to a machine in $S_i$ with the earliest completion time
 Values of $\delta^*_{ij}$ are not actually used
 Allocation LP is simply used to intelligently reduce the set of machines we use in MCT
 Resolve the allocation LP only when the system parameters ($\alpha_i$ and $\mu$) change
Example 1
System A
Given a HC system:
$\alpha = \begin{pmatrix} 2.45 & 2.45 \end{pmatrix}$ and $\mu = \begin{pmatrix}9 & 5 \\ 2 & 1 \end{pmatrix}$
Allocation LP gives:
$\delta^* = \begin{pmatrix}0 & 0.5 \\ 1 & 0.5 \end{pmatrix}$
with task rates of $\delta^* \mu = \begin{pmatrix}0 & 2.5 \\ 2 & 0.5 \end{pmatrix}$
and $\lambda^* = \frac{2.5}{2.45} = 1.0204 > 1$
 Machine 1 executes task type 1 best but is only assigned task type 2
 KPB(1) ≡ MET and both result in unstable systems (task type 2 → machine 1)
 KPB(2) ≡ MCT and both are stable but perform worse then LPAS
Example 2
Given a HC system:
$\alpha = \begin{pmatrix} 2.45 & 2.45 \end{pmatrix}$ and $\mu = \begin{pmatrix}2.5 & 2.5 \\ 2.5 & 2.5 \end{pmatrix}$
Allocation LP gives:
$\delta^* = \begin{pmatrix}0 & 1 \\ 1 & 0 \end{pmatrix}$ or $\begin{pmatrix}1 & 0 \\ 0 & 1 \end{pmatrix}$
and $\lambda^* = \frac{2.5}{2.45} = 1.0204 > 1$
 Perfectly homogeneous system
 Infinite number of optimal solutions
 LPAS arbitrarily restricts itself
Simulation
 Simple queuing model: interarrivals times and execution times are exponentially distributed
 MET often results in an unstable system so it is excluded
 FirstComeFirstServe (FCFS) local schedulers
 Performance metric: longrun average of number of tasks in the system (queue + executing)
 30 monte carlo runs, reporting 95% confidence intervals
 Finite set of carefully crafted systems are compared
Results
System A
 Previous 2x2 HC system
 Smaller is better
 Unstable heuristics are omitted
System B
$\alpha = \begin{pmatrix} 5 & 8 \end{pmatrix}$ and $\mu = \begin{pmatrix}8 & 3 \\ 4 & 10 \end{pmatrix}$
Allocation LP gives:
$\delta^* = \begin{pmatrix}0.8333 & 0 \\ 0.1667 & 1 \end{pmatrix}$
 Allocating all of machine 2 toward task type 2 helps LPAS compared to MCT
 Not restricting the allocation on machine 1 helps when compared to KPB(1)
System C1
 3x30 system with high task to machine affinity
 Four groups of identical machines so the LP has many equal cost optimal solutions
 Solve the allocation LP with equivalent machines for efficiency (and for solution uniqueness)
 KPB is unstable for $k < 14$
System D
High task and high machine heterogeneity
System E
Low task and high machine heterogeneity
MCT outperforms LPAS because you effectively have one task type and LPAS restricts which machine it can run on (when there is virtually no benefit to do so)
System F1
High task and low machine heterogeneity with $\alpha = \begin{pmatrix}4 & 8 & 10 & 10\end{pmatrix}$
System F2
High task and low machine heterogeneity with $\alpha = \begin{pmatrix}7 & 7 & 7 & 7\end{pmatrix}$
System G
Super low task and machine heterogeneity
MCT performs slightly better then LPAS possibly because the smaller machine set is more restrictive without a large benefit
System H
Super high task and machine heterogeneity, high task to machine affinity
System I1
Same as system C1 but with hyperexponential execution time distribution (twice the variance)
Note that the LPAS upper bound is a typo in the paper
System I2
Same as system C1 but with constant execution time
LPAS2/k Heuristic
 Tries to reduce the number of machines queried for completion time when task of type $i$ arrives
 Randomly choose at most 2 machines from $S_i$ based on $\delta^*_{ij}$
 $j_1$ selected with PMF $p_{j_1} = \frac{\delta^*_{i{j_1}} \mu_{i{j_1}}}{\lambda^* \alpha_i}$
 $j_2$ selected with PMF $p_{j_2} = \frac{\delta^*_{i{j_2}} \mu_{i{j_2}}}{\lambda^* \alpha_i  \delta^*_{i{j_1}} \mu_{i{j_1}}}$
 Now we need the actual values of $\delta^*_{ij}$ not just indices of the nonzero elements
 Select the machine from $\{j_1,j_2\}$ with the earliest completion time
LPStatic Heuristic
 Motivation: requires no state information (similar to MET but hopefully performs better)
 Machine $j$ is selected with PMF $p_{j} = \frac{\delta^*_{ij} \mu_{ij}}{\lambda^* \alpha_i}$
 Proven to always stabilize a stabilizable system
 Performs poorly in practice
System C2
 Large highly heterogeneous system (similar to system C1)
 Used to compare LPStatic and LPAS2/k which uses less state information than LPAS and MCT
GuidedLPAS Heuristics
 No stabilizable system has been found that LPAS can not stabilize
 However this has not been proven thus they introduce a new heuristic
 Motivation: provable stability with similar performance to LPAS
 Achieves a steady state distribution of $\pi_{ij} = \frac{\delta^*_{ij} \mu_{ij}}{\lambda^* \alpha_i}$
 Restricts the set of assignable machines to force this distribution
 Also can define GuidedLPAS2/k
Critique
 Order of the sections is not as coherent as possible
 Simulation results are presented then more heuristics are developed
 Should have compared the provably stabilizable heuristics to all systems along side LPAS
 Could have made a quantitative argument for less state in LPAS and LPAS2/k
 Selection of ETC matrices is questionable
 Add random ETC matrices to compare heuristics
 Add real workload traces
 Add random and roundrobin scheduling for completeness
Future Directions (My Thoughts)
 MCT outperforms LPAS for homogeneous systems
 LPAS does not model all the noise sources in the system
 Use a stochastic optimization approach
 Perturb $\alpha_i$ and $\mu_{ij}$ based on the distributions to see the effect on $\delta_{ij}$
 With low heterogeneity there are many solutions that have a cost near the optimal
 With noise in the parameters to the allocation LP any of these solutions may be optimal
 Include some nearoptimal solutions in $S_i$
 With this approach LPAS will be equivalent to MCT for homogeneous systems
Conclusion
 Steadystate and reduced fidelity models allow some useful insight into the system behavior
 Noise in these systems seems to be simulated but not modelled in the solutions
 My interests lie in better characterizing and developing more practical scheduling algorithms for these HC systems
/