The Papers
-
Steady-State Scheduling on Heterogeneous Clusters: Why and How?
O. Beaumont and A. Legrand and L. Marchal and Y. Robert
Parallel and Distributed Processing Symposium, 18th International
8
171a
(2004)
pdf
-
Linear Programming-Based Affinity Scheduling of Independent Tasks on Heterogeneous Computing Systems
I. Al-Azzoni and D. G. Down
IEEE Transactions on Parallel and Distributed Systems
19
1671-1682
(2008)
pdf
Task Scheduling
Heterogeneous Computing (HC) Systems
- Task execution time depends on the machine type
- Special purpose hardware might execute specific task much faster
- Objective: minimize makespan
- Bin-packing problem is NP-Hard
- Heuristics are used to find near-optimal solutions
Steady-State Scheduling
Motivation
- Vastly more tasks than machines
- Makespan is much larger then task execution time
- Percent of degradation due to neglecting the packing (discrete variables) vanishes asymptotically
- Task execution time estimates are inaccurate
- Benefit from Divisible Load Theory (DLT) efficiency in non-divisible problems
Steady-State Scheduling
Modeling
Focus on only the dominate schedule impacting parts of the problem
Find a model with the least DOF (i.e. only model what you need to model)
Assign portion of task types to machine types
Neglect transient effects
Convert integer valued decision variables to continuous
Temporarily ignore fine grain allocation (e.g. time ordering)
Steady-State Scheduling
Approach
- Model the system in the steady-state
- Make just enough approximations to create a tractable problem
- In stark contrast to "pose a computationally intractable problem, then find suboptimal solutions"
- Find a representative low fidelity model to solve optimally
- Analyze the system behavior using the simpler model
- Determine scheduling policies
Paper #1
Steady-State Scheduling on Heterogeneous Clusters: Why and How?
Overview
- Platform model
- Master-slave tasking
- Pipeline scatter
- Constructing a schedule
- Extensions
- Limitations
Platform Model
- Known processor performance (1 task type)
- Known network topology
- $w_i$ -- time steps per computational unit (time to compute) for node $P_i$
- $c_{ij}$ -- time steps per data unit (time to send) from node $P_i$ to $P_j$
Platform Model
Full overlap, single port model
A node can simultaneously:
- Receive data
- Send data
- Perform a computation on a different data unit
All parameters are assumed to be rational
Master-Slave Tasking
Problem Description
- Master node, $P_m$, does work and sends work to slaves
- $P_m$ receives no work
- Slaves send work amongst themselves
- Slaves either process the work themselves or forward it
- $\alpha_i$ -- fraction of time $P_i$ spends computing
- $s_{ij}$ -- fraction of time $P_i$ sends to $P_j$
Master-Slave Tasking
Example System
Master-Slave Tasking
Linear Program
$$\underset{\alpha_i,\,s_{ij}\,\vert\, e_{ij} \in E}{\text{maximize}} \quad n_{task} = \sum_i \frac{\alpha_i}{w_i}$$
subject to:
$\forall i$ | $0 \leq \alpha_i \leq 1$ | computing time is a fraction |
$\forall i,j \,\vert\, e_{ij} \in E$ | $0 \leq s_{ij} \leq 1$ | communication time is a fraction |
$\forall i$ | $\sum_{j \,\vert\, e_{ij} \in E} s_{ij} \leq 1$ | total sending fraction |
$\forall i$ | $\sum_{j \,\vert\, e_{ji} \in E} s_{ji} \leq 1$ | total receiving fraction |
Master-Slave Tasking
Linear Program
$\forall i,j \,\vert\, e_{jm} \in E$ | $s_{jm} = 0$ | master receives nothing |
$\forall i \,\vert\, i \neq m$ | $\sum_{j \,\vert\, e_{ji} \in E} \frac{s_{ji}}{c_{ji}}$ $= \frac{\alpha_i}{w_i} + \sum_{j \,\vert\, e_{ij} \in E} \frac{s_{ij}}{c_{ij}}$ |
tasks received = processed + forwarded |
Pipelined Scatter
Problem Description
- Multi-hop uniform data distribution
- One source sends messages (possibly indirectly) to a set of targets $\mathcal{P}_\text{target}$
- Each message is delivered to each $P_k \in \mathcal{P}_\text{target}$
- Objective: maximize the number of messages delivered
- $TP$ -- throughput: the fractional number of messages distributed
- $send(i,j,k)$ -- fractional number of messages sent from $i$ to $j$ with destination $k$
Pipelined Scatter
Example System
Pipelined Scatter
Linear Program
$$\underset{TP,\,s_{ij},\,send(i,j,k)}{\text{maximize}} \quad TP$$
subject to:
$\forall i,j \,\vert\, e_{ij} \in E$ | $0 \leq s_{ij} \leq 1$ | communication time is a fraction |
$\forall i$ | $\sum_{j \,\vert\, e_{ij} \in E} s_{ij} \leq 1$ | total sending fraction |
$\forall i$ | $\sum_{j \,\vert\, e_{ji} \in E} s_{ji} \leq 1$ | total receiving fraction |
Pipelined Scatter
Linear Program
$\forall i,j \,\vert\, e_{ij} \in E$ | $s_{ij} = \sum_k send(i,j,k) c_{ij}$ |
sent time is sum of all sending times |
$\forall i,k, \,\vert\, k \neq i$ | $\sum_{j \,\vert\, e_{ji} \in E} send(i,j,k)$ $= \sum_{j \,\vert\, e_{ij} \in E} send(i,j,k)$ |
received = sent if not target |
$\forall P_k \in \mathcal{P}_\text{target}$ | $TP = \sum_{j \,\vert\, e_{jk} \in E} send(j,k,k)$ |
$TP$ is the messages arriving at the target |
Determine Scheduling Period
- Solve the rational linear program (in polynomial time)
- Need to find integer values for the number of:
- tasks computed: $\frac{\alpha^*_i}{w_i} T$
- messages sent: $\frac{s^*_{ij}}{c_{ij}} T$
- messages sent to $P_k$: $send^*(i,j,k) T$
- communication steps: $s^*_{ij} T$
- Period $T$ is the LCM of the denominators of the above fractions (not necessarily polynomial in problem size)
Constructing a Schedule
- Split each node into a sending and receiving nodes and form a bipartite graph
- Compute the weighted edge coloring (in polynomial time)
- Each coloring contains the communication links that may operate simultaneously
Example
Reference
- Found a related example from another paper
-
Scheduling strategies for master-slave tasking on heterogeneous processor platforms
C. Banino and O. Beaumont and L. Carter and J. Ferrante and A. Legrand and Y. Robert
Parallel and Distributed Systems, IEEE Transactions on
15
319 - 330
(2004)
pdf
- Master-slave tasking
Example
Problem and LP Solution
Example
Finding the Bipartite Graph
Example
Schedule
Extensions
- Proof of asymptotic optimality
- Multiple communication ports (NICs)
- Scheduling startup and tear-down efficiently
- Handling fixed scheduling periods
- Dynamic computing and communication environments can be handled efficiently
- Other paper: multiple masters
Limitations of the Approach
- Pipelined multicast is still NP-Hard
- Require nodes to be able to send and receive data
- Determining the topology for non-trivial systems is difficult
- Shared communications links are not taken into account
Critique
- Mostly a good paper written by an expert in the field
- Compelling argument for use of steady-state approaches
- Scattered
- Schedule reconstruction is in 3 sections
- Pipelined multicast is in 2 sections
- Limitations sections also contains benefits
- Missing details that would have made it easier to understand
- Decision variables
- Bipartite graph matching example
Paper #2
Linear Programming-Based Affinity Scheduling of Independent Tasks on Heterogeneous Computing Systems
Overview
- Problem description
- Baseline mapping heuristics
- Linear Programming-Based Affinity Scheduling (LPAS)
- Simulation-based comparison
Problem Description
- Given a HC system description:
- Task types (classes) with known mean execution times for each machine type
- Task arrival rates
- When a task arrives, assign it to a machine immediately
- No tasks are dropped
- Objectives:
- System stability
- Minimize mean task waiting time
- Minimize the required state information
Baseline Mapping Heuristics
|
Minimum Execution Time (MET) |
K-Percent Best (KPB) |
Minimum Completion Time (MCT) |
Assign task to machine |
which is fastest for that task type |
from $S_i^k$ with the earliest completion time, where $S_i^k$ is the set of $k\%$ fastest machines for task type $i$ |
which has the earliest completion time |
Required state |
none |
between MET and MCT |
full |
Issues |
Ignores waiting time at the machine |
Selecting $k$ is difficult. $k$ is the same for each task type. |
No consideration of future task arrivals |
LPAS
Nomenclature
- Let $\alpha_i$ be the arrival rate for task type $i$
- Let $\lambda$ be the normalized system capacity
- For task type $i$ on machine type $j$, let $\mu_{ij} = \frac{1}{ETC_{ij}}$ be tasks per second and $\delta_{ij}$ be the proportional allocation
LPAS
Allocation LP
$$\underset{\lambda,\,\delta_{ij}}{\text{maximize}} \quad \lambda$$
subject to:
$\forall i$ | $\sum_j \delta_{ij} \mu_{ij} \geq \lambda \alpha_i$ | task arrival rate constraint |
$\forall j$ | $\sum_i \delta_{ij} \leq 1$ | machine allocation constraint |
$\forall i,j$ | $\delta_{ij} \geq 0$ | proportional allocation is positive |
LPAS
Allocation LP Solution
- Always has a solution
- $\lambda^* \leq 1$ ⇒ unstable system (overloaded/oversubscribed)
- $\lambda^* > 1$ ⇒ stablizable system (queue size converges to a steady-state distribution as $t \rightarrow \infty$)
- $\lambda^* \sum_i \alpha_i$ is the number of tasks per second the system can execute
LPAS
Algorithm
- Pick the solution which has the smallest number of zeros (least restrictive)
- $S_i = \{j \,:\, \delta^*_{ij} \neq 0\}$ be the set of assignable machines for task type $i$
- Assigns a task to a machine in $S_i$ with the earliest completion time
- Values of $\delta^*_{ij}$ are not actually used
- Allocation LP is simply used to intelligently reduce the set of machines we use in MCT
- Re-solve the allocation LP only when the system parameters ($\alpha_i$ and $\mu$) change
Example 1
System A
Given a HC system:
$\alpha = \begin{pmatrix} 2.45 & 2.45 \end{pmatrix}$ and $\mu = \begin{pmatrix}9 & 5 \\ 2 & 1 \end{pmatrix}$
Allocation LP gives:
$\delta^* = \begin{pmatrix}0 & 0.5 \\ 1 & 0.5 \end{pmatrix}$
with task rates of $\delta^* \mu = \begin{pmatrix}0 & 2.5 \\ 2 & 0.5 \end{pmatrix}$
and $\lambda^* = \frac{2.5}{2.45} = 1.0204 > 1$
- Machine 1 executes task type 1 best but is only assigned task type 2
- KPB(1) ≡ MET and both result in unstable systems (task type 2 → machine 1)
- KPB(2) ≡ MCT and both are stable but perform worse then LPAS
Example 2
Given a HC system:
$\alpha = \begin{pmatrix} 2.45 & 2.45 \end{pmatrix}$ and $\mu = \begin{pmatrix}2.5 & 2.5 \\ 2.5 & 2.5 \end{pmatrix}$
Allocation LP gives:
$\delta^* = \begin{pmatrix}0 & 1 \\ 1 & 0 \end{pmatrix}$ or $\begin{pmatrix}1 & 0 \\ 0 & 1 \end{pmatrix}$
and $\lambda^* = \frac{2.5}{2.45} = 1.0204 > 1$
- Perfectly homogeneous system
- Infinite number of optimal solutions
- LPAS arbitrarily restricts itself
Simulation
- Simple queuing model: inter-arrivals times and execution times are exponentially distributed
- MET often results in an unstable system so it is excluded
- First-Come-First-Serve (FCFS) local schedulers
- Performance metric: long-run average of number of tasks in the system (queue + executing)
- 30 monte carlo runs, reporting 95% confidence intervals
- Finite set of carefully crafted systems are compared
Results
System A
- Previous 2x2 HC system
- Smaller is better
- Unstable heuristics are omitted
System B
$\alpha = \begin{pmatrix} 5 & 8 \end{pmatrix}$ and $\mu = \begin{pmatrix}8 & 3 \\ 4 & 10 \end{pmatrix}$
Allocation LP gives:
$\delta^* = \begin{pmatrix}0.8333 & 0 \\ 0.1667 & 1 \end{pmatrix}$
- Allocating all of machine 2 toward task type 2 helps LPAS compared to MCT
- Not restricting the allocation on machine 1 helps when compared to KPB(1)
System C1
- 3x30 system with high task to machine affinity
- Four groups of identical machines so the LP has many equal cost optimal solutions
- Solve the allocation LP with equivalent machines for efficiency (and for solution uniqueness)
- KPB is unstable for $k < 14$
System D
High task and high machine heterogeneity
System E
Low task and high machine heterogeneity
MCT outperforms LPAS because you effectively have one task type and LPAS restricts which machine it can run on (when there is virtually no benefit to do so)
System F1
High task and low machine heterogeneity with $\alpha = \begin{pmatrix}4 & 8 & 10 & 10\end{pmatrix}$
System F2
High task and low machine heterogeneity with $\alpha = \begin{pmatrix}7 & 7 & 7 & 7\end{pmatrix}$
System G
Super low task and machine heterogeneity
MCT performs slightly better then LPAS possibly because the smaller machine set is more restrictive without a large benefit
System H
Super high task and machine heterogeneity, high task to machine affinity
System I1
Same as system C1 but with hyper-exponential execution time distribution (twice the variance)
Note that the LPAS upper bound is a typo in the paper
System I2
Same as system C1 but with constant execution time
LPAS-2/k Heuristic
- Tries to reduce the number of machines queried for completion time when task of type $i$ arrives
- Randomly choose at most 2 machines from $S_i$ based on $\delta^*_{ij}$
- $j_1$ selected with PMF $p_{j_1} = \frac{\delta^*_{i{j_1}} \mu_{i{j_1}}}{\lambda^* \alpha_i}$
- $j_2$ selected with PMF $p_{j_2} = \frac{\delta^*_{i{j_2}} \mu_{i{j_2}}}{\lambda^* \alpha_i - \delta^*_{i{j_1}} \mu_{i{j_1}}}$
- Now we need the actual values of $\delta^*_{ij}$ not just indices of the nonzero elements
- Select the machine from $\{j_1,j_2\}$ with the earliest completion time
LP-Static Heuristic
- Motivation: requires no state information (similar to MET but hopefully performs better)
- Machine $j$ is selected with PMF $p_{j} = \frac{\delta^*_{ij} \mu_{ij}}{\lambda^* \alpha_i}$
- Proven to always stabilize a stabilizable system
- Performs poorly in practice
System C2
- Large highly heterogeneous system (similar to system C1)
- Used to compare LP-Static and LPAS-2/k which uses less state information than LPAS and MCT
Guided-LPAS Heuristics
- No stabilizable system has been found that LPAS can not stabilize
- However this has not been proven thus they introduce a new heuristic
- Motivation: provable stability with similar performance to LPAS
- Achieves a steady state distribution of $\pi_{ij} = \frac{\delta^*_{ij} \mu_{ij}}{\lambda^* \alpha_i}$
- Restricts the set of assignable machines to force this distribution
- Also can define Guided-LPAS-2/k
Critique
- Order of the sections is not as coherent as possible
- Simulation results are presented then more heuristics are developed
- Should have compared the provably stabilizable heuristics to all systems along side LPAS
- Could have made a quantitative argument for less state in LPAS and LPAS-2/k
- Selection of ETC matrices is questionable
- Add random ETC matrices to compare heuristics
- Add real workload traces
- Add random and round-robin scheduling for completeness
Future Directions (My Thoughts)
- MCT outperforms LPAS for homogeneous systems
- LPAS does not model all the noise sources in the system
- Use a stochastic optimization approach
- Perturb $\alpha_i$ and $\mu_{ij}$ based on the distributions to see the effect on $\delta_{ij}$
- With low heterogeneity there are many solutions that have a cost near the optimal
- With noise in the parameters to the allocation LP any of these solutions may be optimal
- Include some near-optimal solutions in $S_i$
- With this approach LPAS will be equivalent to MCT for homogeneous systems
Conclusion
- Steady-state and reduced fidelity models allow some useful insight into the system behavior
- Noise in these systems seems to be simulated but not modelled in the solutions
- My interests lie in better characterizing and developing more practical scheduling algorithms for these HC systems
/