Worst-Case Execution Time Analysis for Parallel Run-Time Monitoring

Daniel Lo and G. Edward Suh, Cornell University
DAC 2012, June 3-7, 2012, SF, CA, USA
Outline

1. Real Time Systems
2. Real Time Systems Space
3. Reliability and Security
4. Impact of Parallel Monitoring on WCET
5. Parallel Monitoring Challenges
   a. WCET (Basics)
6. Parallel Monitoring Advantages
7. Parallel Monitoring Architecture
8. Integer Linear Programming
9. Formulation
10. Results
11. Future Work
Real Time Systems

A system is said to be real-time if the total correctness of an operation depends upon:
1. logical correctness, and
2. the time in which it is performed.

Operations are like Fire brigades - arrival and arrival time both are equally important.
Real Time Systems Space (1/3)

- Home Appliances
- Transport: Cars/ Aeroplanes etc.
- Personal Electronics
- Robots: Mars rover etc
- Medical Appliances
- Buildings
Real Time Systems Space (2/3)
Space and Complexity of the Real Time Systems is Growing More and More...
Reliability and Security

- A key concern for Emerging Technologies.
- Reliability or Security breach in critical applications (e.g. Medical Applications, may cause physical damage loss of life)

Growing need for Monitoring tasks in parallel to hunt down Security/Reliability Issues
Reliability & Security Issues

Untrusted I/Os Operation
Uninitialized Memory Read
Control Flow Corruption

Monitoring of the task is known to significantly improve Reliability and Security
Parallel Monitoring Challenges

1. Run time overload
   ○ Incorporating Monitoring sequential to the main tasks an absolute killer for the Designers to add new applications
   ○ Parallel Monitoring helps but may still issues stalls to the Main compute

2. Power
   ○ Continuous Parallel Monitoring comes at the cost of power

3. Area
   ○ Parallel Monitoring requires additional hardware/ area

4. Scheduling Overhead for the RTOS

5. Lack of timing guarantees
   ○ Forbids inclusion in Critical RTS
   ○ Need for estimate of the WCET of the monitoring processes
   ○ RTOS needs to know the WCET
WCET
(Worst Case Execution Time)
Definition: an Upper bound on the execution time of a task [Peter1]

Required by the Operating system to schedule talks and provide real Time guarantees.

“Mr. Barnes is expecting you, but he’s currently in a chess game. So, he’ll be with you in a few minutes, or several hours.”
Parallel Monitoring

Advantages

1. Enables many new capabilities:
   a. fine-grained memory protection
   b. error bound checks
   c. hardware errors

2. Protection against large class of software attacks

3. High reduction (orders of tens of percent) in monitoring run time compared to single core monitoring
Parallel Monitoring Architecture Model

- Main and Monitoring core loosely coupled through a FIFO buffer

- Forwarded Instruction:
  - determined based on monitoring technique
  - sent transparently (no explicit inst in main task)
  - triggers series of monitoring instructions

- If FIFO full:
  - Main core needs to wait/stall on forwarded inst till FIFO available. Referred to as Monitoring stall
Monitoring Technique

UMC (Uninitialized Memory Check)

a. Monitoring Core detects bugs that read memory location before being written.
b. Load/ Store instructions forwarded by Main core to Mon core
c. On store Mon sets a tag bit corresponding to the location
d. On load, mon core checks the tag bit and raises exception if not set
Paper focus and Assumptions

1. Analyses focuses on Main core and Monitoring core interactions through the FIFO
2. Monitoring core assumed to have separate memory (no shared resource cycle loss)
3. No timing anomalies in the main core - required to assume that monitoring stalls produce WCET on the main core
4. WCET of a main task and a monitoring task on the different cores may be estimated individually
5. Enough loop iterations for the FIFO to become full.
Worst Case Execution Time Analysis
Classic WCET Analysis: Implicit Path Enumeration

1. Convert the program into a control flow graph (CFG)
2. Formulate ILP to maximize $
   \text{subject to } \sum_{B \in \mathcal{B}_{\text{CFG}}} N_B \cdot C_{B,\text{max}}$

where, $\mathcal{B}_{\text{CFG}}$ : set of basic blocks in the CFG

   - $N_B$ : # of times block $B$ is executed
   - $C_{B,\text{max}}$ : Max cycles to execute $B$

3. In case of branches take only one branch
4. Put constraints on $N_B$ to account for only certain paths getting executed

   Maximum Value of "$t$" gives WCET
Parallel Monitoring WCET Analysis

Classic ILP formulation may be extended to account for the Monitoring stalls per block:

\[ t = \sum_{B \in B_{CFG}} N_B \cdot (c_{B,\text{max}} + s_{B,\text{max}}) \]

where, \( s_{B,\text{max}} \) : max # of cycles that B is stalled due to monitoring
1. Boundary conditions provide the Area of interest
2. In the Area of interest, we may choose a find the points for a maxima of a function
1. Boundary conditions provide the Area of interest
2. In the Area of interest, we may choose a find the points for a maxima of a function
1. Boundary conditions provide the Area of interest
2. In the Area of interest, we may choose to find the points for a maxima of a function
Sequential Monitoring Bound

- Conservative Bound on the worst-case monitoring stalls cycles
  - Monitoring task run in line with the main task on the same core
  - WCET may be attained in the traditional way, by having a single program for monitoring and main execution
  - May causes a monitoring stall for every instruction
  - Extremely conservative compared to parallel execution of monitoring task by coupling through a FIFO
FIFO Model (1/2)

- **Main task**
  - continues as long as FIFO entry available
  - stalls when FIFO full
- **WCET model needs to capture**
  - the worst-case (maximum) number of entries in the FIFO at each forwarded instruction
  - determine how many cycles the main task may be stalled due to the FIFO being full
FIFO Model (2/2)

Monitoring Flow Graph (MFG)
CFG is transformed so that each node contains at most one forwarded instruction
  a. forwarded inst to be located at the end of the code represented by the node

Monitoring Load
# of cycles required for the monitoring core to process all outstanding entries in the FIFO at a given point in time
Monitoring Load

Challenge:
Mathematical Modeling of FIFO at entry by entry level

Simplification:
$t_{M,\text{max}}$: Increase in monitoring load for any forwarded instruction = $\max(\text{worst case monitoring task execution time for any forwarded instruction})$

Bound: $0 \leq \text{Monitoring Load} \leq \text{Maximum monitoring load FIFO can handle lmax}$

Monitoring Load $= n_F \times t_{M,\text{max}}$

Where $n_F$: # of FIFO entries
Monitoring Flow Graph

Node Mx Represents Monitoring Graph Node
Worst case Stall Cycles

Change in Monitoring Load at node M

\( l_{i,M} \): Monitoring Load coming into the Node M
\( l_{o,M} \): Monitoring Load going out of the node M
\( \Delta l_{M} \): Change in the Monitoring load "M"
\( t_{M,max} \) = WCET of Monitoring Task
\( C_{M,min} \) = Minimum cycles to execute Mon

\[ \Delta l_{M} = \begin{cases} t_{M,max} - c_{M,min}, & \text{forwarded inst. } \in M \\ -c_{M,min}, & \text{no forwarded inst. } \in M \end{cases} \]

Output Monitoring Load at node M

\[ l_{o,M} = \begin{cases} 0, & l_{i,M} + \Delta l_{M} < 0 \\ l_{i,M} + \Delta l_{M}, & 0 \leq l_{i,M} + \Delta l_{M} \leq l_{max} \\ l_{max}, & l_{i,M} + \Delta l_{M} > l_{max} \end{cases} \]

\( l_{max} = n_F \cdot t_{M,max} \)

Input Load of node M due to Previous nodes

\[ l_{i,M} = \max_{M_{prev} \in M_{prev}} l_{o,M_{prev}} \]

Stall occurs when forwarded instruction is executed but still no entry in FIFO is free
\( S_{M} \): Number of cycles stalled

\[ S_{M} = \begin{cases} 0, & l_{i,M} + \Delta l_{M} < l_{max} \\ (l_{i,M} + \Delta l_{M}) - l_{max}, & l_{i,M} + \Delta l_{M} \geq l_{max} \end{cases} \]

Worst case monitoring stall cycle for a given node M

\[ \max_{M\in M_{MFG}} \sum S_{M} \]
## Results

Estimated and Observed WCET (clock cycles) with and without monitoring

<table>
<thead>
<tr>
<th>Monitoring</th>
<th>Experiment</th>
<th>cnt</th>
<th>expint</th>
<th>fdct</th>
<th>fibcall</th>
<th>insertsort</th>
<th>matmult</th>
<th>ns</th>
</tr>
</thead>
<tbody>
<tr>
<td>None</td>
<td>wcet-none</td>
<td>64531</td>
<td>3483</td>
<td>1805</td>
<td>245</td>
<td>598</td>
<td>133668</td>
<td>5951</td>
</tr>
<tr>
<td></td>
<td>sim-none</td>
<td>62931</td>
<td>2293</td>
<td>1805</td>
<td>245</td>
<td>598</td>
<td>133668</td>
<td>5951</td>
</tr>
<tr>
<td>UMC</td>
<td>sequential-umc</td>
<td>103052</td>
<td>3591</td>
<td>4382</td>
<td>257</td>
<td>2489</td>
<td>357453</td>
<td>10338</td>
</tr>
<tr>
<td></td>
<td>wcet-umc</td>
<td>64550</td>
<td>3498</td>
<td>3035</td>
<td>245</td>
<td>2083</td>
<td>256120</td>
<td>5953</td>
</tr>
<tr>
<td></td>
<td>sim-umc</td>
<td>62931</td>
<td>2297</td>
<td>2564</td>
<td>245</td>
<td>1864</td>
<td>235120</td>
<td>5951</td>
</tr>
<tr>
<td>CFP</td>
<td>sequential-cfp</td>
<td>151732</td>
<td>11669</td>
<td>1976</td>
<td>794</td>
<td>1174</td>
<td>231507</td>
<td>18623</td>
</tr>
<tr>
<td></td>
<td>wcet-cfp</td>
<td>93544</td>
<td>8984</td>
<td>1805</td>
<td>547</td>
<td>677</td>
<td>133668</td>
<td>13614</td>
</tr>
<tr>
<td></td>
<td>sim-cfp</td>
<td>72540</td>
<td>5247</td>
<td>1805</td>
<td>382</td>
<td>598</td>
<td>133668</td>
<td>9824</td>
</tr>
</tbody>
</table>
## Results (Ratio)

### Ratios Comparing Results from different Experiments

<table>
<thead>
<tr>
<th>Ratio</th>
<th>Benchmark</th>
<th>cnt</th>
<th>expint</th>
<th>fdct</th>
<th>fibcall</th>
<th>insertsort</th>
<th>matmult</th>
<th>ns</th>
</tr>
</thead>
<tbody>
<tr>
<td>wcet-none</td>
<td>sim-none</td>
<td>1.03</td>
<td>1.52</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td>wcet-umc</td>
<td>sim-umc</td>
<td>1.03</td>
<td>1.52</td>
<td>1.18</td>
<td>1.00</td>
<td>1.12</td>
<td>1.09</td>
<td>1.00</td>
</tr>
<tr>
<td>wcet-cfp</td>
<td>sim-cfp</td>
<td>1.29</td>
<td>1.71</td>
<td>1.00</td>
<td>1.43</td>
<td>1.13</td>
<td>1.00</td>
<td>1.39</td>
</tr>
<tr>
<td>sequential-umc</td>
<td>wcet-umc</td>
<td>1.60</td>
<td>1.03</td>
<td>1.44</td>
<td>1.05</td>
<td>1.19</td>
<td>1.40</td>
<td>1.74</td>
</tr>
<tr>
<td>sequential-cfp</td>
<td>wcet-cfp</td>
<td>1.62</td>
<td>1.30</td>
<td>1.09</td>
<td>1.45</td>
<td>1.73</td>
<td>1.73</td>
<td>1.37</td>
</tr>
<tr>
<td>wcet-umc</td>
<td>wcet-none</td>
<td>1.00</td>
<td>1.00</td>
<td>1.68</td>
<td>1.00</td>
<td>3.48</td>
<td>1.92</td>
<td>1.00</td>
</tr>
<tr>
<td>wcet-cfp</td>
<td>wcet-none</td>
<td>1.45</td>
<td>2.58</td>
<td>1.00</td>
<td>2.23</td>
<td>1.13</td>
<td>1.00</td>
<td>2.29</td>
</tr>
</tbody>
</table>
Conclusion

- Parallel Monitoring an attractive solution for improving the safety and reliability of future real-time systems.
- WCET of the P Mon techniques needs to be analyzed before they may be applied.
- Method for estimating the WCET for tasks running on a PMon system presented.
- Non-linear FIFO behavior modeled as an MILP problem to produce the worst-case monitoring stall cycles.
- WCET monitoring stall cycles may be incorporated into traditional IPET methods for WCET estimation.
- Evaluation shows significant improvements over an estimate assuming sequential execution of the monitoring.
- Amount of overestimation is comparable to the overestimation for a system without parallel monitoring.
Future work and Improvements

1. Tighten the WCET bound
   a. Improve by incorporating more info about the main task
   b. Incorporate loop bounds and infeasible paths
2. Improve the time needed to solve the linear programming Problem
3. Architectural features
   a. Shared memory analysis
4. Non-linear programming Techniques
Leakage-Aware Dynamic Scheduling for Real-Time Adaptive Applications on Multiprocessor Systems

Heng Yu, Bharadwaj Veeravalli and Yajun Ha, National University of Singapore
DAC June 13-18 2012
Outline

1. Adaptable Applications
2. Types of Power dissipation
3. Leakage Power
4. What all can reduce Power?
5. Slack saving in 2 processor system
6. Minimum Energy at given Frequency
7. Frequency and Min Energy Settings
8. Slack Receiver Selection
9. Guided Search Heuristics
10. Results
11. Conclusion and Improvement Area
Adaptable Applications World

**Advantages**
+ scalable performance quality as per the environment
+ more program cycles and/or energy budget assigned
  - higher performance quality it achieves (till a threshold)

**Examples:**
1. Scalable Video Coding (SVC) scheme in H.264/ MPEG-4 standards
   + Customized service quality to accommodate n/w and device conditions

2. JPEG2000 codec: Multiple playback resolutions.

Instead of completing or failing the execution, adaptive applications usually define multiple execution granularities with finer grained with better results
+ Cost of increased program cycles and energy

**Strong Motivation for Low Power Vs Performance Tradeoffs**
Scalable App Example
Types of Power Dissipation

1. Dynamic Power
   + Power in Charging and discharging of Loads
   + Depends on
     - Toggle Rate and Frequency
     - Vdd

2. Leakage Power
   + Power lost when the device is off
   + Depends on Vdd, Vbs and process parameters

3. Short circuit Power
   + Depends on
     - O/P Load, I/p Slew, Vdd, Toggle, Freq
Leakage Power Trends

With technology scaling the leakage power is getting high and becoming comparable to the dynamic power.

Increasing Need of Leakage Power Aware Scheduling Algorithms
What all can reduce power?

1. Vdd : Supply voltage (Dynamic Voltage Scaling)
   - Pinst : directly proportional to Vdd*Vdd
   - Decrease in Vdd by 0.7x reduces Dynamic power by Half

2. Vbs: Bias Voltage
   - Impacts Vt and the Leakage power

3. Frequency
   - Impacts the short circuit and dynamic power
   - Linear Relationship

4. Turn off module completely : Heavy penalty on WCET
Slack saving in 2 processor system

- P1:
  - a: 0-60
  - b: 60-100
  - c: 160-200

- P2:
  - a: 0-60
  - b: 60-100
  - c: 160-200

Time axis: 0-200
Minimizing Energy at given Frequency

Total Power Specified by:

\[ P = C_{\text{eff}} V_{dd}^2 f + V_{dd} K_3 e^{K_4 V_{dd}} e^{K_5 V_{bs}} + |V_{bs}| I_j \]

Energy:

\[ E_{\text{cyc}} = C_{\text{eff}} V_{dd}^2 + L_g f^{-1} (V_{dd} K_3 e^{K_4 V_{dd}} e^{K_5 V_{bs}} + |V_{bs}| I_j) \]

Lg : Logic path length of the circuit

Frequency Selection:

\[ f = (L_d K_6)^{-1} ((1 + K_1) V_{dd} + K_2 V_{bs} - V_{th1})^\alpha \]

By adjusting (Vdd,Vbs) values Ecyc can be minimized at a given frequency
Frequency and Min Energy Settings

For each frequency in the set of available frequencies \{f_1, f_2, ..., f_j\} choose Vdd and Vbs in order to get minimum minimum leakage power.
Slack Receiver Selection (1/2)

Issues with Greedy based receiver selection for choosing direct descendant task:

1. Direct Receivers may not fully utilize the slack time
2. Additional parallel candidates for slack distribution.
Candidate Set (abbr. CS) of a slack generator is a set of slack receivers that fully adopts the slack time.

Candidates for receiving Slack from T1: 
{T2, T3}, {T2, T6}, {T4, T5, T6}, and {T4, T5, T3}.
Guided Search Heuristics

A guided-search heuristics to select the "best-fit" frequency levels that maximize the additional program cycles of adaptive tasks.

**Objective:**
1. Maximize or Minimize Frequency so as to consume all the slack from previous node
2. Constrain the search in 1, with the Energy.

\[
\begin{align*}
\text{Maximize} & \quad \sum_{T_i \in T} \Delta c_i \\
\text{Subject to} & \quad \frac{c_i + \Delta c_i}{f_{i,\text{new}}} \leq \frac{c_i}{f_{i,\text{old}}} + t_{s,i} + \Delta t_{oh}, \forall T_i \in T \\
& \quad \sum_{T_i \in T} ((c_i + \Delta c_i) E_{cyc}^{f_{j,\text{new}}}) \leq E_s + \sum_{T_i \in T} (c_i E_{cyc}^{f_{j,\text{old}}}) + \Delta E_{oh}
\end{align*}
\]
Results

Performance over 2.5 times over even-energy approach, 31.6% better than the greed approach
Conclusion, Future Work and Improvement Areas

- Novel framework proposed for leakage aware multiprocessor dynamic scheduling on adaptive applications
- Efficient Slack distribution technique demonstrated for Leakage Aware Dynamic Scheduling
- Approach does not take into account the toggle rate of the system.
- A processor may have multiple task running at a given time: Analysis and Algorithm needs to be based on multi threading options
- Overhead of different voltage levels needs to be studied
Questions