deliverable d5.1 - full system design: progress report · scorpio: signiﬁcance-based ......

Deliverable D5.1 - Full System Design: Progress Report

Contract No: ICT-323872

Contractual Date of Delivery: 30/5/2014 (M12)Actual Date of Delivery: 12/6/2014Main Authors: Christos Antonopoulos (CERTH)Co-Authors: Georgios Karakonstantis (EPFL), Charalambos Chalios(QUB),

Jan Riehme (RWTH Aachen), Peter Debacker(IMEC),Vassilis Vassiliadis(CERTH)

Reviewers: Nikolaos Bellas (CERTH), Spyros Lalis (CERTH)Konstantinos Parasyris(CERTH), Andy Burg (EPFL),Dimitrios Nikolopoulos (QUB), Uwe Naumann (RWTH Aachen)

Estimated Person Months: 8Classification: Public (PU)Report Version: 1.0

SCoRPiO: Significance-Based Computing for Reliability and Power Optimization

Disclaimer:This document reflects the contribution of the participants of the SCoRPiO project. The European Union and itsagencies are not liable or otherwise responsible for the contents of this document; its content reflects the views ofits authors only. This document is provided without any warranty and does not constitute any commitment by anyparticipant as to its content, and specifically excludes any warranty of correctness or fitness for a particular purpose.The user will use this document at her/his own risk.

Deliverable D5.1 - Full System Design: Progress Report 2 / 41

Abstract

Work Package 5 of the SCoRPiO project focuses on the vertical integration of different layers of the system stack.This document serves as a checkpoint of the progress at the end of the first year of the project. We present the modelsdeveloped, the design decisions, the tools involved, and in general the current status of each system layer. Going astep further, at the second part of the document we specify the interfaces between consecutive layers of the system.


Contents

Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1 Introduction 6

2 System Architecture 8

2.1 Overview of System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Automatic Significance Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Mathematical Representation of the Significance-Centric Programming Model . . . . . . . . 10

2.2.2 Interval Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Significance Based on Interval Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.4 Algorithmic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.5 Algorithmic Differentiation with Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.6 Significance Based on Interval Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.8 Status Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Programming Model - Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Status Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 System Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Runtime System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 OS Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.3 Status Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.1 Power Estimation and Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.2 Status Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 Power and Error Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18



2.6.1 Characterization and Propagation of Hardware Behavior at the Software . . . . . . . . . . . . 18

2.6.2 Reliability Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.3 Status Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Interface between Automatic Significance Characterization and Programming Model/Compiler 23

3.1 Programmer / Compiler to Significance Analysis Interface . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Variables Type Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.3 Variable Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.4 Registering Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.5 Triggering the Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.6 Multiple Analysis Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Access to the Results of Significance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Access to Information of scorpio info . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Access to Information of scorpio info item . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Significance Analysis to Programmer / Compiler Interface . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Through scorpio info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.2 Directly from the Binary File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Compiler - System Software 32

5 System Software - Simulator 35

5.1 HW configuration and events monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Handling accelerator cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.2 Handling memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Thread manipulation/Memory transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusions 39



List of Figures

2.1 Layering of the SCoRPiO architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Significance-aware architecture of the runtime system . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Target SCoRPiO architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Flow for extracting instruction-level failure rate and energy models. . . . . . . . . . . . . . . . . . . 19

2.5 From (a) SPICE model of a NAND2 gate, to (b) delay distribution of a NAND2 gate due to onlylocal variations (blue) or local+global variations (green), to (c) critical path delay for 2 ASICs for todifferent supply voltages. Fig. (c) shows that lower voltages result into longer delays and a widerdelay distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Performance degradation over time for 28nm planar and 14nm FinFET technology . . . . . . . . . . 20

2.7 Main interfaces of the SCoRPiO architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Workflow of automatic (or assisted) significance characterization . . . . . . . . . . . . . . . . . . . . 23

5.1 Reliability flags for a memory region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36



Listings

3.1 Instantiating scorpio info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Access to individual entries of the significance information. . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Instantiating scorpio info for significance retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 29



Chapter 1

Introduction

Many modern workloads, such as multimedia, machine learning, visualization, and scientific computing can toleratea degree of imprecision in computations and data. SCoRPiO seeks to exploit this observation and to relax reliabilityrequirements for the hardware layer by allowing a controlled degree of imprecision to be introduced to computationsand data. We investigate methods that allow the system- and application-software layers to synergistically characterizethe significance of various parts of the program for the quality of the end result, and their tolerance to faults. Based onthis information, extracted automatically or manually, the system software can steer computations and data to eitherlow-power, yet unreliable or higher-power and reliable functional and storage components. In addition, the system isable to aggressively reduce its power footprint by opportunistically powering hardware modules below nominal values.

Significance-based computing lays the foundations for not only approaching the theoretical limits of energy reductionof CMOS technology, but also moving beyond those limits by accepting hardware faults in a controlled manner.Reliability issues are not seen as a problem but rather as an opportunity to rethink the concept of computation.

SCoRPiO is based on a vertical, multilayer approach, spanning the whole system stack: from circuit- to application-level. The close collaboration between these layers and the flow of information across them is of paramount importancefor achieving the objectives of the project. The integration effort:

• Ensures that the work performed in individual work packages converges towards the SCoRPiO goal.

• Produces integrated research prototypes, which can be used to qualitatively and quantitatively evaluate the pro-posed techniques.

• Helps identify risks as early as possible and apply contingency plans.

The purpose of the document is twofold:

• It briefly presents the design and/or the current status of each of the four layers of the envisioned SCoRPiO fullsystem stack: automatic significance characterization, programming model and compiler, system software, sim-ulated hardware, and circuit-level power and fault models, as well as the abstraction of the latter at instruction-level. In two of the layers (compiler- and runtime-support) we have opted to front-load the implementationeffort, in order to make the vertical experimental infrastructure available as soon as possible in the course of theproject.

• It defines the interfaces between different layers.

The rest of the document is organized as follows: Chapter 2 outlines the system architecture, and presents the maindesign decisions and progress on each system layer. Chapter 3 discusses the interface between the automatic signifi-



cance characterization layer and the compiler. Chapter 4 focuses on the interface presented to the compiler by systemsoftware. Chapter 5 moves further down the system stack, describing the interface between the simulated hardwareand system software. Finally, Chapter 6 concludes the document.



Chapter 2

System Architecture

2.1 Overview of System Design

SCoRPiO is characterized by a layered approach. All layers, from applications down to low-level hardware powerand fault models collaborate towards the goal of low-power, high-performance, acceptable quality computation basedon the characterization of significance of computations and the exploitation of that information throughout the systemstack. Figure 2.1 depicts the main layers of the SCoRPiO system.

Figure 2.1: Layering of the SCoRPiO architecture

The programming model allows programmers to tag computations according to their significance, to provide code forearly detection of errors and low-cost approximation of results and to set power, performance and quality of results



constraints. The compiler implements the programming model and transfers the programmer-provided information tolower layers of the system. At the same time, it performs analysis to extract exploitable static information from thecode.

The goal of automatic significance characterization is to substitute – or at least assist – the programmer in charac-terizing the significance of computations. The results of automatic significance characterization can be analyzed andabstracted to information that can be used by the compiler to tag the significance of computations, and to automaticallygenerate results checking code and approximate versions of the computations wherever applicable.

System software (runtime system and operating system) maps the computations on hardware, according to their sig-nificance characteristics. Moreover, it is responsible to steer application execution and system configuration to themost appropriate direction in order to achieve user specified goals.

The simulated hardware platform is the computational substrate on which system and application software is executed.It offers configurable modules which can function either reliably, or outside their normal power/performance envelope,thus offering very low power execution at the expense of potential errors. At the same time it provides a set of low-level error detection and/or correction mechanisms, as well as feedback to system software on its interaction withapplication code.

The bottom layer of the SCoRPiO stack is circuit-level power and fault models. These models are developed witha combination of simulations and calibration based on real on-silicon prototypes. They predict power consumption,performance and fault rates at different configurations (voltage, frequency etc.), external parameters (temperature) andphases of the lifetime of semiconductor products (aging). The simulation and calibration results are abstracted to thelevel of ISA instructions.

The rest of these chapter discusses in more detail each of the layers and outlines the progress achieved during the firstyear of the project.

2.2 Automatic Significance Characterization

In deliverables D1.1.1 and D1.1.2 of SCoRPiO a new significance characterization was formally defined, implementedand tested on an initial collection of example codes. To our knowledge this is the first rigorous theoretical approach toquantify, or even define significance of numerical models and codes.

The significance analysis is based on Algorithmic Differentiation (AD) and Interval Arithmetic (IA), and requires thefollowing input:

• A C/C++ code (which essentially can be considered as the implementation of a mathematical function y =

f(x)).

• An interval (range) of input values [x].

• A significance bound ϵ ∈ IR, ϵ > 0.

Using this input as a starting point, the significance analysis should identify parts of the given code that are non-significant under certain significance conditions. An integral part of WP 1 is the development of those significanceconditions. Another important output of the significance analysis is simple conditions that can be used to quicklyidentify errors in the output of computations executed on non-reliable cores. Finally, another target is to automaticallyproduce lower-precision, yet more efficient in terms of execution time and power consumption, versions of non-significant code, or even to replace such computations with constant or previously computed values.



2.2.1 Mathematical Representation of the Significance-Centric Programming Model

The significance-centric programming model developed in WP2 is task based, i.e. the given implementation ofy = f(x) will compute the output y as a composition of K > 0 task results tk = gk(x), k = 1, . . . ,K: y =

h(g1(x), . . . , gK(x)). Often, function h is some sort of reduction, that compose a final scalar output y ∈ IR1 fromtemporary results tk, k = 1, . . . ,K, of tasks. Therefore we assume in this section that all involved functions f ,gk, k = 1 . . . ,K, and h are mappings from IRn into IR.

In the context of SCoRPiO we are looking for efficient ways to decide, if the output of individual tasks tk = gk(x) aresignificant or not.

2.2.2 Interval Arithmetic

Interval Arithmetic (IA) [10, 11] is a well established concept to generalize computations from numbers x ∈ IR tointervals (or ranges) of numbers [x] = [x, x] ⊆ IR with x ≤ x. Intervals will be marked by square brackets and aninterval evaluation of f will be denoted by [y] = f [x]. IA is often used to compute verified solutions of problems thatsuffer from rounding errors or to solve global optimization problems.

Usually an evaluation of a function f : X −→ Y : y = f(x) with X ⊆ IRn,Y ⊆ IR computes the value y = f(x)

for a single evaluation point x ∈ IRn of the domain X ⊆ IRn of f , whereas an interval evaluation by IA computesguaranteed enclosures f [x] of all possible values of f over the input interval [x]: f [x] ⊇ {f(x)|x ∈ [x]} ⊂ Y.

In order to compute interval enclosures f [x] from intervals [x] for input variable x and a given implementation offunction y = f(x), variables of intrinsic numeric types have to be re-declared with an interval data type and alloperations have to be replaced by interval versions.

2.2.3 Significance Based on Interval Arithmetic

The influence of input x for the input interval [x] on output [y] of the interval function [y] = f [x] can be measuredby the interval value of the output [y]: A rather big width w([y]) = y − y of the output interval [y] indicates a strongdependency on the input interval [x], whereas a small width w([y]) points to a small influence of the input range [x]

on the output y = f(x).

If w([y]) ≤ ϵ, inputs x ∈ [x] from interval [x] for the function f(x) can be characterized as non-significant.

Using the same rationale, one can judge the significance of input x on the temporary results tk = gk(x), k = 1, . . . ,K,

of tasks after an interval evaluation [y] = f [x], but not the significance of task results tk for the final output y. To solvethis problem with a direct (or forward) method like IA, all relevant tasks results tk have to be treated as independentpseudo-input variables. Depending on function h this approach might lead to a massive computational effort.

The adjoint mode of Algorithmic Differentiation (AD) can solve this problem more efficiently.

2.2.4 Algorithmic Differentiation

Algorithmic Differentiation (AD) [12, 3] is used in many areas of scientific computing to compute quantitative depen-dency information (sensitivities, derivatives). For a given implementation of a function y = f(x) with x ∈ IR, ADallows to compute values of the derivative function f ′(x) of f at the evaluation point specified by the input x of f : Forx = a, AD computes the slope f ′(a) = f ′(x)|x=a of f at point x = a. Note that AD is a semantic transformation ofprogram code. It is neither symbolic differentiation, nor approximation of slopes by finite differences.

For f : IRn −→ IR and x ∈ IRn, the adjoint mode of AD allows to compute efficiently the gradient ∇xy =



(∂y∂x1

, . . . , ∂y∂xn

)T

of a scalar output variable y = f(x) ∈ IR with respect to all inputs x ∈ IRn by only one evaluationof an adjoint model of f(x) independently on the value of n. Moreover, derivatives ∇tky of y with respect to allinternal program variables such as tasks’ results tk, k = 1, . . . ,K, are also available after the adjoint propagation.

2.2.5 Algorithmic Differentiation with Intervals

Combining IA and the adjoint mode of AD, one can obtain interval enclosures for derivatives ∇[x][y] =(

∂[y]∂[x1]

, . . . , ∂[y]∂[xn]

)T

of output [y] with respect to inputs [x] , where the interval bounds of ∇[x][y] are formed by the minimal and maximalslopes of function f(x) over the input range [x]. Due to the nature of adjoint mode AD, interval derivatives ∇[tk][y]

of output [y] with respect to all task results tk, k = 1, . . . ,K, are also available.

2.2.6 Significance Based on Interval Derivatives

To estimate the significance of a task result tk = gk(x) on the final value y = f(x) for a given input interval [x], weuse interval arithmetic to compute the product of [tk] and its interval derivative ∇[tk][y]. This product is a linearizationof h over the interval [tk] = gk[x], but also a worst case estimation of the influence of tk on y = h (g1(x), . . . , gK(x))

taking into account the minimal and maximal slope of h over [tk]. Significance S(tk) of task result tk = gk(x) is nowsimply defined as the width of this product.

To summarize, the approach to compute significance for task results tk, k = 1 . . . ,K, is:

1. An interval evaluation of an adjoint model of y = f(x) = h (g1(x), . . . , gK(x)) is performed that

• computes enclosures of task results [tk] = g[xk], k = 1, . . . ,K, and the final value [y] from the specifiedinput range [x],

• stores all relevant information for the propagation of interval derivatives backwards through the computa-tion.

2. The adjoint propagation computes derivatives ∇[tk][y] of the final value y with respect to task results tk =

gk(x),k = 1, . . . ,K.

3. A task tk = gk(x), k = 1, . . . ,K is called non-significant over the input interval [x], if the significance of thetask result tk is less than the provided significance bound ϵ > 0 :

w([tk] · ∇[tk][y]

)=: S(tk) ≤ ϵ, k = 1, . . . ,K. (2.1)

2.2.7 Implementation

The significance analysis of WP1 is implemented in dco/scorpio, a specialization for an interval base type of theC++ template library dco/c++ (Derivative Code by Overloading for C++) [14, 8, 9] developed at STCE. The C++template library filib++ [7, 6, 5] developed at the University of Wurzburg was used to provide interval data typesand associated overloaded operators and intrinsic functions.

In Chapter 3 we define an interface between the significance analysis and the compiler of the significance-centric pro-gramming model, which allows the evaluation of the significance condition (2.1) directly in the proposed programmingmodel of WP2.



2.2.8 Status Summary

Table 2.1 summarizes the status of different activities related to Automatic Significance Characterization at the end ofM12. For more details please see D7.1.1.

Activity StatusMathematical foundation of significance FinishedImplementation of significance characterization in DCO FinishedInterface specification among compiler and DCO tool FinishedAnalysis of applications In progressExploitation of analysis results for compile-time optimization and characterization In early stage

Table 2.1: Status of activities related to automatic significance characterization and analysis

2.3 Programming Model - Compiler

2.3.1 Programming Model

The programming model of SCoRPiO is based on OmpSS [2]. Computations are expressed in terms of tasks, which canbe executed either sequentially or concurrently, according to their dependencies. Task dependencies are not specifiedexplicitly by the programmer. Instead, they are identified automatically by analyzing the data flow between tasks,using a combination of both run- and compile-time techniques. The programming model is mainly based on pragmasaugmented with a light-weight run-time API.

SCoRPiO elevates the significance of computations to a first class programming concern. Therefore, the programmingmodel allows the developer to tag tasks with their significance. It should be noted that programmer-driven significancecharacterization is one of the two paths explored in SCoRPiO , the other being automatic – or at least assisted –characterization. Significant tasks will always be executed correctly. Non-significant tasks may be executed on lessreliable hardware, at a lower power footprint, however with the possibility of errors.

Moreover, the programmer can express information on the expected ranges of input data. In the next revision of theprogramming model, the programmer will also be able to express performance, energy and quality of output con-straints. This information is exploited both during the automatic significance analysis to characterize tasks accordingto criticality of their contribution to the quality of the final result, and at run-time, to dynamically steer executiontowards achieving the specified constraints.

One of the main goals of the programming model is the early detection and isolation of potential errors. The program-mer can specify light-weight result check functions, which are executed reliably at the end of each non-significanttask that was computed on a non-reliable core. Result check functions try to identify potential errors and limit theirpropagation throughout the computation by dropping the erroneous results, re-executing the task (typically at a dif-ferent configuration), or substituting the results with values calculated from a simpler, approximate version of thecode or even with default values. Beyond checking the results of individual tasks, the programming model allowsthe developer to specify result check functions at the granularity of groups of tasks as well, where the correctness ofthe “aggregate” result of the task group can be evaluated. We envision the generation of result check functions asanother area – beyond significance characterization – in which the programmer will be assisted by the outcome of theautomatic significance analysis.

Unreliability has implications for synchronization primitives as well. For example, barrier synchronization may neverbe achieved should one of the participating tasks fail. In the context of SCoRPiO we support elastic synchronization:The runtime system may wait for just a percentage of the tasks participating to the barrier, or for a specific time



window, before re-evaluating the situation to either wait somewhat longer or for more tasks, kill the unfinished tasksand continue after the barrier, or re-execute the non-reliable tasks participating to the barrier.

2.3.2 Compiler

The compiler developed in the context of SCoRPiO has manyfold functionality:

First, it implements the programming model; it recognizes the SCoRPiO pragmas and lowers them to calls to theruntime system. This is a source-to-source transformation, implemented by extending the SCOOP [16] compilationinfrastructure. The source-to-source part of the compiler also prepares programs to be analyzed by the DCO automaticdifferentiation tool, as discussed in Sections 2.2 and 3.1.

Another important role of the compiler is that of compile-time analysis. We evaluate polyhedral analysis [1] as acompile-time technique to associate computations with the respective data. This information can be used to implicitlyestimate the significance of data and drive their placement on memory areas of different reliability and correctioncapabilities, to estimate the memory footprint of each task and to automate memory transfers between the host memoryand the local memory of accelerator cores.

As mentioned earlier, non-reliable tasks can be executed on unreliable cores. However, certain operations need tobe protected, even when they belong to non-reliable tasks. Two typical examples are operations affecting the controlflow of the code and address generation / pointer arithmetic. Errors in these operations will typically result to crashes.Therefore, we plan to extend the ISA of the target accelerator core architecture with operations which will be protected– and thus executed reliably – even on cores configured to operate in non-reliable design points. The compiler willperform code slicing to extract slices of code that affect control flow and address generation instructions. Theseslices will then be implemented using protected instructions. All analysis passes are implemented by extending theLLVM [4] compiler infrastructure. LLVM offers a rich intermediate representation, which, combined with its modulararchitecture, facilitates the implementation of complex analysis techniques.

The final task of the compiler is to generate binaries, for both the host and the accelerator cores. To that end, we usethe native compilers for the respective architectures. We will, however, need to modify the compiler for the acceleratorcores in order to support the extended ISA.


Table 2.2 summarizes the status of different activities related to the programming model and compiler support at theend of M12. For more details please see D7.1.1.

Activity StatusSpecification of significance-centric programming model Finished - subject to refinement during

implementation and evaluationProgramming model support in the compiler In progress (advanced stage, ahead of schedule)Interface specification among compiler and DCO tool FinishedExploitation of analysis results for compile-time optimization and In early stagecharacterizationMemory access pattern analysis In progressExtended ISA (with protected instructions) support In early stagePorting of applications In progress

Table 2.2: Status of activities related to programming model and compiler support



2.4 System Software

2.4.1 Runtime System

The runtime system (RTS) is the layer of the SCoRPiO stack which orchestrates the execution of the significance-awareapplications on the underlying hardware. At the same time, the runtime system dynamically reconfigures hardwareparameters (e.g. the ration of unreliable / power efficient to reliable / power demanding cores) in order to executeapplications within user specified power, performance and quality of results constraints. We extend BDDT [15], arun-time system designed for task-parallel computation, in order to support significance-based computing.

BDDT was initially designed for dynamic dependence analysis between tasks. The programmer defines the direction-ality of the task arguments, in or out(e.g. the argument is read or written during the task execution) and their size.At the task issue time, the runtime analyzes dependencies between tasks and suspends the execution of a task until allits dependencies are resolved (implying that all data updates on which the task is dependent are committed to memoryby tasks issued earlier). BDDT implements dependence analysis at the granularity of virtual memory blocks, using acustom memory allocator. The size of memory blocks is configurable and controls a trade-off between accuracy andperformance of dynamic-dependence analysis. BDDT supports block sizes as small as 2 bytes, which enable accuratedependence analysis, even at the sub-word level, at the cost of very high runtime overhead, as the dependence ana-lyzer must scan all memory blocks that compose each writeable object to detect a potential dependence. On the otherhand, the programmer has the option to use application-specific block sizes, which guarantee accuracy and minimizeoverhead. In numerical linear algebra libraries, for example, the block sizes used in BDDT should coincide with arraytile sizes used by the library to balance parallelism and locality. The runtime system assigns metadata to every block,which describe the directionality of the block and the tasks that access it. The blocks that compose each memoryobject in the program are located with O(1) overhead during the analysis.

The original design of BDDT was extended to support the significance-based programming paradigm. Figure 2.2outlines the architecture of the runtime system.

Memory Management

Two memory reliability domains will be available and applications will have the option to allocate memory from therequired domain on demand.

Scheduling

One of the main extensions of BDDT concerning execution control relates to scheduling techniques. The schedulingalgorithm of the runtime system must be adjusted to incorporate the notion of significance. In that direction, theruntime differentiates between reliable and unreliable cores (workers). Unreliable cores operate in subnominal areasof the voltage / frequency envelope. They are characterized by low power consumption, at the expense of potentialfaults. Such cores execute only tasks marked as non-significant. On the other hand, Significant tasks will always beexecuted reliably, by cores operating at a nominal voltage / frequency ratio, thus being protected from faults, howeverwith higher power consumption.

Load balancing techniques within the runtime are also redesign in the context of significance-based execution. Unre-liable workers are not allowed to steal work from the queues of reliable workers. However, the opposite may be thecase, if the analytical model suggests so in order to boost performance and balance the workload of different cores.



Figure 2.2: Significance-aware architecture of the runtime system

System Configuration

The decisions of the scheduler will be based on the predictions of an analytical model, which will try to deriveoptimal system configurations, task to core, and data to memory mappings for user-defined optimization metrics, suchperformance or energy consumption. The analytical model will takes as input the number of significant and non-significant tasks and produce suggestions for the optimal configuration of hardware (ratio of reliable to unreliableworkers) and software (ratio of tasks that will be executed unreliably). For example, the model will decide if thenumber of tasks that will execute unreliably will be less than the number of tasks marked as non-significant, when thisimproves performance.

Synchronization

Synchronization primitives are also altered to support relaxed synchronization. The runtime implements barrier-typesynchronization at the granularity of task groups. The barrier can wait for either a programmer-defined ratio of theunreliable tasks to complete, or for a specified period of time before continuing the execution. Whenever the masterthread reaches a synchronization point it sets a global state for the group being synchronized. Whenever a workercommits a task, it checks if the required ratio of unreliable tasks has been achieved, provided that all significant tasksof the group has already committed. If this is the case, it informs the master thread which in turn initializes the cleanupprocess. During the cleanup, workers remove from their queues non executed tasks of the group. The behavior issimilar when a task group barrier watchdog timer goes off.



Programmer-directed Error Detection and Correction

Finally, the runtime supports the execution of result check functions at the end of individual tasks or task groups.Result check functions are executed reliably, either by configuring the processor or, more often, by implementingthem with protected, reliable instructions at compile time.

2.4.2 OS Support

In the SCoRPiO software stack the operating system executes on the host processor only. We opt to use GNU/Linuxand reuse its established facilities and interfaces as much as possible. The main extensions required relate to integratingthe accelerators with the system, interacting with them and taking into account their peculiarities in case cores are ina non-reliable operation mode.

More specifically, the OS is extended in order to be able to:

• Identify the topology of the system (cores and memory).

• Configure cores in specific voltage / frequency steppings. Some of the steppings are within the normal workingenvelope of the core and thus the core operates reliably, whereas others are aggressive, resulting to potentiallyunreliable operation.

• Configure the power consumption / reliability of different memory regions on the accelerators, as well as theaccess rights in each region.

• Monitor the execution in terms of power consumption, and frequency of detectable errors.

• Handle serious, non-recoverable errors on the accelerator cores.

• Provide primitives to the higher system and application software layers for thread manipulation, synchronizationand data transfers.

The respective interfaces are discussed in more detail in Chapter 5.


Table 2.3 summarizes the status of different activities related to system software support at the end of M12. For moredetails please see D7.1.1.

Activity StatusRun-time support for the significance- In progress (advanced stage,based programming model ahead of schedule)Interface between the compiler and the runtime FinishedInterface between system software and hardware In progress (advanced stage)Analytical models for scheduling tasks to In progressoptimize power, energy or performanceEnergy accounting and control at the task-level Finished

Table 2.3: Status of activities related to system software support



Figure 2.3: Target SCoRPiO architecture

2.5 Simulator

The template of the many-core architecture serving as the hardware layer of the SCoRPiO project is depicted inFigure 2.3. This architecture will be implemented in the context of a full-system functional simulator. The targetarchitecture is composed of a fully reliable ARM-based processor (the host) and a many-core accelerator. The acceler-ator will be modeled and enhanced to support the significance-driven paradigm. The ARM general-purpose processorruns a full-fledged operating system and serves as a coordinator of the system, while the many-core accelerator is usedto offer increase the speed and/or the power efficiency of compute intensive applications or parts of them.

The ARM processor and the accelerator share the main memory, used as communication medium between the two.The basic component of the proposed target many-core accelerator template is a cluster of cores connected via alocal, fast interconnect to the memory subsystem. The accelerator consists of one or more clusters, connected via aNetwork-on-Chip. The accelerator target architecture features a configurable number of simple RISC cores (i.e. 32-bitOpenRISC), with private or shared L1 I-cache architecture. All cores within a cluster share a Tightly Coupled DataMemory (TCDM) accessible via a local interconnection.

The local interconnection is, from a behavioral point of view, a parametric Mesh-of-Trees (MoT) interconnectionnetwork (also called logarithmic interconnect) to support high-performance communication between processors andmemories. The module is intended to connect processing elements to a multi-banked memory for both data andinstructions. Logarithmic interconnect data routing is based on address decoding: a first-stage checks if the requestedaddress falls within the local memory address range or has to be directed to the main memory. To increase moduleflexibility this stage is optional, enabling explicit L3 data access on the data side while, on the instruction side, can bebypassed letting the cache controller take care of L3 memory accesses for lines refill.

The interconnect provides fine-grained address interleaving on the memory banks to reduce banking conflicts in caseof multiple accesses to logically contiguous data structures. The crossing latency consists of one clock cycle. Incase of multiple conflicting requests, for fair access to memory banks, a round-robin scheduler arbitrates access anda higher number of cycles is needed depending on the number of conflicting requests, with no latency in between. Incase of no banking conflicts, data routing is done in parallel for each core, thus enabling a sustainable full bandwidthfor processors-memory communication. To reduce memory access time and increase shared memory throughput, readbroadcast has been implemented at a zero penalty compared with a simple read.

On the data side, a L1 multi-ported, multi-banked, TCDM is directly connected to the logarithmic interconnect. Thenumber of memory ports is equal to the number of banks, to allow concurrent access to different memory locations.Once a read or write request is brought to the memory interface, the data is available on the negative edge of the same



clock cycle, leading to two cycles latency for conflict-free TCDM access. As already mentioned above, if conflictsoccur there is no extra latency between pending requests; once a given bank is active, it responds with no wait cycles.

The L1 Instruction Cache basic block has a core-side interface for instruction fetches and an external memory interfacefor refill. The inner structure consists of the actual memory and the cache controller logic managing the requests. Themodule is configurable in its total size, associativity, line size and replacement policy (FIFO, LRU, random). The basicblock can be used to build different Instruction Cache architectures, like private or shared.

2.5.1 Power Estimation and Fault Injection

One of the most important features of the SCoRPiO simulator platform is its ability to realistically provide an esti-mation of power consumption and to inject errors whenever a core operates outside its normal voltage / frequencyenvelope. Both power estimation and fault injection are based on models developed by studying circuit behavior at thegate level and generalizing this information at the level of instructions. This process is discussed in Section 2.6.


Table 2.4 summarizes the status of different activities related to hardware modeling and simulation at the end of M12.For more details please see D7.1.1.

Activity StatusInterface between system software and hardware In progress (advanced stage)Implementation of a many-core simulated system In progress (advanced stage)architecture supporting significance-based executionExtended ISA (with protected instructions) support In early stage

Table 2.4: Status of activities related to hardware modeling and simulation

2.6 Power and Error Modeling

2.6.1 Characterization and Propagation of Hardware Behavior at the Software

The shrinking of device dimensions and the resulting static and temporal variations in transistor characteristics lead totiming and static-noise-margin variations in computational and storage components of todays systems. Such variationsmay result in delay and memory failures, which depend highly on the supply voltage and other dynamically changingenvironmental conditions such as temperature. The main goal of SCoRPiO is to explore the impact of such hardwareinduced failures on system operation and develop mechanisms at the software and hardware level that synergisticallyexploit the error resiliency of various applications for tolerating such faults.In order to develop such mechanisms weneed to propagate the behavior of the underlying hardware at the software layer and study its impact on the applicationand system performance. Note that the effectiveness of such mechanisms will depend on the accuracy of the behavioraland energy models.

To achieve maximum accuracy we opt to derive high level models from the gate level representation of a RISCprocessor, departing from the majority of existing works that rely on injecting faults at the RTL representation withouttrue connection to the physical realization of the target architecture. To achieve a balance between simulation time andaccuracy, the high level models are developed at the granularity of an instruction, which is the main primitive withininstruction set architecture simulators.



Figure 2.4: Flow for extracting instruction-level failure rate and energy models.

However, the extraction of instruction level failure rate (IFR) and energy (IE) models requires extensive Monte Carlogate-level simulations of the target core under different operating conditions and a large set of representative codeswith a variety of instruction sequences and data patterns. Our characterization procedure, shown in Figure 2.4, beginswith the profiling of representative sets of code from various computational kernels and the extraction of frequentdata patterns and instruction sequences. Based on these, we perform several Monte Carlo simulations at the gate-level using the implemented processor core and we obtain the corresponding effective (dynamic) delay distribution ofall the computational paths. This approach makes it possible to move from spice level variability data to single cellvariability models in the library files and ultimately to critical path delay distributions, as shown in Figure 2.5. Notethat the whole flow depicted in Figure 2.4 needs to be repeated after every microarchitecture enhancement of the core.This will allow to capture any potential change in the behavior and energy of each instruction that might incur aftereach enhancement of the single core allowing accurate fault and energy modeling at the different stages of the project.

By performing such an analysis we can capture both the dependence of the circuit delay on the processed operands and

(a) (b) (c)

Figure 2.5: From (a) SPICE model of a NAND2 gate, to (b) delay distribution of a NAND2 gate due to only local vari-ations (blue) or local+global variations (green), to (c) critical path delay for 2 ASICs for to different supply voltages.Fig. (c) shows that lower voltages result into longer delays and a wider delay distribution.



Figure 2.6: Performance degradation over time for 28nm planar and 14nm FinFET technology

the non-uniform length of the critical paths within each computational stage. In other words, we are able to capturethe different degree of vulnerability to variations of each instruction depending on the way the instruction stresses thenon-uniform critical paths across the different pipeline stages.

IFRi(cdj) for each instruction i under different considered parameters cdj (i.e. voltage, temperature) can be quantifiedas the total number of violated cycles over the total simulated cycles for the instruction i. IFR can be defined at lowergranularity with respect to the failure probability of individual stages as:

IFRi =

∑s∈S

IFRs,i

Li(2.2)

where L is the total number of cycles executed for the instruction of type i and IFRs,i is the failure probability ofinstruction i at stage s, quantified as the total number of violated cycles in stage s over the total number of cycles instage s for the instruction i.

For producing the instruction level energy models we use the extracted representative data patterns and perform poweranalysis at different operating conditions. Specifically, for each instruction we perform several Monte Carlo simu-lations with relevant data patterns and we average out the energy dissipation per clock over all the executed clockcycles.

2.6.2 Reliability Modeling

Beyond variability, we also take into account reliability in our models. Moving to FinFET in 14nm technology andbeyond, transistor aging cannot be neglected anymore when designing a digital circuit. This is mainly due to anincreased importance of an effect called Bias Temperature Instability (BTI), which can cause charges to get trappedin the transistors gate. A BTI model has been designed and calibrated with silicon data of research grade 14nmFinFET devices. Based on these models and ring oscillator test circuits, we have evaluated the impact of BTI on the14nm performance. As Figure 2.6 shows, while initially the 14nm FinFET transistors have a lower spread on theperformance, during lifetime, the devices will degrade significantly more than 28nm planar transistors. Using thesedegradation numbers, the BTI delay degradation is integrated in the variability model as derating factors, which allowto translate variability data of a design in 28nm to a reasonable variability estimate in 14nm. At this point we wouldlike also to mention that care has been taken in order to ensure that the gate level models and the derating factors canbe transfered between the groups of the consortium, even in cases that the technology vendor and available designlibraries do not match completely.




Table 2.5 summarizes the status of different activities related to power and error modeling at the end of M12. Formore details please see D7.1.1.

Activity StatusProbabilistic fault models of hardware components - FinishedSilicon validationIdentification of error prone hardware components In progressImplementation of an OpenRISC core at 28nm FinishedOpenRISC timing profile optimization In progress (advanced stage)Evaluation of power efficiency / reliability In progressdesign points (for memories)

Table 2.5: Status of activities related to power and error modeling

2.7 Interfaces

As we mentioned earlier, the effective collaboration between different system layers is a prerequisite for the success ofSCoRPiO . Interfaces need to be defined and implemented in order to allow the flow of information across consecutivelayers. Figure 2.7 outlines the three major interfaces of SCoRPiO .

The main interfaces are the following:

Figure 2.7: Main interfaces of the SCoRPiO architecture



1. The interface between the automatic significance characterization & analysis methods and the programmingmodel & compiler infrastructure. The compiler needs to prepare source code so that it can be analyzed by theautomatic characterization tools. Similarly, the compiler needs to access the outcome of the significance analysisand characterization process. This interface is described in Chapter 3.

2. The interface between the compiler and system software. The compiler has to lower applications to the API of-fered by the runtime system. Moreover, part of the API of the runtime (specifically functionality for monitoringand configuring application execution) needs to be directly available to the programmer. Chapter 4 discusses indetail this interface.

3. The interface between hardware and system software. System software needs to be able to identify hardwareresources, configure and use them. Moreover, system software acquires valuable feedback on the status of thesystem, the utilization of resources and the hardware/software interaction. Chapter 5 is dedicated to defining thehardware/software interface.



Chapter 3

Interface between Automatic SignificanceCharacterization and ProgrammingModel/Compiler

Figure 3.1 depicts the workflow of automatic (or assisted) significance characterization. To use the significance anal-ysis functionality provided by dco/scorpio the compiler/programmer has to:

1. Start from the initial version of source code, with just task annotations.

2. Use a source to source pass of the SCoRPiO compiler to create an annotated version of the original source code,appropriate for dco/scorpio (Section 3.1).

3. Compile the annotated source we want to analyze for sifnificance characterization with a C++ compiler, linkthe object code against the dco/scorpio. Running the executable will generate an output file containing the

Figure 3.1: Workflow of automatic (or assisted) significance characterization



results of the significance analysis (Section 3.2).

4. Read the output file and make use of the significance information (Section 3.3). This part can be eitherprogrammer-driven, or undertaken by a specialized tool, external to dco/scorpio and the SCoRPiO compiler,that uses the significance output to (a) classify each task as either significant or non-significant, (b) produce resultcheck functions, and (c) suggest default values or approximate versions of the computation for failing tasks.

The output of the process is a significance-annotated source code, ready to be compiled by the SCoRPiO compiler.

Section 3.1 describes the interface between the compiler and the significance analysis. The access to significanceanalysis results is described in Section 3.2, and 3.3 discuss the information transfer from the analysis back to thecompiler.

3.1 Programmer / Compiler to Significance Analysis Interface

In this section we describe the interface between the compiler and and dco/scorpio, that is the annotations neces-sary to do a significance analysis together with the syntax of the dco/scorpio run time support.

3.1.1 Header Files

#include "dco scorpio.hpp"#include "scorpio info.hpp"

The first header file makes the data type dco::ia1s::type of dco/scorpio available together with overloadedversions of arithmetic operators and intrinsic functions. This data type allows to compute the interval values andadjoints required to compute the significance of variables as proposed in [13].

With the second header file, the main data structure for the significance analysis is introduced: The class scorpio info

is a container class for significance entries of variables (class scorpio info item). The interface described in theremainder of this chapter is defined by scorpio info.

3.1.2 Variables Type Change

The overloaded arithmetic of dco/scorpio is activated by changing the data type of variables relevant for thesignificance analysis into dco::ia1s::type, the data type provided by dco/scorpio. Relevant are all variablesthat depend on the input variables (their value is computed from the input values) and are used to compute the finaloutput of the code.

Besides changing the type of the program variables to dco::ia1s::type, the programmer should also initializethem using an input range. See Section 3.1.4 for details.

3.1.3 Variable Declarations

The significance analysis requires at least one instance of class scorpio info. In listing 3.1, sivar is an in-stance of the scorpio info class, and descr is an optional C string (char*) describing the intended significanceanalysis.



1 s c o r p i o i n f o s i v a r ( [ d e s c r ] ) ;

Listing 3.1: Instantiating scorpio info

3.1.4 Registering Variables

Class scorpio info was created to store significance information of variables. Instead of using member functionsof scorpio info, we recommend to use three groups of pre-processor macros to register variables in the signifi-cance frame work of dco/scorpio (discussed in the following sections). These macros are designed as wrappersfor methods of class scorpio info, that allow the following information to be generated automatically:

• Name of the variable

• File name where the variable is registered (more specific: where macro is used)

• Line number where the variable is registered.

Note that variable names are stored as given to the macro, e.g. array indices will not be expanded (that is x[i] fori=1 will be stored as ”x[i]” and not as ”x[1]”).

Up to six task specifiers t1, t2,..., t6 of type int can be used to mark variables by task numbers, loop indices,array indices or global counter values. Later, during the significance evaluation by the compiler of the significance-centric programming model (or a special converter tool), this information can be used to assign significance output tothe compiler internal representation of task results.

Registering Inputs

Input variables have to be introduced to the significance analysis by registering in the scorpio info instancecreated previously. Two pre-processor macros are available to register scalar inputs.

SIINPM( sivar, var, lb, ub, t1 ,[t2, ...,t6] );SIINPMD( sivar, var, lb, ub, descr, t1 ,[t2, ...,t6] );

Non-scalar input variables (arrays) must be registered element wise1.

The ScorpioInfo-INPut-Macro SIINPM takes the following arguments:

Argument Description Data typesivar Name of the scorpio info variable scorpio info

var Name of the input variable dco::ia1s::type

lb Lower bound of input range double,int

ub Upper bound of input range double,int

t1 Task specifier 1 (mandatory) int

t2,...,t6 Additional task specifiers (optional) int

Macro SIINPMD takes an description (char*) of the variable as additional mandatory argument (in front of the firsttask identifier t1).

1This is the case at the moment. Proper interfaces for higher dimensional objects will be provided at a later stage of the project.



Registering Intermediate Values

Scalar intermediate variables can be added to the analysis framework by two pre-processor macros.

SIADDM( sivar, var, t1 ,[t2, ...,t6] );SIADDMD( sivar, var, descr, t1 ,[t2, ...,t6] );

Non-scalar intermediate variables (arrays) must be registered element wise again.

The ScorpioInfo-ADD-Macro SIADDM takes the following arguments:


var Name of the intermediate variable dco::ia1s::type



Macro SIADDMD takes an description (char*) of the variable as additional mandatory argument (in front of the firsttask identifier t1).

Note that task results should be registered as intermediate variables, if they are not final outputs of the completecomputation too.

Registering Final Outputs

Final output variables are the outputs of the complete computation. We are interested in the significance of task results(as intermediate values) for these final outputs.

Two pre-processor macros allows to register scalar final outputs in the scorpio info instance.

SIOUTM( sivar, var, sfct, t1 ,[t2, ...,t6] );SIOUTMD( sivar, var, sfct, descr, t1 ,[t2, ...,t6] );

Non-scalar output variables (arrays) must be registered element wise again. Note that by registering more than onefinal output only the aggregate significance of intermediate variables can be computed.

The ScorpioInfo-OUTput-Macro SIOUTM takes the following arguments:


var Name of the final output variable dco::ia1s::type

sfct Significance scaling factor double



Macro SIOUTMD takes an description (char*) of the variable as additional mandatory argument (in front of the firsttask identifier t1).

3.1.5 Triggering the Analysis

The significance analysis of the data stored in an scorpio info instance is activated by the method analyze ofclass scorpio info.



sivar.analyze()

where sivar denotes an instance of scorpio info.

After this call, significance values for all registered variables are available.

3.1.6 Multiple Analysis Runs

In the case of more than one final output, individual significance analysis runs for every output can be done by justone scorpio info instance. For that, the significance computed for one output has to be reset, the next output canbe registered and the analysis can be triggered again.

To reset the significance values in an scorpio info instance, method reset of class scorpio info is used:

sivar.reset()

where sivar denotes an instance of scorpio info.

3.2 Access to the Results of Significance Analysis

3.2.1 Access to Information of scorpio info

scorpio info was designed as a container class for significance information in the tradition of container classesfrom the STL (Standard Template Library). Usually, the following methods of scorpio info are used only:

Method signature Descriptionscorpio info item operator[] ( int n ) returns a copy of the n-th entryint size() number of scorpio info item storedvoid clear( void ) free memory used internally

Nevertheless, other methods of the STL container class interface have been implemented too:

void push_back( const scorpio_info_item & si );

int size( void ) const ;

int capacity() const ;

void reserve( int n ) ;

void resize( int n ) ;

//------------------

// iterator support

//------------------

std::vector<scorpio_info_item>::iterator begin() ;

std::vector<scorpio_info_item>::iterator end() ;

std::vector<scorpio_info_item>::reverse_iterator rbegin() ;

std::vector<scorpio_info_item>::reverse_iterator rend() ;

Moreover, for an easy and quick output of significance results, the C++ stream output operator << has been overloadedtoo:

std::ostream& operator<<( std::ostream &s, scorpio_info &sis);



3.2.2 Access to Information of scorpio info item

Individual significance entries stored in an instance of scorpio info can be retrieved by the element access oper-ator []. If si1 is an instance of scorpio info with at least one element of scorpio info item stored, thefollowing code would access the first element (with index 0):

1 s i 1 [ 0 ]

Listing 3.2: Access to individual entries of the significance information.

The following methods of scorpio info item can be used to retrieve the significance information from signifi-cance entries:

Method signature Descriptiondouble significance() significance value, default criteriadouble significance( int i ) significance value, i-th criteriainterval value() interval valueinterval adjoint() interval adjointstd::string name() name and description (joined)std::string file() file name of registrationint line() line number of registrationint tsk1() 1st task identifierint tsk2() 2nd task identifierint tsk3() 3rd task identifierint tsk4() 4th task identifierint tsk5() 5th task identifierint tsk6() 6th task identifier

Moreover, for an easy and quick output of significance results, the C++ stream output operator << has been overloadedtoo:

std::ostream& operator<<( std::ostream &s, const scorpio_info_item &si);

3.3 Significance Analysis to Programmer / Compiler Interface

3.3.1 Through scorpio info

The class scorpio info was designed as an abstract layer between the significance analysis itself and the compilerof the significance-centric programming model of WP2. Instances scorpio info can operate in two basic modes:Significance computation/storing and significance retrieval from external files.

The generated significance information can be stored in an external file via the method write to file of classscorpio info:

sivar.write to file( fname )

where sivar is an instance of scorpio info, and fname the name of the file to create (char* or std::string).

To access the significance information stored in an external file, the compiler of the significance-centric programmingmodel (or a special converter tool) has to create an instance of scorpio info too, but it does not need to activatethe significance analysis part :



1 s c o r p i o i n f o s i v a r ( f a l s e , [ d e s c r ] ) ;

Listing 3.3: Instantiating scorpio info for significance retrieval

where sivar is the identifier for the scorpio info instance. The first argument false of the constructor avoidsto activate the significance analysis. descr is an optional C string (char*) describing the significance results thatcan be read from an external file by the method read from file:

sivar.read from file( fname );

Here sivar is an instance of scorpio info, and fname the name of the file to read (char* or std::string).

After the retrieval of the significance information via the read from file() method, the access functions dis-cussed in Section 3.2 can be used to access the significance data from within the compiler or the converter.

3.3.2 Directly from the Binary File

An alternative to using the abstract interface discussed in 3.3.1, tools may opt to read the analysis information directlyfrom the binary file. The content of the binary file adheres to the following hierarchical scheme:

task(source file, source line, group id, task id) At the top level lies the task construct. It uniquely identifies a taskby its position in a source file (source file and source line) as well as a task id and group id. They uniquelyidentify a task in the context of a single application run. The task construct essentially envelopes all significanceinformation pertaining to the execution of a single task in the form of the following two containers.

arguments scalar The argument scalar container holds information about the scalar arguments of a task.There is exactly one such construct in each task container.

argument A scalar argument is uniquely identified by its name in the source code. It contains four fieldsin the exact order as described below.

name The argument name.

type Type can either be IN, OUT or AUXILIARY. The first two values indicate that the specific scalarargument is involved in the data flow graph of the tasks, whereas AUXILIARY simply states thevalue of this scalar parameter might be read.

value(lower bound, upper bound) The value field will contain the upper and lower bounds of theparameter value.

adjoint(lower bound, upper bound) The adjoint field will contain the upper and lower bounds of theparameter adjoint value.

arguments array The argument array container holds information about the array arguments of a task. TheDCO essentially manages the arrays as multiple scalar variables. Therefore the DCO analysis will provideinformation for all elements of an array. For each array argument there is exactly one construct in eachtask container.

argument An array argument is uniquely identified by its name in the source code. It contains four fieldsin the exact order as described below.

name The argument name.

type Type can either be IN, OUT or AUXILIARY. The first two values indicate that the specific arrayargument is involved in the data flow graph of the tasks, whereas AUXILIARY simply states theelements of this array parameter might be read.



index information The index information describes the number of dimensions of the array and definesbounds of each index space dimension.

dimension multitude Specify the number of dimensions.

(lower bound,upper bound) Define bounds for the ith dimension.

value(lower bound, upper bound),adjoint(lower bound, upper bound) For each element of the arraydefine the lower and upper bound of the value and adjoint in lexicographical order.

The proposed binary DCO-to-compiler interface format is outlined below.

Each task construct begins with the magic byte 0x00 which is followed by information to identify the source file nameand source line on which the task was invoked. This information consists of an unsigned char to denote the length ofthe file name followed by a string of unsigned characters. Afterwards, the group id and task id fields are present, eachone is represented using an unsigned long numbers. After the task delimiter, information about its scalar argumentsfollows.

Scalar arguments:A scalar argument starts with the delimiter byte 0x02, which is then followed by the name, type, value and adjointfields. The name information starts with a single byte (name length). It denotes the length of the string which specifiesthe argument name. The next name length bytes contain the actual argument name. he following argument is type,a single byte with three valid values: IN=0x01, OUT=0x02 or AUXILIARY=0x03. The next field, value, consists oftwo doubles: one for the lower bound and one for the upper bound of the argument value. Finally, the adjoint fieldalso consists of two doubles: one for the lower bound and one for the upper bound of the adjoint.

If the next byte is 0x02, another scalar argument is present. If however, a byte with value 0x01 is next, the datafollowing describe the array arguments of the task.

Array arguments:An array argument starts with the delimiter byte 0x02 which is followed by the name, type, dimensions and value-adjoint fields. The name and type fields are identical to those of scalar arguments. Information which bounds thearray element indices follows. It consists of a single unsigned char named dimensions length, which specifies thenumber of dimensions present in the array index space. For each dimension, a pair of index bounds can be found,each consisting of two unsigned longs. The final field is named value-adjoint. Its first four bytes are an unsigned long(value adjoint vector length), which specifies how many value-adjoint tuples will be read. Each tuple consists of fourdoubles as follows: value lower bound, value upper bound, adjoint lower bound, adjoint upper bound. The tuplesare stored in lexicographical order with respect to the index space defined by the dimensions field.

If the next byte is 0x03 a new array argument is parsed. Otherwise, if the next byte has a value of 0x00 a new taskcontainer is present.

Table 3.1 presents the delimiters used in the binary scheme and Table 3.2 outlines the bytecodes which represent thevarious argument types. Finally, Table 3.3 summarizes the DCO tool output binary format in EBNF notation.

Delimiters ValueTask delimiter 0x00

Array arguments start 0x01Argument delimiter 0x02

Table 3.1: Delimiters

Argument Type ValueIN 0x01

OUT 0x02AUXILIARY 0x03

Table 3.2: Argument type codes



DCOOutput = TaskDesc.

TaskDesc = TaskDelim FileName FileLineNr GroupID TaskID [ScalarArgList] [ArrayArgList].

TaskDelim = 0x01.

ScalarArgList = {ScalarArg}.ScalarArg = ArgDelim ArgName Type ScalarValueRange ScalarAdjointRange.

ScalarValueRange = RealLowerBound RealUpperBound.

ScalarAdjointRange = RealLowerBound RealUpperBound.

ArrayArgList = ArrayDelim {ArrayArg}.ArrayArg = ArgDelim ArgName Type Dim {IndexRange}Dim AdjDim {ArrayAdjointRange}AdjDim.IndexRange = IntLowerBound IntUpperBound.

ArrayAdjointRange = ScalarValueRange ScalarAdjointRange.

FileName = String.

ArgName = String.

String = Len {Char}Len.Len = UnsignedChar.

FileLineNr = UnsignedLongInt.

GroupID = UnsignedLongInt.

TaskID = UnsignedLongInt.

Type = IN | OUT | AUX.

IN = 0x01.

OUT = 0x02.

AUX = 0x03.

ArrayDelim = 0x01.

ArgDelim = 0x02.

Dim = UnsignedChar.

AdjDim = UnsignedChar.

RealLowerBound = Double.

RealUpperBound = Double.

IntLowerBound = UnsignedLong.

IntUpperBound = UnsignedLong.

Table 3.3: DCO Tool Output Format in EBNF Notation.



Chapter 4

Compiler - System Software

The API of BDDT used by the compiler to lower the pragmas-based programmer specifications to equivalent calls tothe runtime system (RTS) is the following:

void task call (uint32 t funcid, uint32 t total args, uint32 t groupid,uint8 t significance, uint8 t (*taskcheck)(void *),void *taskcheck args, size t args size, uint32 t redo,void * parameters to task invocation))

Creates a task for execution with the RTS. Information about the closure of the task-function to be executed is passed,as well as information concerning significance of the task.

• significant specifies the significance of the spawned task. This information is provided by the programmerand can be used by the RTS to map the task to a core of appropriate reliability.

• groupid is an identification for the taskgroup in which the task belongs to. Each taskgroup has a unique id.

• taskcheck is a function pointer to a result-check function. The result-check function will be called after thetask completes or fails. It allows the programmer to run checks to validate the integrity of the computations ordata. The result-check function is executed reliably and it returns either either SGNF SUCCESS which denotesthat the task has produced acceptable results, or SGNF REDO, which denotes that the task failed and should bere-executed if the maximum number of re-executions has not been reached. In case this number is reached thetask will silently fail. SGNF SUCCESS denotes successful execution of the task. If the result-check functionis not present the assumed return value is SGNF SUCCESS.

• taskcheck args is a pointer to the arguments of the result-check function.

• args size is the length in bytes of the arguments passed to the taskcheck function.

• redo denotes the maximum number of the times that this task will be re-executed in case of failure.

void task wait (uint32 t groupid, uint8 t (*groupcheck)(void *),void *groupcheck args, size t args size, uint8 t type,uint32 t time ms, double ratio, uint32 t redo)

Synchronization primitive for tasks. One task can wait for all tasks of the work-group, wait for a percentage of themor wait for a specific amount of time.

• groupid identifies the group of tasks for which the RTS will wait before the execution continues.



• groupcheck is a pointer to a result-check function which will be executed after the waiting conditions havebeen met. This helps to check the integrity of the computations of a group of tasks. Again, this functions canreturn SGNF FAILURE, SGNF REDO or SGNF SUCCESS.

• groupcheck args is a pointer to the arguments of the group result-check function.

• type defines the events for which the synchronization function will wait. There are three options which canbe bit-wised if we want a combination of them. SYNC TIME will wait until a specified amount of time beforeresuming, provided that all significant tasks have completed.SYNC RATIO will wait for a specified ratio ofnon-significant and potentially unreliable tasks to be finished before resuming and SYNC ALL, which is thedefault option will wait for all significant and non-significant tasks of the workgroup to be completed.

• time ms defines the time (in milliseconds) for which the statement will wait for non-significant tasks to com-plete. If the SYNC TIME flag is not defined the value is ignored.

• ratio defines the ratio of non-significant tasks that must complete before the execution resumes. If theSYNC RATIO flag is not defined this argument will be ignored.

• redo specifies the maximum times that this group is allowed to re-execute. If the groupcheck function returnsSGNF REDO more times than redo then the execution of the group will be marked as failed.

Moreover, the runtime offers 8 run-time calls, which are part of the API and the programmer can use directly in hercode. These calls have been initially introduced in D2.1. The Run-time calls with the prefix sgfn task may only beused inside a task result-check function. The rest of the API calls may be used in any part of the host code, as well asthe group result-check functions:

int sgfn task status get()sgfn task status get queries the run-time system to find out whether the task has crashed during its latest execution.This could either mean that the hardware has detected an unrecoverable fault and reported it to the run-time or that thetask failed to complete in a timely manner with respect to user-specified time constraints. sgfn task status get returns0 if no crash has been detected, 1 otherwise.

void sgfn task significant set(uint8 t significance)sgfn task significant set is used to set the significance information of a given task. It can only be invoked in the contextof a task-level result-check function. In the event that a task has failed, the end developer may feel that it is necessaryto modify the significance characterization of a task before it is re-executed.

void sgfn task executions max set(uint32 t number)sgfn task executions max set can only be invoked in the context of a task-level result-check function. It is used to setthe maximum number of times a task may be re-executed.

int sgfn task reexecutions get()sgfn task reexecutions get is used to find out the number of re-executions of this task.

The next three run-time calls can be used to manage as well as query information about the execution of a specifiedtask group. These can be invoked anywhere in the code after or during the execution of a task group result-checkfunction.

void sgfn group reexecutions max set(const char *lbl, uint32 t number)sgfn group reexecutions max set is used to set the maximum number of times a task group specified by lbl may bere-executed.



int sgfn group reexecutions get(const char *lbl) sgfn group reexecutions get is used to get the number of re-executionsa task has been re-executed.

int sgfn group return value(const char* lbl)sgfn group return value can be used to retrieve the return value of the last executed result-check function for the taskgroup specified by lbl.

int sgfn group reexecutions get(const char* lbl)sgfn group reexecutions get can be used to retrieve the number of times the task group identified by lbl has beenre-executed.

int sgfn group status get(const char* lbl)sgfn group status get returns the execution status for a task group which is identified by lbl. A group is considered tohave crashed if either of the following is true:

1. a significant task has crashed, or

2. the number of insignificant tasks which completed successfully does not meet user specified constraints. Theseconstraints are set using the ratio() and all clauses in the taskwait #pragma.



Chapter 5

System Software - Simulator

5.1 HW configuration and events monitoring

5.1.1 Handling accelerator cores

The runtime system will be able to define the execution state of each core, as well as get information about execution,such as statistics for the number of errors that happened over a time interval, by accessing special registers. Themicroarchitecture will provide access to these HW registers by special read/write instructions.

The runtime system will be able to define the reliability level of each core using an interface similar to ACPI P-states.The cores will need to implement only the part of the ACPI standard needed to achieve such functionality. Two P-states will be supported, corresponding to two different tuples of voltage and frequency (V, f). Cores in state P0 willoperate reliably, using nominal frequency and voltage. Cores in state P1 will operate unreliably in reduced voltage andfrequency.

Each core will provide a control register (PSTATE CTL), the value of which will determine the P-state in which thecore will operate. If more reliability levels are defined in the future, this will correspond in a bigger number of possibleP-states for the core. Similarly, these states will be enforced by writing the corresponding value on the control register:

Similarly, the runtime system will have access to HW counters of the Execution Monitoring Unit (EMU) of each core.In each EMU, there will be the following fixed HW counters:

• CNT TOTAL INST The total amount of instructions that were executed by the core

• CNT APPROX INST The amount of instruction that were executed approximately (without error correction)by the core

• CNT TOTAL ERRORS The total number of errors detected in the core

• CNT UNCOR ERRORS The number of errors that was detected in the core and was not corrected

• CNT TOTAL CYCLES The total number of cycles for a task executed on the core

There will also be an extra control register HWCNT CTL for enabling/disabling events counting and flushing thecounters. The first bit (bit 0) of this counter will enable/disable the counting of events by writing 1 or 0 respectively.If bit 1 of HWCNT CTL is set then the HW counters will be reset.

In order to not allow dataflow between reliable and unreliable data, each core can be configured to allow or ban accessto specific ranges of the memory of the accelerator, depending on its reliability level. For this reason, each core in



an accelerator will have a private Access Table (AT). AT will contain one flag per memory region that will definewhether the core is allowed to access it while executing in unreliable mode. ATs will be modifiable at runtime. Theruntime will be able to set these flags before a task is executed, and, consequently, control the access to sensitivememory data. For example, an unreliable core will not be allowed to access data in a range of physical addresses thatcontain significant data. In this way, the runtime will be able to enforce the significance semantics as described bythe programmer and isolate significant from non-significant data and the hardware will have a straightforward way todetect memory violations during unreliable execution. If an error during the execution of an unreliable core, leads amemory access to a memory area that access is not allowed to the core, the hardware will identify it as a violationand will raise an exception to the host core. A base register MEM PROT will be available to each core and will pointto the start of the access table. This table will contain one flag per memory range of the accelerator’s tightly coupledmemory.

5.1.2 Handling memory

Similar functionality will be provided for memory. All cores in an accelerator handle a tightly coupled memory(TDCM), which is physically divided in regions. There is not a MMU so we have an 1:1 mapping of physical regionsto memory pages. Each region has a reliability level which depends on the protection and correction mechanismsimplemented by the region and its refresh rate. During execution, the runtime system can configure the refresh rate ofa region and enable/disable the ECC support in regionss that have ECC enabled. In this way the RTS will be able tocontrol the reliability level of the region.

A Reliability Table (RT) will contain information about the protection mechanisms that the region has and its refreshrate. This table will be common to all cores since the all cores of an accelerator have access to a unique TCDM. A baseregister MEM REL common to all cores will point to the start of the memory area containing this information. Foreach region there will be information about which protection mechanisms are available in the region and its refreshrate. For those values that can be altered during runtime, i.e. refresh rate, enabled/disabled ECC, the RTS will be ableto define the desired state by writing the corresponding value in the proper field.

bit 0

bit 7

refresh rate

EC

C

8T/6T

currently unused

Figure 5.1: Reliability flags for a memory region

Figure 5.1 shows the layout of the flags in the RT. Bits 0 − 2 of each RT entry define the refresh rate of the regioncells. There is provision for 8 different values of refresh rate. Each value will correspond to one value of refresh rate.The exact values depend on the hardware implementation. Bit 3 shows if the ECC is enabled for this region. Value 0means that ECC is disabled, while 1 means that it is enabled. If the implementation allows the enabling/disabling ofECC for this region, it should, also, allow writing to this bit. Bit 4 shows the type of cells used for this region. If itsvalues is 1 then 8T cells have been used to construct the region, making in more reliable. If bit 4 is 0 then 6T cellshave been used to construct it. This bit is not writeable. The rest of the bits are currently unused.

There will also exist HW counters for events in memory:



• CNT MEM TOTAL ERRORS Total number of errors that manifested in this region

• CNT MEM CORR ERRORS Number of errors that manifested in this region and were corrected by ECC

Similarly to the events counting on cores events there will be a control register, MEMCNT CTL for memory eventswith the exact same functionality.

Moreover, the runtime system will be able to dictate the allocation of the data to happen in either, reliable or unreliablememory regions. Using the information provided by the reliability table, the runtime will be able to allocate for theapplication, on-demand, memory ranges from different physical memory regions (reliable, or unreliable). We willutilize this to build a library (similar to libnuma of linux) with which the user or the runtime system will be able toenforce the allocation of the data in a reliable or unreliable region. A basic set of functions to allocate memory on thelocal memory of the core will be:

• void *sig alloc onnode(size t size, int region)

• void *sig realloc onnode(void *oldaddr, size t old size, size t new size)

• void *sig free(void *addr)

• void sig preferred(int reliability)

5.2 Error handling

Executing processors and memories in low reliability mode may cause errors. When an EMU detects such a violationit causes an interrupt to the host processor and passes information about the core which produced the error, the typeof the error and the exact instruction that caused it. The host OS will use the siggqueue to send a signal SIGSCRP tothe runtime system passing the information about the signal in pointer void *si ptr.

5.3 Topology

The topology will be accessible via the sysfs virtual filesystem of linux. Information on both the cores and thememory configuration will be available through this interface. The information about the cores will include a) theaccelerator they belong to, within the cluster, b) the range of physical addresses each core may access c) the capabilitiesof the core: state (reliable, unreliable), access mode per memory region. A user can change the state of the core bywriting in the file that represents the state. As far as the memory is concerned, there is information for each regionin all TCDMs in the cluster. The available information is: a) the range of physical addresses it includes, b) Theprotection/correction mechanisms available in this region c) which of these mechanisms are currently enabled d) thereliability state of the region (refresh rate). The reliability of the region as well as can be changed by writing in therespective file. In the same fashion, a user can enable or disable protection/correction mechanisms.

5.4 Thread manipulation/Memory transfers

For handling creation, termination, binding of threads, as well as memory transfers between accelerators or accelera-tors and the host memory, there will be a library provided by the hardware layer that will bypass the OS. Briefly, wepresent this interface:

• uint32 t thread create() Creates a new thread and returns its (unique) id to the runtime system



• uint8 t thread bind(uint32 t thread id, uint32 t core id) Binds the thread with id thread id in core core id.It returns 0 in case of success or -1 in case of failure.

• thread status t thread status(uint32 t thread id) This call returns the status of the thread with id thread idin a struct of type thread status t which contains information about the thread, such as its running state.

• void *memcpy(void *dest, void *src, size t size, uint32 t flags) Copies data between different physical mem-ories. It copies size bytes form memory area starting at address src to memory area starting at address dest. Itreturns a pointer to dest



Chapter 6

Conclusions

In this document, we presented the main components of the vertical, multi-layer SCoRPiO project approach towardssignificance-driven, power-efficient computations on hardware substrates configured for aggressive power savings atthe expense of reliability. We briefly described the design of each layer and outlined the progress during the first yearof the project.

At the second part of the document, we went a step further and defined the main interfaces:

• Between the compiler & programming model and the automatic significance analysis infrastructure.

• Between the compiler & programming model and the system software layers.

• At hardware / software boundary.

In the following months, the first prototypes of the implementation of all system layers will be available. At that point,it will be possible to perform the first qualitative and quantitative evaluation of the SCoRPiO project approach. As wementioned earlier in the document, we have opted to front-load the implementation effort, in order to make the verticalexperimental infrastructure available as soon as possible in the course of the project.

As we have discussed in the description of work, integration has been designed as an iterative, spiral approach. As aresult, we expect that both the functionality of different layers and the interfaces among them may have to be refinedin future versions of the system.



Bibliography

[1] Cedric Bastoul. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13thInternational Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pages 7–16, Wash-ington, DC, USA, 2004. IEEE Computer Society.

[2] Alejandro Duran, Eduard Ayguade, Rosa M Badia, Jesus Labarta, Luis Martinell, Xavier Martorell, and JuditPlanas. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters,21(02):173–193, 2011.

[3] A. Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differenti-ation. SIAM, 2nd edition, 2008.

[4] Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis and transfor-mation. pages 75–88, San Jose, CA, USA, Mar 2004.

[5] Michael Lerch, German Tischler, Jurgen Wolff von Gudenberg, Werner Hofschuster, and Walter Kramer.FILIB++ interval library. www2.math.uni-wuppertal.de/˜xsc/software/filib.html.

[6] Michael Lerch, German Tischler, Jurgen Wolff von Gudenberg, Werner Hofschuster, and Walter Kramer. Theinterval library filib++ 2.0 - design, features and sample programs. Preprint 2001/4, Universitat Wuppertal,2001.

[7] Michael Lerch, German Tischler, Jurgen Wolff von Gudenberg, Werner Hofschuster, and Walter Kramer. Filib++,a fast interval library supporting containment computations. ACM Trans. Math. Softw., 32(2):299–324, 2006.

[8] Johannes Lotz, Klaus Leppkes, and Uwe Naumann. dco/c++ - Derivative Code by Overloading in C++.Technical Report AIB-2011-06, RWTH Aachen, May 2011.

[9] Johannes Lotz, Uwe Naumann, and Jorn Ungermann. Hierarchical algorithmic differentiation: A case study. InRecent Advances in Algorithmic Differetiation, pages 187–196. Springer, 2012.

[10] Ramon E. Moore. Interval Analysis. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1966.

[11] Ramon E. Moore, R. Baker Kearfott, and Michael J. Cloud. Introduction to Interval Analysis. Society forIndustrial and Applied Mathematics, 1 edition, 1 2009.

[12] U. Naumann. The Art of Differentiating Computer Programs. An Introduction to Algorithmic Differentiation.Software, Environments, and Tools. SIAM, 2011.

[13] Jan Riehme and Uwe Naumann. Initial Significance Based Computing Modeling : Significance Analysis forNumerical Models, SCoRPiO Deliverable D1.1.1. Technical report, RWTH Aachen, 2013.

[14] Software and Tools for Scientific Engineering, RWTH Aachen University, Germany. Derivative Code by Over-loading in C++ (dco/c++). http://www.stce.rwth-aachen.de/software/dco_cpp.html.



[15] George Tzenakis, Angelos Papatriantafyllou, John Kesapides, Polyvios Pratikakis, Hans Vandierendonck, andDimitrios S. Nikolopoulos. Bddt:: Block-level dynamic dependence analysis for deterministic task-based paral-lelism. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program-ming, PPoPP ’12, pages 301–302, New York, NY, USA, 2012. ACM.

[16] Foivos S. Zakkak. Scoop: Language extensions and compiler optimizations for task-based programming models.Master’s thesis, University of Crete, School of Sciences and Engineering, Computer Science Department, 2012.


deliverable d5.1 - full system design: progress report · scorpio: signiﬁcance-based ......

Documents