llnl summer school 07/08/2014 what is ocr? tg team (presenters: romain cledat & bala seshasayee)...

32
LLNL Summer School 07/08/2014 What is OCR? TG Team (presenters: Romain Cledat & Bala Seshasaye July 8, 2014 https://xstack.exascale- tech.com/wiki/ This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Upload: augusta-powers

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

LLNL Summer School 07/08/2014

What is OCR?

TG Team (presenters: Romain Cledat & Bala Seshasayee)July 8, 2014

https://xstack.exascale-tech.com/wiki/

This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either

expressed or implied, of the U.S. Government.

2LLNL Summer School 07/08/2014

• OCR– Open Community Runtime– Developed collaboratively with partners (mainly Rice University and

Reservoir Labs)

• The term ‘OCR’ is used to refer to– A programming model– A user-level API– A runtime framework– One of several reference runtime implementations

• In this talk– Presentation of the programming model– Presentation of the API and implementations through demos

OCR

3LLNL Summer School 07/08/2014

• Design a software stack to meet Exascale goals– Target a strawman architecture– Provide a programming model, API, reference implementation and

tools

• Concerns– Extreme hardware parallelism– Data locality– Fine grained resource management– Resiliency– Power and energy and not just performance– Platform independence

TG X-Stack project goals

4LLNL Summer School 07/08/2014

mainEdt

fibIterEdt

fibIterEdt

fibIterEdt

sumEdt

doneEdt

Dataflow programming model

Runtime maps the constructed

data-flow graph to architecture

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

Shared LLC

Interconnect……

…..

……

…..

N

N-2N-1

Fib(N-2)Fib(N-1)

Fib(N)

EDT

Datablock Data shared between EDTs

A non-blocking unit of work. Runnable once all pre-slots are satisfied.

Creation link: Source EDT creates destination

Event/Data link: Source EDT provides data to the destination

Both creation and event/data link

5LLNL Summer School 07/08/2014

OCR level of abstractionvoid ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 1, n ), avg );}

if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a);}void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); }

hides…

hides…

hides…

OCR’s level of abstraction is at the very bottom

TBB user-friendly API

6LLNL Summer School 07/08/2014

• Common– All objects globally and uniquely identifiable and relocate-able

• Computation– Event Driven Task (EDT)– Does not perform synchronization– Distinct from the notion of thread or core

• Data– Data-block (DB)– Relocate-able consecutive chunk of data

• Synchronization, links– Events– Runtime-visible

• Slots– Positional end-points for dependences

OCR concepts

7LLNL Summer School 07/08/2014

• Simplest OCR Concepts – EDTs, datablocks• Example – producer/consumer• Clarifying concepts – what EDTs can/can’t do, DBs are/aren’t• Example – simple synchronization• More concepts – events, slots• Example – complex synch• More concepts – latch events

Outline

8LLNL Summer School 07/08/2014

• Event Driven Task (EDT)• N pre-slots (known at creation time)• Available states on a slot:

– Connected (attached to another slot) or unconnected

– Satisfied or unsatisfied

OCR concepts: 3 core building blocks

0 N

Data

Pre-slots

Data

Data

Data

Data

Data

Data

• A globally visible name space of data blocks– Explicitly created – EDTs can only access data either

created by them or passed through their pre-slots

EDT1 EDT2

• EDT1 creates EDT2

• EDT1 provides data on EDT2’s pre-slot– Possibly through an indirection chain

EDT

9LLNL Summer School 07/08/2014

• EDTs– 0..N in/out pre-slots

• Slots are initially “unconnected” and “unsatisfied”• At creation time, the number of incoming slots must be known

– An EDT executes after all pre-slots are “satisfied”• Satisfaction of pre-slots can happen in any order

– An EDT can access memory:• Data-blocks:

– passed in through one of its in/out slots (the EDT gets a C pointer)– created by the EDT

• Stack and ephemeral heap (local)• NO global memory

– An EDT, during its execution, can at any time:• Write to any accessible data-blocks• Manipulate the dependence graph for future (not yet runnable) EDTs

OCR execution model for EDTs

10LLNL Summer School 07/08/2014

• Dynamic dependence construction• Producer and consumer never know about each other• Focus on minimum needed for placement and scheduling

ANIMATE: Example 1: Producer/Consumer

ConsumerEDT

ProducerEDT

Data

Concept OCR

Creation link

Event/Data link

Both creation & event

ConsumerEDT

ProducerEDT

Data

11LLNL Summer School 07/08/2014

• Control dependence is no different than a data dependence

ANIMATE: Example 2: Simple synchronization

Concept OCR

Step 1EDT

Step 2-aEDT

Step 2-bEDT

Step 1EDT

Step 2-aEDT

Step 2-bEDT

Ø Ø

12LLNL Summer School 07/08/2014

• Runtime EDTs– Created by the runtime to handle more complex synchronization

situations– 0..N pre slots

• Slots are initially “unconnected” and “unsatisfied”– Runtime EDTs have a “trigger” rule that determines when they

“satisfy” their outgoing edges and what gets propagated• Latch runtime EDT (multi-party synchronization)

– 2 pre slots; “waiting-on” count and current count– When: satisfy outgoing edges when number of satisfies on both pre

slots matches (similar to reference count in TBB)– What: NULL (incoming data-blocks are ignored)

OCR execution model for runtime EDTs

13LLNL Summer School 07/08/2014

ANIMATE: Example 3: In place parallel update

Concept OCR

SetupEDT

Parallel_1EDT

Parallel_2EDT

WrapupEDT

Data

Data

SetupEDT

Parallel_1EDT

Parallel_2EDT

WrapupEDT

Data Data

SyncREDT

Data

ØØ

Ø

LLNL Summer School 07/08/2014

OCR ecosystem

FSim - TG Architecture

Low-level compilers

Platforms

OCR implementations

LLVM

OCR targeting TG

C, Array DSL CnC Hero

CodeHC

CnC Translator

HC CompilerR-Stream

HTA

PIL

Programming platforms

OCR API + Tuning AnnotationsOpen Community Runtime

x86

GCC

OCR targeting x86

Cluster

Evaluation platforms

15LLNL Summer School 07/08/2014

• OCR API is at the “assembly” level; other tools are meant to sit between it and programmers

• Few simple concepts, multiple ways to use them– Interested in determining “best” use

• Dependence graph built on the fly:– Complicates the writing of the program– Scalable approach

Take-aways

16LLNL Summer School 07/08/2014

• On some code, OCR matches or bests OMP• Simple scheduler, no data-blocks (very preliminary but promising)

Preliminary results

17LLNL Summer School 07/08/2014

• Development of a specification:– Memory model

• Tuning hints and annotations

• More expressive support for collectives

Areas of investigation

LLNL Summer School 07/08/2014

Case Study: FFT in OCR

This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either

expressed or implied, of the U.S. Government.

19LLNL Summer School 07/08/2014

• Final year undergraduate project in Oregon State University• OCR implementation of Fast Fourier Transform

– Cooley-Tukey algorithm– Evolution from serial version– OCR behavior

Background

20LLNL Summer School 07/08/2014

• Divide-and-conquer• Data-flow friendly

Algorithm

Source:Wikimedia Commons

21LLNL Summer School 07/08/2014

• (1) Serial implementation – 1 EDT running the entire program• (2) Naive parallelization – division of DFT is carried out by EDTs

recursively, combination of outputs is done by 1 EDT for each step of recursion

• (3) Bounded parallelization – both stages of butterfly are parallelized, but upto a user-specified block size (to minimize scheduling overhead)

• (4) Bounded parallelization with datablocks – previous implementations operated on a single datablock; this uses 3 datablocks (input, output real & imaginary terms)

• Scope for better parallelism– Finer datablocks– Staggered creation of EDTs in the combination phase

Versions

22LLNL Summer School 07/08/2014

Behavior

Version No. of EDTs Mean EDT Longevity (us)

Load variance across cores (%)

Running time (s)

Serial 2 1673420 70.7 3.36

Naïve parallel 12582913 253 5.1 877.0

Bounded parallel 1793 1982 2.7 0.46

Bounded parallel w/ datablocks

1793 1946 2.9 0.45

• OCR X86 running FFT on 232 sized dataset– 2.9GHz Xeon 16 cores; 8 cores made available to OCR

• Balance to be achieved between number and size of EDTs

23LLNL Summer School 07/08/2014

Backup

24LLNL Summer School 07/08/2014

Strawman architecture

Intel Confidential / Internal Use Only

• Heterogeneous• Hierarchical architecture• Tapered memory

bandwidth• Global, shared address

space• Software managed non-

coherent memories • Functional simulator

available

DP FPFMAC

Execution Engine (XE)

32KB I$

64KB SP

RF?

Application specific

GPInt

Control Engine (CE)

32KB I$

64KB SP

RF?

System SW

XE XE XE XE

XE XE XE XE

CE 1MB shared L2

Block (8 XE + CE)

Cluster (16 Blocks)PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

………..

8MB Shared LLC

Interconnect……

…..

……

…..

Processor Chip (16 Clusters)PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect………..

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Core

1MB L2………..

8MB Shared LLC

Interconnect………..

64MB Shared LLC

……

…..

……

…..

Interconnect

25LLNL Summer School 07/08/2014

OCR vs other solutions

CnC MPI OCR OpenMP

TBB

Execution model

Tasks Bulk Sync Fine-grained tasks

Bulk Sync Tasks

Memory model

Shared memory

Explicit message passing

Explicit; global

Shared memory

Shared memory

Separation of

concerns?

Yes No Yes No Yes (but can dig deeper)

26LLNL Summer School 07/08/2014

• N pre slots (N known at creation time)

• Optional attached “completion event”

OCR concepts: building blocks

Evt

0 N

EDT

0 N

( )

Data

• No pre slots• Post slot always

“satisfied”

• N pre slots (N fixed by type of event NOT determined by user)

• Post slot initially “unsatisfied”

• Slot is:– Connected (attached to another slot) or unconnected– Satisfied (user-triggered or runtime-triggered) or unsatisfied

Pre slots

Post slots (multiple connections)

Seshasayee, Bala
This slide needs to be redone slightly to reduce confusion.The next 2 slides need to be redone heavily as they're difficult to grasp.

27LLNL Summer School 07/08/2014

OCR concepts: add dependence

Data

Evt

0 N

OR

EDT

0 N

Evt

0 N

OR

Evt

0 N

EDT

0 N

Connected=>

1 of 4 possible combinations

Argument 1 Argument 2

28LLNL Summer School 07/08/2014

OCR concepts: satisfy

EDT

0 N

Evt

0 N

OR

Data

OR

NULLEDT

0 N

Satisfied/triggered

Data

=>

1 of 4 possible combinations

Argument 1 Argument 2

29LLNL Summer School 07/08/2014

• Dynamic dependence construction• Producer and consumer never know about each other• Focus on minimum needed for placement and scheduling

Example 1: Producer/Consumer

ConsumerEDT

ProducerEDT

Data

Concept OCR

Evt

ConsumerEDT

ProducerEDT

Data

(1) dbCreate(*) addDep

(3) satisfy

(2) edit Data

Who executes call

Data dependence

Control dependence

30LLNL Summer School 07/08/2014

• Control dependence is no different than a data dependence

Example 2: Simple synchronization

(1) satisfy

Concept OCR

Step 1EDT

Step 2-aEDT

Step 2-bEDT

Evt

Step 1EDT

(*) addDep

NULL

Step 2-aEDT

Step 2-bEDT

31LLNL Summer School 07/08/2014

Example 3: In place parallel update

Concept OCR

SetupEDT

Parallel_1EDT

Parallel_2EDT

WrapupEDT

Data

Data

SetupEDT Data

Parallel_1EDT

Parallel_2EDT

FinishEDT

WrapupEDT

(1) dbCreate

(1) edtCreate

(1) e

dtCr

eate

(3) edtCreate

(4) addDep

(2) addDep(2

) add

Dep

(3) edtCreate

32LLNL Summer School 07/08/2014

Example 4: Single assignment update

Concept OCR

SetupEDT

Parallel_1EDT

Parallel_2EDT

WrapupEDT

Data

SetupEDT Data

Parallel_1EDT

Parallel_2EDT

WrapupEDT

(1) dbCreate

(1) edtCreate

(1) e

dtCr

eate

(2) addDep

Data2Data1

Evt2

Data2Data1

Evt1

(4) dbCreate (4) dbCreate

(5) satisfy (5) satisfy

(3) addDep

(1) e

vtCr

eate