llnl summer school 07/08/2014 what is ocr? traleika glacier team (presenters: romain cledat & bala...

Download LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014

If you can't read please download the document

Upload: june-bailey

Post on 14-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014 https://xstack.exascale-tech.com/wiki/ This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. Slide 2 LLNL Summer School 07/08/2014 OCR Open Community Runtime Developed collaboratively with partners (mainly Rice University and Reservoir Labs) The term OCR is used to refer to A programming model A user-level API A runtime framework One of several reference runtime implementations In this talk Presentation of the programming model Presentation of the API and implementations through demosOCR 2 Slide 3 LLNL Summer School 07/08/2014 Design a software stack to meet Exascale goals Target a strawman architecture Provide a programming model, API, reference implementation and tools Concerns Extreme hardware parallelism Data locality Fine grained resource management Resiliency Power and energy and not just performance Platform independence Traleika Glacier (TG) X-Stack project goals 3 Slide 4 LLNL Summer School 07/08/2014 mainEdt fibIterEdt sumEdt doneEdt Dataflow programming model 4 Runtime maps the constructed data-flow graph to architecture .. Shared LLC Interconnect .. N N-2 N-1 Fib(N-2)Fib(N-1) Fib(N) EDT Datablock Data shared between EDTs A non-blocking unit of work. Runnable once all dependences are satisfied. Creation link: Source EDT creates destination Dependence: Source EDT satisfies one of destinations dependences Both creation and dependence link Slide 5 LLNL Summer School 07/08/2014 OCR level of abstraction 5 void ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range ( 1, n ), avg ); } if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a); } void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); } hides OCRs level of abstraction is at the very bottom TBB user-friendly API Slide 6 LLNL Summer School 07/08/2014 Event Driven Task (EDT) Distinct from the notion of a thread/core Executes when all required data-blocks have been provided to it Creates other EDTs and provides data-blocks to them High level OCR concepts 6 Data Globally visible namespace of data-blocks Explicitly created and destroyed Only available global memory Data-blocks can move EDT 1 EDT 2 Dependence EDT 1 provides data to EDT 2 EDT 1 creates EDT 2 Visible to the runtime Accessible data-blocks Data-blocks for other EDTs Create other EDTs EDT Slide 7 LLNL Summer School 07/08/2014 Dynamic dependence construction Focus on minimum needed for placement and scheduling Example 1: Producer/Consumer 7 Consumer EDT Producer EDT Data ConceptOCR Consumer EDT Producer EDT Data Slide 8 LLNL Summer School 07/08/2014 EDTs An EDT executes after all its dependences are satisfied The number of dependences must be known at creation time Dependence satisfaction can occur in any order An EDT can, during its execution: Create other EDTs and data-blocks (DBs) Manipulate the dependence graph for future (not ready) EDTs Access stack and ephemeral local heap, but NO global memory Access data-blocks passed in as a dependence or created by the EDT DBs Contiguous block of global memory visible to any EDT OCR enforces no restrictions on its access OCR execution model 8 Slide 9 LLNL Summer School 07/08/2014 Steps 1, 2-a, and 2-b need not know about each other they may have all been created by another EDT Example 2a: Simple synchronization 9 ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Step 1 EDT Step 2-a EDT Step 2-b EDT Evt1 Slide 10 LLNL Summer School 07/08/2014 Events used to: Satisfy one or more of an EDTs dependents Dynamically change the flow graph Events capture the concepts of: Data dependence: data-blocks flow along the edges Pure control dependence (as in the example)Events 10 Slide 11 LLNL Summer School 07/08/2014 Slots are used to order the dependences of an EDT (akin to the order of arguments in a C function) Example 2b: Multiple dependences 11 ConceptOCR Step 2 EDT Step 1-b EDT Step 1-a EDT Step 2 EDT Step 1-b EDT Step 1-a EDT E1 E2 Slide 12 LLNL Summer School 07/08/2014 Each EDT dependence has one slot assigned to it Each slot can optionally receive a data-block Slots are initially unsatisfied; events connected to the slot propagate the satisfied state An EDT becomes runable once all of its slots are satisfied, with the order of satisfaction unimportant Slots also optionally carry the arguments of an EDT, making the ordering of slots important Slots of an EDT 12 Unconnected & Unsatisfied Connected & Unsatisfied Connected & Satisfied Add Dependence Event Satisfaction Slide 13 LLNL Summer School 07/08/2014 Example 3a: No implied ordering constraints 13 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Shared Data Parallel_1 and Parallel_2 can execute in parallel even if they share access to a data-block Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Shared Data E1 E2 Slide 14 LLNL Summer School 07/08/2014 Example 3b: Single assignment update 14 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_2 EDT Wrapup EDT Data2Data1 Data2 Parallel_1 EDT Data1 Slide 15 LLNL Summer School 07/08/2014 Example 4: FFT with a finish-EDT 15 Setup EDT Done EDT FFT DFT FFT Twiddle Even Odd FFT(Even)FFT(Odd) FFT(X) Finish EDT X Slide 16 LLNL Summer School 07/08/2014 All EDTs have a completion event associated with them The event becomes satisfied when the EDT completes and carries the data-block returned by the EDT A finish-EDTs completion event has a special semantic The event becomes satisfied when the EDT and all of its children complete The event carries no data-block Use cases Localized barrier-like synchronization Allows for an unknown number of ancestors Note that no EDT waits for completionFinish-EDT 16 Slide 17 LLNL Summer School 07/08/2014 EDT Templates: ocrEdtTemplateCreate(), ocrEdtTemplateDestroy() Tasks: ocrEdtCreate(), ocrEdtDestroy() DBs Datablock management: ocrDbCreate(), ocrDbDestroy() Datablock usage: ocrDbRelease() Events Event management: ocrEventCreate(), ocrEventDestroy() Event satisfaction: ocrEventSatisfy() Dependence definition: ocrAddDependence() Miscelaneous Entry point of OCR: mainEdt() Shutdown: ocrShutdown() API cheat sheet 17 Slide 18 LLNL Summer School 07/08/2014 OCR ecosystem FSim - TG Architecture Low-level compilers Platforms OCR implementations LLVM OCR targeting TG C, Array DSL CnC Hero Code HC CnC Translator HC Compiler R-Stream HTA PIL Programming platforms OCR API + Tuning Annotations Open Community Runtime x86 GCC OCR targeting x86 Cluster Evaluation platforms Slide 19 LLNL Summer School 07/08/2014 Handwritten LULESH Fast Fourier Transform, Stream, HPCG Synthetic Aperture Radar (SAR) Cholesky factorization CoMD High-level tools generated LULESH (from Concurrent Collections) NAS Parallel Benchmarks (from Hierarchical Tile Array) FFT, Conjugate Gradient, Integer Sort, Embarrasingly Parallel, etc. Jacobi, Successive Over-relaxation, etc. (RStream) List of applications available on OCR 19 Slide 20 LLNL Summer School 07/08/2014 OCR API is at the assembly level; other tools are meant to sit between it and programmers (but heroes do exist) In OCR, the structure of the application and the application are mingled Complicates the writing of the program Much more scalable and adaptable approach Very few concepts a multitude of ways to use them EDTs, data-blocks, dependences What is the most expressive and natural way to use them for you? Several implementations X86 (you have a copy) TG-FSim (demoed) and TG-Emul MPI (demoed)Take-aways 20 Slide 21 LLNL Summer School 07/08/2014 Case Study: FFT in OCR This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. Slide 22 LLNL Summer School 07/08/2014 Final year undergraduate project in Oregon State University OCR implementation of Fast Fourier Transform Cooley-Tukey algorithm Evolution from serial version OCR behaviorBackground 22 Slide 23 LLNL Summer School 07/08/2014 Divide-and-conquer Data-flow friendlyAlgorithm 23 Source:Wikimedia Commons Slide 24 LLNL Summer School 07/08/2014 Serial implementation 24 Source:Wikimedia Commons Slide 25 LLNL Summer School 07/08/2014 Nave implementation 25 Source:Wikimedia Commons Slide 26 LLNL Summer School 07/08/2014 Bounded implementation 26 Source:Wikimedia Commons Slide 27 LLNL Summer School 07/08/2014 Bounded implementation with datablock 27 Source:Wikimedia Commons Slide 28 LLNL Summer School 07/08/2014Behavior 28 VersionNo. of EDTsMean EDT Longevity (us) Load variance across cores (%) Running time (s) Serial2167342070.73.36 Nave parallel125829132535.1877.0 Bounded parallel179319822.70.46 Bounded parallel w/ datablocks 179319462.90.45 OCR X86 running FFT on 2 32 sized dataset 2.9GHz Xeon 16 cores; 8 cores made available to OCR Balance to be achieved between number and size of EDTs Slide 29 LLNL Summer School 07/08/2014 Serial implementation Nave parallelization recursive division of DFT Bounded parallelization division bounded by a working set size Bounded parallelization with datablocks additionally, use 3 datablocks (input, real, imaginary portions) Possible next steps for better parallelism Finer datablocks Staggered creation of EDTs in the combination phaseSummary 29 Slide 30 LLNL Summer School 07/08/2014 OCR API is at the assembly level; other tools are meant to sit between it and programmers Few simple concepts, multiple ways to use them Interested in determining best use Dependence graph built on the fly: Complicates the writing of the program Scalable approach TODO: Take-aways 30 FIXME Slide 31 LLNL Summer School 07/08/2014 Development of a specification: Memory model Tuning hints and annotations More expressive support for collectives Areas of investigation 31 FIXME Slide 32 LLNL Summer School 07/08/2014 32 Backup Slide 33 LLNL Summer School 07/08/2014 Strawman architecture 33 Intel Confidential / Internal Use Only Heterogeneous Hierarchical architecture Tapered memory bandwidth Global, shared address space Software managed non- coherent memories Functional simulator available DP FP FMAC DP FP FMAC Execution Engine (XE) 32KB I$ 64KB SP RF? Application specific GP Int GP Int Control Engine (CE) 32KB I$ 64KB SP RF? System SW XE CE 1MB shared L2 Block (8 XE + CE) Cluster (16 Blocks) .. 8MB Shared LLC Interconnect .. Processor Chip (16 Clusters) Slide 34 LLNL Summer School 07/08/2014 N pre slots (N known at creation time) Optional attached completion event OCR concepts: building blocks 34 Evt 0N EDT 0N ( ) Data No pre slots Post slot always satisfied N pre slots (N fixed by type of event NOT determined by user) Post slot initially unsatisfied Slot is: Connected (attached to another slot) or unconnected Satisfied (user-triggered or runtime-triggered) or unsatisfied Pre slots Post slots (multiple connections) Slide 35 LLNL Summer School 07/08/2014 OCR concepts: add dependence 35 Data Evt 0N OR EDT 0N Evt 0N OR Evt 0N EDT 0N Connected => 1 of 4 possible combinations Argument 1 Argument 2 Slide 36 LLNL Summer School 07/08/2014 OCR concepts: satisfy 36 EDT 0N Evt 0N OR Data OR NULL EDT 0N Satisfied/triggered Data => 1 of 4 possible combinations Argument 1 Argument 2 Slide 37 LLNL Summer School 07/08/2014 Dynamic dependence construction Producer and consumer never know about each other Focus on minimum needed for placement and scheduling Example 1: Producer/Consumer 37 Consumer EDT Producer EDT Data ConceptOCR Evt Consumer EDT Producer EDT Data (1) dbCreate (*) addDep (3) satisfy (2) edit Data Who executes call Data dependence Control dependence Slide 38 LLNL Summer School 07/08/2014 Control dependence is no different than a data dependence Example 2: Simple synchronization 38 (1) satisfy ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Evt Step 1 EDT (*) addDep NULL Step 2-a EDT Step 2-b EDT Slide 39 LLNL Summer School 07/08/2014 Example 3: In place parallel update 39 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Finish EDT Wrapup EDT (1) dbCreate (1) edtCreate (3) edtCreate (4) addDep (2) addDep (3) edtCreate Slide 40 LLNL Summer School 07/08/2014 Example 4: Single assignment update 40 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Wrapup EDT (1) dbCreate (1) edtCreate (2) addDep Data2Data1 Evt2 Data2Data1 Evt1 (4) dbCreate (5) satisfy (3) addDep (1) evtCreate Slide 41 LLNL Summer School 07/08/2014 On some code, OCR matches or bests OMP Simple scheduler, no data-blocks (very preliminary but promising) Preliminary results 41 Slide 42 LLNL Summer School 07/08/2014 OCR vs other solutions 42 CnCMPIOCROpenMPTBB Execution modelTasksBulk SyncFine-grained tasks Bulk SyncTasks Memory modelShared memory Explicit message passing Explicit; globalShared memory Separation of concerns? YesNoYesNoYes (but can dig deeper) Synchronization ?APIExplicitImplicit & Explicit mechanisms ?