attacking the programming model wall
DESCRIPTION
Attacking the programming model wall. Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28 th 2013. Setting the scenario (HW). Market pressure. Multicores. Moore law from components to cores Simpler cores, shared memory, cache coherent, full interconnect. Manycores. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/1.jpg)
Attacking theprogramming model
wallMarco Danelutto
Dept. Computer Science, Univ. of PisaBelfast, February 28th 2013
![Page 2: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/2.jpg)
Market pressure
Hw advances
Power wall
Setting the scenario (HW)
![Page 3: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/3.jpg)
Market pressure
•From components•To cores
Moorelaw
•Gesture/voice interaction•3/D graphics
Newneeds
•New applications•Larger data sizes
Supercomputing
![Page 4: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/4.jpg)
Multicores Moore law from components to cores Simpler cores, shared memory, cache
coherent, full interconnect
Name Cores Contexts Sockets Core x board
AMD 6176 12 1 4 48E5-4650 8 2 4 64SPARC T4 8 8 4 512
![Page 5: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/5.jpg)
Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe)
Options for cache coherence, more complex inter core communication protocols
Manycores
Name Core Core
Contexts Intercon Mem controllers
TileraPro64 VLIW 64 64 Mesh 4Intel PHI IA-64 60 240 Ring 8 (2 way)
![Page 6: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/6.jpg)
ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe)
Data parallel computations only
GPUs
Name Cores M interface
M bandwidth
nVidia C2075
448 384 bit 144GB/sec
mVidia K20X 2688 384 bit 250GB/sec
![Page 7: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/7.jpg)
Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores
Non-standard programming tools
FPGA
Name Cells Block RAM M bandwith
Artix 7 215,000 13Mb 1,066MB/sVirtex 7 2,000,000 68Mb 1,866MB/s
![Page 8: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/8.jpg)
Power cost > hw cost
Thermal dissipation cost
FLOP/Watt is “the must”
Power wall
![Page 9: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/9.jpg)
Reducing idle costs◦ E4 CARMA CLUSTER
ARM + nVIDIA Spare Watt → GPU
Reducing the cooling costs◦ Eurotech AURORA TIGON
Intel technology Water cooling Spare Watt → CPU
Power wall (2)
![Page 10: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/10.jpg)
Close to metal programming models
SIMD / MIMD abstractionprogrammingmodels
High level high productivityprogramming models
Setting the scenario (sw)
![Page 11: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/11.jpg)
Programming models
Low abstraction level High abstraction level
Pros◦ Performance / efficiency◦ Heterogeneous hw
targeting
Cons◦ Huge application
programmer responsibilities
◦ Portability (functional, performance)
◦ Quantitative parallelism exploitation
Pros◦ Expressive power◦ Separation of concerns◦ Qualitative parallelism
exploitation
Cons◦ Performance / efficiency◦ Hw targeting
![Page 12: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/12.jpg)
Separation of concerns
Functional Non functional
What has to be computer
Function from input data to output data
Domain specific
Application dependent
How the results is computed
Parallelism, Power management, Security, Fault Tolerance, …
Target hw specific
Factorizable
![Page 13: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/13.jpg)
Supported programming paradigms
Current programmingframeworks
OpenCL
OpenMP
MPI
CILK
TBBExpressive power
![Page 14: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/14.jpg)
Market pressur
es
HW advances
Low
level
progra
mming
models
UrgenciesNeed for
a) Parallel programming models
b) Parallel programmers
![Page 15: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/15.jpg)
Structured parallel programming
Algorithmic skeletons Parallel design patterns
From HPC community
Started in early ‘90(M. Cole’s PhD thesis)
Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls
From SW engineering community
Started in early ‘00
“Recipes” to handle parallelism (name, problem, forces, solutions, …)
![Page 16: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/16.jpg)
Compiling tools + run
time systems
Clear program
ming model
High level
abstractionsSimilarities
![Page 17: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/17.jpg)
Common, parametric, reusable parallelism exploitation patterns (from HPC community)
Exposed to programmers as constructs, library calls, objects, higher order functions, components, ...
Composable◦ Two tier model: “stream parallel” skeletons with
inner “data parallel” skeletons
Algorithmic skeletons
![Page 18: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/18.jpg)
Sample classical skeletons
Stream parallel Data Parallel
Parallel computation of different items from an input stream
Task/farm (master/worker), Pipeline
Parallel computation on (possibly overlapped) partitions of the same input data
Map, Stencil, Reduce, Scan, Mapreduce
![Page 19: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/19.jpg)
‘90 •Complex patterns, no composition•Targeting clusters•Mostly libraries (RTS)
‘00 •Simple data/stream parallel patterns•Composable, targeting COW/NOW•Libraries + First compilers
‘10 •Optimized, composable building blocks•Targeting cluster of heterogeneous multicore•Quite complex tool chain (compiler + RTS)
Evolution of the concept
![Page 20: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/20.jpg)
‘90 •Cole PhD thesis skeletons•P3L (Pisa)•SCL (Imperial college London)
‘00 •Lithium/Muskel (Pisa), Skandium (INRIA)•Muesli (Muenster), SkeTo (Tokio)•OSL (Orleans), Mallba (La Laguna)
‘10 •SkePu (Linkoping)•FastFlow (Pisa/Torino)•TBB? (Intel), TPL? (Microsoft)
Evolution of the concept (2)
![Page 21: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/21.jpg)
Implementing skeletons
Template based Macro Data Flow based
Skeleton implemented by instantiating a “concurrent activity graph template”
Performance models used to instantiate quantitative parameters
P3L, Muesli, SkeTo, FastFlow
Skeleton program compiled to macro data flow graphs
Rewriting/refactoring compiling process
Parallel MDF graph interpreter
Muskel, Skipper, Skandium
![Page 22: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/22.jpg)
Formally proven rewriting rules
Farm(Δ) = ΔPipe(Δ1, Δ2) = SeqComp(Δ1, Δ2)
Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))
Refactoring skeletons
![Page 23: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/23.jpg)
Pipe(Farm(Δ1), Δ2) •Service time = maxi=1,2 {stagei}•Nw = nw(farm)+1
Pipe(Δ1, Δ2) •Higher service time•Nw = 2
SeqComp(Δ1, Δ2) •Sequential service time•Nw = 1
Farm(SeqComp(Δ1,Δ2))
•Service time < original•With less resources (normal form)
Sample refactoring: normal form
![Page 24: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/24.jpg)
Performance modelling
![Page 25: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/25.jpg)
Pipeline service timeMaxi=1,k { serviceTime(Stagei)}
Pipeline latency∑i=1,k { serviceTime(Stagei)}
Farm service timemax { taskSchedTime, resGathTime, workerTime/#worker}
Map latency partitionTime + workerTime + gatherTime
Sample performance models
![Page 26: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/26.jpg)
Key strenghts Full parallel structure of the application exposed
to the skeleton framework◦ Exploited by optimizations, support for autonomic non
functional concern management Framework responsibility for architecture
targeting◦ Write once run everywhere code, with architecture
specific compiler and back end (run time) tools Only functional debugging required to
application programmers
![Page 27: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/27.jpg)
Expressive power reduce time to
deploy
Parallel structure exposed guarantees
performance
Ideally
![Page 28: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/28.jpg)
•Application programmer: WHAT•System programmer: HOW
Separation of concerns
•Structure suggested•Interpreted by tools
Inversion of control
•Close to hand coded programs•At a fraction of the devel time
Performance
Assessments
![Page 29: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/29.jpg)
Carefully describe a parallelism exploitation pattern including◦ Applicability◦ Forces◦ Possibile implementations/problem solutions
As text At different levels of abstraction
Parallel design patterns
![Page 30: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/30.jpg)
Finding concurrency
Algorithm space
Supporting structure
Implementation mechanism
Pattern spaces
![Page 31: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/31.jpg)
Collapsed in algorithmic skeletons◦ application programmer → concurrency and algorithm spaces◦ Skeleton implementation (system programmer)→ support
structures and implementation mechanisms
Patterns
![Page 32: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/32.jpg)
Structured parallel programming: design patterns
Design patterns
Problem
Programming
tools
Low level code
Follow, learn, use
![Page 33: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/33.jpg)
Structured parallel programming: skeletons
Skeleton library
ProblemHigh level code
Instantiate, compose
![Page 34: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/34.jpg)
Structured parallel programming
Design patterns
ProblemHigh level code
Use knowledge to instantiate, compose
Skeletons
![Page 35: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/35.jpg)
Working unstructured Tradeoffs
◦ CPU/GPU threads◦ Processes/Threads◦ Coarse/fine grain
tasks Target
architecture dependent decisions Concurrent activity set
Thread/
Procs
SynchMemo
ry
![Page 36: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/36.jpg)
Creation◦ Thread pool vs. on-the-fly creation
Pinning◦ Operating system dendent effectiveness
Memory management ◦ Embarrassingly parallel patterns may benefit of
process memory space separation (see Memory (next) slide)
Thread/processes
![Page 37: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/37.jpg)
Cache friendly algorithms◦ Minimization of cache coherency traffic◦ Data aligment/padding
Memory wall◦ 1-2 memory interfaces per 4-8 cores ◦ 4-8 memory interfaces per 60-64 cores (+internal
routing)
Memory
![Page 38: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/38.jpg)
High level, general purpose mechanisms◦ Passive wait◦ High latency
Low level mechanisms◦ Active wait◦ Smaller latency
Eventually◦ Synchronization on memory (fences)
Synchronization
![Page 39: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/39.jpg)
Ideally◦ As much parallel activities as necessary to sustain the
input data rate Base measures
◦ Estimated input pressure & task processing time, communication overhead
Compile vs. run time choices◦ Try to devise statically some optimal values◦ Adjust initial settings dynamically based on observations
Devising parallelism degree
![Page 40: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/40.jpg)
Auto scheduling◦ Idle workers require tasks from a “global” queue◦ Far nodes require less than near ones
Affinity scheduling◦ Tasks scheduled on the producing cores
Round robin allocation of dynamically allocated chunks
NUMA memory exploitation
![Page 41: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/41.jpg)
Algorithmic code:
computing the
application results out of
the input data
Non functional code:
programming performance, security, fault
tolerance, power
management
More separation of concerns
![Page 42: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/42.jpg)
Behavioural skeletonsStructured
parallel algorithm code
Pipe
Pipe
Seq
Seq
Map
Seq
Seq
Seq
Sensors & Actuators
exposes
NFC autonomic manager
reads
ECA rule based program
Autonomic manager: ex-ecutes a MAPE loop. Ateach iteration, and ECA(Event Condition Action)
rule system is executed usingmonitored values and possi-bly operating actions on thestructured parallel pattern
Sensors: determinewhat can be perceived
of the computationActuators: determine whatcan be affected/changed
in the computationMAPEloop
![Page 43: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/43.jpg)
Event: inter arrival time changes Condition: faster than service time Action: increase the parallelism degree
Sample rules
Event: fault at worker Condition: service time low Action: recruit new worker
resource
![Page 44: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/44.jpg)
BS assessments
![Page 45: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/45.jpg)
Yes, nice, but then ?
We have MPI, OpenMP, Cuda, OpenCL …
![Page 46: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/46.jpg)
FastFlow Full C++, skeleton based, streaming parallel
processing framework
•Pipeline, Farm•Composable, customizable
Streaming network patterns
•Lock free SPMC, MPSC, •& MPMC queues
Arbitrary streaming networks
•Lock free SPSC queue•General threading model
Simple streaming networks
http://mc-fastflow.sourceforge.net
![Page 47: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/47.jpg)
Full POSIX/C++ compliancy◦ G++, make, gprof, gdb, pthread, …
Reuse existing code◦ Proper wrappers
Run from laptops to clusters & clouds◦ Same skeleton structure
Bring skeletons to your desk
![Page 48: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/48.jpg)
Basic abstraction: ff_nodeclass RedEye: public ff_node { … int svc_init(){ … }
void svc_end() { … }
void * svc(void * task) { Image *in = (Image *)task; Image * out = …. return((void *) out); } … }
ff_node
Input channel
svcOutput channel
![Page 49: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/49.jpg)
Farm(Worker, Nw)◦ Embarrassingly parallel computations on streams◦ Computing Worker in parallel (Nw copies)◦ Emitter + string of workers + Collector implementation
Pipeline(Stage1, … , StageN)◦ StageK processes output of Stage(K-1)
and delivers to Stage(K+1) Feedback(Skel, Cond)
◦ Routes back results from Skel to input or forward to output depending on Cond
Basic stream parallel skeletons
![Page 50: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/50.jpg)
Setting up a pipelineff_pipeline myImageProcessingPipe;
ff_node startNode = new Reader(…); ff_node redEye = new RedEye();ff_node light = new LightCalibration();ff_node sharpen = new Sharpen();ff_node endNode = new Writer(…);
myImageProcessingPipe.addStage(startNode);myImageProcessingPipe.addStage(redEye);myImageProcessingPipe.addStage(light);myImageProcessingPipe.addStage(sharpen);myImageProcessingPipe.addStage(endNode);
myImageProcessingPipe.run_and_wait_end();
![Page 51: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/51.jpg)
Refactoring (farm introduction)ff_node sharpen = new Sharpen();ff_farm<> thirdStage; std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w); …myImageProcessingPipeaddStage(sharpen);myImageProcessingPipe.addStage(thirdStage);…
![Page 52: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/52.jpg)
Refactoring (map introduction)ff_farm<> thirdStage;
std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w);Emitter em; // scatter data to workersCollector co; // collect results from wfarm.add_emitter(em);farm.add_collector(co); …myImageProcessingPipe.addStage(sharpen);…
![Page 53: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/53.jpg)
1. Create a suitable skeleton accelerator2. Offload tasks from main (sequential)
business logic code3. Accelerator exploits the “spare cores” on
your machine
FastFlow accelerator
![Page 54: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/54.jpg)
FastFlow acceleratorff_farm<> farm(true); // Create acceleratorstd::vector<ff_node *> w;for(int i=0;i<nworkers;++i) w.push_back(new Worker);farm.add_workers(w);farm.add_collector(new Collector);farm.run_then_freeze();
while(…) { …. farm.offload(x); // offload tasks …}…while(farm.load_result(&res)) { ….// eventually process results}
![Page 55: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/55.jpg)
0.5msecs tasks
![Page 56: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/56.jpg)
5msecs tasks
![Page 57: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/57.jpg)
50msecs tasks
![Page 58: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/58.jpg)
Scalability (vs. PLASMA)
![Page 59: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/59.jpg)
Farm vs. Macro Data Flow
![Page 60: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/60.jpg)
FastFlow evolution
Data parallelism
•Generalized map•Reduce
Distributed version
•Transport level channels•Distributed orchestration of FF applications
HeterogeneousHw targeting
•Simple map & reduce offloading•Fair scheduling to CPU/GPU cores
![Page 61: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/61.jpg)
Distributed version
![Page 62: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/62.jpg)
Cloud offloadingApplication
runs on COWNot enough throughput
Monitored on a single cloud
resourcePerformance
model: number
of necessary cloud
resources
Distributed version
(local+cloud)With throughput
![Page 63: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/63.jpg)
Cloud offloading (2)
![Page 64: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/64.jpg)
Performance modelling of percentage of tasks to offload (in a map or in a reduce)
GPU offloading
![Page 65: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/65.jpg)
GPU map offloading
![Page 66: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/66.jpg)
FastFlow ported to Tilera Pro64◦ (paper this afternoon)
With minimal intervention to ensure functional portability
And some more changes made to support better hw optimizations
Moving to MIC
![Page 67: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/67.jpg)
FastFlow on Tilera Pro64
![Page 68: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/68.jpg)
Conclusions
![Page 69: Attacking the programming model wall](https://reader036.vdocument.in/reader036/viewer/2022062323/5681649c550346895dd6801b/html5/thumbnails/69.jpg)
Thanks to Marco Aldinucci, Massimo Torquati, Peter Kilpatrick, Sonia Campa, Giorgio Zoppi, Daniele Buono, Silvia Lametti, Tudor Serban
Any questions?