deep dive, university of florida (02/2015)

Deep Dive Meeting

February 3-4, 2015

Deep Dive University of Florida

February 3-4, 2015

Current Attendee List:

Bob Voigt NNSA HQ [email protected] Matt Bement LANL [email protected] David Daniel LANL [email protected] Dave Nystrom LANL [email protected] Maya Gokhale LLNL [email protected] Martin Schulz LLNL [email protected] Jim Ang SNL [email protected] Arun Rodrigues SNL [email protected] Jeremy Wilke SNL [email protected]

S. Balachandar “Bala” University of Florida [email protected]

Alan George University of Florida [email protected] Rafi Haftka University of Florida [email protected] Herman Lam University of Florida [email protected] Sanjay Ranka University of Florida [email protected] Greg Stitt University of Florida [email protected] Tom Jackson University of Florida [email protected] Tania Banerjee University of Florida [email protected] University of Florida Students: Dylan Rudolph [email protected] Nalini Kumar [email protected] Carlo Pascoe [email protected] ([email protected]) Kasim AlliKasim [email protected] Chris Hajas [email protected] Mohammed Gadou Michael Retherford

NOTE: There will be a $125.00 registration fee per person to cover all expenses associated with the meeting. In addition to other things, this will cover breakfast and lunch for two days, dinner Tuesday night, coffee breaks, etc. Please make checks payable to the University of Florida. A receipt will be available at the meeting.

NOTE: The meeting is all day Tuesday and ½ day on Wednesday. We will provide transportation to the airport as needed. Please make reservations at the University Hilton.

https://mail.ufl.edu/owa/redir.aspx?C=tqowiwtlvk6CT-oKj6ecs5up8rooBdIIg4Bvnsam39ANXV5STyQvZ2JT702aZ06K4bwQ1ssdd6Q.&URL=mailto%3arudolph%40hcs.ufl.edu

https://mail.ufl.edu/owa/redir.aspx?C=tqowiwtlvk6CT-oKj6ecs5up8rooBdIIg4Bvnsam39ANXV5STyQvZ2JT702aZ06K4bwQ1ssdd6Q.&URL=mailto%3ankumar%40hcs.ufl.edu

https://mail.ufl.edu/owa/redir.aspx?C=tqowiwtlvk6CT-oKj6ecs5up8rooBdIIg4Bvnsam39ANXV5STyQvZ2JT702aZ06K4bwQ1ssdd6Q.&URL=mailto%3apoppyc%40ufl.edu

https://mail.ufl.edu/owa/redir.aspx?C=tqowiwtlvk6CT-oKj6ecs5up8rooBdIIg4Bvnsam39ANXV5STyQvZ2JT702aZ06K4bwQ1ssdd6Q.&URL=mailto%3akasimalli490%40yahoo.com

https://mail.ufl.edu/owa/redir.aspx?C=tqowiwtlvk6CT-oKj6ecs5up8rooBdIIg4Bvnsam39ANXV5STyQvZ2JT702aZ06K4bwQ1ssdd6Q.&URL=mailto%3achrishajas%40ufl.edu

UF Deep dive agenda: Tuesday, February 3, 2015

8:20 Van pickup at Hilton

8:30 – 9:00 Breakfast

9:00 – 9:30 Welcome and Deep-Dive Overview (3 Sessions) 1. Behavioral emulation (BE): modeling & simulation/emulation methods

2. CS issues (performance, energy, and thermal) 3. Use of reconfigurable computing to accelerate behavioral emulation

* Each of the three deep-dive sessions is designed to be interactive: a combination of short presentations by UF and Tri-lab researchers, intermixed with discussion, demonstrations, etc.

9:30 – 11:30 Session 1: Behavioral Emulation: Modeling & Simulation/Emulation Methods UF topics:

o Behavioral characterization o Parameter estimation

Tri-lab topics: o Overview of FastForward 2 and DesignForward 2 (Jim Ang, SNL) o Multi-scale architectural simulation with the Structural Simulation Toolkit (Arun

Rodrigues, SNL)

11:30 – 12:30 Lunch

12:30 – 2:00 Session 1 (continued): Behavioral Emulation: Beyond Device Level UF topics:

o Synchronization for speed o Congestion modeling o Behavioral characterization & modeling beyond device level

Tri-lab topics: o Using discrete event simulation for programming model exploration at extreme-

scale (Jeremy Wilke, SNL) o ASC next-generation code projects (David Daniel, LANL)

2:00 – 5:00 Session 2: CS Issues (Performance, Energy, and Thermal) UF topics:

o Performance and autotuning for hybrid architectures o Energy and thermal optimization o Dynamic load balancing

Tri-lab topics: o Performance, energy, and thermal benchmarking (Jim Ang, SNL) o Why power is a performance issue: utilizing overprovisioned systems

(Martin Schulz, LLNL)

* There will be an afternoon coffee break in this time slot

6:30 Dinner (University Hilton)

Wednesday February 4, 2015

8:20 Van pickup

8:30 – 9:00 Breakfast

9:00 – 11:00 Session 3: Use of Reconfigurable Computing to Accelerate Behavioral Emulation UF topics:

o Efficient mapping of behavioral emulation objects (BEOs) onto a system of FPGAs o Demo of current single FPGA prototype o Transitioning to multiple FPGAs o Challenges associated with maximizing emulation speed while maintaining

scalability/usability

Tri-lab topic:

FPGA-based emulation of processing near memory (Maya Gokhale, LLNL)

11:00 – 12:00 Open discussion and planning for action items

12:00 Box lunch; transportation to airport as needed.

CCMT

UF Deep-Dive Agenda

| 1

Tuesday, February 3, 2015

8:20 Van pickup at Hilton 8:30 – 9:00 Breakfast9:00 – 9:30 Welcome and Deep-Dive Overview (3 Sessions)

1. Behavioral emulation (BE): modeling & simulation/emulation methods

2. CS issues (performance, energy, and thermal)3. Use of reconfigurable computing to accelerate

behavioral emulation

* Each of the three deep-dive sessions is designed to beinteractive: a combination of short presentations by UF and Tri-lab researchers, intermixed with discussion, demonstrations, etc.

CCMT | 2

9:30 – 11:30 Session 1: Behavioral Emulation: Modeling & Simulation/Emulation Methods� UF topics:

o Behavioral characterization o Parameter estimation

� Tri-lab topics:o Overview of FastForward 2 and

DesignForward 2 (Jim Ang, SNL)o Multi-scale architectural simulation with

the Structural Simulation Toolkit (Arun Rodrigues, SNL)

11:30 – 12:30 Lunch

UF Deep-Dive Agenda

CCMT

12:30 – 2:00 Session 1 (continued): Behavioral Emulation: Beyond Device Level� UF topics:

o Synchronization for speed o Congestion modeling o Behavioral characterization & modeling

beyond device level� Tri-lab topics:

o Using discrete event simulation for programming model exploration at extreme-scale (Jeremy Wilke, SNL)

o ASC next-generation code projects (David Daniel, LANL)

| 3

UF Deep-Dive Agenda

CCMT

2:00 – 5:00 Session 2: CS Issues (Performance, Energy, and Thermal)� UF topics:

o Performance and autotuning for hybrid architectures

o Energy and thermal optimizationo Dynamic load balancing

� Tri-lab topics:o Performance, energy, and thermal benchmarking

(Jim Ang, SNL)o Why power is a performance issue: utilizing

overprovisioned systems (Martin Schulz, LLNL)* There will be an afternoon coffee break in this time slot

6:30 Dinner (University Hilton)

| 4

UF Deep-Dive Agenda

CCMT

Wednesday February 4, 20159:00 – 11:00 Session 3: Use of Reconfigurable Computing

to Accelerate Behavioral Emulation� UF topics:

o Efficient mapping of behavioral emulation objects (BEOs) onto a system of FPGAs

o Demo of current single FPGA prototypeo Transitioning to multiple FPGAso Challenges associated with maximizing

emulation speed while maintaining scalability/usability

� Tri-lab topic:o FPGA-based emulation of processing near

memory (Maya Gokhale, LLNL)11:00 – 12:00 Open discussion and planning for action items12:00 Box lunch; transportation to airport as needed.

| 5

UF Deep-Dive Agenda

CCMT

CCMT

Behavioral Emulation for Design-Space Exploration of

CCMT Apps Principal Investigators:

Dr. Alan George, Dr. Herman Lam, Dr. Greg Stitt Student Project Leaders:

Nalini Kumar, Carlo Pascoe, Dylan Rudolph NSF Center for High-Performance Reconfigurable Computing (CHREC)

ECE Department, University of Florida

CCMT

Outline � Project context, scope, & focus � Behavioral Emulation approach � Research thrusts

| 7

CCMT

Context: DOE Co-design

| 8

CCMT | 9

Approach:

CCMT

BEOs & Behavioral Emulation Flow

Apps & Kernels

Application BEOsAppBEOs

Architecture BEOsArchBEOs

init (device);mem_init (A);mem_init (B);broadcast (A,comm_grp);scatter (B,B*,comm_grp);compute (dot_product,A,B*); Simulation/

Emulation Platform

Future-gen Systems & Notional

Architectures

Existing Systems & Architectures

skeleton apps(macro-scale)

mini-apps(meso-scale)

kernels(micro-scale)

system(macro-scale)

node(meso-scale)

device (micro-scale)

Testbed benchmarking &experimentation

Behavioral simulation (SW) or emulation (HW) experimentation

Notional systems exploration

);p);

k l t

Apps & Kernels

evice);t (A)

Application BEOsAppBEOs

Architecture BEOsArchBEOs

init (device);mem_init (A);mem_init (B);broadcast (A,comm_grp);scatter (B,B*,comm_grp);compute (dot_product,A,B*); Simulation/

Emulation Platform

Future-gen Systems & Notional

Architectures system

Systems &Architectures

skeleton apps(macro-scale)

mini-apps(meso-scale)

kernels(micro-scale)

system(macro-scale)

node(meso-scale)

device (micro-scale)

| 10

ArchitectureDesign-space Exploration

ApplicationDesign-space Exploration

CCMT

Application Design-Space Exploration (DSE) � Given characteristics of promising exascale architectures

(at device, node, and system level), – C/o vendor roadmaps & future technologies (e.g., FastForward 2

& DesignForward 2) – Explore different ways to parallelize exascale applications

� Focused study – Not intending to perform DSE on all types of exascale apps – Focus on exascale apps relevant to our CCMT Center

� Along with Behavioral Emulation approach (discussed later) � This focus allows for optimizations & modeling techniques not

available for general-purposed system simulators

| 11

Scope and Focus

CCMT

Approach: Behavioral Emulation

� How may we study Exascale before the age of Exascale? – Analytical studies – systems are too complicated – Software simulation – simulations are too slow at scale – Behavioral emulation – to be defined herein – Functional emulation – systems too massive and complex – Prototype device – future technology, does not exist – Prototype system – future technology, does not exist

� Many pros and cons with various methods

– We believe behavioral emulation is most promising in terms of balance of project goals (accuracy, speed, and scalability, as well as versatility)

| 12

CCMT

Behavioral Emulation (BE) � Component-based, coarse-grained simulation

– Fundamental constructs called BE Objects (BEOs) act as surrogates – BEOs characterize & represent behavior of app, device, node, & system

objects as fabrics of interconnected ArchBEOs (with AppBEOs) up to Exascale � Multi-scale simulation

– Hierarchical method based upon experimentation, abstraction, exploration � Multi-objective simulation

– Performance, power, reliability, and other environmental factors

CCMT| 13

CCMT

Fundamental Design of an Arch BEO

Emulation Plane � Mimic appropriate behavior of modeled object � Interact with other BEOs via tokens to support

emulation studies Management Plane

� Measure, collect, and/or calculate metrics and statistics

� Support architectural exploration Metrics

� Performance factors (execution time, speedup, latency, throughput, etc.)

� Environmental factors (power, energy, cooling, temperature)

� Dependability factors (reliability, availability, redundancy, overhead)

Arch BEO: Abstract model (surrogate) of an architecture object • Basic primitive in BE approach to studies of Exascale systems

Emulation Plane

Computation model

Communi-cation model

Power model

Reliability model

Management PlaneMeasurement, data collection,

& synchronization

Architecture Behavioral Emulation Object (BEO)

Tokens to/from other BEOs

| 14

CCMT

Conclusions: Research Thrusts � Behavioral Characterization (Session 1 morning)

– How do we build, calibrate, then validate performance?

� Parameter Estimation (Session 1 morning) – How do we efficiently capture behavior in surrogates?

� Synchronization & Congestion (Session 1 afternoon) – How do we handle sync and congestion at scale?

� Management & Visualization – How do we measure & analyze massive systems & apps?

� Reconfigurable Architectures (Session 3) – How do we exploit FPGA hardware for speed & scale?

� Resilience & Energy (starting after Y1) – How do we extend beyond performance attributes?

BE M

odeli

ng R

esea

rch

Plat

form

Res

earc

h

| 15

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

DOE’s&Fast&Forward&and&Design&Forward&R&D&Projects:&Influence'Exascale'Hardware''James&A.&Ang,&Ph.D.&Manager,&Scalable&Computer&Architectures&Sandia&NaFonal&Laboratories&Albuquerque,&NM&

&University&of&Florida&&CCMT&Exascale&Deep&Dive&Workshop&Gainesville,&FL&February&3S4,&2015&

1

SAND2015-0626 PE

Exascale(Hardware(Challenges(

Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and

Burton Smith, 2004

!  Le2(to(the(Invisible)Hand)!  Industry(follows(an(

evolu:onary(path(focused(on(Peak(Flops(

!  In(the(Era(of(Dennard(Scaling(our(ad)hoc)approach(to(integra:on(of(MPPs(with(COTS(microprocessors(was(acceptable(

!  With(the(end(of(Dennard(scaling,(this(is(no(longer(able(to(meet(DOE(Mission(Applica:on(Requirements(

2(

Exascale(Hardware(Challenges(–(cont.(!  We(need(to(Mo:vate(and(Influence(

Architectural(Changes(!  Processor/node(Architectures(!  System(Architectures(

(

!  Our(Investments(are(not(only(in(Architectures(!  We(cannot(just(develop(new(Exascale(

Architectures(and(Throw)it)over)the)wall(to(our(applica:on(developers(

!  We(need(Hardware/So2ware(CoQdesign((

!  The(transi:on(of(the(DOE(Legacy(Code(base(is(another(important(challenge(!  Challenge(should(influence(future(hardware(

thru(CoQdesign(

Network

layer

Mem

ory la

yers

Multi

-core

pro

cess

or layer

3(

Industry(Engagement(is(Vital(

!  We(need(industry(involvement(!  Avoid(oneQoff,(stoveQpiped(solu:ons(!  Con:nued(“product”(availability(and(upgrades(beyond(DOE(support(

(

!  Industry(cannot(and(will(not(solve(the(problem(alone((!  Business(model(obligates(industry(to(op:mize(for(profit,(beat(compe:tors(!  Industry(investments(heavily(weighted(towards(nearQterm,(evolu:onary(

improvements(with(small(margin(over(compe:tors(!  Industry(funding(for(longQterm(technology(R&D(is(limited(and(constrained(!  Industry(does(not(understand(DOE(Applica:ons(and(Algorithms(

(

!  How(can(we(impact(industry?((!  Work(with(those(that(have(strong(advocate(s)(within(the(company(!  Fund(research,(development(and(demonstra:on(of(longQterm(technologies(that(

clearly(show(poten:al(as(future(massQmarket(products((or(product(components)(!  Corollary:(do(not(fund(product(development((as(part(of(DOE(R&D(por`olio)(!  Industry(will(incorporate(promising(technologies(into(future(product(lines(

4(4(

NNSA/ASC(and(SC/ASCR(are(partnering(to(Influence(Industry(

!  Aligned(Hardware(Architecture(Efforts(!  April(2011(MOU(signed(between(SC(and(NNSA(

!  July(2011(Issued(RFI(on(Cri:cal(Technologies(for(Exascale(!  July(2012(Established(Fast(Forward(nodeQlevel(Cri:cal/Cross(Cugng(

Technology(R&D(projects(

!  October(2013(Established(Design(Forward(interconnect(R&D(Projects(!  November(2014(Fast(Forward(2:(Exascale(Node(Designs(

!  TBD:(Design(Forward(2:(Conceptual(Designs(of(Exascale(Systems(

!  Aligned(joint(Advanced(Technology(pla`orm(procurements(!  CORAL:(Oak(Ridge,(Argonne,(and(Lawrence(Livermore(Na:onal(Labs(

!  APEX:(Los(Alamos,(Lawrence(Berkeley(and(Sandia(Na:onal(Labs(

5(

!  Objec:ve:(Accelerate)transi7on)of)innova7ve)ideas)from)processor)and)memory)architecture)research)into)future)products))

!  Evaluate(advanced(research(concepts(and(develop(quan:ta:ve(evidence(of(their(benefit(for(DOE(applica:ons–using(Proxy(apps(and(collabora:ng(on(CoQdesign(!  Engage(DOE(applica:on(teams(to(understand(technology(trends/constraints((how(it(

impacts(their(code(development)(

!  Understand(how(to(program(these(new(features(

!  Quan:ta:ve(evidence(to(lower(risk(to(adop:on(of(innova:ve(ideas(by(product(teams(

!  Cri:cal(Node(Technologies(and(Designs(for(ExtremeQscale(Compu:ng(

6(

Fast(Forward(Program(

!  Fast(Forward(1((July(2012(–(Sept.(2014)((!  AMD:(Heterogeneous(processor,(ProcessingQinQmemory(and(2Qlevel(Memory(

!  IBM:(Advanced(Memory(Concepts(

!  Intel:(Core(energy(efficiency(and(ProcessingQnearQmemory(

!  Intel/Whamcloud:(Storage(reliability,(I/O(API,(burst(buffer(management(

!  Nvidia:(Memory(hierarchy,(processor/packaging/programming(

!  Fast(Forward(2((Nov(2014(–(2016)(!  AMD:(Near(threshold(voltage(logic,(other(lowQpower(compu:ng(

technologies,(and(new(standardized(memory(interface(

!  Cray:(alterna:ve(processor(design(points(including(ARM(microprocessors(

!  IBM:(inves:gate(nextQgenera:on(standardized(memory(interface(

!  Intel:(energy(efficient(node(and(system(architectures,(including(so2ware(targeted(at(developing(extreme(scale(systems(

!  Nvidia:(focus(on(energy(efficiency,(programmability(and(resilience(

7(

Fast(Forward(Program(

!  Objec:ve:(R&D(of(interconnect(architectures(and(conceptual(designs(for(future(extremeQscale(computers:(!  Oct.(2013–2015,(Design(Forward(1:(Interconnect(Networks(

!  Overall(Interconnect(Architecture(!  Interconnect(Integra:on(with(Processor(and(Memory(!  Mul:ple(Communica:on(Library(Progression(and(Interac:on(!  Interconnect(Fabrics(and(Management(!  Protocol(Support(!  Scalability(

!  Start(is(imminent,(Design(Forward(2:(System(design(and(integra:on(!  Overall(System(Architecture(!  Energy(U:liza:on(!  Resilience(and(Reliability(!  Data(Movement(through(the(System(!  Packaging(Density(!  System(So2ware(!  Programming(Environment(

8(

Design(Forward(Program(

AMD(!  Processor&Research&

!  Heterogeneous(nodes(which(blend(CPU(and(GPU(cores(

!  Improved(energy(efficiency(

!  Efficient(communica:on(and(data(movement(across(the(die(

!  Simplified(programming(models(

!  Memory&Research&!  Inves:ga:ng(new(memory(

technologies(

!  Reduced(data(movement(

!  Higher(performance(

!  Reduced(energy(consump:on(

!  New)Memory)Interface:(standardized,(robust(interface(to(support(integra:on(of(heterogeneous(memory(and(cores(

!  SoYware&Tools&!  HSA(Founda:on(

Concept Node Design

Source: AMD FastForward Project Overview (https://asc.llnl.gov/fastforward/AMD-FF.pdf)

9(

Intel(

!  Processor&Research&!  Lightweight(processor(cores(

!  Fast(synchroniza:on(

!  Specialized(aspects(of(ISA(and(processor(for(data(movement(

!  Tapered(access(to(memory(

!  Interconnect&Research&!  Tapering(bandwidth(networks(

!  Integra:on(of(NICs(into(processor(

!  Intelligent(data(movement(to(reduce(power(

!  SoYware&Tools&!  Open(Community(Run:me((OCR)(

!  Explora:on(of(OpenMP(and(MPI(as(legacy(environment(

Source: Intel FastForward Project Overview (https://asc.llnl.gov/fastforward/Intel-FF.pdf and IPDPS2013 talks)

10(

NVIDIA(

!  Processor&Research&!  Temporal(SIMT(and(Scalariza:on(

!  Reduce(effect(of(wide(vectors(

!  Coherency(and(consistency(across(system(

!  Hierarchical(memory(systems(

!  Interconnect&Research&!  Open(standards(for(the(data(center(

!  Support(direct(GPU(messaging(

!  Programmability&!  Global(address(spaces((PGAS)(

!  Efficient(cross(machine(collec:ves(

!  Fast(synchroniza:on(

!  Ac:ve(messages(

!  Heterogeneous(cores(

Source: NVIDIA FastForward Project Overview (https://asc.llnl.gov/fastforward/Nvidia-FF.pdf)

11(

IBM(!  Memory&Research&

!  Novel(Computa:on(near(memory(

!  Reduc:on(in(data(movement(and(associated(overhead(

!  Advances(in(programming(models,(compiler(and(run:me(environment(

!  Leverage(of(emerging(memory(technologies(

!  Advances(in(memory(efficiency(

!  Advances(in(memory(system(integra:on,(power(and(reliability(management(

!  Impact&!  Large(reduc:on(of(data(movement(

!  Significant(improvement(at(system(level(performance,(power(efficiency,(and(reliability(

!  Successful(exploita:on(of(novel(architecture(features(while(abstrac:ng(the(hardware(complexity,(enables(by(evolu:onary(and(revolu:onary(approaches(

Source: IBM FastForward Project Overview (https://asc.llnl.gov/fastforward/IBM-FF.pdf)

12(

Cray(

13(

!  Network&CommunicaFon&API&!  NIC(func:ons(to(enable(efficient(execu:on(

of(network(API(

!  Structures(required(to(achieve(scalability(of(a(diverse(range(of(traffic(paqerns?(

!  Novel(func:ons(in(future(cores(to(facilitate(efficient(wakeup(on(the(arrival(of(new(data?((

!  Network&Protocol&!  How(can(the(NICs(generate(simple,(small,(

HPC(op:mized(packets(at(a(sufficient(rate?(

!  Interoperable(protocols(in(support(of(heterogeneous,(adap:ve(designs(

!  What(flexibility(is(needed(to(allow(vendor(differen:a:on?(

!  Network&Management&API&!  What(are(the(important(management(

func:ons(to(provision?(

!  What(structure(of(system(management(best(serves(those(func:ons?(

!  Standardized(APIs(to(allow(management(of(a(variety(of(high(performance(networks.(

ASC(and(ASCR(are(partnering(on(Joint(Advanced(Technology(System(Procurements(!  The(APEX((LANL,(LBNL,(and(SNL)(collabora:on(is(intended(to(

result(in(the(procurement(of(two(pla`orms(in(~2020(!  NERSC/ASCR(procurement(of(NERSCQ9(

!  ACES/ASC(procurement(of(ATSQ3((Advanced(Technology(System)((

!  Both(pla`orms(will(focus(on(mee:ng(both(mission(needs(and(pursuing(Advanced(Technology(concepts(!  Q(We(expect(to(use(NonQRecurring(Engineering(investment(to(guide(and(

improve(system(performance(and(produc:vity(

14(

HighQlevel(Design(Philosophy(for(ATS3(

!  Delivered(applica:on(performance(is(the(primary(driver(in(support(of(mission(requirements(!  Peak(FLOPS(requirement(will(not(appear(in(RFP(

!  Advanced(technology(development(is(assumed(to(be(necessary(to(meet(mission(needs(!  Accelerate(development(of(yet(to(be(iden:fied(key(technologies(!  3rd(round(of(NRE(–((Trinity/NERSCQ8,(CORAL,(APEX)(

!  APEX(are(preQexascale(plaòrms(!  MUST(support(path(to(exascale(programming(models(

!  While(suppor:ng(exis:ng(mission(needs(

!  Support(MPI(+(OpenMP((threads)(!  Matured(on(Trinity/Cori(and(CORAL(plaòrms(

!  Addi:onal(support(for(other,(yet(to(be(iden:fied,(MPI+X(programming(models(

APEX(Capability(Improvement(

!  An(increase(in(predic:ve(capability(requires(increases(in(the(fidelity(of(both(geometric(and(physics(models(!  This(implies(usable(large(plaòrm(memory(capacity(

!  APEX(must(demonstrate(a(significant(capability(improvement(!  Improvement(measured(rela:ve(to(Trinity((ATS1)(and(Cori((NERSCQ8)(!  Improvement(as(a(func:on(of(performance((total(:me(to(solu:on),(

increased(geometries,(increased(physics(capabili:es,(power/energy(efficiency,(resilience(and(other(factors(

!  Previous(DOE(investments(assumed(to(be(an(integral(part(of(produc:on(compu:ng(for(APEX(!  Trinity/NERSCQ8(NRE(projects:(Burst(Buffer(and(Advanced(Power(

management(!  Fast(Forward(and(Design(Forward(Projects(

!  Poten:al(Path(Forward(project(!  NRE(could(take(select(technologies(the(final(yards(towards(produc:on(

&

Fast(Forward(and(Design(Forward(Impact(

!  APEX(Team(is(performing(Market(Surveys(

!  Vendors(visi:ng(in(phases(star:ng(in(January(2015(!  IBM,(Intel,(Cray,(Nvidia,(AMD,(SGI,(HP,(Micron,(Broadcom,(ARM,(etc.(

!  Fast(Forward(and(Design(Forward(Accomplishments(and(Progress(have(direct(influence(over(the(development(of(APEX(technical(requirements(

!  Developing(NRE(strategy(!  We(started(early(to(enable(a(richer(range(of(NRE(topics(

CCMT

CCMTResearch Thrust

BE Characterization

CCMT| 2

Introduction

� Summary of topics that will be discussed in this session

CCMT| 3

Distributed Behavioral Emulation

� Goal: Enable fast and scalable simulation of Exascale systems– Require efficient simulation

representation and synchronization mechanisms for PDES*

� Different from other approaches in our definitions of processes, events, and event timings– Three kinds of events: send

event, receive event, internal event

– Relation between events are defined w.r.t logical time: correspondence, causality, concurrency

*Parallel Discrete Event Simulation

How do we define processes and generate events?

CCMT| 4

PDES mapping of Matrix Multiply

� Assuming one thread/processor, one logical process is generated for each processor core

– ProcBEOs (logical processes) read events from AppBEOs (event queues)

� Partial ordering of events by assigning timestamps using integer timestamps (logical clock)

� Real clock timestamps are used to estimate execution time

if (node==0) {broadcast (A,comm_grp);barrier ();scatter (B,B*,comm_grp);compute (dot_product,A,B*);gather (result,comm_grp);

} else {broadcast (A,comm_grp);barrier ();recv (B,B*,node_0);compute (dot_product,A,B*);send (result,node_0);

}Pseudo-code for parallel matrix multiply (C=BxA)

Timing diagram for matrix multiply distributed simulation (on 4-core CPU)

time

ProcBEO1

ProcBEO2

ProcBEO3

e17 e18 e19

e23e24

e25

e33 e34

e43 e44

e20 e21 e22

e35

e45

e11

e21

e31

e12 e13

e41

e15e14 e16

e22

e32

e42ProcBEO4

Where do we get the timestamps from?

How do we define processes?

How do we generate events?

CCMT| 5

Performance Models (1)

� Calibration data is used for developing interpolation models which are used to predict the execution time – performance models

– Data have varying dimension (One-dimensional: Dot Product, Multi-dimensional: Matrix Multiply)

� We are using Kriging interpolation for multi-dimensional interpolation

– More about performance models in the next talk of this session

execution_time = f()Train interpolation model

Training/calibration data Predicted execution time

Estimate for test inputs

Exceedserror threshold?Experimental testbed,

Cycle-accurate Device Simulator,Fast Forward 2 vendors,

etc.

How do we generate timestamps for internal events?

What about send and receive events?

CCMT| 6

Performance Models (2)

� In order to update the timestamps of sending and receiving processes, we have to account for:

– Time taken by the message to reach it destination– If the receiving process is busy, the time spent by the message waiting in the queue– Time that receiving process spends in waiting for the message to arrive

� In BE, each Send event generates an event token with a timestamp equal to its local clock

How do we generate timestamps for send & receive events?

� We use network performance models to estimate time taken by a token to reach its destination

– Qualitative parameters are used to mimic movement of packets in network

– Quantitative parameters help in estimating communication time

• Some quantitative parameters are functions of independent variables (e.g., latency)

• Others are fixed information about the network (e.g., hop time)

CCMT| 7

ProcBEO & CommBEO Calibration

Each BEO represents a simulation LP

ProcBEOs emulate processing units� Initialization, computation etc. are internal events� Interaction with other BEOs are send/receive events� Update local clocks based on performance models for

compute operations– A compute operation can be decomposed several ways– e.g., matrix multiply can be broken down into either

multiply and add operations, dot product, or a smaller matrix multiply

– Need to account for any non-negligible overheads

CommBEOs emulate network switches� Mimic communication on the network by

sending/receiving event tokens instead of real packets to other CommBEOs

� Update the timestamp of the token at each hop through the network

How do we represent physical system with LPs in Behavioral Emulation?

ys

r

CCMT| 8

Demo: AppBEO� AppBEO for the 3D matrix multiply kernel used in Nek5000

– AppBEO instructions are compiled into events that ProcBEOs can understand– Timestamps, estimates of event execution time, are generated and compiled

pre-simulation whenever possible

CCMT| 9

Demo: ProcBEO� ProcBEO instance to model a Tile-Gx36 processing unit

– AppBEO instructions are resolved by the ProcBEO– Local processor clock is updated based on the event being processed

(computation or communication)

CCMT| 10

Demo: CommBEO� CommBEO instance to model a Tile-Gx36 switch

– Event token is forwarded based on the destination– A virtual machine block informs the local ProcBEO of the CommBEO clock

CCMT| 11

Demo: Simulation � Using the App-Proc-Comm BEO stack we can define and simulate the

behavior of many-core device– This simulation has been setup for a 81-core device of which only 36 cores are

active in simulation– Connection between the device cores is described using a routing table

CCMT| 12

Behavioral Emulation Workflow

– Kernels on next-gen Tile-Gx72

– Kernels on anticipated Intel Knight’s Landing (KNL)

– Microbenchmarks• Computation• Communication

– Kernels• 2D Matrix Multiply

– Computation– Communication

Step 1: Calibration

Step 2: Validation

Step 3: Prediction

More on performance models in the next talk in this session

CCMT| 13

-40-35-30-25-20-15-10

-505

% E

RRO

R

TRANSFER SIZE (32-BIT WORDS)

Error in simulating Gather

-70-60-50-40-30-20-10

0

% E

RRO

R


Error in simulating Scatter

2 cores4 cores8 cores16 cores32 cores

-50

-40

-30

-20

-10

0

10

% E

RRO

R


Error in simulating Broadcast

Results: Communication Microbenchmarks

Simulation setup:– Communication pattern: Tree Broadcast, Naïve

Gather, Naïve Scatter– BEOs modeled: Tilera iMesh network CommBEOs

Observation: – Simulations under-predict execution time in most

cases, can improve calibration to account for setup overhead

– Accuracy broadly improves with increase in number of cores and transfer size

CCMT| 14

Results: Parallel 2D Matrix Multiply

Raw data available in Appendix

-10

-5

0

5

10

15

20

2 cores 4 cores 8 cores 16 cores 32 cores

% E

RRO

R

NO. OF PROCESSOR CORES

Prediction Error (Coarse-grained Decomposition)

Simulation setup: compute models for matrix multiply, loop overhead, & network parametersObservations:– Abstraction of compute details improves simulation accuracy at a one-time cost of training effort– Accuracy of simulations is a function of domain, no. of samples, & other kriging parameters

-5

0

5

10

15

20

2 cores 4 cores 8 cores 16 cores 32 cores

% E

RRO

R

NO. OF PROCESSOR CORES

Prediction Error (Fine-grained Decomposition) 64x64128x128256x256512x5121024x1024

i i )

Abstraction improves simulation accuracy for all problem sizes

Fewer cores means more share of work performed by each processor. For fine-

grained decomposition, more error incurred.

� Computation dominates communication, resulting in high total error� Error in dot-product model gets multiplied several times over

CCMT| 15

Performance Prediction

With some confidence in Behavioral Emulation approach we can proceed to study next-generation devices

– Ability to evaluate what-if scenarios by changing BEOs parameters

1. Tile-Gx72: Existing Tilera many-core processor– Largest device made by Tilera: 72 cores– Cores in Tile-Gx72 are identical to cores in Tile-Gx36– To simulate Tile-Gx72, we scale simulation to 72 Proc & CommBEOs

2. Knight’s Landing (KNL): Anticipated Intel many-core processor– Rumored to have Xeon Phi-type cores with Mesh network– To simulate anticipated Knight’s Landing

• Calibrate XeonPhi ProcBEOs based on existing XeonPhi processor cores• Use validated CommBEOs developed for iMesh network

– 64-core device: similar in size to existing Xeon Phi– 100-core device: probable size; larger than existing devices

How do we simulate future or notional systems?

Interaction with Fast Forward 2 vendors will provide key information for carrying out useful simulations

CCMT| 16

Results: Prediction for Tile-Gx72 & Intel KNL

0.881.03

1.251.55 1.87

0.0

0.5

1.0

1.5

2.0

1

10

100

1000

10000

64x64 128x128 256x256 512x512 1024x1024

Spee

dup

Exec

utio

n tim

e (m

s)

Matrix size

2D matrix multiplyTile-Gx36 Tile-Gx72 Speedup

Larger matrix sizes utilize the Gx72 device better

0246810

110

1001000

10000100000

128x128 256x256 512x512 1024x1024 2048x2048

Spee

dup

Exec

utio

n tim

e (m

s)

Matrix size

2D Matrix MultiplyTile-Gx36 KNL 64 KNL 100 Speedup (Gx36 vs KNL64) Speedup (Gx36 vs KNL100)

Communication overshadows computation on larger KNL100 device, resulting in no speedup over KNL64

l

Different application algorithm (2D block decomposition) may scale better on KNL100

Simulation Setup:– 72-Core Tile Device (Gx72)– Twice as many cores, laid

out in 9 by 8 mesh

Simulation Setup:– BEOs modeled (For KNL):

CommBEOs for TileraiMesh, ProcBEOs for XeonPhi

CCMT

Identified Issues (1)

17

� How to systematically collect data?• Benchmarking automation is needed

– For wide range of app and system parameters, on targeted platforms– Benchmark suite with basic set of prog. methods (e.g., OpenMP, OpenCL, & MPI)

� How do we automate modeling process?• Device-independent techniques

� How to easily repeat experiments?• Need automatic porting of benchmark suite

– On new platforms– On upgrades of existing platform

� How do we select practical techniques for interpolation on multi-dimensional data for a given computation?

• e.g., with matrix multiply: m, n, p, data type, memory affinity

� How do we determine appropriate granularity for compute decomposition?

• Multiscale approach

Next talk in this session

CCMT

� Code to skeleton apps • Transforming applications from C/C++/Fortran source to high-level

instructions on AppBEO

� Simulation synchronization• Global vs. distributed simulator clock

� Scalability of software simulator • Message-passing simulator • Leverage and integrate SST Macro/Micro

� How do we model CommBEO congestion?• Event timing with CommBEO ingress• Exploring flit-level, packet-based, flow-based, & hybrid model

18

Identified Issues (2)

We will present our thoughts on these issues in the afternoon session

CCMT

CCMT

Questions?

CCMT| 20

References

System (macro-scale) Simulators– C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar, D. A. Evensky,

and J. Mayo, “A simulator for large-scale parallel architectures” International Journal of Parallel and Distributed Systems, vol. 1, no. 2, pp. 57-73, 2010. SST MACRO

– E. Grobelny, D. Bueno, I. Troxel, A.D. George, and J.S. Vetter, “FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications, Simulation”, Simulation, Vol. 83, No. 10, pp. 721-745, Oct. 2007. FASE

– G. Zheng, G. Kakulapati, L. V. Kale, “Bigsim: A parallel simulator for performance prediction of extremely large parallel machines”, 18th IPDPS, pp. 78, 2004. BIGSIM

– A. D. George, R. B. Fogarty, J. S. Markwell, and M. D. Miars, “An Integrated Simulation Environment for Parallel and Distributed System Prototyping”, Simulation, vol. 72, pp. 283-294, May 1999. ISE

– A. Symons, V. L. Narasimhan, "Parsim-message PAssing computeR SIMulator," IEEE First International Conference on Algorithms and Architectures for Parallel Processing, vol. 2, pp. 621, 630, 19-20, ICAPP, 1995. PARSIM

20

CCMT| 21

References

Device (micro-scale) & Node (meso-scale) Simulators– Z. Dong, J. Wang, G. Riley, and S. Yalamanchili, “An Efficient Front-End for Timing-Directed

Parallel Simulation of Multi-Core System”, 7th International ICST Conference on Simulation Tools and Techniques (SIMUTools 2014), March 2014. MANIFOLD

– J. Wang, J. Beu, S. Yalamanchili, and T. Conte. “Designing Configurable, Modifiable and Reusable Components for Simulation of Multicore Systems”, 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, November 2012. MANIFOLD

– M. Hseih, R. Riesen, K. Thompson,W. Song, A. Rodrigues, “SST: A Scalable Parallel Framework for Architecture-Level Performance, Power, Area and Thermal Simulation”, Computer Journal, vol. 55, no. 2, pp. 181-191, 2012. SST MICRO

– M. Hseih, A. Rodrigues, R. Riesen, K. Thompson,W. Song, “A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration”, SIGMETRICS, Performance Evaluation Review, vol. 38, no. 4, pp. 63-68 2011. SST MICRO

Object-oriented System Modeling– J. C. Browne, E. Houstis, and J. R. Purdue, “POEMS – End to End Performance Models for

Dynamic Parallel and Distributed Systems”

21

CCMT| 22

References

Hardware Emulation– Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovi, and D. Patterson, “A Case for

FAME : FPGA Architecture Model Execution”, ISCA’10, June 19–23, 2010, Saint-Malo, France, 290–301.

– J. Wawrzynek, D. A. Patterson, S. Lu, and J. C. Hoe, “RAMP: A Research Accelerator for Multiple Processors”, 2006.

Supercomputer-specific Modeling & Simulation– S. R. Alam, R.F. Barrett, M. R. Fahey, J. M. Larkin, and P.H. Worley, “Cray XT4 : An

Early Evaluation for Petascale Scientific Simulation”, 2007.– A. Hoisie, G. Johnson, D. J. Kerbyson, M. Lang, and S. Pakin, “A Performance

Comparison Through Benchmarking and Modeling of Three Leading Supercomputers : Blue Gene / L , Red Storm , and Purple”, (November), 1–10, 2006.

Analytical Modeling – L. Carrington, A. Snavely, and N. Wolter, “A performance prediction framework

for scientific applications”. Future Generation Computer Systems, 22(3), 336–346.

– N. Jindal, V. Lotrich, E. Deumens, B.A. Sanders, and I. Sci, “ SIPMaP : A Tool for Modeling Irregular Parallel Computations in the Super Instruction Architecture”, IPDPS 2013

22

CCMT

APPENDIX

CCMT| 24

AppBEO representation � Need representation of applications that simulator can

understand– AppBEOs are list of instructions processed by ProcBEOs– Small and simple description allows easy development

• Developer does not need to worry about creating working application code– Intermediate format can be compiled into format specific to simulation

platformAppBEO (high-level description)

// Define group as nodes 0-3VAR commGrp=0:3// Broadcast matrix A (dataSize=64*64/2) to groupBcast(int32,2048,0,commGrp)// Barrier syncBarrier(commGrp)// Scatter 1/4 of matrix B (dataSize=(64*64)/(4*2)) to each nodeScatter(int32,512,0,commGrp)// Perform dot product of vector size 64 of int32DotProduct(int32,64)// Gather solutions from matrices (dataSize=(64*64)/(4*2))Gather(int32,512,commGrp)Done

Intermediate format (AppBEO for node0)send 1 1 129971 1recv 4send 2 2 129971 1recv 8send 13 1 381 1recv 12send 16 1 32420 1recv 17send 18 2 32420 1recv 19send 20 3 32420 1recv 21advt 5753856

Human Readable Intermediate Format (debug mode)

// Bcast(int32,2048,0,commGrp)send 1 1 129971 1 Send broadcast to node 1recv 4 Receive acknowledgement for broadcast from node 1send 2 2 129971 1 Send broadcast to node 2recv 8 Receive acknowledgement for broadcast from node 2// Barrier(commGrp)send 13 1 381 1 Send barrier to node 1recv 12 Received barrier from node 0// Scatter(int32,512,0,commGrp)send 16 1 32420 1 Scatter from master to node 1recv 17 Receive acknowledgement for scatter from 1send 18 2 32420 1 Scatter from master to node 2recv 19 Receive acknowledgement for scatter from 2send 20 3 32420 1 Scatter from master to node 3recv 21 Receive acknowledgement for scatter from 3// DotProduct(int32,64)advt 5753856 Advance timer for compute time in dot product

CCMT| 25

Compute Microbenchmarks

image size testbed(us) simulation(us) %error320x240 142161.26 144614.40 1.73480x320 286130.51 289228.80 1.08640x480 574425.74 578457.60 0.70800x600 899691.47 903840.00 0.461024x768 1483695.36 1480851.46 -0.191280x1024 2518175.66 2468085.76 -1.991600x1200 3618511.74 3615360.00 -0.09

Sobel FilteringTile-Gx36

vector size testbed(us) simulation(us) %error

10 0.59 3.30 458.6520 1.14 3.28 187.2930 1.68 3.25 93.6740 2.22 3.23 45.5650 2.76 3.21 16.2660 3.3 3.30 -0.0570 3.84 3.84 -0.0380 4.38 4.38 -0.0190 4.92 4.86 -1.27

100 5.46 4.83 -11.46200 10.86 10.87 0.05300 16.27 16.27 -0.03400 21.67 21.67 -0.01500 27.07 27.07 0.01600 32.48 32.47 -0.03700 37.88 37.89 0.02800 43.28 43.28 0.00900 48.69 48.69 0.00

1000 54.09 54.08 -0.02

Dot Product

Matrix Multiply

matrix size testbedsim

(fine-grain)error

(fine-grain)sim

(coarse-grain)error

(coarse-grain)4x4 6.56 69.312 956.59 11.365 73.258x8 41.79 280.96 572.31 65.041 55.6416x16 312.42 1159.424 271.11 383.281 22.6832x32 2446.1 4885.504 99.73 2879.644 17.7264x64 19255.99 23097.344 19.95 19374.552 0.62128x128 172640.13 167739.392 -2.84 153122.202 -11.31256x256 1379730.55 1275658.24 -7.54 1292484.776 -6.32512x512 10971014.13 9933422.592 -9.46 10121067.52 -7.75

* Execution times are reported in microseconds

CCMT| 26

Communication Microbenchmarks

testbedtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores

2 2.08 4.00 5.75 8.96 16.354 2.18 3.71 5.99 9.12 15.638 2.23 3.84 6.11 9.37 15.80

16 2.60 4.30 6.82 10.08 16.5932 2.97 5.22 8.37 11.53 17.9864 3.96 7.18 11.18 15.67 26.45

128 5.90 11.13 18.48 28.59 45.13256 10.49 19.75 38.63 57.60 87.11512 18.07 36.38 71.45 105.59 193.42

1024 35.08 68.94 136.00 204.33 306.402048 67.67 134.28 265.65 400.31 598.274096 132.59 265.22 526.19 789.34 1187.098192 264.97 524.96 1050.17 1574.81 2366.27

16384 526.41 1044.68 2101.43 3149.43 4735.2932768 1040.36 2091.25 4184.37 6292.76 9479.93

simulationtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores

2 0.763 1.526 3.053 4.578 6.8674 0.883 1.766 3.533 5.298 7.9478 0.945 1.89 3.781 5.67 8.505

16 1.181 2.362 4.725 7.086 10.62932 1.643 3.286 6.573 9.858 14.78764 2.613 5.226 10.453 15.678 23.517

128 4.547 9.094 18.189 27.282 40.923256 8.832 17.664 35.329 52.992 79.488512 16.8 33.6 67.201 100.8 151.2

1024 32.802 65.604 131.209 196.812 295.2182048 65.455 130.91 261.821 392.73 589.0954096 130.353 260.706 521.413 782.118 1173.1778192 259.719 519.438 1038.877 1558.314 2337.471

16384 5.18E+02 1036.382 2072.765 3109.146 4663.71932768 1036.67 2073.34 4146.681 6220.02 9330.03

Tile-Gx36


64 6.15 13.41 36.51 52.34 114.12128 8.73 20.62 47.45 92.92 174.96256 12.90 32.51 72.91 148.05 306.32512 20.96 57.86 130.39 275.00 561.61

1024 38.01 107.46 245.89 525.27 1075.482048 72.20 208.07 480.80 1027.08 2200.944096 139.06 410.39 949.67 2029.95 4175.838192 273.31 810.71 1886.74 4014.38 8283.43

16384 544.03 1627.38 3772.34 8039.68 16573.4032768 1086.51 3229.03 7510.44 16008.20 33024.70


64 2.614 7.844 18.312 39.276 81.264128 4.548 13.646 31.85 68.286 141.218256 8.833 26.501 61.845 132.561 274.053512 16.801 50.405 117.621 252.081 521.061

1024 32.803 98.411 229.635 492.111 1017.1232048 65.456 196.37 458.206 981.906 2029.3664096 130.354 391.064 912.492 1955.376 4041.2048192 259.72 779.162 1818.054 3895.866 8051.55

16384 518.192 1554.578 3627.358 7772.946 16064.1832768 1036.671 3110.015 7256.711 15550.13 32137.03


2 2.25 4.31 8.18 16.37 34.224 2.35 4.55 9.04 18.12 37.938 2.44 4.80 9.47 19.09 38.42

16 2.67 5.52 11.30 22.75 47.2632 3.09 7.02 14.76 30.07 62.5364 4.16 9.91 21.48 44.79 92.63

128 5.98 15.72 35.16 73.96 153.31256 10.58 28.98 66.20 140.57 290.44512 18.68 53.50 124.32 264.59 545.31

1024 34.75 103.78 239.20 511.73 1058.202048 67.97 201.70 471.07 1008.84 2084.434096 133.39 400.91 934.42 2001.22 4138.498192 266.98 798.12 1858.98 3989.01 8248.35

16384 537.05 1593.76 3719.23 7972.09 16501.7032768 1064.45 3181.46 7435.59 15934.10 32989.50


2 1.145 3.437 8.029 17.241 35.7254 1.265 3.797 8.869 19.041 39.4458 1.327 3.983 9.303 19.971 41.367

16 1.563 4.691 10.955 23.511 48.68332 2.025 6.077 14.189 30.441 63.00564 2.995 8.987 20.979 44.991 93.075

128 4.929 14.789 34.517 74.001 153.029256 9.214 27.644 64.512 138.276 285.864512 17.182 51.548 120.288 257.796 532.872

1024 33.184 99.554 232.302 497.826 1028.9342048 65.837 197.513 460.873 987.621 2041.1774096 130.735 392.207 915.159 1961.091 4053.0158192 260.101 780.305 1820.721 3901.581 8063.361

16384 518.573 1555.721 3630.025 7778.661 16075.9932768 1037.052 3111.158 7259.378 15555.85 32148.84

GatherBroadcastScatter

All times are reported in microseconds

CCMT| 27

Parallel 2D Matrix Multiply (fine-grained)Tile-Gx36

Sim (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.1304 0.0655 11.5241 0.0658 11.7888 0.2607 0.0984 5.7620 0.0996 6.2268 0.2607 0.1176 2.8810 0.1203 3.6510128x128 0.5182 0.2597 83.8615 0.2601 84.9026 1.0364 0.3911 41.9308 0.3922 43.7565 1.0364 0.4582 20.9654 0.4609 23.9679256x256 2.0732 1.0367 637.7964 1.0371 641.9463 4.1464 1.5546 318.8982 1.5557 326.1610 4.1464 1.8181 159.4491 1.8207 171.3914512x512 8.2983 4.1432 4966.5802 4.1436 4983.1683 16.5965 6.2196 2483.2901 6.2208 2512.3332 16.5965 7.2567 1241.6451 7.2594 1289.3649

1024x1024 33.8551 16.6480 39250.2968 16.6484 39317.4514 67.7103 24.8948 19625.1484 24.8959 19742.6556 67.7103 29.0023 9812.5742 29.0049 10006.0127

Testbed (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.1343 0.0661 9.7008 0.0676 10.0319 0.2671 0.1013 4.8422 0.1026 5.3593 0.2658 0.1217 2.4253 0.1233 3.2394128x128 0.5338 0.2612 76.2128 0.2679 77.6801 1.0638 0.3893 38.0983 0.4019 40.1810 1.0642 0.4606 19.1066 0.4737 22.3124256x256 2.1424 1.0479 606.9473 1.0712 614.4742 4.2790 1.5803 303.4670 1.6047 312.7736 4.2792 1.8725 151.7895 1.8683 165.0370512x512 8.7386 4.4178 4846.7818 4.4391 4890.1128 17.4151 6.5211 2422.9984 6.4870 2467.7218 17.3390 7.6500 1211.9521 7.4919 1269.4627

1024x1024 35.2284 17.6642 38738.2033 17.6520 39021.2287 71.3471 26.7471 19369.4312 26.4181 19615.3064 71.7815 31.7918 9688.5531 30.6422 9949.3548

% error 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 -2.91 -0.94 18.79 -2.61 17.51 -2.41 -2.82 19.00 -2.98 16.19 -1.92 -3.35 18.79 -2.47 12.71128x128 -2.93 -0.58 10.04 -2.92 9.30 -2.58 0.45 10.06 -2.41 8.90 -2.61 -0.52 9.73 -2.70 7.42256x256 -3.23 -1.07 5.08 -3.19 4.47 -3.10 -1.63 5.08 -3.05 4.28 -3.10 -2.91 5.05 -2.55 3.85512x512 -5.04 -6.22 2.47 -6.66 1.90 -4.70 -4.62 2.49 -4.10 1.81 -4.28 -5.14 2.45 -3.10 1.57

1024x1024 -3.90 -5.75 1.32 -5.69 0.76 -5.10 -6.93 1.32 -5.76 0.65 -5.67 -8.77 1.28 -5.34 0.57

CCMT| 28

Parallel 2D Matrix Multiply (fine-grained)Tile-Gx36

Sim (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.2607 0.1326 1.4405 0.1383 2.5149 0.2607 0.1412 0.7203 0.1530 2.2305128x128 1.0364 0.4921 10.4827 0.4978 14.6032 1.0364 0.5211 5.2413 0.5329 11.0018256x256 4.1464 1.9554 79.7245 1.9611 96.1017 4.1464 2.0294 39.8623 2.0412 62.6345512x512 16.5965 7.7729 620.8225 7.7787 686.1851 16.5965 8.0516 310.4113 8.0634 401.2533

1024x1024 67.7103 31.0982 4906.2871 31.1039 5171.6415 67.7103 32.1370 2453.1436 32.1488 2822.1686

Testbed (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.2647 0.1378 1.2140 0.1412 2.3351 0.2636 0.1476 0.6238 0.1558 2.2542128x128 1.0654 0.5024 9.5857 0.5109 13.8897 1.0551 0.5337 4.8142 0.5525 10.8169256x256 4.2764 2.0088 76.0019 2.0239 93.4666 4.2866 2.1775 38.0804 2.1386 62.2750512x512 17.2959 8.2301 606.6027 7.9981 678.9970 17.2911 8.7498 303.6608 8.3159 402.1725

1024x1024 72.6566 35.4219 4848.8904 32.6373 5164.7570 71.9278 37.0273 2427.4457 33.5693 2832.1929

% error 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 -1.52 -3.83 18.65 -2.08 7.70 -1.10 -4.30 15.47 -1.75 -1.05128x128 -2.72 -2.05 9.36 -2.55 5.14 -1.78 -2.37 8.87 -3.55 1.71256x256 -3.04 -2.66 4.90 -3.10 2.82 -3.27 -6.80 4.68 -4.55 0.58512x512 -4.04 -5.55 2.34 -2.74 1.06 -4.02 -7.98 2.22 -3.04 -0.23

1024x1024 -6.81 -12.21 1.18 -4.70 0.13 -5.86 -13.21 1.06 -4.23 -0.35

CCMT| 29

Parallel 2D Matrix Multiply (coarse-grained)Tile-Gx36

Sim (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.1304 0.0655 9.7517 0.0658 10.0164 0.2607 0.0984 4.7197 0.0996 5.1845 0.2607 0.1176 2.2172 0.1203 2.9872128x128 0.5182 0.2597 76.2505 0.2601 77.2916 1.0364 0.3911 40.2694 0.3922 42.0952 1.0364 0.4582 18.8820 0.4609 21.8845256x256 2.0732 1.0367 652.5564 1.0371 656.7064 4.1464 1.5546 318.1159 1.5557 325.3787 4.1464 1.8181 167.3377 1.8207 179.2800512x512 8.2983 4.1432 5043.3792 4.1436 5059.9673 16.5965 6.2196 2508.1097 6.2208 2537.1528 16.5965 7.2567 1271.9594 7.2594 1319.6792

1024x1024 33.8551 16.6480 8935.1544 16.6484 9002.3090 67.7103 24.8948 5254.7738 24.8959 5372.2809 67.7103 29.0023 2503.3172 29.0049 2696.7557

Testbed (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.1343 0.0661 9.7008 0.0676 10.0319 0.2671 0.1013 4.8422 0.1026 5.3593 0.2658 0.1217 2.4253 0.1233 3.2394128x128 0.5338 0.2612 76.2128 0.2679 77.6801 1.0638 0.3893 38.0983 0.4019 40.1810 1.0642 0.4606 19.1066 0.4737 22.3124256x256 2.1424 1.0479 606.9473 1.0712 614.4742 4.2790 1.5803 303.4670 1.6047 312.7736 4.2792 1.8725 151.7895 1.8683 165.0370512x512 8.7386 4.4178 4846.7818 4.4391 4890.1128 17.4151 6.5211 2422.9984 6.4870 2467.7218 17.3390 7.6500 1211.9521 7.4919 1269.4627

1024x1024 35.2284 17.6642 38738.2033 17.6520 39021.2287 71.3471 26.7471 19369.4312 26.4181 19615.3064 71.7815 31.7918 9688.5531 30.6422 9949.3548

% error 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 -2.91 -0.94 0.52 -2.61 -0.15 -2.41 -2.82 -2.53 -2.98 -3.26 -1.92 -3.35 -8.58 -2.47 -7.78128x128 -2.93 -0.58 0.05 -2.92 -0.50 -2.58 0.45 5.70 -2.41 4.76 -2.61 -0.52 -1.18 -2.70 -1.92256x256 -3.23 -1.07 7.51 -3.19 6.87 -3.10 -1.63 4.83 -3.05 4.03 -3.10 -2.91 10.24 -2.55 8.63512x512 -5.04 -6.22 4.06 -6.66 3.47 -4.70 -4.62 3.51 -4.10 2.81 -4.28 -5.14 4.95 -3.10 3.96

1024x1024 -3.90 -5.75 -76.93 -5.69 -76.93 -5.10 -6.93 -72.87 -5.76 -72.61 -5.67 -8.77 -74.16 -5.34 -72.90

CCMT| 30

Parallel 2D Matrix Multiply (coarse-grained)Tile-Gx36

Sim (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.2607 0.1326 1.1213 0.1383 2.1956 0.2607 0.1412 0.6697 0.1530 2.1799128x128 1.0364 0.4921 9.2816 0.4978 13.4021 1.0364 0.5211 5.4840 0.5329 11.2444256x256 4.1464 1.9554 78.8943 1.9611 95.2714 4.1464 2.0294 41.5034 2.0412 64.2756512x512 16.5965 7.7729 652.3573 7.7787 717.7198 16.5965 8.0516 328.8056 8.0634 419.6477

1024x1024 67.7103 31.0982 1611.9654 31.1039 1877.3199 67.7103 32.1370 1611.8781 32.1488 1980.9031

Testbed (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.2647 0.1378 1.2140 0.1412 2.3351 0.2636 0.1476 0.6238 0.1558 2.2542128x128 1.0654 0.5024 9.5857 0.5109 13.8897 1.0551 0.5337 4.8142 0.5525 10.8169256x256 4.2764 2.0088 76.0019 2.0239 93.4666 4.2866 2.1775 38.0804 2.1386 62.2750512x512 17.2959 8.2301 606.6027 7.9981 678.9970 17.2911 8.7498 303.6608 8.3159 402.1725

1024x1024 72.6566 35.4219 4848.8904 32.6373 5164.7570 71.9278 37.0273 2427.4457 33.5693 2832.1929

% error 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 -1.52 -3.83 -7.64 -2.08 -5.97 -1.10 -4.30 7.37 -1.75 -3.29128x128 -2.72 -2.05 -3.17 -2.55 -3.51 -1.78 -2.37 13.91 -3.55 3.95256x256 -3.04 -2.66 3.81 -3.10 1.93 -3.27 -6.80 8.99 -4.55 3.21512x512 -4.04 -5.55 7.54 -2.74 5.70 -4.02 -7.98 8.28 -3.04 4.35

1024x1024 -6.81 -12.21 -66.76 -4.70 -63.65 -5.86 -13.21 -33.60 -4.23 -30.06

CCMT| 31

Parallel Sobel Filtering

Testbed(ms) 2 cores 4 cores 8 cores

Image Size ScatterCompute_

GxCompute_

Gy Gather Total ScatterCompute_

GxCompute_


GxCompute_

Gy Gather Total320x240 0.308 35.666 36.179 0.630 72.464 0.481 17.848 18.210 0.942 37.351 0.576 8.923 9.131 1.097 19.737480x320 0.620 71.616 72.651 1.265 145.456 0.956 35.740 36.495 1.877 74.774 1.137 17.857 18.265 2.174 39.408640x480 1.245 142.968 146.393 2.542 292.448 1.908 71.638 73.206 3.773 149.754 2.262 35.795 36.614 4.359 79.069800x600 1.950 223.252 229.986 3.966 458.540 2.977 112.061 114.598 5.951 234.704 3.509 56.025 57.241 6.828 123.2621024x768 3.227 365.944 377.464 6.498 752.560 4.864 183.752 189.415 9.681 385.713 5.728 91.946 93.859 11.290 202.0161280x1024 5.419 611.279 661.535 10.860 1286.660 8.098 306.565 335.381 16.132 660.897 9.519 153.350 159.848 18.831 338.492

Simulation (ms) 2 cores 4 cores 8 cores

Image size ScatterCompute_

GxCompute_


GxCompute_


GxCompute_

Gy Gather Total320x240 0.306 35.750 36.557 0.604 73.223 0.463 17.875 18.278 0.902 37.531 0.550 8.938 9.139 1.044 19.695480x320 0.610 71.501 73.114 1.210 146.440 0.920 35.750 36.557 1.808 75.047 1.086 17.875 18.278 2.096 39.360640x480 1.219 143.002 146.227 2.422 292.875 1.833 71.501 73.114 3.623 150.083 2.156 35.750 36.557 4.201 78.688800x600 1.903 223.440 228.480 3.784 457.613 2.861 111.720 114.240 5.667 234.501 3.355 55.860 57.120 6.580 122.9391024x768 3.114 366.084 374.342 6.209 749.755 4.683 183.042 187.171 9.289 384.197 5.485 91.521 93.585 10.802 201.4181280x1024 5.190 610.140 623.903 10.370 1249.609 7.796 305.070 311.951 15.498 640.328 9.127 152.535 155.976 18.024 335.687

Error % 2 cores 4 cores 8 cores

Image size ScatterCompute_

GxCompute_


GxCompute_


GxCompute_

Gy Gather Total320x240 -0.58 0.24 1.04 -4.11 1.05 -3.69 0.15 0.38 -4.18 0.48 -4.63 0.16 0.09 -4.79 -0.21480x320 -1.67 -0.16 0.64 -4.31 0.68 -3.78 0.03 0.17 -3.69 0.37 -4.46 0.10 0.08 -3.58 -0.12640x480 -2.13 0.02 -0.11 -4.72 0.15 -3.94 -0.19 -0.13 -3.97 0.22 -4.69 -0.12 -0.16 -3.62 -0.48800x600 -2.43 0.08 -0.65 -4.57 -0.20 -3.88 -0.30 -0.31 -4.77 -0.09 -4.39 -0.30 -0.21 -3.64 -0.261024x768 -3.50 0.04 -0.83 -4.44 -0.37 -3.72 -0.39 -1.18 -4.05 -0.39 -4.25 -0.46 -0.29 -4.32 -0.301280x1024 -4.23 -0.19 -5.69 -4.52 -2.88 -3.72 -0.49 -6.99 -3.93 -3.11 -4.11 -0.53 -2.42 -4.28 -0.83

Tile-Gx36

CCMT| 32

Parallel Sobel FilteringTile-Gx36

Testbed(ms) 16 cores 32 coresImage Size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 0.654 4.460 4.577 1.186 10.946 0.748 2.233 2.288 1.257 6.625480x320 1.262 8.954 9.147 2.356 21.731 1.398 4.470 4.586 2.450 13.028640x480 2.483 17.889 18.340 4.680 43.487 2.698 8.971 9.218 4.850 25.876800x600 3.843 28.010 28.677 7.312 68.064 4.141 14.013 14.386 7.610 40.3371024x768 6.249 45.989 46.921 12.012 111.072 6.648 23.021 23.493 12.437 65.8221280x1024 10.356 76.749 79.732 20.314 185.698 10.961 38.438 39.749 21.044 110.190

Simulation (ms) 16 cores 32 coresImage size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 0.605 4.469 4.570 1.098 10.790 0.664 2.234 2.285 1.084 6.366480x320 1.187 8.938 9.139 2.219 21.532 1.270 4.469 4.570 2.230 12.636640x480 2.346 17.875 18.278 4.454 43.003 2.493 8.938 9.139 4.507 25.175800x600 3.639 27.930 28.560 6.982 67.160 3.849 13.965 14.280 7.092 39.2841024x768 5.925 45.761 46.793 11.485 110.012 6.234 22.880 23.396 11.694 64.3031280x1024 9.839 76.268 77.988 19.200 183.343 10.305 38.134 38.994 19.592 107.123

Error % 16 cores 32 coresImage size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 -7.49 0.20 -0.16 -7.46 -1.42 -11.14 0.07 -0.13 -13.77 -3.91480x320 -5.93 -0.19 -0.08 -5.81 -0.92 -9.11 -0.03 -0.35 -9.00 -3.01640x480 -5.53 -0.08 -0.33 -4.83 -1.11 -7.61 -0.37 -0.86 -7.07 -2.71800x600 -5.31 -0.29 -0.41 -4.52 -1.33 -7.06 -0.34 -0.74 -6.81 -2.611024x768 -5.18 -0.50 -0.27 -4.39 -0.95 -6.22 -0.61 -0.41 -5.97 -2.311280x1024 -4.99 -0.63 -2.19 -5.49 -1.27 -5.98 -0.79 -1.90 -6.90 -2.78

CCMT| 33

Parallel 2D Matrix Multiply (fine-grained)

time (ms) Tile-Gx36 Tile-Gx72matrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

64x64 0.26 0.16 0.64 0.17 2.32 0.26 0.19 0.20 0.21 2.64

128x128 1.04 0.59 4.66 0.60 11.08 1.04 0.63 1.59 0.66 10.73

256x256 4.15 2.29 35.43 2.30 60.81 4.15 2.33 12.64 2.36 48.52

512x512 16.60 9.09 275.92 9.10 377.14 16.60 9.26 100.93 9.28 244.03

1024x1024 67.71 36.28 2180.57 36.30 2591.75 67.71 36.79 806.49 36.82 1388.01

Tile-Gx72

CCMT| 34

Parallel Sobel Filtering

Image size Tile-Gx36 Tile-Gx72 Speedup320x240 6.13 4.67 1.31480x320 12.32 8.49 1.45640x480 23.80 16.20 1.47800x600 37.57 25.81 1.461024x768 60.67 41.31 1.471280x1024 100.37 66.88 1.501600x1200 146.77 98.55 1.49

Tile-Gx72

Time (ms) Tile-Gx36 Tile-Gx72Image size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total

320x240 0.74 2.01 2.06 1.21 6.13 0.92 1.12 1.14 1.27 4.67

480x320 1.38 4.17 4.26 2.40 12.32 1.60 2.09 2.13 2.46 8.49

640x480 2.66 8.04 8.23 4.77 23.80 2.97 4.02 4.11 4.87 16.20

800x600 4.05 12.85 13.14 7.42 37.57 4.46 6.70 6.85 7.57 25.81

1024x768 6.50 20.74 21.20 12.12 60.67 7.05 10.73 10.97 12.35 41.31

1280x1024 10.67 34.32 35.09 20.18 100.37 11.43 17.16 17.55 20.52 66.88

1600x1200 15.43 50.27 51.41 29.55 146.77 16.37 25.70 26.28 29.99 98.55

CCMT| 35

Parallel 2D Matrix Multiply (fine-grained)

matrix size Tile-Gx36 (ms) Gx72 (ms) KNL 64 (ms) KNL 100 (ms)Speedup (Gx36 vs

KNL64)Speedup (Gx36 vs

KNL100)128x128 11.08 10.73 8.54 10.50 1.30 1.05256x256 60.81 48.52 33.60 41.20 1.81 1.48512x512 377.14 244.03 135.00 165.00 2.79 2.291024x1024 2591.75 1388.01 555.00 673.00 4.67 3.852048x2048 18000.00 -- 2270.00 2721.73 7.93 6.61

KNL 64 KNL 100

Time (ms) KNL 64 KNL 100matrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total

128x128 1.04 0.56 0.07 0.58 8.54 1.04 0.47 0.05 0.51 10.50256x256 4.15 2.07 0.38 2.09 33.60 4.15 1.74 0.24 1.77 41.20512x512 16.60 8.21 2.28 8.24 135.00 16.60 6.77 1.46 6.81 165.001024x1024 67.70 32.60 15.10 32.70 555.00 67.70 26.90 9.69 26.90 673.00

CCMT| 36

Parallel Sobel FilteringKNL 64KNL100

time (ms) KNL64 KNL100Image size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 0.76 0.12 0.12 1.02 2.22 1.05 0.10 0.10 1.31 2.86480x320 1.55 0.26 0.26 2.44 4.72 1.77 0.16 0.16 2.50 4.90640x480 2.66 0.49 0.49 4.37 8.22 3.20 0.35 0.35 4.90 9.11800x600 4.38 0.80 0.80 7.56 13.70 4.28 0.49 0.49 6.68 12.30

1024x768 6.55 1.27 1.27 11.50 20.80 7.43 0.87 0.87 12.43 21.901280x1024 10.80 2.11 2.11 19.40 34.60 11.95 1.37 1.37 20.61 35.601600x1200 15.60 3.09 3.09 28.70 50.70 16.00 1.98 1.98 28.30 48.601920x1080 16.70 3.34 3.34 31.20 54.80 17.97 2.22 2.22 32.55 55.30

time in ms time in ms time in ms

Image size Tile-Gx36 KNL 64 KNL 100Speedup

(Gx36 vs KNL64)Speedup

(Gx36 vs KNL100)320x240 6.13 2.22 2.86 2.76 2.15480x320 12.32 4.72 4.9 2.61 2.51640x480 23.80 8.22 9.11 2.90 2.61800x600 37.57 13.7 12.3 2.74 3.051024x768 60.67 20.8 21.9 2.92 2.771280x1024 100.37 34.6 35.6 2.90 2.821600x1200 146.77 50.7 48.6 2.89 3.021920x1080 158.00 54.8 55.3 2.88 2.86

CCMT| 1

Performance Modeling

� Calibration data is used for developing interpolation models which are used to predict the execution time – performance models

– Data have varying dimension (One-dimensional: Dot Product, Multi-dimensional: Matrix Multiply)

� We are using Kriging interpolation for multi-dimensional interpolation

– More about performance models in the next talk of this session

execution_time = f()Train interpolation model

Training/calibration data Predicted execution time

Estimate for test inputs

Exceedserror threshold?Experimental testbed,

Cycle-accurate Device Simulator,Fast Forward 2 vendors,

etc.

How do we generate timestamps for internal events?

What about send and receive events?

CCMT| 2

Performance ModelingMotivation:� Behavioral emulation necessitates that simulators do not perform cycle-accurate (or

otherwise complex) operations� However, simulators must still know time required for an operation (e.g., Matrix

Multiply) based on input sizes (e.g., [M, N, P] = [256, 100, 42])

Goals: � Estimate values of specific non-integral numerical parameters (e.g., computation

time) before or during simulation without having access to real generators (e.g., the target CPU) of those parameters

– Methods which produce these parameters are surrogates for the real generators– Any access to the target platforms is assumed to be strictly in advance of the

simulation� Determine efficacy of possible models in multi-dimensional domains� Perform uncertainty analysis to determine degree to which estimation is introducing

error into the simulator

CCMT| 3

Performance ModelingApproach: � On the target platform or simulator, take a number of

representative samples (of the parameter of interest) within the expected domain

� Using these samples, interpolate any other needed values just prior to, or during, the simulation

Kriging:� A product of the geostatistics community, used for many-

dimensional sparse interpolation� Other (likely less accurate) options:

– Radial basis functions– Nearest-neighbor– Convex-hull linear interpolation

Universal Kriging:� Inputs: Variogram (spatial relationship of

data), Polynomial Degree, Samples� Internal: Computes weights for each of

the samples� Outputs: Interpolated values, estimate

of variance at interpolation points

CCMT| 4

Performance ModelingAside: What do our data look like? (1 of 3: Easy)

� Matrix Multiplication is relatively easy, but still non-trivial, to interpolate� Banding near numbers which

are additive combinations of powers of two

� Bands can have times of 2-10 times longer than their neighbors

� Likely due to cache particularities in the system

� Example details:� Platform: x86 Ivy-Bridge Quad� Single Core Used� Triple-loop textbook method� C, GCC, -O2

CCMT| 5

Performance ModelingAside: What do our data look like? (2 of 3: Hard)

� FFTs (FFTW) are the most difficult benchmarks in the set� Computation time strongly

related to how composite the input size is

� Adjacent samples can jump by more than an order of magnitude

� Interpolating to obtain an average error fraction of just below 1 is difficult

� Example details:� Platform: x86 Ivy-Bridge Quad� Single Core Used� FFTW� C, GCC, -O2

CCMT| 6

Performance ModelingAside: What do our data look like? (3 of 3: Other)

� CUDA BLAS DGEMM is different� Divided into blocks� Not symmetrical about diagonal� Partitions are differently sized

� Example details:� Platform: Quadro K600 GPU

CCMT| 7

Performance ModelingApproach: Interpolation Process: Part 1 of 4

Step One: obtain ordered, but slightly randomized samples within the domain

� Randomization prevents aliasing along areas of unusual time

� Determining the number of samples to use is one of the research goals

� Even sampling may not be ideal,but to do otherwise requires unavailable a priori knowledge

� Example details:� These samples cover 0.17% of the

domain, a relatively small amount� Notably: they mostly missed the

bands of high computation time• This will matter during evaluation time• It may, or may not, be a good thing

CCMT| 8


Step Two: using those samples, construct an interpolation model

� After entering these parameters, we have a model which can substituted for the real data in the simulator (the surrogate)

� It will not be perfect, as it is subject to both the samples and the parameters

� For example: this model loses most of the banding of the original data

CCMT| 9


Step Three: evaluate the model with another dataset and an error metric

� Test data are completely random within the domain

� Our Error Metric: average mean-squared fractional error

CCMT| 10


Step Four: If the error was too great, go back to step one, but use different parameters

� Obtaining good parameters for Kriging (without much knowledge of the underlying processes) is difficult, so we let computers do it for us

– Specifically: the Step Four to Step One Transition (revising parameters if needed) is done by Genetic Algorithms (GA)

• GAs simulate the evolutionary process by only allowing the best Genomes (solutions) to pass on their genes to the next generation

• Process is computationally intensive, as many Kriging evaluations are required, but an optimal solution is likely to result

– The GAs stop only when a good-enough solution is found, which may take up to several days

� After all iterations, we can incorporate the model into the simulator� It is necessary to obtain good Kriging parameters for many different

sample-fractions, so we can know how many samples are needed

CCMT| 11

Demo: Modeling on x86There are a number of tunable parameters which must be selected for each stage:

� Sampling– Domain of interest– Sampling strategy (linear, random, logarithmic, etc.)– Number of points– Noise reduction strategy

� Model construction– Interpolation method (Kriging, RBF, etc.)– Method parameters

� Error quantification– Number of evaluation points– Evaluation sampling method– Error metric

CCMT| 12

Performance Modeling (Results)

Single-Dimensional Benchmarks(One Input Parameter)

Multi-Dimensional Benchmarks(Two and Three Input Parameters)

CCMT| 13

Performance Modeling (Results)

Kriging versus Nearest-Neighbor

� Kriging outperforms nearest-neighbor interpolation in (most) all cases (values greater than unity)

� There is little or no improvement for FFT

� For the high-algorithmic-complexity algorithms, Kriging is much better

� Kriging has a better improvement for more sparse sampling

CCMT| 14

Identified Issues

� Kriging requires the selection of a number of interpolation parameters, the choice of which is not obvious

� We must select a sampling strategy (or set of strategies), each with a number of parameters which must be picked

� In order to effectively model functions like the FFTs, an alternative to Kriging must be used

– This may be a type of domain partitioning which is used in conjunction with Kriging, or something entirely different

� Performing Kriging can be cumbersome on some platforms, so it may be excluded from use at run-time (which may not be necessary)

CCMT

CCMTResearch Thrust

Synchronization & Congestion

CCMT| 2

Synchronization & CongestionMotivation: � Current BE infrastructure ensures causality but isn’t optimized for large-scale

simulation� Communication events in BE abstract away details of network packet transfers

making robust congestion modeling necessaryGoal: Adapt synchronization and congestion-modeling techniques to support simulation experiments with millions of behavioral objects

Fine-grained

MMT

Gem5SimplescalarVeloce

PalladiumRAMP

Scalability

POEMS

FASE SST MacroFSim

BigSim

ROSSParsim

Coarse-

grained

Manifold

Analytical methodsSimulator Landscape

GPGPUSim

ppDisclaimer: This is first attempt at classifying different types of simulators. Our goal is to understand the unique features and associated advantages of each simulator for adoption choice.

SST Micro

Can we find events that can be executed in parallel?

CCMT| 3

Expressing Causality

� Maintaining causality ensures correctness and exposing concurrency helps in determining the maximum amount of parallelism in the simulation

� Causal history – all events in an events past that can affect its outcome – Causal history is enough to determine the ordering of events

� Lamport clocks, not consistent with real time, are consistent with causality

– Partial ordering of events by assigning timestamps using logical clock � Vector Time is consistent and can characterize causality

CCMT

Do we need complete causal history of an event?

CCMT| 4

Search for Efficient Representation

� Size of time vectors is a limiting factor for scalable simulation– Several methods exist for compressing message timestamps which tradeoff

simulation speed, storage, and complete ordering of events– Utilize only direct dependencies or use causal distributed breakpoints

� Creating concurrent regions may be cheaper than causality– A dependence blocks can be regarded as a single, atomic event– This approach offers for some scope for abstracting execution details

� There may also be merit in focusing only on the state transitions

Can we sacrifice on accuracy to speed up simulations?

What are our options for synchronizing execution of these events?

CCMT| 5

Causality & Event Synchronization

� Non-aggressive vs Aggressive– Conservative vs Optimistic– Hybrids

� “Limited” Optimistic– Window-based: events with timestamps within some agreed upon window are

executed between synchronizations – Space-based: LPs divided into clusters, each cluster executes optimistically,

interaction between clusters is conservative. – Penalty-based: LPs are either penalized and blocked or favored and not blocked

depending on rollback behavior– Knowledge-based: optimistic execution, broadcast message is sent out if error is

detected– Probabilistic: periodic probabilistic synchronization of LPs– State-based: optimism continuously adjusted based on local state information

� Application dependent

5

How is event synchronization handled in existing simulators?

Conservative Approach (Pessimistic estimates)Avoid all causality errors

Optimistic Approach (Detection & recovery)

Allow errors, “rollback” to recover

CCMT| 6

Synchronization Schemes

� Majority coarse-grained simulators use conservative schemes *– BigSim is trace-driven, ParSim is time-driven, and Fsim lacks timing model– SST uses component-based back-end with global queue and MPI barriers

� ROSS uses optimistic time-warp protocol for event ordering

� We envision adopting one or combination of multiple approaches tuned for BE

– Sync. scheme fixed or selected by user as per their need

– Schemes for different scales of simulation or simulation platform

– Multi-pass simulation

* List of references at end

How can we specialize and tune these approaches for BE?

How is event synchronization handled in existing simulators?

CCMT| 7

Parameters for Scalable PDES Design

� Various options available to explore – Partitioning- how to cluster LPs?– Adaptability- how does simulator change based on state?– Aggressiveness- how much should conditional knowledge be processed?– Accuracy- how much error can be tolerated?– Risk- how far should a potentially incorrect message be propagated?– Synchrony- what is the degree of temporal binding or coupling of LPs?– Knowledge embedding- knowledge of an LPs behavioral attributes that is

embedded in the simulation– Knowledge dissemination/acquisition- how much does LP initiate

transmission/request of information to/from other LPs?


But choice of synchronization mechanism will dictate our methods and ability to model congestion on the network.

CCMT| 8

Congestion Modeling

� Many recent simulators and frameworks have low-level network models– SST-Micro uses high-fidelity component models to simulate CMP and SMP systems– SST/Macro uses packet-level, flow-based, & hybrid train models for system networks– FSim and xSim use fine-grained network models– Fsim, xSim, and BigSim allows high-level latency models and detailed model of

communication fabric

� Explore existing congestion models for use in Behavioral Emulation

– Fine-grained vs. coarse-grained packet-level models– Analytical flow-based or train-based models– Queuing-theory models– System-specific models derived from experiments


CCMT| 9

Congestion Modeling

� Based on tradeoff studies devise congestion models for BE– Adopt one or combination of multiple approaches tuned for BE– For most promising approaches, study variations in congestion

behavior and scalability with choice of synchronization schemes

� Tuning for Behavioral Emulation– Congestion model fixed or selected by user as per their need– Different models for different simulation levels (on-chip, inter-

node, inter-rack )– Models for different levels of simulation platform (exploit

different levels of parallelism)

� Choice of synchronization mechanisms and congestion models can significantly affect the design and speed of emulation in hardware

CCMT

Questions?

CCMT| 11

References

System (macro-scale) Simulators– C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar, D. A. Evensky,

and J. Mayo, “A simulator for large-scale parallel architectures” International Journal of Parallel and Distributed Systems, vol. 1, no. 2, pp. 57-73, 2010. SST MACRO

– E. Grobelny, D. Bueno, I. Troxel, A.D. George, and J.S. Vetter, “FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications, Simulation”, Simulation, Vol. 83, No. 10, pp. 721-745, Oct. 2007. FASE

– G. Zheng, G. Kakulapati, L. V. Kale, “Bigsim: A parallel simulator for performance prediction of extremely large parallel machines”, 18th IPDPS, pp. 78, 2004. BIGSIM

– A. D. George, R. B. Fogarty, J. S. Markwell, and M. D. Miars, “An Integrated Simulation Environment for Parallel and Distributed System Prototyping”, Simulation, vol. 72, pp. 283-294, May 1999. ISE

– A. Symons, V. L. Narasimhan, "Parsim-message PAssing computeR SIMulator," IEEE First International Conference on Algorithms and Architectures for Parallel Processing, vol. 2, pp. 621, 630, 19-20, ICAPP, 1995. PARSIM

11

CCMT| 12

References

Synchronization in PDES– R. Schwarz“Detecting Causal Relationships in Distributed Computations: In

Search of the Holy Grail”– F. Mattern. “Virtual Time and Global States in Distributed Systems”. Proc.

Workshop on Parallel and Distributed Algorithms, Chateau de Bonas, Oct. 1988, M. Cosnard et al. (eds.), Elsevier / North Holland, pp. 215-226, 1989.

– L. Lamport. “Time, Clocks, and the Ordering of Events in a Distributed System”. Communications of the ACM,Vol. 21, No. 7, pp. 558-565, July 1978.

– C.J. Fidge. “Logical Time in Distributed Computing Systems”. IEEE Computer, Vol. 24, No. 8, pp. 28-33, Aug. 1991.

– P.C. Bates and J.C. Wileden. “High-Level Debugging of Distributed Systems: The Behavioral Abstraction Approach”. Journal of Systems and Software, Vol. 4, No. 3, pp. 255-264, Dec. 1983.

12

CCMT| 13

References

Device (micro-scale) & Node (meso-scale) Simulators– Z. Dong, J. Wang, G. Riley, and S. Yalamanchili, “An Efficient Front-End for Timing-Directed

Parallel Simulation of Multi-Core System”, 7th International ICST Conference on Simulation Tools and Techniques (SIMUTools 2014), March 2014. MANIFOLD

– J. Wang, J. Beu, S. Yalamanchili, and T. Conte. “Designing Configurable, Modifiable and Reusable Components for Simulation of Multicore Systems”, 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, November 2012. MANIFOLD

– M. Hseih, R. Riesen, K. Thompson,W. Song, A. Rodrigues, “SST: A Scalable Parallel Framework for Architecture-Level Performance, Power, Area and Thermal Simulation”, Computer Journal, vol. 55, no. 2, pp. 181-191, 2012. SST MICRO

– M. Hseih, A. Rodrigues, R. Riesen, K. Thompson,W. Song, “A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration”, SIGMETRICS, Performance Evaluation Review, vol. 38, no. 4, pp. 63-68 2011. SST MICRO

Object-oriented System Modeling– J. C. Browne, E. Houstis, and J. R. Purdue, “POEMS – End to End Performance Models for

Dynamic Parallel and Distributed Systems”

13

CCMT| 14

References

Hardware Emulation– Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovi, and D. Patterson, “A Case for

FAME : FPGA Architecture Model Execution”, ISCA’10, June 19–23, 2010, Saint-Malo, France, 290–301.

– J. Wawrzynek, D. A. Patterson, S. Lu, and J. C. Hoe, “RAMP: A Research Accelerator for Multiple Processors”, 2006.

Supercomputer-specific Modeling & Simulation– S. R. Alam, R.F. Barrett, M. R. Fahey, J. M. Larkin, and P.H. Worley, “Cray XT4 : An

Early Evaluation for Petascale Scientific Simulation”, 2007.– A. Hoisie, G. Johnson, D. J. Kerbyson, M. Lang, and S. Pakin, “A Performance

Comparison Through Benchmarking and Modeling of Three Leading Supercomputers : Blue Gene / L , Red Storm , and Purple”, (November), 1–10, 2006.

Analytical Modeling – L. Carrington, A. Snavely, and N. Wolter, “A performance prediction framework

for scientific applications”. Future Generation Computer Systems, 22(3), 336–346.

– N. Jindal, V. Lotrich, E. Deumens, B.A. Sanders, and I. Sci, “ SIPMaP : A Tool for Modeling Irregular Parallel Computations in the Super Instruction Architecture”, IPDPS 2013

14

CCMT| 1

Scalability MethodsGoal: Reduce simulator execution time by exploiting domain-specific simplifying assumptions

� Intended to supplement other simulation efforts by providing techniques and methods which can be used (primarily) prior to simulation-time to speed-up the simulators

� Here, we focus on the simplifying assumptions which are a result of behavioral emulation (as opposed to general DES problems)

– “We need not be concerned with the result, just the time it took to get there”

– “We need not be concerned with handling application-specified non-determinism (RNGs)”

– “We need not be concerned with handling run-time conditional operations”• Acceptable: if (my_rank != 0) { do_something(); }

• Not acceptable: if (current_time >= 1.5) { do_something(); }

� There may be other sets of simplifying assumptions as a result of being concerned mostly with one application (CMT- related), and one machine size (Exascale)

CCMT| 2

Scalability Methods� Current methods of interest:

– Global task graph manipulation (done at compile time, see compilation overview below)• Generate a global task graph• Manipulate this task graph to reduce the number of total tasks required

– Micro-scale symmetry exploitation• Find code blocks (within a simulation process) which are isomorphic to blocks in other simulation processes• Publish these code blocks to a “cache” (or model them in advance) to avoid repetition at the micro scale

Application Description

Machine Description

Machine Configuration

Parser Code Generator Task-Graph Generator

AbstractSyntaxTree

ProcessCode

Task Graph

Task-Graph Modifier

Task GraphCode Generator

AbstractProcess

Code

To simulator

CCMT| 3

Task-Graph Manipulationdefine :: dataSize << 128*1024/mpi.maxrank >>define :: me << mpi.myrank >>

import(mpi)mpi.perform("initialize", [])

for (i, [1..30]) {

if ( (me % 2) = 0 ) {mpi.send(me + 1, dataSize)mpi.recv(me - 1, dataSize)mpi.send(me - 1, dataSize)mpi.recv(me + 1, dataSize)

}

if ( (me % 2) = 1 ) {mpi.recv(me - 1, dataSize)mpi.send(me + 1, dataSize)mpi.recv(me + 1, dataSize)mpi.send(me - 1, dataSize)

}

mpi.perform("fft2d", [1024, 128])}

mpi.perform("finalize", [])

� To be able to statically generate a global task graph for all processes, a limited language is necessary (example AppBEO right)

� Language can support:– Unconditional looping– Compile-time evaluated conditionals– Function calls without side effects– Basic macros

� Language must avoid:– Variable manipulation and assignment (all

expressions must be evaluated at compile time)– Run-time conditional statements– Random number generators– Any things which give the program a state which is

not just the location in the instruction stream

� These are acceptable compromises, as we require only a description language, not a Turing-complete one

CCMT| 4

Task-Graph Manipulation

Vertical CombinationOriginal Graph

Block Combination

� High level process:– Produce this global graph (has about 10^12 entries for a substantial application on an exascale machine)

or produce a highly-symmetric compressed task graph (if permitted by the application)– Section off the graph at a natural hardware level (e.g., per node)– Apply combination and simplification rules to reduce the number of graph nodes

(one section of a node with two cores)

CCMT| 5

Micro-Scale Symmetry Exploitation� Task graph simplification will produce multi-operation blocks which require new performance models� We generate (at compile time, or run time) a model for this block, and then use the model, but do not

re-simulate the internals of the block– Block is modeled based on the relative timing of its input signals

– Block is limited to roughly a maximum of 4-6 inputs before this modeling becomes untenable

� This method is only helpful if there is enough symmetry to find many isomorphic blocks in the simulation (likely)

– This is likely because each node in a large application is likely to do very similar things as other nodes

Simplified GraphModel for Multi-Operation Block

CCMT| 6

Issues and Conclusions

� Known plausible limitations to task graph simplification:– Must be able to generate global task graph (may be not feasible due to time constraints)– Must be able to effectively manipulate this graph in a timely manner– Application must be written such that the task graph can be simplified

� Known plausible limitations to micro-scale symmetry exploitation:– Application must be written such that there is enough symmetry to exploit across

processes (likely)– It must be possible to generate models fairly quickly

� Conclusions:– If they are feasible, the use of these methods could permit huge simulator speedups for

large machines– They also may permit less accurate, but very fast, analytical solutions to determining the

run-time of a simulated machine|application pair

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

�� !"#�

LLNL-PRES-xxxxxx

�� $�%��&�$��'��(��$��

��)��&��*��*+��

$$($��,-��.../�"�

Lawrence Livermore National Laboratory L0��%��'��&��)��1��+�2��'��

3��&��1�

�  ��

�  ��•  ��'��&��'�1��•  4��&��++��+�5��&�*��6�6 ��'�� 3��67�•  6��&��5��&�*��3�� 8,-3$�(�7�•  ,��&��4�&9��,��5��&�*��4,7�•  ��%��%��%��'��&�'��+�5��&�*��+��7�

�  ��

��:��

��:��'�

6��8�'��

6��1�'��'�

�+��&��$�+��

;��3��

:��,��

3��&��1�


3��&��1�

�  ��

�  ��•  �� •  4��&��++��+�5��&�*��6�6 ��'�� 3��67�•  6��&��5��&�*��3�� 8,-3$�(�7�•  ,��&��4�&9��,��5��&�*��4,7�•  ��%��%��%��'��&�'��+�5��&�*��+��7�

�  ��

��:��

��:��'�

6��8�'��

6��1�'��'�

�+��&��$�+��

;��3��

:��,��

3��&��1�


3��&��1�

�  !�� "#$��%&��•  ��'��<��=<� �&��•  ��+��

��

��

��

��

��

��

��

��

��

��

��

Aluminum

FP Ops


3��&��1�

Input data from multiple

measurements

Available Visualization

domains

Selected visualization:

3D Torus

Choice of mappings:

Present data on nodes or links?

Drag selection to map data to visualization


3��&��1�

�  ��'��(��•  3��9�&��:8��•  -<�'�)��38 !!!�

�  ��)��•  >��1�?��&��%��?��•  ��'��&��'�&��

�  *�� '��(��•  ��'��+��&��'��•  ��'�&��+��+��'��

�  ��'��'��(��•  6��@��9��%��9��•  :�<A��%��'��

�  &��+��*��

��

��,�(��-,��

22+� ��3��&��1�

�� '�'�''''''''��3��3��3��333��3��33 �3��3 �333333 �� &��&��&�� 1111111��.��/��0�


3��&��1�

�  ��'��  B��%��+��  $�&��%��'��  4��&��

��''��


3��&��1�wLawwrenreenenenennce cecececece LivLivLivLivLivLiLiLi ermermermermermoreoreoreoreore NaNaNaNaNaNNN tiotiotiotiotiotititi nalnalnalnalnallll LaLaLaLaLaLLL borborborborborbbb atoatoatoatoatottt yyry ryryryry0��0��0� �� %�%�� '��'��&�& �� )�)��1��1�� +�2+ 22�� '�'� ��

3��3��3��3 �� &�&�&� �� 1�11


3��&��1�

2��+��>�%�:��0��4��&9�6�'�

8��6�'��'�%��$��3��&�

4��>�%�

$�+�&��6�'�>�%�0��$��3��&�


3��&��1�

�  ��

�  ��•  ��'��&��'�1��•  4��&��++��+�5��&�*��6�6 ��'�� 3��67�•  6��&��5��&�*��3�� 8,-3$�(�7�•  ,��&��4�&9��,��5��&�*��4,7�•  ��'��-�'��'��- �� 0��

�  ��

��:��

��:��'�

6��8�'��

6��1�'��'�

�+��&��$�+��

;��3��

:��,��

3��&��1�


3��&��1�

��1��'��


3��&��1�

��1��'��

C0��D��%��


3��&��1�

�  ��'��2��

�  31��1��•  3��%��&��%��&�•  (��'&��'��&��&��%��•  (��'��'��'�<�'�1��%��+�%��&��•  (��4��?��&��+�&��

� ��'��2��

� 31��1��• 3��%��&��%��&�• (��'&��'��&��&��%��• (��'��'��'�<�'�1��%��+�%��&�• (��4��?��&��+�&��

��1��'��

(��-��&��%��4��


3��&��1�

�  !��4"��55)56��,5��'�� '��

�  )�71��•  ,��+��&��•  6��'��'��

�  8�� •  3��%��•  ,��&��'��&��

�  &�9��:�"!�

�  ��;��•  ��'��+��+��&��A+��&��&��•  ��E&��F��%�'��

sp−mz, 4500W power bound

Max power per processorBest configuration

Avg.Watts

Nodes (8−32)

Processor Power Bound (Watts) (51, 65, 80, 95, 115)

Cores (4−16)

Seconds

y


3��&��1�

��&9��=�&��'+*4*=�.��&��%��

��%��

•  ��%��'��&�

•  ��%��'��&9��'��&��

y


3��&��1�

�  ��"4<$��•  '+�5'��+��7�—  ,��"!#0�

•  �5'��+��7�—  ,��/!0�

�  ��'��•  ��'��&��'��1��

��•  G��<��)��%��%��•  H��<��)�4��&��&9�•  ��%��&��&��&��(��'��1��%��%��

4��4��&9

��I�

�&��

y


3��&��1�

4��4��&9

��I�

�&��

(��'��1��%��%��

y


3��&��1�

4��4��&9

��

��I

��&

��

(��'��1��%��%��

y


3��&��1�

�  ��'��(��•  (��'��J<��9��&��+��&9�•  (��9��&&��+��&��+�

�  �'��'��'�� •  4��K��+��9��C��&��&��E&��&�D�•  0��9��+��&9��%��<�'��&��'��&�

�  =�� '��>��

4��4��&9

��

��I

��&

��

(��'��1��%��%��

y


3��&��1�

�  ��?�3�� •  ��A��3��&��&��%��'��%��•  �� !L��&��+��&��

�  @��(��'��!��AA��•  ��%��&��%��&��&��•  �&��&�'��&&��+��•  ,I��&&��'��'��&��%��

ParaDiS Timestep (1-500)

Critical Path Node Max Load Imb. Common Case

y


3��&��1�

�  ��'�� •  6��'��%��&��'��'��+�?��•  ��&��%��+'��•  ��'�&�'��+'��&��1��&��

�  &��B��'��-�'��=��•  ��%��&��&��&��•  �$�G��?&��

�  &��"��=��•  ��+��+��'�•  ��&��&��

��%��•  6��&��•  (��'��

��

y0��0�0�0�0�0�0�0�0�0�00�0�000�00�00�00�0�0�0�0�00�00�00�0�0�0��00�0��00��000�0000�0�00�000��0000000 ��

(��

:��

-�&��

��'��'

-

:

��'��


3��&��1�

�  <� � �1��'��9��C��!��•  ��&��*��'�&��&��+��'��%��•  ��'�&��%��'��'��&��

1000

2000

3000

Watts

��

0 100 200 300 400

010

0020

0030

00

Time

Wat

ts

&��


3��&��1�

�  31��1��'��'��•  6��%��'��&��&��•  3��%��&��'�<�'��%��•  2��'�1��+��%��&��&��%��

�  31��1��'�� •  2��'��&��A+��5��'��*��&�7�•  ��'�&��'��'�1��%��&��&��•  ��?��&��+��'�1��'��1��

�  �� '��•  ��+��&��&9��5��&�� &�'��7�•  �&&��'��+��%��•  ��+��<��+��&�'��+��

CCMT

CCMTHardware Software Co-design

of CMT-Nek CodesPerformance, Energy and Thermal Issues

Tania Banerjee and Sanjay RankaComputer and Information Science and Engineering

CCMTT5 2

Long Term Goals

106 107 108 109 cores

• Parallelization and UQ of Rocflu and CMT-Nek beyond a million cores

• Parallel Performance and Load Balancing• Single Processor (Hybrid) Performance• Energy Management and Thermal Issues

CCMTT5 3

Hybrid Multicores: Performance, Energy and Thermal Management

101 102 103 104 cores

� Code Generation for hybrid cores─ Support for multiple types of cores─ Support for Vectorization

� Multi-objective optimization– Energy─ Performance

� Thermal Constraints

p

CCMTT5 4

Hybrid Multicores, Performance, Energy and Thermal Management

� Multiple Elements can be optimized for Energy─ Processor (Dynamic Voltage Scaling)─ Caches (Dynamic Cache Reconfiguration)─ Buses─ Memory

� Multi-objective optimization– Energy─ Performance

� Multiple Constraints─ Thermal Issues─ Packaging Issues

B

A

Feasible space

Energy

Time

CCMT| 5

Multi-core processors

Intel 48-core Processor

Number of cores are growing

8 cores AMD Opteron 16-Core Processor

Nvidia Kepler Processor

Nvidia Fermi GPU Nvidia Maxwell GPU

CCMTT5 6

Multiple CoresCommon� Multiple flows of Control� Multiple Local Memories

Differences�Synchronization� Communication

Single Processor Performance (Hybrid Multicores)GPU Cores

Common� Single/Multiple flows of Control� Multiple Local Memories

Differences� Amount of Local memory� Communication

CCMT| 7

Our Work

7

Optim

ization Goals

Environments

Optim

ization Goals

Environments

• Execution time• Throughput• MakespanPerformance

• Total energy consumed• Energy due to Leakage

powerEnergy

• Maximum temperature• Average temperature• Spatial and Temporal

gradientsTemperature

P

E

T

CCMT8

Performance, Energy and Thermal Levers

L1 Cache Reconfiguration

L2 Cache Reconfiguration

DVS of Cores

DVS of Buses

CCMT| 9

Our work

Develop an integrated framework for multicore machines that address:

1. Computation2. Energy3. Temperature

Challenging Multi-objective optimization and system issues.

CCMT| 10

Spectral Element Method

� (i, j, k) =

(i, j, k) =

(i, j, k) =

� If Nx = Ny = Nz = N– Then B = C = AT

� Complexity: O(N4)� N is typically between 5-25

– A large number of small matrix multiplications

x

yz

r

st

The derivative computing kernel requires 75-80% of the total execution time of CMT-Nek.

CCMT| 11

Spectral Elements:Derivatives and codes

� Similarly, 5loop versions and 5loop-fused versions were considered.

Algorithm: dudr-4loopdo k = 1, Nzdo j = 1, Ny

do i = 1, Nxdo l = 1, Nx

dudr(I, j, k) = dudr(I, j, k) + a(i, l) * u(l, j, k, ie)enddo

enddoenddo

enddo

Algorithm: dudr-4loop-fuseddo k = 1, Nz* Ny

do i = 1, Nxdo l = 1, Nx

dudr(I, k) = dudr(I, k) + a(i, l) * u(l, k, ie)enddo

enddoenddo

CCMT| 12

Optimizations

� Autotuning– Apply loop transformations

• Loop permutation• Loop unroll

– CHiLL applies loop transformation automatically on the target code

Related Work: C. Chen, J. Chame, M.W. Hall, CHiLL: A Framework for Composing High-Level Loop Transformations, Technical Report 08-897, University of Southern California, Computer Science Department, 2008.

CCMT| 13

Loop permutation

� Simple when you have perfect nesting

do k = 1, nz1do j=1,ny1

do i=1,nx1statement

enddoenddo

enddo

do i = 1, nx1do j=1,ny1

do k=1,nz1statement

enddoenddo

enddo

do i = 1, nx1do k=1,nz1

do j=1,ny1statement

enddoenddo

enddo

do j = 1, ny1do k=1,nz1

do i=1,nx1statement

enddoenddo

enddo

do j = 1, ny1do i=1,nx1

do k=1,nz1statement

enddoenddo

enddo

do k = 1, nz1do i=1,nx1

do j=1,ny1statement

enddoenddo

enddo

CCMT| 14

Loop unroll

� Duplicate loop body� adjust loop header and data indexes� May be applied to outer level loops too.� Unroll factors are preferably divisors of the iteration space� Reduces the number of limit checks for iterator� Exposes possibility of vectorization to the back end compiler

– c (i:i+4, j, k) = a(j, i:i+4) * b(i:i+4, k)

� Code size increases, may result in higher I-cache miss rates

do k = 1, 10do j=1,10

do i=1,10c(i, j, k) = a(j, i) * b(i, k)

enddoenddo

enddo

do k = 1, 10do j=1,10

do i=1,10,2c(i, j, k) = a(j, i) * b(i, k) c(i+1, j, k) = a(j, i+1) * b(i+1, k)

enddoenddo

enddo

CCMT| 15

Possible Combinations

Number of implementations for Nx=Ny=Nz=10

= 4! * 4 ^ 4

= 24 * 256 = 6144 variantsTotal number of variants = 98240

Algorithm: dudr-4loopdo k = 1, Nzdo j = 1, Ny

do i = 1, Nxdo l = 1, Nxdudr(I, j, k) = dudr(I, j, k) + a(i, l) * u(l, j, k, ie)

enddoenddo

enddoenddo

CCMT| 16

CHiLL example

CHiLL script

Input code

Output code

Algorithm: dudr-4loop-fuseddo k = 1, Nz* Nydo i = 1, Nxdo l = 1, Nx

dudr(I, k) = dudr(I, k) + a(i, l) * u(l, k, ie)enddo

enddoenddo

permute([2, 1, 3])unroll(0, 1, 1)unroll(0, 2, 2)unroll(0,3,5)

CCMT| 17

Genetic Algorithm

� We use genetic algorithms to search the exploration space efficiently.

� “Hello world!” example– Start with an arbitrary 12 character string– Goal is to generate “Hello world!”– Probability of coming up with target string

in one try:• 1/95^12

A "Hello World!" Genetic Algorithm Example by James Matthews athttp://www.generation5.org/content/2003/gahelloworld.asp

CCMT| 18

GA Characteristics

� Individuals– Each individual consists of a 12-letter string– Each individual has a fitness value

� Fitness function: – Sum of distance of each letter from target

letter

� Population– Individuals make a population– Population size = 2048 for this problem

CCMT| 19

GA Operators� Operators are used to create new

individuals from existing ones.– Creates a new generation

� Possible operations– Crossover

– GWTc')kv2%8@– 4K)?vM^pE`Yp– Result: GK)?vM^pE`Yp

– Mutation• Random point mutation

– GK)?vM^pE`Zp

CCMT| 20

Genetic Algorithm

� Algorithm:Initialize populationDo i=1, max_iter

calculate fitness of populationsort population based on fitnessprint the member p with the best fitnessif p.fitness is 0 then break;apply crossover and mutation operations

on pairs of members to create new population

Enddo

CCMT| 21

Genetic Algorithm

� Output:– Best: RV[S`(yxj)p! (188)– Best: 8mkCrJvrhsT& (153)– Best: Yakor7yiIvg (132)– Best: GvthH$vrryU" (106)– Best: BiXpb wqwXg& (82)– Best: Sdmul0wqwXe' (75)– Best: J]ndm"wqwvl% (53)– Best: ?_jyk"uonnk (52)– Best: J]ndm"wqwac! (43)– Best: J]ndm"wquqg& (42)– Best: Chkmo"vtuqg& (34)– Best: Gllho wpuig! (22)– Best: Hdmul wqqmf" (21)

– Best: Hdmul wqqmf" (21)– Best: Ldnlp wqqmf" (15)– Best: Ldnlp wqqmf" (15)– Best: Jckop wqrnc! (14)– Best: Hejlp"wqrlg" (11)– Best: Ifklp wqrnc! (9)– Best: Ifklp wqrnc! (9)– Best: Hejlm wprnc! (8)– Best: Jenlo wprld! (5)– Best: Genlo wprld! (4)– Best: Genlo wprld! (4)– Best: Hellp wormd" (3)– Best: Helko wprld! (2)

-- Best: Helko wprld! (2)– Best: Helmo world! (1)– Best: Hello wormd! (1)– Best: Hello world! (0)

CCMT| 22

Genetic Algorithm

� Iterations: 30� Total number of individuals studied:

30*2048 = 61440

CCMT| 23

GA (Our application)

� Individuals– Base code– Permutation sequence– Maximum of 5 unroll factors

CCMT| 24


� Fitness function– Time– Energy

� Population size– 100 individuals

� Operators– Mutation – Crossover

CCMT| 25


� Mutation

CCMT| 26


� Crossover

CCMT| 27


� As a result of the mutation and crossover, certain incompatibilities may arise.

� Example: Base code of individual P mutation to 4loop

CCMT| 28


� Knowledge based GA: not 100% random– Inclusion of target individual in the initial

population• Target individual = CMT-Nek multiplication

algorithm

– Crossover: Loop permutation has a greater role in performance, so inherit loop permutation sequence from the better performing parent

– Fixing incompatibilities in crossover operation

CCMT| 29


i=i+1

Input: n

Generate initial population

i=1

Generate algorithm for the ith individual

Compile and run matrix multiplication

Set fitness value of the ith individual

i < n ? Sort individuals

Report the best individual

Create new generation

Stop ? Stop

YesNo

Yes

No

CCMT| 30


� Stopping criteria– The last three generations result in the same best

individual– A pre-defined maximum number of iterations is

reached– Improvement in performance of best individual

compared to target individual is x%• x is set dynamically• First 5 iterations, x = 60%• Next 5 iterations, x = 50%• …• Next 5 iterations, performance of best individual is better

than that of target individual

CCMT| 31

GA Output

� Matrix size 12:� Itr: 1 Best Algo: 4loop Unroll factors: 3 1 2 4 Permute: 4235 Fitness: 4.57044� Itr: 2 Best Algo: 4loopfused Unroll factors: 6 6 3 12 Permute: 423 Fitness: 3.33914� Itr: 3 Best Algo: 4loop Unroll factors: 3 1 1 2 Permute: 4235 Fitness: 3.15236 � Itr: 4 Best Algo: 4loop Unroll factors: 3 1 1 2 Permute: 4235 Fitness: 3.15236 � Itr: 5 Best Algo: 4loop Unroll factors: 3 1 1 2 Permute: 4235 Fitness: 3.15236

CCMT| 32

Performance And Energy

� Software Implementation:– CMT-Nek– 4loop version– 4loop-fused version – 5loop-version – 5loop-fused version

� CPU Platforms:– IBM Blue Gene/Q– AMD Opteron 6378

Performance and Energy Benchmarking of Spectral Element Solvers, Tania Banerjee and Sanjay Ranka (under preparation)

CCMT| 33

Architectures

� BG/Q node� Cores: 16� Each core:

– 4-way SMT– 1.6 GHz

� 204.8 GFLOPS peak performance

� 55W peak power

� Dell 6145 node� 4 AMD Opteron CPU� Each CPU:

– 16 cores– 2.4 GHz

� 614.4 GFLOPS peak performance

� 115 W peak power

CCMT| 34

IBM BG/Q (Performance)

� Comparable number of total nodes, but performance with 10x10x10 matrix size is 20% better than with 16x16x16

� Matrix size 10x10x10, 100 elements

� 51% improvement versus CMT-Nek (~ 2 times)

� 34 GFLOPS average

� Matrix size: 16x16x16, 25 elements

� 61% improvement versus CMT-Nek (~ 2.53 times)

� 12.7 GFLOPS average

0

1

2

3

4

5

6

7

dudr dudt duds

Run

time

(sec

onds

)

Derivatives

Performance

CMT-Nek 5loop-fused4loop 4loop-fused

0

0.5

1

1.5

2

2.5

3

dudr dudt dudsRun

time

(sec

onds

)

Derivatives

Performance


CCMT| 35

AMD Opteron (Performance)


� GNU compilers� 43% improvement versus

CMT-Nek (~1.72 times)� 209 GFLOPS


� GNU compilers� 42% improvement versus

CMT-Nek (~1.73 times)� 80 GFLOPS

0

0.2

0.4

0.6

0.8

1

dudr dudt duds

Run

time

(sec

onds

)

Derivatives

Performance

CMT-Nek 5loop 5loop-fused4loop 4loop-fused

0

0.2

0.4

0.6

0.8

1

1.2

dudr dudt duds

Run

time

(sec

onds

)

Derivatives

Performance

CMT-Nek 5loop 5loop-fused4loop 4loop-fused

� Comparable number of total nodes, but performance with 10x10x10 matrix size is 10% better than with 16x16x16

CCMT| 36

Energy Measurements on IBM BG/Q

� Environmental Monitoring (EMON) APIs� MonEQ wrapper library

– Reports power consumption by domains– Utilizes interrupts for more frequent current/voltage readings– APIs:

• MonEQ_Initialize• MonEQ_Finalize, • MonEQ_StartPowerTag, • MonEQ_EndPowerTag

Related Work: S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, M. E. Papka, Measuring Power Consumption on IBM Blue Gene/Q, IEEE International Symposium on Parallel and Distributed ProcessingWorkshops, 2013.

CCMT| 37

BG/Q Power Domains

� Power is measured on a node board basis� Power Domains:

– Core Logic power– Chip Memory Interface and SDRAM-DDR3– Optical module power– Optical module power + PCIExpress– HSS Network Transceiver power for Compute+Link

Chip– Link Chip Core power– Core array power

CCMT| 38

Monitoring Power

� Power consumed by the basic dudt-4loop implementation for matrix size 10x10x10

� After the initial start, power consumption is constant for all domains

CCMT| 39

Energy versus Performance Plots

320033003400350036003700

2.05 2.1 2.15 2.2 2.25

Ener

gy (J

oule

s)

Runtime (seconds)

Energy versus Performance: dudt 4loop-fused

150016001700180019002000

0.95 1 1.05 1.1 1.15

Ener

gy (J

oule

s)

Runtime (seconds)

Energy versus Performance: dudr, 4loop

1600

1650

1700

1750

1800

1850

1900

1 1.02 1.04 1.06 1.08 1.1 1.12

Ener

gy (J

oule

s)

Runtime (seconds)

Energy versus Performance: dudt, 4loop

15001600170018001900200021002200

0.95 1.05 1.15 1.25 1.35

Ener

gy (J

oule

s)

Runtime (Seconds)

Energy versus Performance: dudt, 5loop-fused

CCMT| 40

IBM BG/Q (Energy)

� Observations:– matrix size 10x10x10, 100 elements– 55% reduction in energy versus CMT-Nek

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Ener

gy (J

oule

s)

Derivatives

Energy Consumption

CMT-Nek 5loop-fused 4loop 4loop-fused

0

0.5

1

1.5

2

2.5

3

Run

time

(sec

onds

)

Derivatives

Performance


CCMT| 41

IBM BG/Q (Energy)

� Observations:– matrix size 16x16x16, 25 elements– 56.8% reduction in energy versus CMT-Nek – Consumes 40% more energy compared to the

10x10x10 case

0

2000

4000

6000

8000

10000

12000

dudr dudt duds

Ener

gy (J

oule

s)

Derivatives

Energy Consumption

CMT-Nek 5loop-fused 4loop 4loop-fused

0

1

2

3

4

5

6

7

dudr dudt duds

Run

time

(sec

onds

)

Derivatives

Performance


CCMT| 42

GA Results

� Hipergator (Performance)� Teller@Sandia (Energy)

– 104 nodes cluster– AMD-Fusion A10-5800K– 4 cores operating at 3.8GHz– Used PowerInsight to measure power

Related Work: J.H.Laros, III, P. Pokorny, and D. DeBonis, PowerInsight– A Commodity Power Measurement Capability, The Third International Workshop on Power Measurement and Profiling in conjunction with IEEE IGCC 2013, 2013

CCMT| 43

Results (Hipergator)

CCMT| 44

Results (Hipergator)

• Between 9.7% to 38.6% improvement, average improvement of 28.2%• Maximum improvement for N=12

CCMT| 45

Results (SNL)

• Between 27% to 45% improvement average improvement of 37%• Maximum improvement for N=8 and N=12

CCMT| 46

Results (SNL)

• Between 23% to 45% improvement, average improvement of 34%• Average power consumption is about the same for the various

implementations across different matrix sizes• Improvement in energy consumption heavily reflects improvement in

runtime

CCMT| 47

Energy Versus Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20 25 30 35

Ener

gy (J

oule

s)

Runtime (seconds)


CCMT| 48


0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

3.5 4 4.5 5 5.5 6 6.5 7

Ener

gy (J

oule

s)

Runtime (seconds)


CCMT| 49

Integration with CMT-Nek

Algorithm: For all spectral elements do

Compute dudrCompute dudsCompute dudtPopulate du for (x, y, z) coordinate system using dudr, duds and dudt above

Enddo

File: navier1.fSubroutine: conv1 Inputs: du, u: where du represents a derivative matrix populated in the subroutine and u is the function matrix

CCMT| 50

GPU Architecture

Tesla K20c:13 Processors192 Cores48k shared memory64k registers1170 GFLOP/s Peak

CCMT| 51

GPU Implementation

▪ Optimizations:– The derivative operator matrices D and DT

matrices are only brought once per block from the device memory to shared memory.

– The derivative operator matrices D and DTare stored in registers instead of shared memory.

Related work: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices, C. Jhurani, P. Mullowney, Journal of Parallel and Distributed Computing September 2014.

CCMT| 52

GPU Performance

▪ Compared with– CUGEMM– `Combined’

• Computation of dudr, duds and dudt happen in one kernel reusing function data as much as is possible.

– `Separate’• Computation of dudr, duds and dudt happen in

separate kernels independent of each other.

▪ Platform: Tesla K20c

CCMT| 53

GPU (Performance)

▪ Observations– Performance increases nearly linearly with matrix size– Over 180 GFLOPS for matrix size 16x16x16– 39% improvement versus CUGEMM for matrix size 16x16x16

CCMT| 54

Energy Modeling

▪ Nvidia-smi gives instantaneous power▪ Run kernel thousands of times▪ Measure every second▪ Divide power consumed into GFLOPs

CCMT| 55

GPU (Energy)

▪ Observations:– Power consumed was nearly similar for each kernel– Hence performance/watt is dominated by performance results.

CCMT| 56

Hybrid processors: Performance and energyCPU GPU CPU

Time(ms)

GPUTime(ms)

CPUEnergy

(mJ)

GPU Energy

(mJ)

Total Time (ms)

Total energy

(mJ)

0 10000 0.0000 0.8333 0.0000 0.8333 0.8333 0.8333

10 9990 0.0333 0.8325 0.4167 0.8325 0.8325 1.2492

20 9980 0.0667 0.8317 0.8333 0.8317 0.8317 1.6650

40 9960 0.1333 0.8300 1.6667 0.8300 0.8300 2.4967

80 9920 0.2667 0.8267 3.3333 0.8267 0.8267 4.1600

160 9840 0.5333 0.8200 6.6667 0.8200 0.8200 7.4867

320 9680 1.0667 0.8067 13.3333 0.8067 1.0667 14.1400

640 9360 2.1333 0.7800 26.6667 0.7800 2.1333 27.4467

1280 8720 4.2667 0.7267 53.3333 0.7267 4.2667 54.0600

2560 7440 8.5333 0.6200 106.6667 0.6200 8.5333 107.2867

CCMT| 57

Interpolation

� Optimizations:– Matrix multiplication– Ordering of operations

• Expensive operations can be done earlier while matrix size is still small

• xyz, xzy, yxz, yzx, zxy, zyx

CCMT| 58

Lightweight Distributed Metric Service

� LDMS: Data collection tool at LANL� Gives temperature numbers.� After logging in,

� Output:

� Units of temperature - degree Celsius� Sensors are placed inside core.

1421325154.002116, 2116, 1600, 30, 30, 30, 32

CCMT| 59

Conclusions� We benchmarked the most compute intensive kernel of

CMT-Nek for performance and energy.� Our work highlights autotuning as an important strategy for

improving both performance and energy, over different architectures– We got between 42-61% improvement in performance and about

55% improvement in energy requirement

� We coupled genetic algorithms with autotuning for performing smart search

� Currently working on spectral interpolation and temperature measurements

CCMT| 60

Dynamic Voltage Scaling

60

�

�

� ,

CCMT| 61

Reconfigurable Cache

61

Capacity Tuning

Associativity Tuning

Line Size Tuning

Zhang et al., ACM TECS 2005

CCMT| 62

Our Research

62

CCMT| 63

Optimal Cache Configurations

63

CCMT64

Cache Reconfiguration in Multi-core Systems

W. Wang, P. Mishra and S. Ranka, DAC, 2011

CCMT65

Cache Reconfiguration and Partitioning

Dynamic cache reconfiguration in L1 caches and partitioning in L2 cache are highly correlated.

CCMT66

Algorithm

• We assume task mapping is given

• We statically profile each independent task for all L1 cache configurations and L2 cache partitioning factors� Greatly reduced design space size

• Step 1: We employ a dynamic programming based algorithm to find the optimal L1 cache configurations for each core (with multiple tasks) separately under all L2 partition factors

• Step 2: We then find the optimal L2 cache partition scheme

CCMT67

Algorithm Illustration

67

Step 1

Step 2

CCMT68

Our approach can achieve 29% energy savings compared with CP and up to 14% savings

compared with DCR + UCP

CCMT| 69

Thermal Issues of Multi-core Processors

� The power density of multi-core processors has doubled every three years and this rate is expected to increase as frequencies scale faster than operating voltages

� A small increase of 10 C in temperature may result in 2×reductions in the lifespan of the device

� The cost of cooling system increases super-linearly in power consumption

Power and heat flux trend in the desktop processor

CCMT| 70

Managing Temperature: MotivationTemperature varies on multiple cores

[Sarood2011]rrararrrrrrrrrrooooooood2ddd2dd 010100 ]1]1]

Tilera Processor

CCMT| 71

Thermal RC model

• P: power consumption• Ti: initial temperature• t: execution time

There are three major factors affecting the on-chip transient temperature: average power of the processor, initial temperature and execution time

RCt

iAA eTTRPTRPtT�

�� )()(

CCMT| 72

Thermal Management

� Sensor based– [Mukherjee06, Sharifi08, Cochran09]– Pros: low computation overhead– Cons: imprecision and noise from the raw data

of temperature sensor; fixed position of the sensor

� Model based– [Skadron03, Huang04, Liu06, Rao07]– Pros: estimate the temperature accurately– Cons: high computation overhead

CCMT| 73

HotSpot

CCMT| 74

Steady State Temperature

� Assume P is the power of a thermal element, Ti is the initial temperature and TA is the ambient temperature, then the transient temperature at time t is:

� Assume t -> ∞, we get steady state temperature:

� Time to reach steady state temperature: 20ms –20s

RCtiAAt eTTRPTRPT /)( ��

As TRPT ��

CCMT| 75

– G: thermal conductance between thermal elements (cores and heatsinks)

– A: thermal conductance between thermal elements and outside environment

– Thermal conductance is the quantity of heat transferred withtemperature difference of one kelvin, measured in W/K

Matrix Model

CCMT| 76

Matrix Model

� Suppose there are u cores and v heatsinks, Tm is the steady-state temperature of thermal element m, TA is the ambient temperature.��

��Mm

Amnm mPmATTnmGTT Mm ),()()(),()(

(1) PATTR A ��

�

�

��

� � ��

��

1

,0 if ),(),(

if ),,( where vu

ikk

ij jiiAkiGjijiG

R

�

CCMT| 77

Matrix Model

� In a multi-core processor with u cores

– Tcore, 1×u vector of steady-state temperature of cores

– TA, ambient temperature, I = [1, 1, · · · , 1]T

– C, u×u matrix• The amount of change in temperature of mth core caused

bynth core is given by the Cm,n times the change in thermalpower of nth core

– P, 1×u vector of power of cores

PCITT Acore ��

CCMT| 78

Matrix Model

� Generation of matrix C

u

u

HotSpot[P0, P1, …, Pu-1] [T0, T1, …, Tu-1]

HotSpot[P0+α, P1, …, Pu-1] [T’0, T’1, …, T’u-

1]

[ C00, C01, … , C0u-1 ]subtract TuTT -1

C00 C01 C02 ... C0u-1

CCMT| 79

Matrix Model


u

u


HotSpot[P0, P1+α, …, Pu-1] [T’0, T’1, …, T’u-

1]


C00 C01 C02 ... C0u-1

C10 C11 C12 ... C1u-1

CCMT| 80

Matrix Model


u

u


HotSpot[P0, P1, P2+α…, Pu-1] [T’0, T’1, …, T’u-

1]


C00 C01 C02 ... C0u-1

C10 C11 C12 ... C1u-1

C20 C21 C22 ... C2u-1

CCMT| 81

Matrix Model


u

u


HotSpot[P0, P1, …, Pu-1+α] [T’0, T’1, …, T’u-

1]

[Cu-10, Cu-11, … , Cu-1u-1 ]subtract TuTT -1

C00 C01 C02 ... C0u-1

C10 C11 C12 ... C1u-1

C20 C21 C22 ... C2u-1

….

Cu-10 Cu-11 Cu-12 ... Cu-1u-1

CCMT| 82

C-Matrix of a 4 core processor

CCMT| 83

Matrix Model

� Compact– Quite simple equations to compute the steady-

state temperature of cores, much easier than HotSpot

� Accurate– Achieve very close steady-state temperature to

HotSpot (Experiments)� Limitations

– Only steady-state temperature– Still need HotSpot to generate C matrix

CCMT| 84

Evaluation

� Multicore parameters: – 4 cores and 16 cores

processors– Each core is

abstracted as a 8mm ×8mm square chip

– For 4 cores processors, each core dissipates 100Wpower at maximum voltage. For 16 cores processor, 50W each core

– The default thermal configuration in HotSpot.

� Matrix C generation– Use 20 different

power configurations to generate 20 different matrix Cs and average them

� DAG generation:– 32, 64, 128, 256, 512

tasks– execution time, 20 –

60 time units– probability of any

two node having an edge between them is set to 0.1

CCMT| 85

Evaluation

� Matrix Model vs HotSpot --- Peak Temperature

CCMT| 86

Evaluation

� Matrix Model vs HotSpot -- Computation Time

CCMT| 87

Temperature Aware Scheduling

� Uniform voltage without throttling– Each core can work at full voltage or has to be shut off

� Uniform voltage with throttling– The voltage of each core has to be the same but can be

varied between a maximum and minimum� Non-uniform voltage with throttling

– The voltage of each core can be varied independently

� Problem: For each of above three types of processors, determine the workload distribution for each core, so that the total throughput across all cores is maximized and the maximum temperature for any core is bounded by a given threshold

• Determine data parallel workloads distribution on multicore processor, so that the total throughput across all cores is maximized and the maximum temperature for any core is bounded by a given threshold

• denotes workloads assigned on core i;• denotes the running time of the workload on core i;• denotes the peak temperature on core I;• denotes the temperature threshold for all cores

m}{1,2,...,i , T s.t.

max/ max

th

m

1i

��

i

ii

T

tw

iwitiTthT

Uniform voltage without throttling

CCMT| 89

Uniform voltage without throttling

Mix

MiTPxCT

xz

i

MjthjijA

Mti

��

��

�

�

�

�

�

,1or 0

, s.t.

max

Mix

Zx

MiPxCTT

T

i

Mii

MjjijAp

p

��

�

��

�

�

�

�

1,or 0

, s.t.

min

*

CCMT| 90

Uniform voltage with throttling

Optimization Problem

– Simplify (*) as follows:

where D is equal to – x is given by:

10

(*) s.t.

max

��

��

x

TPxCTx

MjthijA

MiC

Dxj ij

��

,

PTT Ath �

MiC

Dxj ij

i

i�

�

��

�

�

��

,1 ,minmin

CCMT| 91

Non-uniform voltage with throttling

Optimization Problem

*)*(* ,10

(**) , s.t.

max

Mix

MiTPxCT

x

i

MjthjijA

Mii

��

��

�

�

�

CCMT| 92


� Three possible cases: • Case 1: The threshold temperature is high and all

cores can execute at its maximum voltage without exceed the threshold. In this case all xis are set to 1s

• Case 2: The threshold temperature is low and requires all xis to be less than 1. In this case these xisare all bounded by Equation (**)

• Case 3: The threshold temperature is such that xivalues of some of the cores is limited by the constraint given by equation (***)

CCMT| 93


� Case 2

10

s.t.

max

��

� ��

XDXC

xMi

i

*)*(* ,10

(**) , s.t.

max

Mix

MiTPxCT

x

i

MjthjijA

Mii

��

��

�

�

�

DCX 1 ��

CCMT| 94


� Case 3:• Check Case 1 followed by Case 2. If both fails,

than an approximation is used as in Case 3 by assuming that all xi values are the same and the algorithm for uniform voltage with throttling

CCMT| 95

Evaluation

� CPU: multi-core processors with 4, 16, 32 and 64 cores

� The ambient temperature used was 45.5oC. The maximum allowable temperature was set to 70oC

� Metric: coreper t throughpuMaximum

t throughpuTotal

Cores ofNumber Effective �

Number of cores

Floorplan Area per core Maximum power per core

4 2×2 grid 8mm×8mm 40Watt



64 8×8 grid 2mm×2mm 2.5Watt

CCMT| 96

Evaluation

� Uniform voltage without throttling– MIP (Mixed Integer Programming): The solution

derived by our algorithm that minimizes the maximum temperature for the optimal value of P

– BestP: Consider all subsets of size P. Find the subset that corresponds to the lowest maximum temperature

– BestP+1: Consider all subsets of size P+1. Find the subset that corresponds to the lowest maximum temperature

– WorstP: Consider all subsets of size P. Find the subset that corresponds to the highest maximum temperature

CCMT| 97

Evaluation

� Uniform voltage without throttling

CCMT| 98

Evaluation

� Uniform voltage with throttling– Throughput comparison

CCMT| 99

Evaluation

� Uniform voltage with throttling– Computation time comparison

CCMT| 100

Evaluation

� Non-uniform voltage with throttling– Throughput comparison

CCMT| 101

Evaluation

� Non-uniform voltage with throttling– Computation time

CCMT| 102

Decomposing Hot Tasks

� Partition the “hot” tasks into multiple subtasks and interleave these subtasks with “cool” tasks to reduce the overall maximum temperature– To the best of our knowledge, our work is the first

attempt to develop efficient task partitioning algorithms to demonstrate significant temperature reduction.

� Several heuristic task partitioning algorithms using “cool” tasks to interleave “hot” tasks – 1) for a periodic set of tasks with common period– 2) for a periodic set of tasks with individual period

1We define “hot” tasks as tasks with higher average power consumption, and “cool” tasks as tasks with lower average power consumption.

CCMT| 103

Related Work� Dynamic Voltage and Frequency Scaling (Reduce

power)– Dynamic Voltage and Frequency Scaling (DVFS) can be

used to reduce the power consumption by lowering the supply voltage and operating frequency, thereby reduce the on-chip temperature [Brooks2001, Rao2008, Kadin2008, Ebi2009, …]

– Cons: faces a serious problem in time-constrained applications

� Temperature aware task sequencing algorithm (Reduce initial temperature)– Reduce peak temperature compared to a random

sequence [Jayaseelan2008]– Cons: fails to reduce temperature in cases when one or

more of the “hot”1 tasks are long

CCMT| 104

Temperature-aware task partitioning algorithm

Illustrative example

CCMT| 105



CCMT| 106



CCMT| 107



CCMT| 108



CCMT| 109



CCMT| 110



Task Partitioning Algorithm can achieve lower peak temperature g gthan Task Sequencing Algorithm

CCMT| 111

Experiments

� Platform:� CPU:

• ARM Cortex A8 (Simplescalar)• 2-width in-order issue, 32KB instruction cache• 1.5GHz clock speed

� Power simulator: • Wattch

� Temperature evaluation: • HotSpot• Ambient temperature:

� Tasks: Synthetic tasks and real benchmarks are used� Algorithms: Compare with task sequencing algorithm and

EDF algorithm.

CCMT| 112

Experiments

Benchmarks

Simplescalar CACTI

Wattch core

Cache statistics

Core component statistics

Wattch

Power of tasksAmbient Temperatures,

cpu thermal parameters… Temperature-aware task

partitioning algorithms

Peak temperature

RCt

iAA eTTRPTRPtT�

�� )()(

Cpu frequency, cache configurations …

CCMT| 113

Experiments: Periodic tasks with common period

Real benchmarks:

set1 patricia, adpcm, rijndael, susan, crc, FFT, dijkstra, epic

set2 patricia, djpeg, adpcm, sha, FFT, rijndael, susan, rijndael

set3 sha, djpeg, FFT, rijndael, dijkstra, epic, rijndael, susan

set4 rijndael, dijkstra, FFT, gsm, sha, patricia, pegwit, djpeg

CCMT| 114

Experiments: Periodic tasks with common period

� Temperature comparison:

Task partitioning algorithm (TPA) can reduce the peak temperature by up to

5.88o8oC compared with task sequencing algorithm (TSA)

CCMT| 115

Periodic tasks with individual period

EDF scheduling with task partitioning (g (EDFpFp) can also reduce the peak DF scheduling with task ptemperature by up to 6

p66opartitparp66oC

CCMT| 116

Experiments

� Overhead:

Average context switch per task for TPA is as low as 2 (left figure).Average context switch per task foAverage context switch per task for

or TPA or TPoror EDFp

is as low as 2 (left figure).is app is also lower then 2 (right figure).Average context switch per task foor DFpED p s alis

They are tolerable in many practical scenariosss1lso lolsoss1.

1 1 Context switch time on ARM M cpuu can be less than 10us [SEGGER]

CCMT| 117

Conclusion

� We propose to partition the “hot” tasks into multiple subtasks and interleave these subtasks with “cool” tasks to reduce the overall maximum temperature

� We propose two heuristic task partitioning algorithms using “cool” tasks to interleave “hot” tasks – 1) for a periodic set of tasks with common

period– 2) for a periodic set of tasks with individual

period

CCMT| 118

Temperature-aware Scheduling fro Multicores

� Multicore Processors:– Multiple heating sources– Heat interaction between neighboring cores

CCMT| 119

Inter Core Scheduling

CCMT| 120

Experiments

� Platform:� CPU:

• Simplescalar, ARM Cortex A9 (multicore)• 2-width out-of-order issue, 32KB instruction cache• 1.2GHz clock speed.

� Power simulator: • Wattch

� Temperature evaluation: • Temperature simulator: HotSpot• Ambient temperature: 45.15oC

� Tasks: Synthetic tasks and real benchmarks are used� Algorithms: Min-Min, PDTM [Yeo2008], TPS-1(δ=0.33ms),

TPS-2(δ=0.66ms), TPS-3(δ=1.32ms), TPS-3(δ=2.64ms)

CCMT| 121

Experiments

Benchmarks

Simplescalar CACTI

Wattch core

Cache statistics

Core component statistics

Wattch

Power of tasksAmbient Temperatures,

CPU thermal parameters… Temperature-aware task

partitioning algorithms

Peak temperature

CPU frequency, cache configurations …

HotSpot

CCMT| 122

Experiments

� Multicore: – Real benchmarks

CCMT| 123

Experiments

� Multicore: Synthetic tasks

TPS algorithm reduces the peak temperature by up to 11.688o8oC TPS algorithm reducompared with Min

uinin-

s the peak tempsucenn--Min algorithm. p g

PDTM can achieve similar peak temperature reduction, but requires PDTM can33% more

chieve sn ace e makespan

ilsimianan.

CCMT| 124

Experiments

� Multicore: – Real benchmarks

TPS S algorithm reduce the peak temperature by up to o 9.922o2oC CC compared with Minin-TPSSMin

gorithm reduS galn n algorithm, 4.52

u22oce the peak temperature by up too 929. 2cece

22oC compared with PDTM algorithm.g , p gPDTM can achieve similar peak temperature reduction, but requires s 44% % more PDTM camakespan

n acan aanan.

CCMT| 125

Experiments

� Multicore: – TPS vs PDTM: peak temperature undersame

makespan

TPSPS-S 1 relaxed d algorithm reduce the peak temperature by up to o 2000o0oC TPPSS- relaxed1 1 d gorithm reduce thealcompared with PDTM algorithm.

CCMT| 126

Experiments

� Multicore:– TPS vs PDTM: scalability

CCMT| 127

Conclusions

Thermal model addressed core throttling can highly improve performance

Heuristics with transient thermal model have better improvements than methods with steady-state model

Computing time cost: transient is larger than steady-state– It is worthwhile because it is calculated offline.

Different initial configurations on f and t may result in small difference of achievements

– Practically it will be better by using the steady-state solution as initial configuration

CCMT| 128

Conclusion

� We propose to partition the “hot” tasks into multiple subtasks and interleave these subtasks with “cool” tasks to reduce the overall maximum temperature

� We propose heuristic task partitioning algorithms using “cool” tasks to interleave “hot” tasks on both single core and multicore processors

� Experimental results show that out algorithm outperforms existing state-of-art thermal-aware scheduling algorithm in terms of peak temperature and makespan.

CCMT| 129

Transient Thermal models

Steady-state thermal model

– Efficient but does not capture transient effects (worst case scenario)

Transient-state thermal model:– If the average power of core is P over a time period t,

then the temperature at the end of this period T(t) is given by:

G is the thermal conductance matrix C is the thermal capacitance matrix

is the ambient temperatureis the initial temperature

CCMT| 130

Approach

1. We propose a solution to the convex optimization problem with the simple thermal model to solve the problem of maximizing throughput under the temperature constraint.2. We also propose a heuristic algorithm with the transient thermal model to solve the problem with higher accuracy.

CCMT| 131

Evaluation – Matrix Multiplication

General scheme

– High throughput improvement than w/o HLB

– Around 10% throughput improvement than base solution

– With very large workload, solutions of heuristic and base will converge

Homogeneous-scaling scheme

NNon-scaling scheme

Hengxing Tan, and Sanjay Ranka, Thermal-aware Scheduling for Data Parallel Workloads on Multi-Core Processors, ISCC 2014 (Work partially supported by NSF)

CCMTT5 132

Future Work: Energy and Thermal Management

� Varying Architectural Elements─ Processor (Dynamic Voltage Scaling)─ Caches (Dynamic Cache Reconfiguration)─ Buses─ Memory

� Developing Optimized Libraries – Energy─ Performance─ Temperature

B

A

Feasible space

Energy

Time

CCMT| 133

Selected Publications

133

� Jaeyeon Kang, Sanjay Ranka: Energy-Efficient Dynamic Scheduling on Parallel Machines. HiPC –International Conference on High Performance Computing, 2008: 208-219.

� Jaeyeon Kang and Sanjay Ranka, Dynamic Algorithms for Energy Minimization on Parallel Machines., Proceeding of Euromicro International Conference on Parallel, Distributed and network-based Processing (PDP), 2008, pp. 399-406.

� Jaeyeon Kang and Sanjay Ranka, DVS based Energy Minimization Algorithm for Parallel Machines, Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2008, pp. 1-12.

� Zhe Wang and Sanjay Ranka, A Simple Thermal Model for Multi-core Processors and Its Application to Slack Allocation, Proceedings of International Parallel and Distributed Processing Symposium 2010, pp. 1-11.

� Weixun Wang, Prabhat Mishra and Sanjay Ranka, “Dynamic Reconfiguration in Real-Time Systems: Energy, Performance, Reliability and Thermal Perspectives”, Springer, 2012 (Expected)

� Weixun Wang, Sanjay Ranka and Prabhat Mishra, “Energy-Aware Dynamic Reconfiguration Algorithms for Real-Time Multitasking Systems”, SUSCOM, Issue. 1, pages 35-45, 2011 (Invited Paper)

� Weixun Wang, Prabhat Mishra and Sanjay Ranka, “Energy Optimization of Cache Hierarchy in Real-Time Multicore Systems”, TCAD, under review

� Weixun Wang and Prabhat Mishra, “PreDVS: Preemptive Dynamic Voltage Scaling for Real-Time Multitasking Systems”, TODAES, under review

� Weixun Wang, Sanjay Ranka and Prabhat Mishra, “Energy-Aware Dynamic Slack Allocation for Real-Time Multitasking Systems”, SUSCOM, under review

CCMT| 134

Managing Temperature: ApproachesHigh temperature leads to performance loss.

[Rajan2008]

Temperature Thresholding -

• Lower workloads at • Increase workloads at

Zigzag throttling processor speeds

– Zigzag effects cause more loss– Processor will put “hot” cores into

low power state.

CCMT| 135

Modeling Thermal Behavior (HotSpot)

CCMT136

Energy Levers

CCMT| 137

Relaxation� Most real architectures

only support discrete frequency settings to scale– E.g. Given options 1.0

Ghz, 1.5 Ghz, 2.0 Ghzfor dual cores

– Results: 1.4 Ghz, 1.99 Ghz

� Relaxation– Native: downward – relaxation

• 1.0 Ghz, 1.5 Ghz– Our method: relaxing

result frequency to the neighboring discrete value

• 1.0 Ghz, 2.0 Ghz– Practically choose

among 2 or 4 neighbors are good enough

CCMT| 138

Solution with the steady state model

Assumption:– Applications runs for a long time with

constant frequency– Each core completes work simultaneously

Each core will arrive at its steady-state temperature

Use convex solver m}{1,2,...,i ,

, T-T s.t.

max

maxmin

ith

3m

1i

1

m

1i

��

��

�

�

FfF

fG

f

i

iij

i�

CCMT| 139

Solution with transient model

Related Dual Objective Minimize the Makespan across all cores with given workloads and temperature threshold

A general solution uses non-linear solver as SQP

m}{1,2,...,i , T

W s.t.

max min

th ��

i

ii

i

Ttβf

t

CCMT| 140

Iterative Refinement Process

CCMT| 141

Additional constraints

Homogeneous-scaling: cores run at same frequency Objective:

Non-scaling: cores run at fixed frequency but some cores could be turned off

Objective:

��

m

1imax/ max ii ttf�

max/ maxm

1i�� ii ttF�

CCMT| 142

Additional Constraints (with simple model)

Homogeneous-scaling : cores run at same frequency

– Convex problem

Non-scaling: cores run at fixed frequency

– Mixed Integer linear problem

m}{1,2,...,i ,

, T-T s.t.

max

maxmin

ith

3m

1i

1

��

��

FfF

fG

f

ij

�

m}{1,2,...,i , 1or 0

, T-T s.t.

max

ith

m

1i

13

1

��

��

��

�

i

iij

m

i i

x

xGF

xF�

CCMT| 143

Additional Constraints (with Transient model

Homogeneous-scaling: cores run at same frequency

Non-scaling: cores run at fixed frequency but some cores could be turned off

The heuristic can work on both problems

m}{1,2,...,i ,

, TT

, s.t.

max t min

maxmin

thi

m

1i

i

��

��

FfF

Wtf i�

m}{1,2,...,i

, TT

, / s.t.

max t min

thi

m

1i max

i

��

��FWti �

CCMT| 144

Heuristic – local search

Based on Heat Load Balance (HLB)Slicing workloads and move then from hot cores to cool cores

1: Start from an initial configuration including f and t• Distribute total workloads evenly to cores with

2: Move a workload unit • From: the core decided by peak temperature • To: the core decided by max gradient , in term of frequency slice

3: Repeat step 2 until same peak temperature on all cores4: If

• Continue move workload unit until From: decided by gradient

Else if • Backward move workload unit until (reducing f by increase

t)

CCMT| 145

Scheduling time

CCMT| 146


� Two major challenges:– 1) Number of Partitions– 2) Sequencing of Subtasks

� Two broad scenarios:– 1) A periodic set of tasks with common period.

All the tasks have the same arrival time and deadline.

– 2) A set of periodic tasks with individual period. Each task may have different arrival time and deadline.

CCMT| 147

Scenario 1: Periodic tasks with common period

� A periodic set of N heterogeneous tasks L, let Pi be the average power consumption during the execution time ci of task τi.

� The goal is to find a sequence of these tasks using task partitioning to minimize the peak temperature

CCMT| 148

Algorithm: Periodic tasks with common period

� Sort the tasks based on the power profile from coolest to hottest

� Group the sorted tasks into k categories with equal number of tasks.

� Partition tasks in category j, 2 <= j <= k, into 2i-1

equal subtasks. Partition tasks in category 1 into 2 equal subtasks.

� for i = 1 to k − 1 do– Interleave tasks of ith category with tasks of (i+1)th

category to form the new (i+1)th category� end for

CCMT| 149

Periodic tasks with common period

3:

2:

1:

T2 : power = 25

T1: power = 20

T3: power = 15

CCMT| 150


3:

2:

1:

T2

T1

T3

T2 T2 T2

T1

T3

CCMT| 151


3:

1&2:

T2

T1T3

T2 T2 T2

T1T3

CCMT| 152


1&2&3:T2 T1T3 T2T2 T2T1T3

CCMT| 153

Scenario 2: Periodic tasks with individual period

� A set of periodic N heterogeneous tasks in a set L where each task has its own period pi. The arrival time ai is equal to the start time of its period and the deadline di is equal to the end time of its period.

� The goal is to find a sequence of these tasks using task partitioning to minimize the peak temperature

CCMT| 154

Algorithm: Periodic tasks with individual period

Use EDF scheduler to get the initial schedule of these taskswhile loop for M times do

Calculate the thermal profile of task sequence, find the “hot” task instance τh where peak temperature occurs.

Partition the task instances whose execution period overlap with the arrival time or deadline of the “hot” task instance.

In the hot interval, remove all the subparts of τh and calculate the available slack for each “cool” task instance.

while there are parts of τh unassigned and some “cool” task instance has available slack dofor each “cool” task instance τci in the hot interval do

if slacki > 0 thenAppend one unit of τh into τci and update the slack for all “cool” task instancesend if

end forend while

If there is still some subparts of τh unassigned, scan the hot interval and assign them uniformly into the idle time.end while

CCMT| 155


� Partition the task instances whose execution period overlaps with the arrival time and/or deadline of the “hot” task instance

� Slack allocation:

iii

isucci

predpredi

ESTLSTslack

cLSTddLST

cESTaaEST

iih

iiih

��

��

��

),,min(

),,max(

��

��

CCMT| 156


CCMT| 157


Load Balancing: Particles and Mesh

T5 2

Computation Partitioning: Types of Parallelism

Independent Parallelism�Bundled simulations using

DAKOTA (e.g. UQ, parametric variations)

�Task Parallelism�Independent models that are

simulated concurrently�(e.g Fluid, particle coupling at

microscale)

�Data Parallelism�Parallelization of Eulerian grid

Parallelization of Lagrangian particles

ndBDv

TIns(em

DPPp

In�

v

�T�

s�(

m

�D�

p

T5 3

Fluid Model Solid Model

Across immersed interfacesAcross cells and elements

Lagrangian/EulerianLevels of AMR

Communication Mapping: Types of Interactions

T5 4

4

Preferential particle clusteringLagrangian remap

Computational power focusing

Extreme event UQ-drivenComputational steering

Adaptive mesh refinement

Load Balancing: Types of Adaptivity

CUDA Memory model

� A Thread block has R/W access to Shared memory

� A block grid has R/W access to Global memory

� A block grid has R/only to Constant memory

� Global memory (of order 4G) resides in DRAM and has very high access latency than the shared memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Images from CUDA programming guide

Global memory access� When accessing global

memory, peak performance is achieved when all threads in a half warp access continuous memory locations.

Non-coalesced access

Coalesced access


Shared memory access� Shared memory is divided into banks� Multiple simultaneous accesses to a bank

result in a bank conflict� Conflicting accesses are serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

No conflicts Conflicting access


4 Phases of PIC algorithm

1. Charge Deposition Phase 3. Force Gathering Phase

4. Particle push Phase2. Field Solve Phase- Compute the forces (Poisson equations) needed for particle motion from the accumulated particle charges

PIC algorithm on a triangular mesh� Irregular structure makes

partitioning complex.� Each particle requires a

search to find the enclosing triangle� This step forms an

additional Search Phasein the PIC algorithm flow

� Search phase forms one of the time consuming steps in the PIC flow

Fig: Mesh from ORNL used for XGC1 benchmarks

GPU Parallelization using Mesh coloring� Triangles in mesh are considered

as nodes of a graph� Triangles with at least one

common vertex are of different colors

� Every GPU kernel works only on one color

� Pros� No conflicting access

� Cons� Needs multiple kernel invocations.� For efficient indexing, we need to

maintain color sorted order of triangles and particles which involves additional computation

GPU Parallelization using Mesh Partitioning� Particles, triangles and

vertices of triangles form the partitioning entities in the PIC problem

� Mesh is partitioned into regions using a virtual rectangular grid

� Each region is mapped to a GPU block

Region 1 Region 2

Region 3 Region 4

GPU Block

Triangular Mesh Partitioning� Triangles that cross region

boundaries are referred as shadow triangles. The vertices of shadow triangles are termed as shadow vertices

� Shadow vertices and triangles are replicated

� Particles, triangles and vertices are represented using linear arrays

Shadow vertices

Region 1 Region 2

Region 3 Region 4

on

Shadow Triangleow vert

Replication of shadow entities� Replication ensures that each block can compute

independently of the other blocks� After the computation, an aggregation step merges the values

from all shadow vertices

Vertices in Region 1

Shadow Vertices in Region 1

Vertices in Region 2

Shadow Vertices in Region 2

Triangles in Region 1

Shadow Triangles in Region 1

Triangles in Region 2

Shadow Triangles in Region 2

Particles in Region 1 Particles in Region 2

GPU kernels� Bucket sort for triangles, vertices and particles� For each simulation iteration

� Triangle search� Field solve phase� Force aggregation of shadow vertices� Particle push phase� Re-sorting of moved particles

Triangle density based partitioning� Ensures effective load

balancing across regions� Expensive pre-processing

step � Need to use a spatial

indexing data structure like KD-tree to partition triangles

� During search phase particles traverse the KD-tree

� KD-tree is not very well suited for GPU

Region 1 Region 2

Region 3

Region 4

Region 5

Region 6

Partitioning using Level 1 grid� The virtual rectangular grid

partitions the mesh into regions

� Pre-processing step is very fast

� Load imbalance due to difference in triangle density

� The linear search for triangles can be a bottleneck

Region 1 Region 2

Region 3 Region 4

Level 2 Partitioning (Partition Region into Sub-Regions� Used only for the search

phase� No replication of shadow

vertices across sub-regions as the vertices are read-only while searching.

� Requires sorting of triangles, and vertices in sub-region order which is again part of pre-processing

� Sub-region order sorting of particles has to be performed after each iteration

Incremental Sort –Uniform Partitioning� After each simulation iteration,

particles have to be re-sorted� For efficient bucket sort all the

sub-regions should be present in shared memory which stores the particle count in the sub-region

� As number of sub-regions increase, it wont fit in shared memory of GPU

� In reality most of the particles will move only to adjacent sub -regions.

� Keep only adjacent regions ( Y) to region (X) in shared memory

X

Y Y

Y

YYY

Y

Y

Non-uniform Partitioning using Level 2 grid� Level 2 grid is not uniform� Dense in regions where

triangle density is more.� Non-uniformity creates

asymmetry and requires more complex pre-processing and indexing methods.

� Incremental sort becomes very complex

Field solve phase� Most of the flops are executed in this phase� Each region is mapped to a GPU block.� GPU block loads the vertices and shadow vertices in a

region to shared memory� Each thread operates on a set of particles in the region� The force is updated on vertices/ shadow vertices in

shared memory. � Different particles can update the same vertex. Hence atomic

update is used. Doesn’t consume much cycles as atomic updates in shared memory are very fast.

� Once the block completed execution the values are written back to global memory

Experimental Results� Mesh from ORNL used for XGC1 benchmarks

� 1.8 Million triangles� Randomly distributed 18 Million particles� Level 1 partitioning uses 32 X 32 rectangular grid (regions)� NVIDIA Tesla T10 GPU with 4GB global memory, 16k shared

memory and 240 computing cores

Comparison of triangle search time (10 simulation iterations)� Uniform and non-uniform

partitioning gives similar performance when there are sufficient number of GPU blocks

� Simpler uniform partitioning would be a better choice GPU blocks Time (ms)

4096 3111.11

9216 1366.21

16384 877.23

25600 609

36864 500.92

50176 427

Non-uniform partitioning

Uniform partitioning

GPU blocks Time (ms)

1024 12561.06

2779 7235.16

22471 989.88

33464 428.51

Particle sorting time (Non-uniform partitioning)� ~20X speedup when using shared memory for sorting

Order of sub-region( Max # of triangles in sub-region)

Sorting using shared memory (Time in ms)

Sorting without using shared memory (Timein ms)

10 185.68 1492.55

200 183.88 1881.56

1000 183.55 2441.08

5000 183.99 3773.92

Particle Incremental Sorting time (Uniform Partitioning )� Relatively independent of

the number of blocks used Number of GPU blocks

Particles per thread

Time in (ms)

35157 2 61.49

17579 4 60.49

4395 16 60.73

1099 64 61.13

550 128 65.79

314 224 63.89

276 255 65.07

Conclusion� Methodologies to Parallelize PIC on triangular mesh using

GPUs� Shadow entities (replication) provides a simpler and

efficient solution� Algorithms discussed are scalable with the size of mesh,

number of particles and can be easily ported to a multi-GPU framework

CCMT

CCMT

Research ThrustNGEE Reconfigurable Platform

CCMT| 2

IntroductionBackground:� Behavioral emulation (BE) approach: manages Exascale complexity via

– BEOs (abstraction of object behavior; not cycle accurate)– Multi-scale (abstraction at micro, meso, and macro levels)

Goal: Research & develop toolset to scale BE approach of system simulation up to Exascale while maintaining required performance (speed)

– Software PDES behavioral simulator– Hardware-accelerated behavioral emulator

Approach: (for behavior emulator)� Explore methods of mapping BEOs onto systems of reconfigurable processors� Investigate use of large-scale reconfigurable supercomputing, RSC (e.g.,

Novo-G#, next-gen RSC) in emulation of Exa/extreme-scale systemsRelated research:� Multi-FPGA systems (Novo-G, Catapult, BEE3-based cluster, Bluehive)� Multi-FPGA sytem interconnect (Novo-G#, BEE3-based cluster)� FPGA-accelerated architectural emulation (RAMP)� Recent interest in FPGA-based heterogeneous computing for big data and

data centers: Microsoft, IBM, Intel, Oracle, Google, Baidu

CCMT| 3

Session 3 Outline � Introduction: motivation, goal, & approach� Mapping BEOs onto reconfigurable (RC) platform

– Novo-G# reconfigurable supercomputer– Basic appBEO, procBEO, & commBEO designs

� Current single-FPGA prototype (NGEEv1)– Current single-device prototype & demo– Additional results & SMP performance comparison

� Transitioning to multiple FPGAs – Questions, identified issues, & possible directions– Direct FPGA-to-FPGA communication via

3D interconnect in Novo-G#– New potential NGEE target architecture– Proposed scalability measure

� NGEEv2 single-FPGA design– Effect of BE V2 methodology improvements

on our FPGA acceleration efforts– Multiple-FPGA considerations

� Conclusions

s

CCMT| 4

Novo-G Reconfigurable Supercomputer

� Developed and deployed at CHREC – Most powerful reconfigurable computer

in (academic) world– 2012 Alexander Schwarzkopf Prize for

Technology Innovation @ NSF Center

� Apps acceleration– In key science domains: bioinformatics,

finance, image & video processing

� Hardware emulation– Behavioral emulation of future-gen

systems, up to Exascale

� 2014 upgrade– 64 GiDEL ProceV (Stratix V D8)– 4x4x4 3D-torus or 6D-hypercube– 6 Rx-Tx links per FPGA– 4x 10 Gbps per link

Novo-G Annual Growth2009: 24 GiDEL ProcStar III cards (96 top-end

Stratix-III FPGAs), each with 4.25GB SDRAM2010: 24 more ProcStar III cards (96 more Stratix-III

FPGAs), each with 4.25GB SDRAM2011: 24 ProcStar IV cards (96 top-end Stratix-IV

FPGAs), each with 8.50GB SDRAM2012: 24 more ProcStar IV cards (96 more Stratix-IV

FPGAs), each with 8.50GB SDRAM2014: 64 ProceV cards (64 top-end Stratix-V FPGAs),

with high-speed 4x4x4 torus or 6D-hypercube

CCMT| 5

Novo-G ProceV Upgrade w/ 3D Torus

Novo-G# (Novo-jee-sharp)• 32 GiDEL ProceV (Stratix V D8)• 4x4x2 3d-torus or 5d-hypercube• 6 Rx-Tx links per FPGA• 4x 10 Gbps per link• Data-link layer: SerialLite III protocol

• Full-duplex, CRC32 protection, in-band or out-of-band flow control

• Physical layer: Interlaken protocol• 64B/67B encoding, multi-lane sync.

CXP to 3-QSFP Cable(provides connectivity

for 3D torus)

4x4x2 Torus(soon to be 4x4x4)

CXP Port (underneath):12x10Gbps channels 8-lane PCI Express Gen3

ProceV Board

QSFP+ daughterboard3x QSFP+ Ports:4x10Gbps

channels each

Stratix V D8device

4x(soo(

8 ProceV nodes

Upgraded Novo-G

Special contributions by Abhijeet Lawande via cost-share from CHREC

CCMT| 6

Novo-G# 3D Torus Protocol Stack

� 3-layer 3D torus protocol stack (shown above) based on IP from Altera and GiDEL� Basic point-to-point services provided by Interlaken and SerialLite-III, network

oriented services provided by RTL code� Direct FPGA interconnect crucial to the scalability of NGEE

Network services3d-torus FPGA architecture

Trans-ceivers

Trans-ceivers

Layer 2 Switch

ProtocolIP

Application

Layer 3 Router

Network layerDimension order routing

Collective routingSource data buffering

Data-link layerPhysical addressing

Packet switchingCongestion control

Data-link layerData framing

Error detection (CRC)

Physical layerClock recovery

Line codingMulti-lane sync.

NN t k

Services availablethrough IP

Services availablethrough RTL design

CCMT| 7

Mapping AppBEO onto Single FPGA� High-level appBEO script (abstraction of target app)

mapped to custom machine code (MIF file)� Stream of instructions for procBEOs� Instruction delivery options

– Pulled from on-chip ROM– Pushed from CPU (Instr. stream from

CPUs through external memories)rnal memories)

OptimizationExploration

CCMT| 8

Mapping ProcBEO onto Single FPGA

� Mimics “real” processor under study– Instruction decoding, timekeeping– No real computation: interpolation of compute operations– Generates tokens to emulate comm packets

� Lightweight processing elements� Initial prototype

– One-to-one mapping of procBEOs to interpolation & comm resources

eapping of nterpolation rces


CCMT| 9

� System-specific: fabric is explicit emulation of target architecture

� Consists of token buffers, arbiter, router, network timer

� Packets transferredcontain characteristicsof (not real) data

| 9

,meransferredaracteristicsl) data

Mapping CommBEO onto Single FPGA


CCMT| 10








� Conclusions

s

CCMT| 11

Current Single-FPGA Prototype (NGEEv1)

Functioning prototype running on single FPGA of Novo-G� No optimization (i.e., max-resource implementation)� Current core density of 90 for Stratix IV, 256 for Stratix V

– Each core contains one each of appBEO, procBEO, and commBEO– Stratix IV currently limited by FPGA block RAM, not logic

• 9x10 mesh on Stratix IV: logic 19%, block memory 100%

– Higher core density on Stratix V• 16x16 mesh on Stratix V: logic 94%, block memory 100%

� appBEO scripts stored in on-chip block RAMs as memory initialization files (MIFs)

� Proc interpolation resources replaced with MIF pre-processing� Explicit emulation of target communication fabric without

congestion modeling� Separate management plane fabric collecting management

tokens for postmortem analysis (e.g., simulation visualization)

CCMT| 12

DEMO: FPGA-specific appBEOs� Generate memory initialization files (mif) to configure FPGA simulator

– R script to convert appBEO instructions into custom NGEE-specific machine code – Generates core-level instruction streams for configuring simulators FPGA bit file

CCMT| 13

DEMO: Simulator Setup & Execution� mif files assembled into FPGA bit file & loaded into FPGA� Custom c driver initiates simulator and collects results from FPGA management plane� Management plane logs core-level events & streams to host

CCMT| 14

Additional Results & SMP Performance Comparison

Experimental SetupExperiments with TileGX36, next-gen TileGX72, & anticipatedIntel Xeon Phi Knights Landing architecture � Single Stratix IV E530 vs SMP on single quad-core Xeon E5520

CPU @ 2.27GHz� Proc/comm configurations: Tile 6x6, Tile 9x8, Knights Landing 9x8� App configuration: work equally distributed to all available cores for

each proc/comm configuration� Apps: 2D MM & Sobel filtering� App kernels executed 250 times to amortize simulator overheads� Compare management results to SMP for equivalency/correctness� Compare execution times to SMP for performance improvement

CCMT| 15

Performance Comparison: 3 Data Points

*

Tile 6x6 Simulated Time(Consistent with

Prediction ErrorSMP results)

FPGA Simulation Time

SMP Simulation Time

Speedup2D MM

1024x1024across 36 cores

2.82x109 ns -0.35%35.7us 4.82ms

~135xSobel

800x600 image across 36 cores

9.27x107 ns -2.61%54.2us 8.08ms

~149x

Tile 9x8 Simulated Time(Consistent with



SMP Simulation Time

Speedup2D MM


1.66x109 ns To be determined81.1us 10.24ms

~126xSobel



~176x

KNL 9x8 Simulated Time(Consistent with



SMP Simulation Time

Speedup2D MM



~126xSobel



~176x

Nex

t-gen

Ant

icip

ated

CCMT| 16








� Conclusions

s

CCMT| 17

Transitioning to Multiple FPGAs With Novo-G# as target system architecture, requires modification of current design to allow communication between commBEOs instantiated on different FPGAs over direct FPGA interconnect

Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?Q3. Would the use of other system

architectures be more advantageous given new BE requirements?

CCMT| 18

Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?– Added communication infrastructure & overhead– Multi-layer communication protocol & virtual network fabric– Modified ISA, packet structure, management tokens, inter-

device bandwidth allocation– General design considerations (e.g., arbitrary no. of resources

vs. hardcoded limits of single FPGA)– Likely reduced BEO density

Q2. Implications on speed/scalability?Q3. Would the use of other system


CCMT| 19

Direct FPGA-to-FPGA Communication via 3D Interconnect

Ext. receiver

Ext. receiver

Ext. receiver

Negative Router

Ext. trans-mitter

Ext. trans-mitter

Ext. trans-mitter

Int. trans-mitter

Int. receiver

App logic

Ext. receiver

Ext. receiver

Ext. receiver

Ext. trans-mitter

Ext. trans-mitter

Ext. trans-mitter

-X

-Y

-Z

+X

+Y

+Z

-X

-Y

-Z

+X

+Y

+Z

Positive Router

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

XCVR IP

router_clkxcvr_clk xcvr_clk

256 256 256 256

10 Gbps 10 Gbps

Header Data Data DataHeadH dderHeader Data Data DataD tData

Data Data Data Reserved

256 bits

…

DestX Y Z

Payload size

SourceX Y Z

Header format12 bits 12 bits8 bits

Packet number

8 bits

Reserved

24 bits

App

start_of_packet

size

packet_num

valid

data

source

start_of_packet

packet_num

valid

data

end_of_packet

dest

CCMT| 20

Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?– Added communication infrastructure & overhead– Multi-layer communication protocol & virtual network fabric– Modified ISA, packet structure, management tokens, inter-

device bandwidth allocation– General design description (i.e., arbitrary No. of resources vs.

hardcoded limits of single FPGA)– Likely reduced BEO density

Q2. Implications on speed/scalability?Q3. Would the use of other system


CCMT| 21

Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?– BEO wait times– Inter-FPGA event queuing, flow controls, queuing size, event

reordering– Sharing of BEO resources– Proposed scalability measurement & its use to inform multi-

FPGA design decisions

Q3. Would the use of other system architectures be more advantageous given new BE requirements?

CCMT| 22

Scalability Studies & Projections

Definitions:� Emulation system: Behavioral emulation platform such as Novo-G#� Emulated system: appBEOs (e.g., modeling CMT app) stimulating

archBEOs (e.g., modeling Blue Gene/L)

Open questions to be answered in the future:� For a given emulation system architecture (e.g., #FPGAs, BEO core

density, core design, interconnect arch, etc.),What are the limits of an emulated system?

– Including size (e.g., #BEOs) and emulation performance

� For given requirements of an emulated system (e.g., macro-scale emulation with Blue Gene/L), What emulation system resources are necessary?

– Including #FPGAs, core density, interconnect arch, etc.

CCMT| 23

Potential Scalability MeasureObjective: � Compare scalability(HW) vs scalability(SW)

HW: hardware approach; SW: software SMP approach

Potential Scalability Measure for HW� Ideally entire simulation system is on single large FPGA;

thus, communication between BEOs is at on-chip rateBaseline: – Validated BE model for single-FPGA performance (PfS)

of NGEE (i.e., BE model of FPGA running other BE models)

� Scalability issues arise when BEOs communicate across FPGAs

– Off-chip communication much more costlyApproach – Validated BE model for multiple-FPGA performance (PfM)

of NGEE (possible after multi-FPGA experiments)

Potential Scalability Measure SM(HW) = PfS/PfM

EmulatedEmulatedSystem

Notional FPGA

EmulatedEmulatedSystem

1

Para

llel E

ffici

ency

No. of Devices

FPGA

SMP

CCMT| 24

Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?– BEO wait times– Inter-FPGA event queuing, flow controls, queuing size, event

reordering– Sharing of BEO resources– Proposed scalability measurement & its use to inform multi-

FPGA design decisions

Q3. Would the use of other system architectures be more advantageous given new BE requirements?

CCMT| 25

Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?Q3. Would the use of other system


CCMT| 26

New Potential NGEE Target Architecture

Node architecture: POWER8 server with– 2 CPUs– 2 CAPI attached accelerators– 4 accelerator 1D torus

System configuration:– 4 POWER8 servers– 16 Nallatech boards– 16 board 2D torus– Up to 32 FPGAs w/ dual chip boards– CAPI enables a hardware kernel bypass

CCMT

Resource pool functionality & OpenCL support

CCMT| 27








� Conclusions

s

CCMT| 28

NGEEv2 single-FPGA designUpdated design based on current and possible future developments in fundamental BE methodology, emulation system architecture, …

� Effect on FPGA acceleration efforts going forward?– Both single & multi FPGA considerations

� Identified issues– BEv2 modifications: e.g., congestion modeling, global

task graph manipulation, micro-scale symmetry exploitation, multi-pass simulation …

� Possible approaches and directions– Alternative acceleration approaches?– Alternate target system architectures?

CCMT| 29

ConclusionsProgress:� Working single-FPGA prototype (micro-scale) with max-resource

implementation & management plane (no optimization)� Beginning stages of performance optimization & scalability evaluation� New design (NGEEv2) ideation

Plans for March:� Prototype NGEE platform operating on multiple FPGAs� Showcase results from optimization studies

– Increased BEO density per FPGA� Performance comparison with software-based

SMP simulator for multiple appBEO scripts � Upgraded Novo-G# (4x4x4 torus) supporting BE � New performance/scalability predictions

for fully upgraded Novo-G#

deep dive, university of florida (02/2015)

Documents