deep dive, university of florida (02/2015)
TRANSCRIPT
![Page 1: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/1.jpg)
Deep Dive Meeting
February 3-4, 2015
![Page 2: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/2.jpg)
![Page 3: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/3.jpg)
Deep Dive University of Florida
February 3-4, 2015
Current Attendee List:
Bob Voigt NNSA HQ [email protected] Matt Bement LANL [email protected] David Daniel LANL [email protected] Dave Nystrom LANL [email protected] Maya Gokhale LLNL [email protected] Martin Schulz LLNL [email protected] Jim Ang SNL [email protected] Arun Rodrigues SNL [email protected] Jeremy Wilke SNL [email protected]
S. Balachandar “Bala” University of Florida [email protected]
Alan George University of Florida [email protected] Rafi Haftka University of Florida [email protected] Herman Lam University of Florida [email protected] Sanjay Ranka University of Florida [email protected] Greg Stitt University of Florida [email protected] Tom Jackson University of Florida [email protected] Tania Banerjee University of Florida [email protected] University of Florida Students: Dylan Rudolph [email protected] Nalini Kumar [email protected] Carlo Pascoe [email protected] ([email protected]) Kasim AlliKasim [email protected] Chris Hajas [email protected] Mohammed Gadou Michael Retherford
NOTE: There will be a $125.00 registration fee per person to cover all expenses associated with the meeting. In addition to other things, this will cover breakfast and lunch for two days, dinner Tuesday night, coffee breaks, etc. Please make checks payable to the University of Florida. A receipt will be available at the meeting.
NOTE: The meeting is all day Tuesday and ½ day on Wednesday. We will provide transportation to the airport as needed. Please make reservations at the University Hilton.
![Page 4: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/4.jpg)
UF Deep dive agenda: Tuesday, February 3, 2015
8:20 Van pickup at Hilton
8:30 – 9:00 Breakfast
9:00 – 9:30 Welcome and Deep-Dive Overview (3 Sessions) 1. Behavioral emulation (BE): modeling & simulation/emulation methods
2. CS issues (performance, energy, and thermal) 3. Use of reconfigurable computing to accelerate behavioral emulation
* Each of the three deep-dive sessions is designed to be interactive: a combination of short presentations by UF and Tri-lab researchers, intermixed with discussion, demonstrations, etc.
9:30 – 11:30 Session 1: Behavioral Emulation: Modeling & Simulation/Emulation Methods UF topics:
o Behavioral characterization o Parameter estimation
Tri-lab topics: o Overview of FastForward 2 and DesignForward 2 (Jim Ang, SNL) o Multi-scale architectural simulation with the Structural Simulation Toolkit (Arun
Rodrigues, SNL)
11:30 – 12:30 Lunch
12:30 – 2:00 Session 1 (continued): Behavioral Emulation: Beyond Device Level UF topics:
o Synchronization for speed o Congestion modeling o Behavioral characterization & modeling beyond device level
Tri-lab topics: o Using discrete event simulation for programming model exploration at extreme-
scale (Jeremy Wilke, SNL) o ASC next-generation code projects (David Daniel, LANL)
2:00 – 5:00 Session 2: CS Issues (Performance, Energy, and Thermal) UF topics:
o Performance and autotuning for hybrid architectures o Energy and thermal optimization o Dynamic load balancing
Tri-lab topics: o Performance, energy, and thermal benchmarking (Jim Ang, SNL) o Why power is a performance issue: utilizing overprovisioned systems
(Martin Schulz, LLNL)
* There will be an afternoon coffee break in this time slot
6:30 Dinner (University Hilton)
![Page 5: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/5.jpg)
Wednesday February 4, 2015
8:20 Van pickup
8:30 – 9:00 Breakfast
9:00 – 11:00 Session 3: Use of Reconfigurable Computing to Accelerate Behavioral Emulation UF topics:
o Efficient mapping of behavioral emulation objects (BEOs) onto a system of FPGAs o Demo of current single FPGA prototype o Transitioning to multiple FPGAs o Challenges associated with maximizing emulation speed while maintaining
scalability/usability
Tri-lab topic:
FPGA-based emulation of processing near memory (Maya Gokhale, LLNL)
11:00 – 12:00 Open discussion and planning for action items
12:00 Box lunch; transportation to airport as needed.
![Page 6: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/6.jpg)
CCMT
UF Deep-Dive Agenda
| 1
Tuesday, February 3, 2015
8:20 Van pickup at Hilton 8:30 – 9:00 Breakfast9:00 – 9:30 Welcome and Deep-Dive Overview (3 Sessions)
1. Behavioral emulation (BE): modeling & simulation/emulation methods
2. CS issues (performance, energy, and thermal)3. Use of reconfigurable computing to accelerate
behavioral emulation
* Each of the three deep-dive sessions is designed to beinteractive: a combination of short presentations by UF and Tri-lab researchers, intermixed with discussion, demonstrations, etc.
CCMT | 2
9:30 – 11:30 Session 1: Behavioral Emulation: Modeling & Simulation/Emulation Methods� UF topics:
o Behavioral characterization o Parameter estimation
� Tri-lab topics:o Overview of FastForward 2 and
DesignForward 2 (Jim Ang, SNL)o Multi-scale architectural simulation with
the Structural Simulation Toolkit (Arun Rodrigues, SNL)
11:30 – 12:30 Lunch
UF Deep-Dive Agenda
![Page 7: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/7.jpg)
CCMT
12:30 – 2:00 Session 1 (continued): Behavioral Emulation: Beyond Device Level� UF topics:
o Synchronization for speed o Congestion modeling o Behavioral characterization & modeling
beyond device level� Tri-lab topics:
o Using discrete event simulation for programming model exploration at extreme-scale (Jeremy Wilke, SNL)
o ASC next-generation code projects (David Daniel, LANL)
| 3
UF Deep-Dive Agenda
CCMT
2:00 – 5:00 Session 2: CS Issues (Performance, Energy, and Thermal)� UF topics:
o Performance and autotuning for hybrid architectures
o Energy and thermal optimizationo Dynamic load balancing
� Tri-lab topics:o Performance, energy, and thermal benchmarking
(Jim Ang, SNL)o Why power is a performance issue: utilizing
overprovisioned systems (Martin Schulz, LLNL)* There will be an afternoon coffee break in this time slot
6:30 Dinner (University Hilton)
| 4
UF Deep-Dive Agenda
![Page 8: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/8.jpg)
CCMT
Wednesday February 4, 20159:00 – 11:00 Session 3: Use of Reconfigurable Computing
to Accelerate Behavioral Emulation� UF topics:
o Efficient mapping of behavioral emulation objects (BEOs) onto a system of FPGAs
o Demo of current single FPGA prototypeo Transitioning to multiple FPGAso Challenges associated with maximizing
emulation speed while maintaining scalability/usability
� Tri-lab topic:o FPGA-based emulation of processing near
memory (Maya Gokhale, LLNL)11:00 – 12:00 Open discussion and planning for action items12:00 Box lunch; transportation to airport as needed.
| 5
UF Deep-Dive Agenda
CCMT
CCMT
Behavioral Emulation for Design-Space Exploration of
CCMT Apps Principal Investigators:
Dr. Alan George, Dr. Herman Lam, Dr. Greg Stitt Student Project Leaders:
Nalini Kumar, Carlo Pascoe, Dylan Rudolph NSF Center for High-Performance Reconfigurable Computing (CHREC)
ECE Department, University of Florida
![Page 9: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/9.jpg)
CCMT
Outline � Project context, scope, & focus � Behavioral Emulation approach � Research thrusts
| 7
CCMT
Context: DOE Co-design
| 8
![Page 10: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/10.jpg)
CCMT | 9
Approach:
CCMT
BEOs & Behavioral Emulation Flow
Apps & Kernels
Application BEOsAppBEOs
Architecture BEOsArchBEOs
init (device);mem_init (A);mem_init (B);broadcast (A,comm_grp);scatter (B,B*,comm_grp);compute (dot_product,A,B*); Simulation/
Emulation Platform
Future-gen Systems & Notional
Architectures
Existing Systems & Architectures
skeleton apps(macro-scale)
mini-apps(meso-scale)
kernels(micro-scale)
system(macro-scale)
node(meso-scale)
device (micro-scale)
Testbed benchmarking &experimentation
Behavioral simulation (SW) or emulation (HW) experimentation
Notional systems exploration
);p);
k l t
Apps & Kernels
evice);t (A)
Application BEOsAppBEOs
Architecture BEOsArchBEOs
init (device);mem_init (A);mem_init (B);broadcast (A,comm_grp);scatter (B,B*,comm_grp);compute (dot_product,A,B*); Simulation/
Emulation Platform
Future-gen Systems & Notional
Architectures system
Systems &Architectures
skeleton apps(macro-scale)
mini-apps(meso-scale)
kernels(micro-scale)
system(macro-scale)
node(meso-scale)
device (micro-scale)
| 10
ArchitectureDesign-space Exploration
ApplicationDesign-space Exploration
![Page 11: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/11.jpg)
CCMT
Application Design-Space Exploration (DSE) � Given characteristics of promising exascale architectures
(at device, node, and system level), – C/o vendor roadmaps & future technologies (e.g., FastForward 2
& DesignForward 2) – Explore different ways to parallelize exascale applications
� Focused study – Not intending to perform DSE on all types of exascale apps – Focus on exascale apps relevant to our CCMT Center
� Along with Behavioral Emulation approach (discussed later) � This focus allows for optimizations & modeling techniques not
available for general-purposed system simulators
| 11
Scope and Focus
CCMT
Approach: Behavioral Emulation
� How may we study Exascale before the age of Exascale? – Analytical studies – systems are too complicated – Software simulation – simulations are too slow at scale – Behavioral emulation – to be defined herein – Functional emulation – systems too massive and complex – Prototype device – future technology, does not exist – Prototype system – future technology, does not exist
� Many pros and cons with various methods
– We believe behavioral emulation is most promising in terms of balance of project goals (accuracy, speed, and scalability, as well as versatility)
| 12
![Page 12: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/12.jpg)
CCMT
Behavioral Emulation (BE) � Component-based, coarse-grained simulation
– Fundamental constructs called BE Objects (BEOs) act as surrogates – BEOs characterize & represent behavior of app, device, node, & system
objects as fabrics of interconnected ArchBEOs (with AppBEOs) up to Exascale � Multi-scale simulation
– Hierarchical method based upon experimentation, abstraction, exploration � Multi-objective simulation
– Performance, power, reliability, and other environmental factors
CCMT| 13
CCMT
Fundamental Design of an Arch BEO
Emulation Plane � Mimic appropriate behavior of modeled object � Interact with other BEOs via tokens to support
emulation studies Management Plane
� Measure, collect, and/or calculate metrics and statistics
� Support architectural exploration Metrics
� Performance factors (execution time, speedup, latency, throughput, etc.)
� Environmental factors (power, energy, cooling, temperature)
� Dependability factors (reliability, availability, redundancy, overhead)
Arch BEO: Abstract model (surrogate) of an architecture object • Basic primitive in BE approach to studies of Exascale systems
Emulation Plane
Computation model
Communi-cation model
Power model
Reliability model
Management PlaneMeasurement, data collection,
& synchronization
Architecture Behavioral Emulation Object (BEO)
Tokens to/from other BEOs
| 14
![Page 13: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/13.jpg)
CCMT
Conclusions: Research Thrusts � Behavioral Characterization (Session 1 morning)
– How do we build, calibrate, then validate performance?
� Parameter Estimation (Session 1 morning) – How do we efficiently capture behavior in surrogates?
� Synchronization & Congestion (Session 1 afternoon) – How do we handle sync and congestion at scale?
� Management & Visualization – How do we measure & analyze massive systems & apps?
� Reconfigurable Architectures (Session 3) – How do we exploit FPGA hardware for speed & scale?
� Resilience & Energy (starting after Y1) – How do we extend beyond performance attributes?
BE M
odeli
ng R
esea
rch
Plat
form
Res
earc
h
| 15
![Page 14: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/14.jpg)
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
DOE’s&Fast&Forward&and&Design&Forward&R&D&Projects:&Influence'Exascale'Hardware''James&A.&Ang,&Ph.D.&Manager,&Scalable&Computer&Architectures&Sandia&NaFonal&Laboratories&Albuquerque,&NM&
&University&of&Florida&&CCMT&Exascale&Deep&Dive&Workshop&Gainesville,&FL&February&3S4,&2015&
1
SAND2015-0626 PE
Exascale(Hardware(Challenges(
Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and
Burton Smith, 2004
! Le2(to(the(Invisible)Hand)! Industry(follows(an(
evolu:onary(path(focused(on(Peak(Flops(
! In(the(Era(of(Dennard(Scaling(our(ad)hoc)approach(to(integra:on(of(MPPs(with(COTS(microprocessors(was(acceptable(
! With(the(end(of(Dennard(scaling,(this(is(no(longer(able(to(meet(DOE(Mission(Applica:on(Requirements(
2(
![Page 15: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/15.jpg)
Exascale(Hardware(Challenges(–(cont.(! We(need(to(Mo:vate(and(Influence(
Architectural(Changes(! Processor/node(Architectures(! System(Architectures(
(
! Our(Investments(are(not(only(in(Architectures(! We(cannot(just(develop(new(Exascale(
Architectures(and(Throw)it)over)the)wall(to(our(applica:on(developers(
! We(need(Hardware/So2ware(CoQdesign((
! The(transi:on(of(the(DOE(Legacy(Code(base(is(another(important(challenge(! Challenge(should(influence(future(hardware(
thru(CoQdesign(
Network
layer
Mem
ory la
yers
Multi
-core
pro
cess
or layer
3(
Industry(Engagement(is(Vital(
! We(need(industry(involvement(! Avoid(oneQoff,(stoveQpiped(solu:ons(! Con:nued(“product”(availability(and(upgrades(beyond(DOE(support(
(
! Industry(cannot(and(will(not(solve(the(problem(alone((! Business(model(obligates(industry(to(op:mize(for(profit,(beat(compe:tors(! Industry(investments(heavily(weighted(towards(nearQterm,(evolu:onary(
improvements(with(small(margin(over(compe:tors(! Industry(funding(for(longQterm(technology(R&D(is(limited(and(constrained(! Industry(does(not(understand(DOE(Applica:ons(and(Algorithms(
(
! How(can(we(impact(industry?((! Work(with(those(that(have(strong(advocate(s)(within(the(company(! Fund(research,(development(and(demonstra:on(of(longQterm(technologies(that(
clearly(show(poten:al(as(future(massQmarket(products((or(product(components)(! Corollary:(do(not(fund(product(development((as(part(of(DOE(R&D(por`olio)(! Industry(will(incorporate(promising(technologies(into(future(product(lines(
4(4(
![Page 16: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/16.jpg)
NNSA/ASC(and(SC/ASCR(are(partnering(to(Influence(Industry(
! Aligned(Hardware(Architecture(Efforts(! April(2011(MOU(signed(between(SC(and(NNSA(
! July(2011(Issued(RFI(on(Cri:cal(Technologies(for(Exascale(! July(2012(Established(Fast(Forward(nodeQlevel(Cri:cal/Cross(Cugng(
Technology(R&D(projects(
! October(2013(Established(Design(Forward(interconnect(R&D(Projects(! November(2014(Fast(Forward(2:(Exascale(Node(Designs(
! TBD:(Design(Forward(2:(Conceptual(Designs(of(Exascale(Systems(
! Aligned(joint(Advanced(Technology(pla`orm(procurements(! CORAL:(Oak(Ridge,(Argonne,(and(Lawrence(Livermore(Na:onal(Labs(
! APEX:(Los(Alamos,(Lawrence(Berkeley(and(Sandia(Na:onal(Labs(
5(
! Objec:ve:(Accelerate)transi7on)of)innova7ve)ideas)from)processor)and)memory)architecture)research)into)future)products))
! Evaluate(advanced(research(concepts(and(develop(quan:ta:ve(evidence(of(their(benefit(for(DOE(applica:ons–using(Proxy(apps(and(collabora:ng(on(CoQdesign(! Engage(DOE(applica:on(teams(to(understand(technology(trends/constraints((how(it(
impacts(their(code(development)(
! Understand(how(to(program(these(new(features(
! Quan:ta:ve(evidence(to(lower(risk(to(adop:on(of(innova:ve(ideas(by(product(teams(
! Cri:cal(Node(Technologies(and(Designs(for(ExtremeQscale(Compu:ng(
6(
Fast(Forward(Program(
![Page 17: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/17.jpg)
! Fast(Forward(1((July(2012(–(Sept.(2014)((! AMD:(Heterogeneous(processor,(ProcessingQinQmemory(and(2Qlevel(Memory(
! IBM:(Advanced(Memory(Concepts(
! Intel:(Core(energy(efficiency(and(ProcessingQnearQmemory(
! Intel/Whamcloud:(Storage(reliability,(I/O(API,(burst(buffer(management(
! Nvidia:(Memory(hierarchy,(processor/packaging/programming(
! Fast(Forward(2((Nov(2014(–(2016)(! AMD:(Near(threshold(voltage(logic,(other(lowQpower(compu:ng(
technologies,(and(new(standardized(memory(interface(
! Cray:(alterna:ve(processor(design(points(including(ARM(microprocessors(
! IBM:(inves:gate(nextQgenera:on(standardized(memory(interface(
! Intel:(energy(efficient(node(and(system(architectures,(including(so2ware(targeted(at(developing(extreme(scale(systems(
! Nvidia:(focus(on(energy(efficiency,(programmability(and(resilience(
7(
Fast(Forward(Program(
! Objec:ve:(R&D(of(interconnect(architectures(and(conceptual(designs(for(future(extremeQscale(computers:(! Oct.(2013–2015,(Design(Forward(1:(Interconnect(Networks(
! Overall(Interconnect(Architecture(! Interconnect(Integra:on(with(Processor(and(Memory(! Mul:ple(Communica:on(Library(Progression(and(Interac:on(! Interconnect(Fabrics(and(Management(! Protocol(Support(! Scalability(
! Start(is(imminent,(Design(Forward(2:(System(design(and(integra:on(! Overall(System(Architecture(! Energy(U:liza:on(! Resilience(and(Reliability(! Data(Movement(through(the(System(! Packaging(Density(! System(So2ware(! Programming(Environment(
8(
Design(Forward(Program(
![Page 18: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/18.jpg)
AMD(! Processor&Research&
! Heterogeneous(nodes(which(blend(CPU(and(GPU(cores(
! Improved(energy(efficiency(
! Efficient(communica:on(and(data(movement(across(the(die(
! Simplified(programming(models(
! Memory&Research&! Inves:ga:ng(new(memory(
technologies(
! Reduced(data(movement(
! Higher(performance(
! Reduced(energy(consump:on(
! New)Memory)Interface:(standardized,(robust(interface(to(support(integra:on(of(heterogeneous(memory(and(cores(
! SoYware&Tools&! HSA(Founda:on(
Concept Node Design
Source: AMD FastForward Project Overview (https://asc.llnl.gov/fastforward/AMD-FF.pdf)
9(
Intel(
! Processor&Research&! Lightweight(processor(cores(
! Fast(synchroniza:on(
! Specialized(aspects(of(ISA(and(processor(for(data(movement(
! Tapered(access(to(memory(
! Interconnect&Research&! Tapering(bandwidth(networks(
! Integra:on(of(NICs(into(processor(
! Intelligent(data(movement(to(reduce(power(
! SoYware&Tools&! Open(Community(Run:me((OCR)(
! Explora:on(of(OpenMP(and(MPI(as(legacy(environment(
Source: Intel FastForward Project Overview (https://asc.llnl.gov/fastforward/Intel-FF.pdf and IPDPS2013 talks)
10(
![Page 19: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/19.jpg)
NVIDIA(
! Processor&Research&! Temporal(SIMT(and(Scalariza:on(
! Reduce(effect(of(wide(vectors(
! Coherency(and(consistency(across(system(
! Hierarchical(memory(systems(
! Interconnect&Research&! Open(standards(for(the(data(center(
! Support(direct(GPU(messaging(
! Programmability&! Global(address(spaces((PGAS)(
! Efficient(cross(machine(collec:ves(
! Fast(synchroniza:on(
! Ac:ve(messages(
! Heterogeneous(cores(
Source: NVIDIA FastForward Project Overview (https://asc.llnl.gov/fastforward/Nvidia-FF.pdf)
11(
IBM(! Memory&Research&
! Novel(Computa:on(near(memory(
! Reduc:on(in(data(movement(and(associated(overhead(
! Advances(in(programming(models,(compiler(and(run:me(environment(
! Leverage(of(emerging(memory(technologies(
! Advances(in(memory(efficiency(
! Advances(in(memory(system(integra:on,(power(and(reliability(management(
! Impact&! Large(reduc:on(of(data(movement(
! Significant(improvement(at(system(level(performance,(power(efficiency,(and(reliability(
! Successful(exploita:on(of(novel(architecture(features(while(abstrac:ng(the(hardware(complexity,(enables(by(evolu:onary(and(revolu:onary(approaches(
Source: IBM FastForward Project Overview (https://asc.llnl.gov/fastforward/IBM-FF.pdf)
12(
![Page 20: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/20.jpg)
Cray(
13(
! Network&CommunicaFon&API&! NIC(func:ons(to(enable(efficient(execu:on(
of(network(API(
! Structures(required(to(achieve(scalability(of(a(diverse(range(of(traffic(paqerns?(
! Novel(func:ons(in(future(cores(to(facilitate(efficient(wakeup(on(the(arrival(of(new(data?((
! Network&Protocol&! How(can(the(NICs(generate(simple,(small,(
HPC(op:mized(packets(at(a(sufficient(rate?(
! Interoperable(protocols(in(support(of(heterogeneous,(adap:ve(designs(
! What(flexibility(is(needed(to(allow(vendor(differen:a:on?(
! Network&Management&API&! What(are(the(important(management(
func:ons(to(provision?(
! What(structure(of(system(management(best(serves(those(func:ons?(
! Standardized(APIs(to(allow(management(of(a(variety(of(high(performance(networks.(
ASC(and(ASCR(are(partnering(on(Joint(Advanced(Technology(System(Procurements(! The(APEX((LANL,(LBNL,(and(SNL)(collabora:on(is(intended(to(
result(in(the(procurement(of(two(pla`orms(in(~2020(! NERSC/ASCR(procurement(of(NERSCQ9(
! ACES/ASC(procurement(of(ATSQ3((Advanced(Technology(System)((
! Both(pla`orms(will(focus(on(mee:ng(both(mission(needs(and(pursuing(Advanced(Technology(concepts(! Q(We(expect(to(use(NonQRecurring(Engineering(investment(to(guide(and(
improve(system(performance(and(produc:vity(
14(
![Page 21: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/21.jpg)
HighQlevel(Design(Philosophy(for(ATS3(
! Delivered(applica:on(performance(is(the(primary(driver(in(support(of(mission(requirements(! Peak(FLOPS(requirement(will(not(appear(in(RFP(
! Advanced(technology(development(is(assumed(to(be(necessary(to(meet(mission(needs(! Accelerate(development(of(yet(to(be(iden:fied(key(technologies(! 3rd(round(of(NRE(–((Trinity/NERSCQ8,(CORAL,(APEX)(
! APEX(are(preQexascale(pla`orms(! MUST(support(path(to(exascale(programming(models(
! While(suppor:ng(exis:ng(mission(needs(
! Support(MPI(+(OpenMP((threads)(! Matured(on(Trinity/Cori(and(CORAL(pla`orms(
! Addi:onal(support(for(other,(yet(to(be(iden:fied,(MPI+X(programming(models(
APEX(Capability(Improvement(
! An(increase(in(predic:ve(capability(requires(increases(in(the(fidelity(of(both(geometric(and(physics(models(! This(implies(usable(large(pla`orm(memory(capacity(
! APEX(must(demonstrate(a(significant(capability(improvement(! Improvement(measured(rela:ve(to(Trinity((ATS1)(and(Cori((NERSCQ8)(! Improvement(as(a(func:on(of(performance((total(:me(to(solu:on),(
increased(geometries,(increased(physics(capabili:es,(power/energy(efficiency,(resilience(and(other(factors(
! Previous(DOE(investments(assumed(to(be(an(integral(part(of(produc:on(compu:ng(for(APEX(! Trinity/NERSCQ8(NRE(projects:(Burst(Buffer(and(Advanced(Power(
management(! Fast(Forward(and(Design(Forward(Projects(
! Poten:al(Path(Forward(project(! NRE(could(take(select(technologies(the(final(yards(towards(produc:on(
&
![Page 22: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/22.jpg)
Fast(Forward(and(Design(Forward(Impact(
! APEX(Team(is(performing(Market(Surveys(
! Vendors(visi:ng(in(phases(star:ng(in(January(2015(! IBM,(Intel,(Cray,(Nvidia,(AMD,(SGI,(HP,(Micron,(Broadcom,(ARM,(etc.(
! Fast(Forward(and(Design(Forward(Accomplishments(and(Progress(have(direct(influence(over(the(development(of(APEX(technical(requirements(
! Developing(NRE(strategy(! We(started(early(to(enable(a(richer(range(of(NRE(topics(
![Page 23: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/23.jpg)
CCMT
CCMTResearch Thrust
BE Characterization
CCMT| 2
Introduction
� Summary of topics that will be discussed in this session
![Page 24: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/24.jpg)
CCMT| 3
Distributed Behavioral Emulation
� Goal: Enable fast and scalable simulation of Exascale systems– Require efficient simulation
representation and synchronization mechanisms for PDES*
� Different from other approaches in our definitions of processes, events, and event timings– Three kinds of events: send
event, receive event, internal event
– Relation between events are defined w.r.t logical time: correspondence, causality, concurrency
*Parallel Discrete Event Simulation
How do we define processes and generate events?
CCMT| 4
PDES mapping of Matrix Multiply
� Assuming one thread/processor, one logical process is generated for each processor core
– ProcBEOs (logical processes) read events from AppBEOs (event queues)
� Partial ordering of events by assigning timestamps using integer timestamps (logical clock)
� Real clock timestamps are used to estimate execution time
if (node==0) {broadcast (A,comm_grp);barrier ();scatter (B,B*,comm_grp);compute (dot_product,A,B*);gather (result,comm_grp);
} else {broadcast (A,comm_grp);barrier ();recv (B,B*,node_0);compute (dot_product,A,B*);send (result,node_0);
}Pseudo-code for parallel matrix multiply (C=BxA)
Timing diagram for matrix multiply distributed simulation (on 4-core CPU)
time
ProcBEO1
ProcBEO2
ProcBEO3
e17 e18 e19
e23e24
e25
e33 e34
e43 e44
e20 e21 e22
e35
e45
e11
e21
e31
e12 e13
e41
e15e14 e16
e22
e32
e42ProcBEO4
Where do we get the timestamps from?
How do we define processes?
How do we generate events?
![Page 25: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/25.jpg)
CCMT| 5
Performance Models (1)
� Calibration data is used for developing interpolation models which are used to predict the execution time – performance models
– Data have varying dimension (One-dimensional: Dot Product, Multi-dimensional: Matrix Multiply)
� We are using Kriging interpolation for multi-dimensional interpolation
– More about performance models in the next talk of this session
execution_time = f()Train interpolation model
Training/calibration data Predicted execution time
Estimate for test inputs
Exceedserror threshold?Experimental testbed,
Cycle-accurate Device Simulator,Fast Forward 2 vendors,
etc.
How do we generate timestamps for internal events?
What about send and receive events?
CCMT| 6
Performance Models (2)
� In order to update the timestamps of sending and receiving processes, we have to account for:
– Time taken by the message to reach it destination– If the receiving process is busy, the time spent by the message waiting in the queue– Time that receiving process spends in waiting for the message to arrive
� In BE, each Send event generates an event token with a timestamp equal to its local clock
How do we generate timestamps for send & receive events?
� We use network performance models to estimate time taken by a token to reach its destination
– Qualitative parameters are used to mimic movement of packets in network
– Quantitative parameters help in estimating communication time
• Some quantitative parameters are functions of independent variables (e.g., latency)
• Others are fixed information about the network (e.g., hop time)
![Page 26: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/26.jpg)
CCMT| 7
ProcBEO & CommBEO Calibration
Each BEO represents a simulation LP
ProcBEOs emulate processing units� Initialization, computation etc. are internal events� Interaction with other BEOs are send/receive events� Update local clocks based on performance models for
compute operations– A compute operation can be decomposed several ways– e.g., matrix multiply can be broken down into either
multiply and add operations, dot product, or a smaller matrix multiply
– Need to account for any non-negligible overheads
CommBEOs emulate network switches� Mimic communication on the network by
sending/receiving event tokens instead of real packets to other CommBEOs
� Update the timestamp of the token at each hop through the network
How do we represent physical system with LPs in Behavioral Emulation?
ys
r
CCMT| 8
Demo: AppBEO� AppBEO for the 3D matrix multiply kernel used in Nek5000
– AppBEO instructions are compiled into events that ProcBEOs can understand– Timestamps, estimates of event execution time, are generated and compiled
pre-simulation whenever possible
![Page 27: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/27.jpg)
CCMT| 9
Demo: ProcBEO� ProcBEO instance to model a Tile-Gx36 processing unit
– AppBEO instructions are resolved by the ProcBEO– Local processor clock is updated based on the event being processed
(computation or communication)
CCMT| 10
Demo: CommBEO� CommBEO instance to model a Tile-Gx36 switch
– Event token is forwarded based on the destination– A virtual machine block informs the local ProcBEO of the CommBEO clock
![Page 28: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/28.jpg)
CCMT| 11
Demo: Simulation � Using the App-Proc-Comm BEO stack we can define and simulate the
behavior of many-core device– This simulation has been setup for a 81-core device of which only 36 cores are
active in simulation– Connection between the device cores is described using a routing table
CCMT| 12
Behavioral Emulation Workflow
– Kernels on next-gen Tile-Gx72
– Kernels on anticipated Intel Knight’s Landing (KNL)
– Microbenchmarks• Computation• Communication
– Kernels• 2D Matrix Multiply
– Computation– Communication
Step 1: Calibration
Step 2: Validation
Step 3: Prediction
More on performance models in the next talk in this session
![Page 29: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/29.jpg)
CCMT| 13
-40-35-30-25-20-15-10
-505
% E
RRO
R
TRANSFER SIZE (32-BIT WORDS)
Error in simulating Gather
-70-60-50-40-30-20-10
0
% E
RRO
R
TRANSFER SIZE (32-BIT WORDS)
Error in simulating Scatter
2 cores4 cores8 cores16 cores32 cores
-50
-40
-30
-20
-10
0
10
% E
RRO
R
TRANSFER SIZE (32-BIT WORDS)
Error in simulating Broadcast
Results: Communication Microbenchmarks
Simulation setup:– Communication pattern: Tree Broadcast, Naïve
Gather, Naïve Scatter– BEOs modeled: Tilera iMesh network CommBEOs
Observation: – Simulations under-predict execution time in most
cases, can improve calibration to account for setup overhead
– Accuracy broadly improves with increase in number of cores and transfer size
CCMT| 14
Results: Parallel 2D Matrix Multiply
Raw data available in Appendix
-10
-5
0
5
10
15
20
2 cores 4 cores 8 cores 16 cores 32 cores
% E
RRO
R
NO. OF PROCESSOR CORES
Prediction Error (Coarse-grained Decomposition)
Simulation setup: compute models for matrix multiply, loop overhead, & network parametersObservations:– Abstraction of compute details improves simulation accuracy at a one-time cost of training effort– Accuracy of simulations is a function of domain, no. of samples, & other kriging parameters
-5
0
5
10
15
20
2 cores 4 cores 8 cores 16 cores 32 cores
% E
RRO
R
NO. OF PROCESSOR CORES
Prediction Error (Fine-grained Decomposition) 64x64128x128256x256512x5121024x1024
i i )
Abstraction improves simulation accuracy for all problem sizes
Fewer cores means more share of work performed by each processor. For fine-
grained decomposition, more error incurred.
� Computation dominates communication, resulting in high total error� Error in dot-product model gets multiplied several times over
![Page 30: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/30.jpg)
CCMT| 15
Performance Prediction
With some confidence in Behavioral Emulation approach we can proceed to study next-generation devices
– Ability to evaluate what-if scenarios by changing BEOs parameters
1. Tile-Gx72: Existing Tilera many-core processor– Largest device made by Tilera: 72 cores– Cores in Tile-Gx72 are identical to cores in Tile-Gx36– To simulate Tile-Gx72, we scale simulation to 72 Proc & CommBEOs
2. Knight’s Landing (KNL): Anticipated Intel many-core processor– Rumored to have Xeon Phi-type cores with Mesh network– To simulate anticipated Knight’s Landing
• Calibrate XeonPhi ProcBEOs based on existing XeonPhi processor cores• Use validated CommBEOs developed for iMesh network
– 64-core device: similar in size to existing Xeon Phi– 100-core device: probable size; larger than existing devices
How do we simulate future or notional systems?
Interaction with Fast Forward 2 vendors will provide key information for carrying out useful simulations
CCMT| 16
Results: Prediction for Tile-Gx72 & Intel KNL
0.881.03
1.251.55 1.87
0.0
0.5
1.0
1.5
2.0
1
10
100
1000
10000
64x64 128x128 256x256 512x512 1024x1024
Spee
dup
Exec
utio
n tim
e (m
s)
Matrix size
2D matrix multiplyTile-Gx36 Tile-Gx72 Speedup
Larger matrix sizes utilize the Gx72 device better
0246810
110
1001000
10000100000
128x128 256x256 512x512 1024x1024 2048x2048
Spee
dup
Exec
utio
n tim
e (m
s)
Matrix size
2D Matrix MultiplyTile-Gx36 KNL 64 KNL 100 Speedup (Gx36 vs KNL64) Speedup (Gx36 vs KNL100)
Communication overshadows computation on larger KNL100 device, resulting in no speedup over KNL64
l
Different application algorithm (2D block decomposition) may scale better on KNL100
Simulation Setup:– 72-Core Tile Device (Gx72)– Twice as many cores, laid
out in 9 by 8 mesh
Simulation Setup:– BEOs modeled (For KNL):
CommBEOs for TileraiMesh, ProcBEOs for XeonPhi
![Page 31: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/31.jpg)
CCMT
Identified Issues (1)
17
� How to systematically collect data?• Benchmarking automation is needed
– For wide range of app and system parameters, on targeted platforms– Benchmark suite with basic set of prog. methods (e.g., OpenMP, OpenCL, & MPI)
� How do we automate modeling process?• Device-independent techniques
� How to easily repeat experiments?• Need automatic porting of benchmark suite
– On new platforms– On upgrades of existing platform
� How do we select practical techniques for interpolation on multi-dimensional data for a given computation?
• e.g., with matrix multiply: m, n, p, data type, memory affinity
� How do we determine appropriate granularity for compute decomposition?
• Multiscale approach
Next talk in this session
CCMT
� Code to skeleton apps • Transforming applications from C/C++/Fortran source to high-level
instructions on AppBEO
� Simulation synchronization• Global vs. distributed simulator clock
� Scalability of software simulator • Message-passing simulator • Leverage and integrate SST Macro/Micro
� How do we model CommBEO congestion?• Event timing with CommBEO ingress• Exploring flit-level, packet-based, flow-based, & hybrid model
18
Identified Issues (2)
We will present our thoughts on these issues in the afternoon session
![Page 32: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/32.jpg)
CCMT
CCMT
Questions?
CCMT| 20
References
System (macro-scale) Simulators– C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar, D. A. Evensky,
and J. Mayo, “A simulator for large-scale parallel architectures” International Journal of Parallel and Distributed Systems, vol. 1, no. 2, pp. 57-73, 2010. SST MACRO
– E. Grobelny, D. Bueno, I. Troxel, A.D. George, and J.S. Vetter, “FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications, Simulation”, Simulation, Vol. 83, No. 10, pp. 721-745, Oct. 2007. FASE
– G. Zheng, G. Kakulapati, L. V. Kale, “Bigsim: A parallel simulator for performance prediction of extremely large parallel machines”, 18th IPDPS, pp. 78, 2004. BIGSIM
– A. D. George, R. B. Fogarty, J. S. Markwell, and M. D. Miars, “An Integrated Simulation Environment for Parallel and Distributed System Prototyping”, Simulation, vol. 72, pp. 283-294, May 1999. ISE
– A. Symons, V. L. Narasimhan, "Parsim-message PAssing computeR SIMulator," IEEE First International Conference on Algorithms and Architectures for Parallel Processing, vol. 2, pp. 621, 630, 19-20, ICAPP, 1995. PARSIM
20
![Page 33: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/33.jpg)
CCMT| 21
References
Device (micro-scale) & Node (meso-scale) Simulators– Z. Dong, J. Wang, G. Riley, and S. Yalamanchili, “An Efficient Front-End for Timing-Directed
Parallel Simulation of Multi-Core System”, 7th International ICST Conference on Simulation Tools and Techniques (SIMUTools 2014), March 2014. MANIFOLD
– J. Wang, J. Beu, S. Yalamanchili, and T. Conte. “Designing Configurable, Modifiable and Reusable Components for Simulation of Multicore Systems”, 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, November 2012. MANIFOLD
– M. Hseih, R. Riesen, K. Thompson,W. Song, A. Rodrigues, “SST: A Scalable Parallel Framework for Architecture-Level Performance, Power, Area and Thermal Simulation”, Computer Journal, vol. 55, no. 2, pp. 181-191, 2012. SST MICRO
– M. Hseih, A. Rodrigues, R. Riesen, K. Thompson,W. Song, “A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration”, SIGMETRICS, Performance Evaluation Review, vol. 38, no. 4, pp. 63-68 2011. SST MICRO
Object-oriented System Modeling– J. C. Browne, E. Houstis, and J. R. Purdue, “POEMS – End to End Performance Models for
Dynamic Parallel and Distributed Systems”
21
CCMT| 22
References
Hardware Emulation– Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovi, and D. Patterson, “A Case for
FAME : FPGA Architecture Model Execution”, ISCA’10, June 19–23, 2010, Saint-Malo, France, 290–301.
– J. Wawrzynek, D. A. Patterson, S. Lu, and J. C. Hoe, “RAMP: A Research Accelerator for Multiple Processors”, 2006.
Supercomputer-specific Modeling & Simulation– S. R. Alam, R.F. Barrett, M. R. Fahey, J. M. Larkin, and P.H. Worley, “Cray XT4 : An
Early Evaluation for Petascale Scientific Simulation”, 2007.– A. Hoisie, G. Johnson, D. J. Kerbyson, M. Lang, and S. Pakin, “A Performance
Comparison Through Benchmarking and Modeling of Three Leading Supercomputers : Blue Gene / L , Red Storm , and Purple”, (November), 1–10, 2006.
Analytical Modeling – L. Carrington, A. Snavely, and N. Wolter, “A performance prediction framework
for scientific applications”. Future Generation Computer Systems, 22(3), 336–346.
– N. Jindal, V. Lotrich, E. Deumens, B.A. Sanders, and I. Sci, “ SIPMaP : A Tool for Modeling Irregular Parallel Computations in the Super Instruction Architecture”, IPDPS 2013
22
![Page 34: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/34.jpg)
CCMT
APPENDIX
CCMT| 24
AppBEO representation � Need representation of applications that simulator can
understand– AppBEOs are list of instructions processed by ProcBEOs– Small and simple description allows easy development
• Developer does not need to worry about creating working application code– Intermediate format can be compiled into format specific to simulation
platformAppBEO (high-level description)
// Define group as nodes 0-3VAR commGrp=0:3// Broadcast matrix A (dataSize=64*64/2) to groupBcast(int32,2048,0,commGrp)// Barrier syncBarrier(commGrp)// Scatter 1/4 of matrix B (dataSize=(64*64)/(4*2)) to each nodeScatter(int32,512,0,commGrp)// Perform dot product of vector size 64 of int32DotProduct(int32,64)// Gather solutions from matrices (dataSize=(64*64)/(4*2))Gather(int32,512,commGrp)Done
Intermediate format (AppBEO for node0)send 1 1 129971 1recv 4send 2 2 129971 1recv 8send 13 1 381 1recv 12send 16 1 32420 1recv 17send 18 2 32420 1recv 19send 20 3 32420 1recv 21advt 5753856
Human Readable Intermediate Format (debug mode)
// Bcast(int32,2048,0,commGrp)send 1 1 129971 1 Send broadcast to node 1recv 4 Receive acknowledgement for broadcast from node 1send 2 2 129971 1 Send broadcast to node 2recv 8 Receive acknowledgement for broadcast from node 2// Barrier(commGrp)send 13 1 381 1 Send barrier to node 1recv 12 Received barrier from node 0// Scatter(int32,512,0,commGrp)send 16 1 32420 1 Scatter from master to node 1recv 17 Receive acknowledgement for scatter from 1send 18 2 32420 1 Scatter from master to node 2recv 19 Receive acknowledgement for scatter from 2send 20 3 32420 1 Scatter from master to node 3recv 21 Receive acknowledgement for scatter from 3// DotProduct(int32,64)advt 5753856 Advance timer for compute time in dot product
![Page 35: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/35.jpg)
CCMT| 25
Compute Microbenchmarks
image size testbed(us) simulation(us) %error320x240 142161.26 144614.40 1.73480x320 286130.51 289228.80 1.08640x480 574425.74 578457.60 0.70800x600 899691.47 903840.00 0.461024x768 1483695.36 1480851.46 -0.191280x1024 2518175.66 2468085.76 -1.991600x1200 3618511.74 3615360.00 -0.09
Sobel FilteringTile-Gx36
vector size testbed(us) simulation(us) %error
10 0.59 3.30 458.6520 1.14 3.28 187.2930 1.68 3.25 93.6740 2.22 3.23 45.5650 2.76 3.21 16.2660 3.3 3.30 -0.0570 3.84 3.84 -0.0380 4.38 4.38 -0.0190 4.92 4.86 -1.27
100 5.46 4.83 -11.46200 10.86 10.87 0.05300 16.27 16.27 -0.03400 21.67 21.67 -0.01500 27.07 27.07 0.01600 32.48 32.47 -0.03700 37.88 37.89 0.02800 43.28 43.28 0.00900 48.69 48.69 0.00
1000 54.09 54.08 -0.02
Dot Product
Matrix Multiply
matrix size testbedsim
(fine-grain)error
(fine-grain)sim
(coarse-grain)error
(coarse-grain)4x4 6.56 69.312 956.59 11.365 73.258x8 41.79 280.96 572.31 65.041 55.6416x16 312.42 1159.424 271.11 383.281 22.6832x32 2446.1 4885.504 99.73 2879.644 17.7264x64 19255.99 23097.344 19.95 19374.552 0.62128x128 172640.13 167739.392 -2.84 153122.202 -11.31256x256 1379730.55 1275658.24 -7.54 1292484.776 -6.32512x512 10971014.13 9933422.592 -9.46 10121067.52 -7.75
* Execution times are reported in microseconds
CCMT| 26
Communication Microbenchmarks
testbedtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores
2 2.08 4.00 5.75 8.96 16.354 2.18 3.71 5.99 9.12 15.638 2.23 3.84 6.11 9.37 15.80
16 2.60 4.30 6.82 10.08 16.5932 2.97 5.22 8.37 11.53 17.9864 3.96 7.18 11.18 15.67 26.45
128 5.90 11.13 18.48 28.59 45.13256 10.49 19.75 38.63 57.60 87.11512 18.07 36.38 71.45 105.59 193.42
1024 35.08 68.94 136.00 204.33 306.402048 67.67 134.28 265.65 400.31 598.274096 132.59 265.22 526.19 789.34 1187.098192 264.97 524.96 1050.17 1574.81 2366.27
16384 526.41 1044.68 2101.43 3149.43 4735.2932768 1040.36 2091.25 4184.37 6292.76 9479.93
simulationtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores
2 0.763 1.526 3.053 4.578 6.8674 0.883 1.766 3.533 5.298 7.9478 0.945 1.89 3.781 5.67 8.505
16 1.181 2.362 4.725 7.086 10.62932 1.643 3.286 6.573 9.858 14.78764 2.613 5.226 10.453 15.678 23.517
128 4.547 9.094 18.189 27.282 40.923256 8.832 17.664 35.329 52.992 79.488512 16.8 33.6 67.201 100.8 151.2
1024 32.802 65.604 131.209 196.812 295.2182048 65.455 130.91 261.821 392.73 589.0954096 130.353 260.706 521.413 782.118 1173.1778192 259.719 519.438 1038.877 1558.314 2337.471
16384 5.18E+02 1036.382 2072.765 3109.146 4663.71932768 1036.67 2073.34 4146.681 6220.02 9330.03
Tile-Gx36
testbedtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores
64 6.15 13.41 36.51 52.34 114.12128 8.73 20.62 47.45 92.92 174.96256 12.90 32.51 72.91 148.05 306.32512 20.96 57.86 130.39 275.00 561.61
1024 38.01 107.46 245.89 525.27 1075.482048 72.20 208.07 480.80 1027.08 2200.944096 139.06 410.39 949.67 2029.95 4175.838192 273.31 810.71 1886.74 4014.38 8283.43
16384 544.03 1627.38 3772.34 8039.68 16573.4032768 1086.51 3229.03 7510.44 16008.20 33024.70
simulationtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores
64 2.614 7.844 18.312 39.276 81.264128 4.548 13.646 31.85 68.286 141.218256 8.833 26.501 61.845 132.561 274.053512 16.801 50.405 117.621 252.081 521.061
1024 32.803 98.411 229.635 492.111 1017.1232048 65.456 196.37 458.206 981.906 2029.3664096 130.354 391.064 912.492 1955.376 4041.2048192 259.72 779.162 1818.054 3895.866 8051.55
16384 518.192 1554.578 3627.358 7772.946 16064.1832768 1036.671 3110.015 7256.711 15550.13 32137.03
testbedtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores
2 2.25 4.31 8.18 16.37 34.224 2.35 4.55 9.04 18.12 37.938 2.44 4.80 9.47 19.09 38.42
16 2.67 5.52 11.30 22.75 47.2632 3.09 7.02 14.76 30.07 62.5364 4.16 9.91 21.48 44.79 92.63
128 5.98 15.72 35.16 73.96 153.31256 10.58 28.98 66.20 140.57 290.44512 18.68 53.50 124.32 264.59 545.31
1024 34.75 103.78 239.20 511.73 1058.202048 67.97 201.70 471.07 1008.84 2084.434096 133.39 400.91 934.42 2001.22 4138.498192 266.98 798.12 1858.98 3989.01 8248.35
16384 537.05 1593.76 3719.23 7972.09 16501.7032768 1064.45 3181.46 7435.59 15934.10 32989.50
simulationtransfer size 2 cores 4 cores 8 cores 16 cores 32 cores
2 1.145 3.437 8.029 17.241 35.7254 1.265 3.797 8.869 19.041 39.4458 1.327 3.983 9.303 19.971 41.367
16 1.563 4.691 10.955 23.511 48.68332 2.025 6.077 14.189 30.441 63.00564 2.995 8.987 20.979 44.991 93.075
128 4.929 14.789 34.517 74.001 153.029256 9.214 27.644 64.512 138.276 285.864512 17.182 51.548 120.288 257.796 532.872
1024 33.184 99.554 232.302 497.826 1028.9342048 65.837 197.513 460.873 987.621 2041.1774096 130.735 392.207 915.159 1961.091 4053.0158192 260.101 780.305 1820.721 3901.581 8063.361
16384 518.573 1555.721 3630.025 7778.661 16075.9932768 1037.052 3111.158 7259.378 15555.85 32148.84
GatherBroadcastScatter
All times are reported in microseconds
![Page 36: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/36.jpg)
CCMT| 27
Parallel 2D Matrix Multiply (fine-grained)Tile-Gx36
Sim (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.1304 0.0655 11.5241 0.0658 11.7888 0.2607 0.0984 5.7620 0.0996 6.2268 0.2607 0.1176 2.8810 0.1203 3.6510128x128 0.5182 0.2597 83.8615 0.2601 84.9026 1.0364 0.3911 41.9308 0.3922 43.7565 1.0364 0.4582 20.9654 0.4609 23.9679256x256 2.0732 1.0367 637.7964 1.0371 641.9463 4.1464 1.5546 318.8982 1.5557 326.1610 4.1464 1.8181 159.4491 1.8207 171.3914512x512 8.2983 4.1432 4966.5802 4.1436 4983.1683 16.5965 6.2196 2483.2901 6.2208 2512.3332 16.5965 7.2567 1241.6451 7.2594 1289.3649
1024x1024 33.8551 16.6480 39250.2968 16.6484 39317.4514 67.7103 24.8948 19625.1484 24.8959 19742.6556 67.7103 29.0023 9812.5742 29.0049 10006.0127
Testbed (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.1343 0.0661 9.7008 0.0676 10.0319 0.2671 0.1013 4.8422 0.1026 5.3593 0.2658 0.1217 2.4253 0.1233 3.2394128x128 0.5338 0.2612 76.2128 0.2679 77.6801 1.0638 0.3893 38.0983 0.4019 40.1810 1.0642 0.4606 19.1066 0.4737 22.3124256x256 2.1424 1.0479 606.9473 1.0712 614.4742 4.2790 1.5803 303.4670 1.6047 312.7736 4.2792 1.8725 151.7895 1.8683 165.0370512x512 8.7386 4.4178 4846.7818 4.4391 4890.1128 17.4151 6.5211 2422.9984 6.4870 2467.7218 17.3390 7.6500 1211.9521 7.4919 1269.4627
1024x1024 35.2284 17.6642 38738.2033 17.6520 39021.2287 71.3471 26.7471 19369.4312 26.4181 19615.3064 71.7815 31.7918 9688.5531 30.6422 9949.3548
% error 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 -2.91 -0.94 18.79 -2.61 17.51 -2.41 -2.82 19.00 -2.98 16.19 -1.92 -3.35 18.79 -2.47 12.71128x128 -2.93 -0.58 10.04 -2.92 9.30 -2.58 0.45 10.06 -2.41 8.90 -2.61 -0.52 9.73 -2.70 7.42256x256 -3.23 -1.07 5.08 -3.19 4.47 -3.10 -1.63 5.08 -3.05 4.28 -3.10 -2.91 5.05 -2.55 3.85512x512 -5.04 -6.22 2.47 -6.66 1.90 -4.70 -4.62 2.49 -4.10 1.81 -4.28 -5.14 2.45 -3.10 1.57
1024x1024 -3.90 -5.75 1.32 -5.69 0.76 -5.10 -6.93 1.32 -5.76 0.65 -5.67 -8.77 1.28 -5.34 0.57
CCMT| 28
Parallel 2D Matrix Multiply (fine-grained)Tile-Gx36
Sim (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.2607 0.1326 1.4405 0.1383 2.5149 0.2607 0.1412 0.7203 0.1530 2.2305128x128 1.0364 0.4921 10.4827 0.4978 14.6032 1.0364 0.5211 5.2413 0.5329 11.0018256x256 4.1464 1.9554 79.7245 1.9611 96.1017 4.1464 2.0294 39.8623 2.0412 62.6345512x512 16.5965 7.7729 620.8225 7.7787 686.1851 16.5965 8.0516 310.4113 8.0634 401.2533
1024x1024 67.7103 31.0982 4906.2871 31.1039 5171.6415 67.7103 32.1370 2453.1436 32.1488 2822.1686
Testbed (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.2647 0.1378 1.2140 0.1412 2.3351 0.2636 0.1476 0.6238 0.1558 2.2542128x128 1.0654 0.5024 9.5857 0.5109 13.8897 1.0551 0.5337 4.8142 0.5525 10.8169256x256 4.2764 2.0088 76.0019 2.0239 93.4666 4.2866 2.1775 38.0804 2.1386 62.2750512x512 17.2959 8.2301 606.6027 7.9981 678.9970 17.2911 8.7498 303.6608 8.3159 402.1725
1024x1024 72.6566 35.4219 4848.8904 32.6373 5164.7570 71.9278 37.0273 2427.4457 33.5693 2832.1929
% error 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 -1.52 -3.83 18.65 -2.08 7.70 -1.10 -4.30 15.47 -1.75 -1.05128x128 -2.72 -2.05 9.36 -2.55 5.14 -1.78 -2.37 8.87 -3.55 1.71256x256 -3.04 -2.66 4.90 -3.10 2.82 -3.27 -6.80 4.68 -4.55 0.58512x512 -4.04 -5.55 2.34 -2.74 1.06 -4.02 -7.98 2.22 -3.04 -0.23
1024x1024 -6.81 -12.21 1.18 -4.70 0.13 -5.86 -13.21 1.06 -4.23 -0.35
![Page 37: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/37.jpg)
CCMT| 29
Parallel 2D Matrix Multiply (coarse-grained)Tile-Gx36
Sim (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.1304 0.0655 9.7517 0.0658 10.0164 0.2607 0.0984 4.7197 0.0996 5.1845 0.2607 0.1176 2.2172 0.1203 2.9872128x128 0.5182 0.2597 76.2505 0.2601 77.2916 1.0364 0.3911 40.2694 0.3922 42.0952 1.0364 0.4582 18.8820 0.4609 21.8845256x256 2.0732 1.0367 652.5564 1.0371 656.7064 4.1464 1.5546 318.1159 1.5557 325.3787 4.1464 1.8181 167.3377 1.8207 179.2800512x512 8.2983 4.1432 5043.3792 4.1436 5059.9673 16.5965 6.2196 2508.1097 6.2208 2537.1528 16.5965 7.2567 1271.9594 7.2594 1319.6792
1024x1024 33.8551 16.6480 8935.1544 16.6484 9002.3090 67.7103 24.8948 5254.7738 24.8959 5372.2809 67.7103 29.0023 2503.3172 29.0049 2696.7557
Testbed (ms) 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.1343 0.0661 9.7008 0.0676 10.0319 0.2671 0.1013 4.8422 0.1026 5.3593 0.2658 0.1217 2.4253 0.1233 3.2394128x128 0.5338 0.2612 76.2128 0.2679 77.6801 1.0638 0.3893 38.0983 0.4019 40.1810 1.0642 0.4606 19.1066 0.4737 22.3124256x256 2.1424 1.0479 606.9473 1.0712 614.4742 4.2790 1.5803 303.4670 1.6047 312.7736 4.2792 1.8725 151.7895 1.8683 165.0370512x512 8.7386 4.4178 4846.7818 4.4391 4890.1128 17.4151 6.5211 2422.9984 6.4870 2467.7218 17.3390 7.6500 1211.9521 7.4919 1269.4627
1024x1024 35.2284 17.6642 38738.2033 17.6520 39021.2287 71.3471 26.7471 19369.4312 26.4181 19615.3064 71.7815 31.7918 9688.5531 30.6422 9949.3548
% error 2 threads 4 threads 8 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 -2.91 -0.94 0.52 -2.61 -0.15 -2.41 -2.82 -2.53 -2.98 -3.26 -1.92 -3.35 -8.58 -2.47 -7.78128x128 -2.93 -0.58 0.05 -2.92 -0.50 -2.58 0.45 5.70 -2.41 4.76 -2.61 -0.52 -1.18 -2.70 -1.92256x256 -3.23 -1.07 7.51 -3.19 6.87 -3.10 -1.63 4.83 -3.05 4.03 -3.10 -2.91 10.24 -2.55 8.63512x512 -5.04 -6.22 4.06 -6.66 3.47 -4.70 -4.62 3.51 -4.10 2.81 -4.28 -5.14 4.95 -3.10 3.96
1024x1024 -3.90 -5.75 -76.93 -5.69 -76.93 -5.10 -6.93 -72.87 -5.76 -72.61 -5.67 -8.77 -74.16 -5.34 -72.90
CCMT| 30
Parallel 2D Matrix Multiply (coarse-grained)Tile-Gx36
Sim (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.2607 0.1326 1.1213 0.1383 2.1956 0.2607 0.1412 0.6697 0.1530 2.1799128x128 1.0364 0.4921 9.2816 0.4978 13.4021 1.0364 0.5211 5.4840 0.5329 11.2444256x256 4.1464 1.9554 78.8943 1.9611 95.2714 4.1464 2.0294 41.5034 2.0412 64.2756512x512 16.5965 7.7729 652.3573 7.7787 717.7198 16.5965 8.0516 328.8056 8.0634 419.6477
1024x1024 67.7103 31.0982 1611.9654 31.1039 1877.3199 67.7103 32.1370 1611.8781 32.1488 1980.9031
Testbed (ms) 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.2647 0.1378 1.2140 0.1412 2.3351 0.2636 0.1476 0.6238 0.1558 2.2542128x128 1.0654 0.5024 9.5857 0.5109 13.8897 1.0551 0.5337 4.8142 0.5525 10.8169256x256 4.2764 2.0088 76.0019 2.0239 93.4666 4.2866 2.1775 38.0804 2.1386 62.2750512x512 17.2959 8.2301 606.6027 7.9981 678.9970 17.2911 8.7498 303.6608 8.3159 402.1725
1024x1024 72.6566 35.4219 4848.8904 32.6373 5164.7570 71.9278 37.0273 2427.4457 33.5693 2832.1929
% error 16 threads 32 threadsmatrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 -1.52 -3.83 -7.64 -2.08 -5.97 -1.10 -4.30 7.37 -1.75 -3.29128x128 -2.72 -2.05 -3.17 -2.55 -3.51 -1.78 -2.37 13.91 -3.55 3.95256x256 -3.04 -2.66 3.81 -3.10 1.93 -3.27 -6.80 8.99 -4.55 3.21512x512 -4.04 -5.55 7.54 -2.74 5.70 -4.02 -7.98 8.28 -3.04 4.35
1024x1024 -6.81 -12.21 -66.76 -4.70 -63.65 -5.86 -13.21 -33.60 -4.23 -30.06
![Page 38: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/38.jpg)
CCMT| 31
Parallel Sobel Filtering
Testbed(ms) 2 cores 4 cores 8 cores
Image Size ScatterCompute_
GxCompute_
Gy Gather Total ScatterCompute_
GxCompute_
Gy Gather Total ScatterCompute_
GxCompute_
Gy Gather Total320x240 0.308 35.666 36.179 0.630 72.464 0.481 17.848 18.210 0.942 37.351 0.576 8.923 9.131 1.097 19.737480x320 0.620 71.616 72.651 1.265 145.456 0.956 35.740 36.495 1.877 74.774 1.137 17.857 18.265 2.174 39.408640x480 1.245 142.968 146.393 2.542 292.448 1.908 71.638 73.206 3.773 149.754 2.262 35.795 36.614 4.359 79.069800x600 1.950 223.252 229.986 3.966 458.540 2.977 112.061 114.598 5.951 234.704 3.509 56.025 57.241 6.828 123.2621024x768 3.227 365.944 377.464 6.498 752.560 4.864 183.752 189.415 9.681 385.713 5.728 91.946 93.859 11.290 202.0161280x1024 5.419 611.279 661.535 10.860 1286.660 8.098 306.565 335.381 16.132 660.897 9.519 153.350 159.848 18.831 338.492
Simulation (ms) 2 cores 4 cores 8 cores
Image size ScatterCompute_
GxCompute_
Gy Gather Total ScatterCompute_
GxCompute_
Gy Gather Total ScatterCompute_
GxCompute_
Gy Gather Total320x240 0.306 35.750 36.557 0.604 73.223 0.463 17.875 18.278 0.902 37.531 0.550 8.938 9.139 1.044 19.695480x320 0.610 71.501 73.114 1.210 146.440 0.920 35.750 36.557 1.808 75.047 1.086 17.875 18.278 2.096 39.360640x480 1.219 143.002 146.227 2.422 292.875 1.833 71.501 73.114 3.623 150.083 2.156 35.750 36.557 4.201 78.688800x600 1.903 223.440 228.480 3.784 457.613 2.861 111.720 114.240 5.667 234.501 3.355 55.860 57.120 6.580 122.9391024x768 3.114 366.084 374.342 6.209 749.755 4.683 183.042 187.171 9.289 384.197 5.485 91.521 93.585 10.802 201.4181280x1024 5.190 610.140 623.903 10.370 1249.609 7.796 305.070 311.951 15.498 640.328 9.127 152.535 155.976 18.024 335.687
Error % 2 cores 4 cores 8 cores
Image size ScatterCompute_
GxCompute_
Gy Gather Total ScatterCompute_
GxCompute_
Gy Gather Total ScatterCompute_
GxCompute_
Gy Gather Total320x240 -0.58 0.24 1.04 -4.11 1.05 -3.69 0.15 0.38 -4.18 0.48 -4.63 0.16 0.09 -4.79 -0.21480x320 -1.67 -0.16 0.64 -4.31 0.68 -3.78 0.03 0.17 -3.69 0.37 -4.46 0.10 0.08 -3.58 -0.12640x480 -2.13 0.02 -0.11 -4.72 0.15 -3.94 -0.19 -0.13 -3.97 0.22 -4.69 -0.12 -0.16 -3.62 -0.48800x600 -2.43 0.08 -0.65 -4.57 -0.20 -3.88 -0.30 -0.31 -4.77 -0.09 -4.39 -0.30 -0.21 -3.64 -0.261024x768 -3.50 0.04 -0.83 -4.44 -0.37 -3.72 -0.39 -1.18 -4.05 -0.39 -4.25 -0.46 -0.29 -4.32 -0.301280x1024 -4.23 -0.19 -5.69 -4.52 -2.88 -3.72 -0.49 -6.99 -3.93 -3.11 -4.11 -0.53 -2.42 -4.28 -0.83
Tile-Gx36
CCMT| 32
Parallel Sobel FilteringTile-Gx36
Testbed(ms) 16 cores 32 coresImage Size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 0.654 4.460 4.577 1.186 10.946 0.748 2.233 2.288 1.257 6.625480x320 1.262 8.954 9.147 2.356 21.731 1.398 4.470 4.586 2.450 13.028640x480 2.483 17.889 18.340 4.680 43.487 2.698 8.971 9.218 4.850 25.876800x600 3.843 28.010 28.677 7.312 68.064 4.141 14.013 14.386 7.610 40.3371024x768 6.249 45.989 46.921 12.012 111.072 6.648 23.021 23.493 12.437 65.8221280x1024 10.356 76.749 79.732 20.314 185.698 10.961 38.438 39.749 21.044 110.190
Simulation (ms) 16 cores 32 coresImage size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 0.605 4.469 4.570 1.098 10.790 0.664 2.234 2.285 1.084 6.366480x320 1.187 8.938 9.139 2.219 21.532 1.270 4.469 4.570 2.230 12.636640x480 2.346 17.875 18.278 4.454 43.003 2.493 8.938 9.139 4.507 25.175800x600 3.639 27.930 28.560 6.982 67.160 3.849 13.965 14.280 7.092 39.2841024x768 5.925 45.761 46.793 11.485 110.012 6.234 22.880 23.396 11.694 64.3031280x1024 9.839 76.268 77.988 19.200 183.343 10.305 38.134 38.994 19.592 107.123
Error % 16 cores 32 coresImage size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 -7.49 0.20 -0.16 -7.46 -1.42 -11.14 0.07 -0.13 -13.77 -3.91480x320 -5.93 -0.19 -0.08 -5.81 -0.92 -9.11 -0.03 -0.35 -9.00 -3.01640x480 -5.53 -0.08 -0.33 -4.83 -1.11 -7.61 -0.37 -0.86 -7.07 -2.71800x600 -5.31 -0.29 -0.41 -4.52 -1.33 -7.06 -0.34 -0.74 -6.81 -2.611024x768 -5.18 -0.50 -0.27 -4.39 -0.95 -6.22 -0.61 -0.41 -5.97 -2.311280x1024 -4.99 -0.63 -2.19 -5.49 -1.27 -5.98 -0.79 -1.90 -6.90 -2.78
![Page 39: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/39.jpg)
CCMT| 33
Parallel 2D Matrix Multiply (fine-grained)
time (ms) Tile-Gx36 Tile-Gx72matrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
64x64 0.26 0.16 0.64 0.17 2.32 0.26 0.19 0.20 0.21 2.64
128x128 1.04 0.59 4.66 0.60 11.08 1.04 0.63 1.59 0.66 10.73
256x256 4.15 2.29 35.43 2.30 60.81 4.15 2.33 12.64 2.36 48.52
512x512 16.60 9.09 275.92 9.10 377.14 16.60 9.26 100.93 9.28 244.03
1024x1024 67.71 36.28 2180.57 36.30 2591.75 67.71 36.79 806.49 36.82 1388.01
Tile-Gx72
CCMT| 34
Parallel Sobel Filtering
Image size Tile-Gx36 Tile-Gx72 Speedup320x240 6.13 4.67 1.31480x320 12.32 8.49 1.45640x480 23.80 16.20 1.47800x600 37.57 25.81 1.461024x768 60.67 41.31 1.471280x1024 100.37 66.88 1.501600x1200 146.77 98.55 1.49
Tile-Gx72
Time (ms) Tile-Gx36 Tile-Gx72Image size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total
320x240 0.74 2.01 2.06 1.21 6.13 0.92 1.12 1.14 1.27 4.67
480x320 1.38 4.17 4.26 2.40 12.32 1.60 2.09 2.13 2.46 8.49
640x480 2.66 8.04 8.23 4.77 23.80 2.97 4.02 4.11 4.87 16.20
800x600 4.05 12.85 13.14 7.42 37.57 4.46 6.70 6.85 7.57 25.81
1024x768 6.50 20.74 21.20 12.12 60.67 7.05 10.73 10.97 12.35 41.31
1280x1024 10.67 34.32 35.09 20.18 100.37 11.43 17.16 17.55 20.52 66.88
1600x1200 15.43 50.27 51.41 29.55 146.77 16.37 25.70 26.28 29.99 98.55
![Page 40: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/40.jpg)
CCMT| 35
Parallel 2D Matrix Multiply (fine-grained)
matrix size Tile-Gx36 (ms) Gx72 (ms) KNL 64 (ms) KNL 100 (ms)Speedup (Gx36 vs
KNL64)Speedup (Gx36 vs
KNL100)128x128 11.08 10.73 8.54 10.50 1.30 1.05256x256 60.81 48.52 33.60 41.20 1.81 1.48512x512 377.14 244.03 135.00 165.00 2.79 2.291024x1024 2591.75 1388.01 555.00 673.00 4.67 3.852048x2048 18000.00 -- 2270.00 2721.73 7.93 6.61
KNL 64 KNL 100
Time (ms) KNL 64 KNL 100matrix size Bcast Scatter Compute Gather Total Bcast Scatter Compute Gather Total
128x128 1.04 0.56 0.07 0.58 8.54 1.04 0.47 0.05 0.51 10.50256x256 4.15 2.07 0.38 2.09 33.60 4.15 1.74 0.24 1.77 41.20512x512 16.60 8.21 2.28 8.24 135.00 16.60 6.77 1.46 6.81 165.001024x1024 67.70 32.60 15.10 32.70 555.00 67.70 26.90 9.69 26.90 673.00
CCMT| 36
Parallel Sobel FilteringKNL 64KNL100
time (ms) KNL64 KNL100Image size Scatter Compute_Gx Compute_Gy Gather Total Scatter Compute_Gx Compute_Gy Gather Total320x240 0.76 0.12 0.12 1.02 2.22 1.05 0.10 0.10 1.31 2.86480x320 1.55 0.26 0.26 2.44 4.72 1.77 0.16 0.16 2.50 4.90640x480 2.66 0.49 0.49 4.37 8.22 3.20 0.35 0.35 4.90 9.11800x600 4.38 0.80 0.80 7.56 13.70 4.28 0.49 0.49 6.68 12.30
1024x768 6.55 1.27 1.27 11.50 20.80 7.43 0.87 0.87 12.43 21.901280x1024 10.80 2.11 2.11 19.40 34.60 11.95 1.37 1.37 20.61 35.601600x1200 15.60 3.09 3.09 28.70 50.70 16.00 1.98 1.98 28.30 48.601920x1080 16.70 3.34 3.34 31.20 54.80 17.97 2.22 2.22 32.55 55.30
time in ms time in ms time in ms
Image size Tile-Gx36 KNL 64 KNL 100Speedup
(Gx36 vs KNL64)Speedup
(Gx36 vs KNL100)320x240 6.13 2.22 2.86 2.76 2.15480x320 12.32 4.72 4.9 2.61 2.51640x480 23.80 8.22 9.11 2.90 2.61800x600 37.57 13.7 12.3 2.74 3.051024x768 60.67 20.8 21.9 2.92 2.771280x1024 100.37 34.6 35.6 2.90 2.821600x1200 146.77 50.7 48.6 2.89 3.021920x1080 158.00 54.8 55.3 2.88 2.86
![Page 41: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/41.jpg)
CCMT| 1
Performance Modeling
� Calibration data is used for developing interpolation models which are used to predict the execution time – performance models
– Data have varying dimension (One-dimensional: Dot Product, Multi-dimensional: Matrix Multiply)
� We are using Kriging interpolation for multi-dimensional interpolation
– More about performance models in the next talk of this session
execution_time = f()Train interpolation model
Training/calibration data Predicted execution time
Estimate for test inputs
Exceedserror threshold?Experimental testbed,
Cycle-accurate Device Simulator,Fast Forward 2 vendors,
etc.
How do we generate timestamps for internal events?
What about send and receive events?
CCMT| 2
Performance ModelingMotivation:� Behavioral emulation necessitates that simulators do not perform cycle-accurate (or
otherwise complex) operations� However, simulators must still know time required for an operation (e.g., Matrix
Multiply) based on input sizes (e.g., [M, N, P] = [256, 100, 42])
Goals: � Estimate values of specific non-integral numerical parameters (e.g., computation
time) before or during simulation without having access to real generators (e.g., the target CPU) of those parameters
– Methods which produce these parameters are surrogates for the real generators– Any access to the target platforms is assumed to be strictly in advance of the
simulation� Determine efficacy of possible models in multi-dimensional domains� Perform uncertainty analysis to determine degree to which estimation is introducing
error into the simulator
![Page 42: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/42.jpg)
CCMT| 3
Performance ModelingApproach: � On the target platform or simulator, take a number of
representative samples (of the parameter of interest) within the expected domain
� Using these samples, interpolate any other needed values just prior to, or during, the simulation
Kriging:� A product of the geostatistics community, used for many-
dimensional sparse interpolation� Other (likely less accurate) options:
– Radial basis functions– Nearest-neighbor– Convex-hull linear interpolation
Universal Kriging:� Inputs: Variogram (spatial relationship of
data), Polynomial Degree, Samples� Internal: Computes weights for each of
the samples� Outputs: Interpolated values, estimate
of variance at interpolation points
CCMT| 4
Performance ModelingAside: What do our data look like? (1 of 3: Easy)
� Matrix Multiplication is relatively easy, but still non-trivial, to interpolate� Banding near numbers which
are additive combinations of powers of two
� Bands can have times of 2-10 times longer than their neighbors
� Likely due to cache particularities in the system
� Example details:� Platform: x86 Ivy-Bridge Quad� Single Core Used� Triple-loop textbook method� C, GCC, -O2
![Page 43: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/43.jpg)
CCMT| 5
Performance ModelingAside: What do our data look like? (2 of 3: Hard)
� FFTs (FFTW) are the most difficult benchmarks in the set� Computation time strongly
related to how composite the input size is
� Adjacent samples can jump by more than an order of magnitude
� Interpolating to obtain an average error fraction of just below 1 is difficult
� Example details:� Platform: x86 Ivy-Bridge Quad� Single Core Used� FFTW� C, GCC, -O2
CCMT| 6
Performance ModelingAside: What do our data look like? (3 of 3: Other)
� CUDA BLAS DGEMM is different� Divided into blocks� Not symmetrical about diagonal� Partitions are differently sized
� Example details:� Platform: Quadro K600 GPU
![Page 44: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/44.jpg)
CCMT| 7
Performance ModelingApproach: Interpolation Process: Part 1 of 4
Step One: obtain ordered, but slightly randomized samples within the domain
� Randomization prevents aliasing along areas of unusual time
� Determining the number of samples to use is one of the research goals
� Even sampling may not be ideal,but to do otherwise requires unavailable a priori knowledge
� Example details:� These samples cover 0.17% of the
domain, a relatively small amount� Notably: they mostly missed the
bands of high computation time• This will matter during evaluation time• It may, or may not, be a good thing
CCMT| 8
Performance ModelingApproach: Interpolation Process: Part 2 of 4
Step Two: using those samples, construct an interpolation model
� After entering these parameters, we have a model which can substituted for the real data in the simulator (the surrogate)
� It will not be perfect, as it is subject to both the samples and the parameters
� For example: this model loses most of the banding of the original data
![Page 45: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/45.jpg)
CCMT| 9
Performance ModelingApproach: Interpolation Process: Part 3 of 4
Step Three: evaluate the model with another dataset and an error metric
� Test data are completely random within the domain
� Our Error Metric: average mean-squared fractional error
CCMT| 10
Performance ModelingApproach: Interpolation Process: Part 4 of 4
Step Four: If the error was too great, go back to step one, but use different parameters
� Obtaining good parameters for Kriging (without much knowledge of the underlying processes) is difficult, so we let computers do it for us
– Specifically: the Step Four to Step One Transition (revising parameters if needed) is done by Genetic Algorithms (GA)
• GAs simulate the evolutionary process by only allowing the best Genomes (solutions) to pass on their genes to the next generation
• Process is computationally intensive, as many Kriging evaluations are required, but an optimal solution is likely to result
– The GAs stop only when a good-enough solution is found, which may take up to several days
� After all iterations, we can incorporate the model into the simulator� It is necessary to obtain good Kriging parameters for many different
sample-fractions, so we can know how many samples are needed
![Page 46: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/46.jpg)
CCMT| 11
Demo: Modeling on x86There are a number of tunable parameters which must be selected for each stage:
� Sampling– Domain of interest– Sampling strategy (linear, random, logarithmic, etc.)– Number of points– Noise reduction strategy
� Model construction– Interpolation method (Kriging, RBF, etc.)– Method parameters
� Error quantification– Number of evaluation points– Evaluation sampling method– Error metric
CCMT| 12
Performance Modeling (Results)
Single-Dimensional Benchmarks(One Input Parameter)
Multi-Dimensional Benchmarks(Two and Three Input Parameters)
![Page 47: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/47.jpg)
CCMT| 13
Performance Modeling (Results)
Kriging versus Nearest-Neighbor
� Kriging outperforms nearest-neighbor interpolation in (most) all cases (values greater than unity)
� There is little or no improvement for FFT
� For the high-algorithmic-complexity algorithms, Kriging is much better
� Kriging has a better improvement for more sparse sampling
CCMT| 14
Identified Issues
� Kriging requires the selection of a number of interpolation parameters, the choice of which is not obvious
� We must select a sampling strategy (or set of strategies), each with a number of parameters which must be picked
� In order to effectively model functions like the FFTs, an alternative to Kriging must be used
– This may be a type of domain partitioning which is used in conjunction with Kriging, or something entirely different
� Performing Kriging can be cumbersome on some platforms, so it may be excluded from use at run-time (which may not be necessary)
![Page 48: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/48.jpg)
CCMT
CCMTResearch Thrust
Synchronization & Congestion
CCMT| 2
Synchronization & CongestionMotivation: � Current BE infrastructure ensures causality but isn’t optimized for large-scale
simulation� Communication events in BE abstract away details of network packet transfers
making robust congestion modeling necessaryGoal: Adapt synchronization and congestion-modeling techniques to support simulation experiments with millions of behavioral objects
Fine-grained
MMT
Gem5SimplescalarVeloce
PalladiumRAMP
Scalability
POEMS
FASE SST MacroFSim
BigSim
ROSSParsim
Coarse-
grained
Manifold
Analytical methodsSimulator Landscape
GPGPUSim
ppDisclaimer: This is first attempt at classifying different types of simulators. Our goal is to understand the unique features and associated advantages of each simulator for adoption choice.
SST Micro
Can we find events that can be executed in parallel?
![Page 49: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/49.jpg)
CCMT| 3
Expressing Causality
� Maintaining causality ensures correctness and exposing concurrency helps in determining the maximum amount of parallelism in the simulation
� Causal history – all events in an events past that can affect its outcome – Causal history is enough to determine the ordering of events
� Lamport clocks, not consistent with real time, are consistent with causality
– Partial ordering of events by assigning timestamps using logical clock � Vector Time is consistent and can characterize causality
CCMT
Do we need complete causal history of an event?
CCMT| 4
Search for Efficient Representation
� Size of time vectors is a limiting factor for scalable simulation– Several methods exist for compressing message timestamps which tradeoff
simulation speed, storage, and complete ordering of events– Utilize only direct dependencies or use causal distributed breakpoints
� Creating concurrent regions may be cheaper than causality– A dependence blocks can be regarded as a single, atomic event– This approach offers for some scope for abstracting execution details
� There may also be merit in focusing only on the state transitions
Can we sacrifice on accuracy to speed up simulations?
What are our options for synchronizing execution of these events?
![Page 50: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/50.jpg)
CCMT| 5
Causality & Event Synchronization
� Non-aggressive vs Aggressive– Conservative vs Optimistic– Hybrids
� “Limited” Optimistic– Window-based: events with timestamps within some agreed upon window are
executed between synchronizations – Space-based: LPs divided into clusters, each cluster executes optimistically,
interaction between clusters is conservative. – Penalty-based: LPs are either penalized and blocked or favored and not blocked
depending on rollback behavior– Knowledge-based: optimistic execution, broadcast message is sent out if error is
detected– Probabilistic: periodic probabilistic synchronization of LPs– State-based: optimism continuously adjusted based on local state information
� Application dependent
5
How is event synchronization handled in existing simulators?
Conservative Approach (Pessimistic estimates)Avoid all causality errors
Optimistic Approach (Detection & recovery)
Allow errors, “rollback” to recover
CCMT| 6
Synchronization Schemes
� Majority coarse-grained simulators use conservative schemes *– BigSim is trace-driven, ParSim is time-driven, and Fsim lacks timing model– SST uses component-based back-end with global queue and MPI barriers
� ROSS uses optimistic time-warp protocol for event ordering
� We envision adopting one or combination of multiple approaches tuned for BE
– Sync. scheme fixed or selected by user as per their need
– Schemes for different scales of simulation or simulation platform
– Multi-pass simulation
* List of references at end
How can we specialize and tune these approaches for BE?
How is event synchronization handled in existing simulators?
![Page 51: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/51.jpg)
CCMT| 7
Parameters for Scalable PDES Design
� Various options available to explore – Partitioning- how to cluster LPs?– Adaptability- how does simulator change based on state?– Aggressiveness- how much should conditional knowledge be processed?– Accuracy- how much error can be tolerated?– Risk- how far should a potentially incorrect message be propagated?– Synchrony- what is the degree of temporal binding or coupling of LPs?– Knowledge embedding- knowledge of an LPs behavioral attributes that is
embedded in the simulation– Knowledge dissemination/acquisition- how much does LP initiate
transmission/request of information to/from other LPs?
How can we specialize and tune these approaches for BE?
But choice of synchronization mechanism will dictate our methods and ability to model congestion on the network.
CCMT| 8
Congestion Modeling
� Many recent simulators and frameworks have low-level network models– SST-Micro uses high-fidelity component models to simulate CMP and SMP systems– SST/Macro uses packet-level, flow-based, & hybrid train models for system networks– FSim and xSim use fine-grained network models– Fsim, xSim, and BigSim allows high-level latency models and detailed model of
communication fabric
� Explore existing congestion models for use in Behavioral Emulation
– Fine-grained vs. coarse-grained packet-level models– Analytical flow-based or train-based models– Queuing-theory models– System-specific models derived from experiments
How can we specialize and tune these approaches for BE?
![Page 52: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/52.jpg)
CCMT| 9
Congestion Modeling
� Based on tradeoff studies devise congestion models for BE– Adopt one or combination of multiple approaches tuned for BE– For most promising approaches, study variations in congestion
behavior and scalability with choice of synchronization schemes
� Tuning for Behavioral Emulation– Congestion model fixed or selected by user as per their need– Different models for different simulation levels (on-chip, inter-
node, inter-rack )– Models for different levels of simulation platform (exploit
different levels of parallelism)
� Choice of synchronization mechanisms and congestion models can significantly affect the design and speed of emulation in hardware
CCMT
Questions?
![Page 53: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/53.jpg)
CCMT| 11
References
System (macro-scale) Simulators– C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar, D. A. Evensky,
and J. Mayo, “A simulator for large-scale parallel architectures” International Journal of Parallel and Distributed Systems, vol. 1, no. 2, pp. 57-73, 2010. SST MACRO
– E. Grobelny, D. Bueno, I. Troxel, A.D. George, and J.S. Vetter, “FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications, Simulation”, Simulation, Vol. 83, No. 10, pp. 721-745, Oct. 2007. FASE
– G. Zheng, G. Kakulapati, L. V. Kale, “Bigsim: A parallel simulator for performance prediction of extremely large parallel machines”, 18th IPDPS, pp. 78, 2004. BIGSIM
– A. D. George, R. B. Fogarty, J. S. Markwell, and M. D. Miars, “An Integrated Simulation Environment for Parallel and Distributed System Prototyping”, Simulation, vol. 72, pp. 283-294, May 1999. ISE
– A. Symons, V. L. Narasimhan, "Parsim-message PAssing computeR SIMulator," IEEE First International Conference on Algorithms and Architectures for Parallel Processing, vol. 2, pp. 621, 630, 19-20, ICAPP, 1995. PARSIM
11
CCMT| 12
References
Synchronization in PDES– R. Schwarz“Detecting Causal Relationships in Distributed Computations: In
Search of the Holy Grail”– F. Mattern. “Virtual Time and Global States in Distributed Systems”. Proc.
Workshop on Parallel and Distributed Algorithms, Chateau de Bonas, Oct. 1988, M. Cosnard et al. (eds.), Elsevier / North Holland, pp. 215-226, 1989.
– L. Lamport. “Time, Clocks, and the Ordering of Events in a Distributed System”. Communications of the ACM,Vol. 21, No. 7, pp. 558-565, July 1978.
– C.J. Fidge. “Logical Time in Distributed Computing Systems”. IEEE Computer, Vol. 24, No. 8, pp. 28-33, Aug. 1991.
– P.C. Bates and J.C. Wileden. “High-Level Debugging of Distributed Systems: The Behavioral Abstraction Approach”. Journal of Systems and Software, Vol. 4, No. 3, pp. 255-264, Dec. 1983.
12
![Page 54: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/54.jpg)
CCMT| 13
References
Device (micro-scale) & Node (meso-scale) Simulators– Z. Dong, J. Wang, G. Riley, and S. Yalamanchili, “An Efficient Front-End for Timing-Directed
Parallel Simulation of Multi-Core System”, 7th International ICST Conference on Simulation Tools and Techniques (SIMUTools 2014), March 2014. MANIFOLD
– J. Wang, J. Beu, S. Yalamanchili, and T. Conte. “Designing Configurable, Modifiable and Reusable Components for Simulation of Multicore Systems”, 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, November 2012. MANIFOLD
– M. Hseih, R. Riesen, K. Thompson,W. Song, A. Rodrigues, “SST: A Scalable Parallel Framework for Architecture-Level Performance, Power, Area and Thermal Simulation”, Computer Journal, vol. 55, no. 2, pp. 181-191, 2012. SST MICRO
– M. Hseih, A. Rodrigues, R. Riesen, K. Thompson,W. Song, “A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration”, SIGMETRICS, Performance Evaluation Review, vol. 38, no. 4, pp. 63-68 2011. SST MICRO
Object-oriented System Modeling– J. C. Browne, E. Houstis, and J. R. Purdue, “POEMS – End to End Performance Models for
Dynamic Parallel and Distributed Systems”
13
CCMT| 14
References
Hardware Emulation– Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovi, and D. Patterson, “A Case for
FAME : FPGA Architecture Model Execution”, ISCA’10, June 19–23, 2010, Saint-Malo, France, 290–301.
– J. Wawrzynek, D. A. Patterson, S. Lu, and J. C. Hoe, “RAMP: A Research Accelerator for Multiple Processors”, 2006.
Supercomputer-specific Modeling & Simulation– S. R. Alam, R.F. Barrett, M. R. Fahey, J. M. Larkin, and P.H. Worley, “Cray XT4 : An
Early Evaluation for Petascale Scientific Simulation”, 2007.– A. Hoisie, G. Johnson, D. J. Kerbyson, M. Lang, and S. Pakin, “A Performance
Comparison Through Benchmarking and Modeling of Three Leading Supercomputers : Blue Gene / L , Red Storm , and Purple”, (November), 1–10, 2006.
Analytical Modeling – L. Carrington, A. Snavely, and N. Wolter, “A performance prediction framework
for scientific applications”. Future Generation Computer Systems, 22(3), 336–346.
– N. Jindal, V. Lotrich, E. Deumens, B.A. Sanders, and I. Sci, “ SIPMaP : A Tool for Modeling Irregular Parallel Computations in the Super Instruction Architecture”, IPDPS 2013
14
![Page 55: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/55.jpg)
CCMT| 1
Scalability MethodsGoal: Reduce simulator execution time by exploiting domain-specific simplifying assumptions
� Intended to supplement other simulation efforts by providing techniques and methods which can be used (primarily) prior to simulation-time to speed-up the simulators
� Here, we focus on the simplifying assumptions which are a result of behavioral emulation (as opposed to general DES problems)
– “We need not be concerned with the result, just the time it took to get there”
– “We need not be concerned with handling application-specified non-determinism (RNGs)”
– “We need not be concerned with handling run-time conditional operations”• Acceptable: if (my_rank != 0) { do_something(); }
• Not acceptable: if (current_time >= 1.5) { do_something(); }
� There may be other sets of simplifying assumptions as a result of being concerned mostly with one application (CMT- related), and one machine size (Exascale)
CCMT| 2
Scalability Methods� Current methods of interest:
– Global task graph manipulation (done at compile time, see compilation overview below)• Generate a global task graph• Manipulate this task graph to reduce the number of total tasks required
– Micro-scale symmetry exploitation• Find code blocks (within a simulation process) which are isomorphic to blocks in other simulation processes• Publish these code blocks to a “cache” (or model them in advance) to avoid repetition at the micro scale
Application Description
Machine Description
Machine Configuration
Parser Code Generator Task-Graph Generator
AbstractSyntaxTree
ProcessCode
Task Graph
Task-Graph Modifier
Task GraphCode Generator
AbstractProcess
Code
To simulator
![Page 56: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/56.jpg)
CCMT| 3
Task-Graph Manipulationdefine :: dataSize << 128*1024/mpi.maxrank >>define :: me << mpi.myrank >>
import(mpi)mpi.perform("initialize", [])
for (i, [1..30]) {
if ( (me % 2) = 0 ) {mpi.send(me + 1, dataSize)mpi.recv(me - 1, dataSize)mpi.send(me - 1, dataSize)mpi.recv(me + 1, dataSize)
}
if ( (me % 2) = 1 ) {mpi.recv(me - 1, dataSize)mpi.send(me + 1, dataSize)mpi.recv(me + 1, dataSize)mpi.send(me - 1, dataSize)
}
mpi.perform("fft2d", [1024, 128])}
mpi.perform("finalize", [])
� To be able to statically generate a global task graph for all processes, a limited language is necessary (example AppBEO right)
� Language can support:– Unconditional looping– Compile-time evaluated conditionals– Function calls without side effects– Basic macros
� Language must avoid:– Variable manipulation and assignment (all
expressions must be evaluated at compile time)– Run-time conditional statements– Random number generators– Any things which give the program a state which is
not just the location in the instruction stream
� These are acceptable compromises, as we require only a description language, not a Turing-complete one
CCMT| 4
Task-Graph Manipulation
Vertical CombinationOriginal Graph
Block Combination
� High level process:– Produce this global graph (has about 10^12 entries for a substantial application on an exascale machine)
or produce a highly-symmetric compressed task graph (if permitted by the application)– Section off the graph at a natural hardware level (e.g., per node)– Apply combination and simplification rules to reduce the number of graph nodes
(one section of a node with two cores)
![Page 57: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/57.jpg)
CCMT| 5
Micro-Scale Symmetry Exploitation� Task graph simplification will produce multi-operation blocks which require new performance models� We generate (at compile time, or run time) a model for this block, and then use the model, but do not
re-simulate the internals of the block– Block is modeled based on the relative timing of its input signals
– Block is limited to roughly a maximum of 4-6 inputs before this modeling becomes untenable
� This method is only helpful if there is enough symmetry to find many isomorphic blocks in the simulation (likely)
– This is likely because each node in a large application is likely to do very similar things as other nodes
Simplified GraphModel for Multi-Operation Block
CCMT| 6
Issues and Conclusions
� Known plausible limitations to task graph simplification:– Must be able to generate global task graph (may be not feasible due to time constraints)– Must be able to effectively manipulate this graph in a timely manner– Application must be written such that the task graph can be simplified
� Known plausible limitations to micro-scale symmetry exploitation:– Application must be written such that there is enough symmetry to exploit across
processes (likely)– It must be possible to generate models fairly quickly
� Conclusions:– If they are feasible, the use of these methods could permit huge simulator speedups for
large machines– They also may permit less accurate, but very fast, analytical solutions to determining the
run-time of a simulated machine|application pair
![Page 58: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/58.jpg)
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
�������������� ��������������������������������������� � !"#�
LLNL-PRES-xxxxxx
��������� ���$�%��&�$���'���(��������$������������
���)���&���������*����*+����
$$($��,-��.../�"�
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� �������������������������� ���������������������������
� �����������• �����'��&�����������������������'�1������• 4���&������������++��+�5��&�*��6�6 �����'��� �3��67�• 6�������������&�����5��&�*���3�� �8,-3$�(�7�• ,�����&�����4�&9�����,������5��&�*��4,7�• ��%���%��������%����'����&�'����+�5��&�*����+��7�
� ���������� ��� ����
��������:�����
������:��'�
6����8�'�����
6��1�'������'�
�+��&���$�+����
;�������3������
:�����,������
3�������&���1�
![Page 59: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/59.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� �������������������������� ���������������������������
� �����������• �������������� �������� �������������������• 4���&������������++��+�5��&�*��6�6 �����'��� �3��67�• 6�������������&�����5��&�*���3�� �8,-3$�(�7�• ,�����&�����4�&9�����,������5��&�*��4,7�• ��%���%��������%����'����&�'����+�5��&�*����+��7�
� ���������� ��� ����
��������:�����
������:��'�
6����8�'�����
6��1�'������'�
�+��&���$�+����
;�������3������
:�����,������
3�������&���1�
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� !��� ������������������������"#$������%&�����• ����'����<����=<� �&����• �������+���������������
���������
��������
��������
�������
��������
�������
���������
��������
�� �� � � ��� �� ��� �� �� � � �� ���� ��� ��� �� � ��� �� ���
����������� ��������
�������������������
Aluminum
FP Ops
![Page 60: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/60.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
Input data from multiple
measurements
Available Visualization
domains
Selected visualization:
3D Torus
Choice of mappings:
Present data on nodes or links?
Drag selection to map data to visualization
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� ���������������������'��(��• 3�����9�&����������������������:8���• -<�'�)��38 !!!�
� ���������)�������������• >������1�?�������&���������������%��?���• ��������������'��&��'�&��
� *���� �����������������������'��(��• ��'�����������+�����&����������'��• �������'�&�����������+���+�����'��
� ��'�����������������������'��(��• 6�����@����������9�������������%��9��• :�<A�������������%���'�������������
� &�����+��*���� ��������
��
��������,�(��-,���
22+� ��3�������&���1�
���������������������� ���������������������������������������������������������������������������������������������������������������������� ������� ������������������������������������������������� �������������������������������������������������� �'�'�''''''''��������������������������������������3��3��3���333��3��33 �3��3 �333333 ������������ �&��&���&����� ���������������� ��� 1111111��������.������/��0�
![Page 61: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/61.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� ��'������������������������������������������������� B���%��������+��� $�&������%���������'���� 4�����&�������
����''�� �������� ������������ ���� ���� ������ ������ ������������ ������������������������
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�wLawwrenreenenenennce cecececece LivLivLivLivLivLiLiLi ermermermermermoreoreoreoreore NaNaNaNaNaNNN tiotiotiotiotiotititi nalnalnalnalnallll LaLaLaLaLaLLL borborborborborbbb atoatoatoatoatottt yyry ryryryry0��0��0� ������ %�%������� ���� ����� ������� '��'��&�& ������� �)�)����������������1��1��� +�2+ 22��������������� ������� ���� ������� �'�'� ���
3��3��3��3 ����������� ��&�&�&� ���������� �� �1�11
![Page 62: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/62.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
2��+�����>�%�:�������0����4��&9�6�'�
8������6�'��'�%�����$������3���&�
4������>�%�
$�+�&���6�'�>�%�0����$������3���&�
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� �������������������������� ���������������������������
� �����������• �����'��&�����������������������'�1������• 4���&������������++��+�5��&�*��6�6 �����'��� �3��67�• 6�������������&�����5��&�*���3�� �8,-3$�(�7�• ,�����&�����4�&9�����,������5��&�*��4,7�• ��'��-�'����������'��- ������������������� 0���������
� ���������� ��� ����
��������:�����
������:��'�
6����8�'�����
6��1�'������'�
�+��&���$�+����
;�������3������
:�����,������
3�������&���1�
![Page 63: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/63.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
���1����������'���
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
���1����������'���
C0����D���%��
![Page 64: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/64.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� ��'�����������������2������������������������ �
� 31�����1����������������• 3�������%��������&������%��������&�• (��'&�����'�����&�����������&���%��• (�������'�����'�����'�<�'�1��%�����+�%������&��• (��4�����?����&������+�&�������
� ��'�����������������2������������������������ �
� 31�����1����������������• 3�������%��������&������%��������&�• (��'&�����'�����&�����������&���%��• (�������'�����'�����'�<�'�1��%�����+�%������&�• (��4�����?����&������+�&�������
���1����������'���
(�����-����&���%��4���
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� !������������4"�����������55)56��,5����������'�������� ��� ���'���������
� )�71���������������������• ,�����+�����&����• 6������'������'�����������
� 8�������������������� ���• 3�������������%��������• ,��&����'������&����
� &�9�������:�"!�
� �����;�������• ���'����+������+���&��A+�����������&����&����• �����������������E&�����F�����%�'�����
sp−mz, 4500W power bound
Max power per processorBest configuration
Avg.Watts
Nodes (8−32)
Processor Power Bound (Watts) (51, 65, 80, 95, 115)
Cores (4−16)
Seconds
y
![Page 65: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/65.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
�������&9��=�&���'+*4*=�.����&���������%����������
��%�����������������������
• ��%������������������������'�����&�
• ��%����'����������������&9������������'��&�����
y
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� �������������"4<$�����������• '+�5'����+���7�— ,�������"!#0�
• �5'���������+���������7�— ,�������/!0�
� ���������'���������������������������������������������• �����'��&����'���1������������
�������������• G��<��)����%��%��• H��<��)�4���&��&9�• ���%�����&������&��&���(��'���1�����%��%��
4���4��&9
���I�
�&��
y
![Page 66: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/66.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
4���4��&9
���I�
�&��
(��'���1�����%��%��
y
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
4���4��&9
��
��I
��&
��
(��'���1�����%��%��
y
![Page 67: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/67.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� ��'������������(�������������������������• (��'���J<�������9��&������+��������������������������&9�• (����������9��������&&����������+�����������&��+�
� �'��'����������'���������� �������������• 4��K����������+��9����C��&���������&���E&��&�D�• 0��9��+�������&9�����%������<�'�����������&������'��&�
� =�������������������� ��'����������>���������� ����
4���4��&9
��
��I
��&
��
(��'���1�����%��%��
y
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� �������?�3���������� ������• �����A��3���&����&��������������%�������������������'���%��• ����� !L����&����������+�������������&�������
� @��(���������'�����!�����������������������AA��• ��������%�����������&���%�����&����&�������• �&�������������&�'���������&&�����+���• ,I������&&�����'�������������'��&���������%��������
ParaDiS Timestep (1-500)
Critical Path Node Max Load Imb. Common Case
y
![Page 68: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/68.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� ��'��������� ��� ���������• 6������'��%��&��'��������������'��+�?����• �����&���������������������%�����+'���������• ����'�&�'���+'���&��������1���������&��
� &��������B����'��-�'����=������������������• ��%�������&��������������&�������������&����• �$�G���?&��
� &��������"��=������������������• ����������+������������+�����'�• ��&��������������&������
����������%��• 6���������������&������• (�����'����������������
���
y0��0�0�0�0�0�0�0�0�0�00�0�000�00�00�00�0�0�0�0�00�00�00�0�0�0��00�0��00��000�0000�0�00�000��0000000 ��
(����
:������
-�&����
����'�����'
-
:
���'���
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� <� � �1���'������9������C�����!���• �����&���*�����'�&��&������+��������'��%���������• ����'�&��%��'����'�������&�������
1000
2000
3000
Watts
��������� ����
0 100 200 300 400
010
0020
0030
00
Time
Wat
ts
&���������� ����
![Page 69: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/69.jpg)
Lawrence Livermore National Laboratory L0�����%������������'��&�����)������1��+�2����������������'��
3�������&���1�
� 31�����1��������������'��������������������'���������• 6�����%�����������'���&������&����������• 3�������%��������&������������'�<�'����%��• 2��'�1��+����%���������&����������&��%��
� 31�����1�����������������'������������������������� ���• 2��'���&��A+���������5��'�������������*�������&�7�• ����'�&������'����'�1��������������%��������&����&�������• �������?����&������+�������'�1�����'������1������
� ��� ���������������'����• �����+�����&���������������&9���5��&�� �������&�'����7�• �&&�����'�����+���������%��������• ���+�������������<��+�������������&�'���+���
![Page 70: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/70.jpg)
CCMT
CCMTHardware Software Co-design
of CMT-Nek CodesPerformance, Energy and Thermal Issues
Tania Banerjee and Sanjay RankaComputer and Information Science and Engineering
CCMTT5 2
Long Term Goals
106 107 108 109 cores
• Parallelization and UQ of Rocflu and CMT-Nek beyond a million cores
• Parallel Performance and Load Balancing• Single Processor (Hybrid) Performance• Energy Management and Thermal Issues
![Page 71: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/71.jpg)
CCMTT5 3
Hybrid Multicores: Performance, Energy and Thermal Management
101 102 103 104 cores
� Code Generation for hybrid cores─ Support for multiple types of cores─ Support for Vectorization
� Multi-objective optimization– Energy─ Performance
� Thermal Constraints
p
CCMTT5 4
Hybrid Multicores, Performance, Energy and Thermal Management
� Multiple Elements can be optimized for Energy─ Processor (Dynamic Voltage Scaling)─ Caches (Dynamic Cache Reconfiguration)─ Buses─ Memory
� Multi-objective optimization– Energy─ Performance
� Multiple Constraints─ Thermal Issues─ Packaging Issues
B
A
Feasible space
Energy
Time
![Page 72: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/72.jpg)
CCMT| 5
Multi-core processors
Intel 48-core Processor
Number of cores are growing
8 cores AMD Opteron 16-Core Processor
Nvidia Kepler Processor
Nvidia Fermi GPU Nvidia Maxwell GPU
CCMTT5 6
Multiple CoresCommon� Multiple flows of Control� Multiple Local Memories
Differences�Synchronization� Communication
Single Processor Performance (Hybrid Multicores)GPU Cores
Common� Single/Multiple flows of Control� Multiple Local Memories
Differences� Amount of Local memory� Communication
![Page 73: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/73.jpg)
CCMT| 7
Our Work
7
Optim
ization Goals
Environments
Optim
ization Goals
Environments
• Execution time• Throughput• MakespanPerformance
• Total energy consumed• Energy due to Leakage
powerEnergy
• Maximum temperature• Average temperature• Spatial and Temporal
gradientsTemperature
P
E
T
CCMT8
Performance, Energy and Thermal Levers
L1 Cache Reconfiguration
L2 Cache Reconfiguration
DVS of Cores
DVS of Buses
![Page 74: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/74.jpg)
CCMT| 9
Our work
Develop an integrated framework for multicore machines that address:
1. Computation2. Energy3. Temperature
Challenging Multi-objective optimization and system issues.
CCMT| 10
Spectral Element Method
� (i, j, k) =
(i, j, k) =
(i, j, k) =
� If Nx = Ny = Nz = N– Then B = C = AT
� Complexity: O(N4)� N is typically between 5-25
– A large number of small matrix multiplications
x
yz
r
st
The derivative computing kernel requires 75-80% of the total execution time of CMT-Nek.
![Page 75: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/75.jpg)
CCMT| 11
Spectral Elements:Derivatives and codes
� Similarly, 5loop versions and 5loop-fused versions were considered.
Algorithm: dudr-4loopdo k = 1, Nzdo j = 1, Ny
do i = 1, Nxdo l = 1, Nx
dudr(I, j, k) = dudr(I, j, k) + a(i, l) * u(l, j, k, ie)enddo
enddoenddo
enddo
Algorithm: dudr-4loop-fuseddo k = 1, Nz* Ny
do i = 1, Nxdo l = 1, Nx
dudr(I, k) = dudr(I, k) + a(i, l) * u(l, k, ie)enddo
enddoenddo
CCMT| 12
Optimizations
� Autotuning– Apply loop transformations
• Loop permutation• Loop unroll
– CHiLL applies loop transformation automatically on the target code
Related Work: C. Chen, J. Chame, M.W. Hall, CHiLL: A Framework for Composing High-Level Loop Transformations, Technical Report 08-897, University of Southern California, Computer Science Department, 2008.
![Page 76: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/76.jpg)
CCMT| 13
Loop permutation
� Simple when you have perfect nesting
do k = 1, nz1do j=1,ny1
do i=1,nx1statement
enddoenddo
enddo
do i = 1, nx1do j=1,ny1
do k=1,nz1statement
enddoenddo
enddo
do i = 1, nx1do k=1,nz1
do j=1,ny1statement
enddoenddo
enddo
do j = 1, ny1do k=1,nz1
do i=1,nx1statement
enddoenddo
enddo
do j = 1, ny1do i=1,nx1
do k=1,nz1statement
enddoenddo
enddo
do k = 1, nz1do i=1,nx1
do j=1,ny1statement
enddoenddo
enddo
CCMT| 14
Loop unroll
� Duplicate loop body� adjust loop header and data indexes� May be applied to outer level loops too.� Unroll factors are preferably divisors of the iteration space� Reduces the number of limit checks for iterator� Exposes possibility of vectorization to the back end compiler
– c (i:i+4, j, k) = a(j, i:i+4) * b(i:i+4, k)
� Code size increases, may result in higher I-cache miss rates
do k = 1, 10do j=1,10
do i=1,10c(i, j, k) = a(j, i) * b(i, k)
enddoenddo
enddo
do k = 1, 10do j=1,10
do i=1,10,2c(i, j, k) = a(j, i) * b(i, k) c(i+1, j, k) = a(j, i+1) * b(i+1, k)
enddoenddo
enddo
![Page 77: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/77.jpg)
CCMT| 15
Possible Combinations
Number of implementations for Nx=Ny=Nz=10
= 4! * 4 ^ 4
= 24 * 256 = 6144 variantsTotal number of variants = 98240
Algorithm: dudr-4loopdo k = 1, Nzdo j = 1, Ny
do i = 1, Nxdo l = 1, Nxdudr(I, j, k) = dudr(I, j, k) + a(i, l) * u(l, j, k, ie)
enddoenddo
enddoenddo
CCMT| 16
CHiLL example
CHiLL script
Input code
Output code
Algorithm: dudr-4loop-fuseddo k = 1, Nz* Nydo i = 1, Nxdo l = 1, Nx
dudr(I, k) = dudr(I, k) + a(i, l) * u(l, k, ie)enddo
enddoenddo
permute([2, 1, 3])unroll(0, 1, 1)unroll(0, 2, 2)unroll(0,3,5)
![Page 78: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/78.jpg)
CCMT| 17
Genetic Algorithm
� We use genetic algorithms to search the exploration space efficiently.
� “Hello world!” example– Start with an arbitrary 12 character string– Goal is to generate “Hello world!”– Probability of coming up with target string
in one try:• 1/95^12
A "Hello World!" Genetic Algorithm Example by James Matthews athttp://www.generation5.org/content/2003/gahelloworld.asp
CCMT| 18
GA Characteristics
� Individuals– Each individual consists of a 12-letter string– Each individual has a fitness value
� Fitness function: – Sum of distance of each letter from target
letter
� Population– Individuals make a population– Population size = 2048 for this problem
![Page 79: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/79.jpg)
CCMT| 19
GA Operators� Operators are used to create new
individuals from existing ones.– Creates a new generation
� Possible operations– Crossover
– GWTc')kv2%8@– 4K)?vM^pE`Yp– Result: GK)?vM^pE`Yp
– Mutation• Random point mutation
– GK)?vM^pE`Zp
CCMT| 20
Genetic Algorithm
� Algorithm:Initialize populationDo i=1, max_iter
calculate fitness of populationsort population based on fitnessprint the member p with the best fitnessif p.fitness is 0 then break;apply crossover and mutation operations
on pairs of members to create new population
Enddo
![Page 80: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/80.jpg)
CCMT| 21
Genetic Algorithm
� Output:– Best: RV[S`(yxj)p! (188)– Best: 8mkCrJvrhsT& (153)– Best: Yakor7yiIvg (132)– Best: GvthH$vrryU" (106)– Best: BiXpb wqwXg& (82)– Best: Sdmul0wqwXe' (75)– Best: J]ndm"wqwvl% (53)– Best: ?_jyk"uonnk (52)– Best: J]ndm"wqwac! (43)– Best: J]ndm"wquqg& (42)– Best: Chkmo"vtuqg& (34)– Best: Gllho wpuig! (22)– Best: Hdmul wqqmf" (21)
– Best: Hdmul wqqmf" (21)– Best: Ldnlp wqqmf" (15)– Best: Ldnlp wqqmf" (15)– Best: Jckop wqrnc! (14)– Best: Hejlp"wqrlg" (11)– Best: Ifklp wqrnc! (9)– Best: Ifklp wqrnc! (9)– Best: Hejlm wprnc! (8)– Best: Jenlo wprld! (5)– Best: Genlo wprld! (4)– Best: Genlo wprld! (4)– Best: Hellp wormd" (3)– Best: Helko wprld! (2)
-- Best: Helko wprld! (2)– Best: Helmo world! (1)– Best: Hello wormd! (1)– Best: Hello world! (0)
CCMT| 22
Genetic Algorithm
� Iterations: 30� Total number of individuals studied:
30*2048 = 61440
![Page 81: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/81.jpg)
CCMT| 23
GA (Our application)
� Individuals– Base code– Permutation sequence– Maximum of 5 unroll factors
CCMT| 24
GA (Our application)
� Fitness function– Time– Energy
� Population size– 100 individuals
� Operators– Mutation – Crossover
![Page 82: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/82.jpg)
CCMT| 25
GA (Our application)
� Mutation
CCMT| 26
GA (Our application)
� Crossover
![Page 83: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/83.jpg)
CCMT| 27
GA (Our application)
� As a result of the mutation and crossover, certain incompatibilities may arise.
� Example: Base code of individual P mutation to 4loop
CCMT| 28
GA (Our application)
� Knowledge based GA: not 100% random– Inclusion of target individual in the initial
population• Target individual = CMT-Nek multiplication
algorithm
– Crossover: Loop permutation has a greater role in performance, so inherit loop permutation sequence from the better performing parent
– Fixing incompatibilities in crossover operation
![Page 84: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/84.jpg)
CCMT| 29
GA (Our application)
i=i+1
Input: n
Generate initial population
i=1
Generate algorithm for the ith individual
Compile and run matrix multiplication
Set fitness value of the ith individual
i < n ? Sort individuals
Report the best individual
Create new generation
Stop ? Stop
YesNo
Yes
No
CCMT| 30
GA (Our application)
� Stopping criteria– The last three generations result in the same best
individual– A pre-defined maximum number of iterations is
reached– Improvement in performance of best individual
compared to target individual is x%• x is set dynamically• First 5 iterations, x = 60%• Next 5 iterations, x = 50%• …• Next 5 iterations, performance of best individual is better
than that of target individual
![Page 85: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/85.jpg)
CCMT| 31
GA Output
� Matrix size 12:� Itr: 1 Best Algo: 4loop Unroll factors: 3 1 2 4 Permute: 4235 Fitness: 4.57044� Itr: 2 Best Algo: 4loopfused Unroll factors: 6 6 3 12 Permute: 423 Fitness: 3.33914� Itr: 3 Best Algo: 4loop Unroll factors: 3 1 1 2 Permute: 4235 Fitness: 3.15236 � Itr: 4 Best Algo: 4loop Unroll factors: 3 1 1 2 Permute: 4235 Fitness: 3.15236 � Itr: 5 Best Algo: 4loop Unroll factors: 3 1 1 2 Permute: 4235 Fitness: 3.15236
CCMT| 32
Performance And Energy
� Software Implementation:– CMT-Nek– 4loop version– 4loop-fused version – 5loop-version – 5loop-fused version
� CPU Platforms:– IBM Blue Gene/Q– AMD Opteron 6378
Performance and Energy Benchmarking of Spectral Element Solvers, Tania Banerjee and Sanjay Ranka (under preparation)
![Page 86: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/86.jpg)
CCMT| 33
Architectures
� BG/Q node� Cores: 16� Each core:
– 4-way SMT– 1.6 GHz
� 204.8 GFLOPS peak performance
� 55W peak power
� Dell 6145 node� 4 AMD Opteron CPU� Each CPU:
– 16 cores– 2.4 GHz
� 614.4 GFLOPS peak performance
� 115 W peak power
CCMT| 34
IBM BG/Q (Performance)
� Comparable number of total nodes, but performance with 10x10x10 matrix size is 20% better than with 16x16x16
� Matrix size 10x10x10, 100 elements
� 51% improvement versus CMT-Nek (~ 2 times)
� 34 GFLOPS average
� Matrix size: 16x16x16, 25 elements
� 61% improvement versus CMT-Nek (~ 2.53 times)
� 12.7 GFLOPS average
0
1
2
3
4
5
6
7
dudr dudt duds
Run
time
(sec
onds
)
Derivatives
Performance
CMT-Nek 5loop-fused4loop 4loop-fused
0
0.5
1
1.5
2
2.5
3
dudr dudt dudsRun
time
(sec
onds
)
Derivatives
Performance
CMT-Nek 5loop-fused4loop 4loop-fused
![Page 87: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/87.jpg)
CCMT| 35
AMD Opteron (Performance)
� Matrix size: 10x10x10, 100 elements
� GNU compilers� 43% improvement versus
CMT-Nek (~1.72 times)� 209 GFLOPS
� Matrix size: 16x16x16, 25 elements
� GNU compilers� 42% improvement versus
CMT-Nek (~1.73 times)� 80 GFLOPS
0
0.2
0.4
0.6
0.8
1
dudr dudt duds
Run
time
(sec
onds
)
Derivatives
Performance
CMT-Nek 5loop 5loop-fused4loop 4loop-fused
0
0.2
0.4
0.6
0.8
1
1.2
dudr dudt duds
Run
time
(sec
onds
)
Derivatives
Performance
CMT-Nek 5loop 5loop-fused4loop 4loop-fused
� Comparable number of total nodes, but performance with 10x10x10 matrix size is 10% better than with 16x16x16
CCMT| 36
Energy Measurements on IBM BG/Q
� Environmental Monitoring (EMON) APIs� MonEQ wrapper library
– Reports power consumption by domains– Utilizes interrupts for more frequent current/voltage readings– APIs:
• MonEQ_Initialize• MonEQ_Finalize, • MonEQ_StartPowerTag, • MonEQ_EndPowerTag
Related Work: S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, M. E. Papka, Measuring Power Consumption on IBM Blue Gene/Q, IEEE International Symposium on Parallel and Distributed ProcessingWorkshops, 2013.
![Page 88: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/88.jpg)
CCMT| 37
BG/Q Power Domains
� Power is measured on a node board basis� Power Domains:
– Core Logic power– Chip Memory Interface and SDRAM-DDR3– Optical module power– Optical module power + PCIExpress– HSS Network Transceiver power for Compute+Link
Chip– Link Chip Core power– Core array power
CCMT| 38
Monitoring Power
� Power consumed by the basic dudt-4loop implementation for matrix size 10x10x10
� After the initial start, power consumption is constant for all domains
![Page 89: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/89.jpg)
CCMT| 39
Energy versus Performance Plots
320033003400350036003700
2.05 2.1 2.15 2.2 2.25
Ener
gy (J
oule
s)
Runtime (seconds)
Energy versus Performance: dudt 4loop-fused
150016001700180019002000
0.95 1 1.05 1.1 1.15
Ener
gy (J
oule
s)
Runtime (seconds)
Energy versus Performance: dudr, 4loop
1600
1650
1700
1750
1800
1850
1900
1 1.02 1.04 1.06 1.08 1.1 1.12
Ener
gy (J
oule
s)
Runtime (seconds)
Energy versus Performance: dudt, 4loop
15001600170018001900200021002200
0.95 1.05 1.15 1.25 1.35
Ener
gy (J
oule
s)
Runtime (Seconds)
Energy versus Performance: dudt, 5loop-fused
CCMT| 40
IBM BG/Q (Energy)
� Observations:– matrix size 10x10x10, 100 elements– 55% reduction in energy versus CMT-Nek
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Ener
gy (J
oule
s)
Derivatives
Energy Consumption
CMT-Nek 5loop-fused 4loop 4loop-fused
0
0.5
1
1.5
2
2.5
3
Run
time
(sec
onds
)
Derivatives
Performance
CMT-Nek 5loop-fused4loop 4loop-fused
![Page 90: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/90.jpg)
CCMT| 41
IBM BG/Q (Energy)
� Observations:– matrix size 16x16x16, 25 elements– 56.8% reduction in energy versus CMT-Nek – Consumes 40% more energy compared to the
10x10x10 case
0
2000
4000
6000
8000
10000
12000
dudr dudt duds
Ener
gy (J
oule
s)
Derivatives
Energy Consumption
CMT-Nek 5loop-fused 4loop 4loop-fused
0
1
2
3
4
5
6
7
dudr dudt duds
Run
time
(sec
onds
)
Derivatives
Performance
CMT-Nek 5loop-fused4loop 4loop-fused
CCMT| 42
GA Results
� Hipergator (Performance)� Teller@Sandia (Energy)
– 104 nodes cluster– AMD-Fusion A10-5800K– 4 cores operating at 3.8GHz– Used PowerInsight to measure power
Related Work: J.H.Laros, III, P. Pokorny, and D. DeBonis, PowerInsight– A Commodity Power Measurement Capability, The Third International Workshop on Power Measurement and Profiling in conjunction with IEEE IGCC 2013, 2013
![Page 91: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/91.jpg)
CCMT| 43
Results (Hipergator)
CCMT| 44
Results (Hipergator)
• Between 9.7% to 38.6% improvement, average improvement of 28.2%• Maximum improvement for N=12
![Page 92: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/92.jpg)
CCMT| 45
Results (SNL)
• Between 27% to 45% improvement average improvement of 37%• Maximum improvement for N=8 and N=12
CCMT| 46
Results (SNL)
• Between 23% to 45% improvement, average improvement of 34%• Average power consumption is about the same for the various
implementations across different matrix sizes• Improvement in energy consumption heavily reflects improvement in
runtime
![Page 93: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/93.jpg)
CCMT| 47
Energy Versus Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 5 10 15 20 25 30 35
Ener
gy (J
oule
s)
Runtime (seconds)
Energy Versus Performance
CCMT| 48
Energy Versus Performance
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
3.5 4 4.5 5 5.5 6 6.5 7
Ener
gy (J
oule
s)
Runtime (seconds)
Energy Versus Performance
![Page 94: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/94.jpg)
CCMT| 49
Integration with CMT-Nek
Algorithm: For all spectral elements do
Compute dudrCompute dudsCompute dudtPopulate du for (x, y, z) coordinate system using dudr, duds and dudt above
Enddo
File: navier1.fSubroutine: conv1 Inputs: du, u: where du represents a derivative matrix populated in the subroutine and u is the function matrix
CCMT| 50
GPU Architecture
Tesla K20c:13 Processors192 Cores48k shared memory64k registers1170 GFLOP/s Peak
![Page 95: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/95.jpg)
CCMT| 51
GPU Implementation
▪ Optimizations:– The derivative operator matrices D and DT
matrices are only brought once per block from the device memory to shared memory.
– The derivative operator matrices D and DTare stored in registers instead of shared memory.
Related work: A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices, C. Jhurani, P. Mullowney, Journal of Parallel and Distributed Computing September 2014.
CCMT| 52
GPU Performance
▪ Compared with– CUGEMM– `Combined’
• Computation of dudr, duds and dudt happen in one kernel reusing function data as much as is possible.
– `Separate’• Computation of dudr, duds and dudt happen in
separate kernels independent of each other.
▪ Platform: Tesla K20c
![Page 96: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/96.jpg)
CCMT| 53
GPU (Performance)
▪ Observations– Performance increases nearly linearly with matrix size– Over 180 GFLOPS for matrix size 16x16x16– 39% improvement versus CUGEMM for matrix size 16x16x16
CCMT| 54
Energy Modeling
▪ Nvidia-smi gives instantaneous power▪ Run kernel thousands of times▪ Measure every second▪ Divide power consumed into GFLOPs
![Page 97: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/97.jpg)
CCMT| 55
GPU (Energy)
▪ Observations:– Power consumed was nearly similar for each kernel– Hence performance/watt is dominated by performance results.
CCMT| 56
Hybrid processors: Performance and energyCPU GPU CPU
Time(ms)
GPUTime(ms)
CPUEnergy
(mJ)
GPU Energy
(mJ)
Total Time (ms)
Total energy
(mJ)
0 10000 0.0000 0.8333 0.0000 0.8333 0.8333 0.8333
10 9990 0.0333 0.8325 0.4167 0.8325 0.8325 1.2492
20 9980 0.0667 0.8317 0.8333 0.8317 0.8317 1.6650
40 9960 0.1333 0.8300 1.6667 0.8300 0.8300 2.4967
80 9920 0.2667 0.8267 3.3333 0.8267 0.8267 4.1600
160 9840 0.5333 0.8200 6.6667 0.8200 0.8200 7.4867
320 9680 1.0667 0.8067 13.3333 0.8067 1.0667 14.1400
640 9360 2.1333 0.7800 26.6667 0.7800 2.1333 27.4467
1280 8720 4.2667 0.7267 53.3333 0.7267 4.2667 54.0600
2560 7440 8.5333 0.6200 106.6667 0.6200 8.5333 107.2867
![Page 98: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/98.jpg)
CCMT| 57
Interpolation
� Optimizations:– Matrix multiplication– Ordering of operations
• Expensive operations can be done earlier while matrix size is still small
• xyz, xzy, yxz, yzx, zxy, zyx
CCMT| 58
Lightweight Distributed Metric Service
� LDMS: Data collection tool at LANL� Gives temperature numbers.� After logging in,
� Output:
� Units of temperature - degree Celsius� Sensors are placed inside core.
1421325154.002116, 2116, 1600, 30, 30, 30, 32
![Page 99: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/99.jpg)
CCMT| 59
Conclusions� We benchmarked the most compute intensive kernel of
CMT-Nek for performance and energy.� Our work highlights autotuning as an important strategy for
improving both performance and energy, over different architectures– We got between 42-61% improvement in performance and about
55% improvement in energy requirement
� We coupled genetic algorithms with autotuning for performing smart search
� Currently working on spectral interpolation and temperature measurements
CCMT| 60
Dynamic Voltage Scaling
60
�
�
� ,
![Page 100: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/100.jpg)
CCMT| 61
Reconfigurable Cache
61
Capacity Tuning
Associativity Tuning
Line Size Tuning
Zhang et al., ACM TECS 2005
CCMT| 62
Our Research
62
![Page 101: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/101.jpg)
CCMT| 63
Optimal Cache Configurations
63
CCMT64
Cache Reconfiguration in Multi-core Systems
W. Wang, P. Mishra and S. Ranka, DAC, 2011
![Page 102: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/102.jpg)
CCMT65
Cache Reconfiguration and Partitioning
Dynamic cache reconfiguration in L1 caches and partitioning in L2 cache are highly correlated.
CCMT66
Algorithm
• We assume task mapping is given
• We statically profile each independent task for all L1 cache configurations and L2 cache partitioning factors� Greatly reduced design space size
• Step 1: We employ a dynamic programming based algorithm to find the optimal L1 cache configurations for each core (with multiple tasks) separately under all L2 partition factors
• Step 2: We then find the optimal L2 cache partition scheme
![Page 103: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/103.jpg)
CCMT67
Algorithm Illustration
67
Step 1
Step 2
CCMT68
Our approach can achieve 29% energy savings compared with CP and up to 14% savings
compared with DCR + UCP
![Page 104: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/104.jpg)
CCMT| 69
Thermal Issues of Multi-core Processors
� The power density of multi-core processors has doubled every three years and this rate is expected to increase as frequencies scale faster than operating voltages
� A small increase of 10 C in temperature may result in 2×reductions in the lifespan of the device
� The cost of cooling system increases super-linearly in power consumption
Power and heat flux trend in the desktop processor
CCMT| 70
Managing Temperature: MotivationTemperature varies on multiple cores
[Sarood2011]rrararrrrrrrrrrooooooood2ddd2dd 010100 ]1]1]
Tilera Processor
![Page 105: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/105.jpg)
CCMT| 71
Thermal RC model
• P: power consumption• Ti: initial temperature• t: execution time
There are three major factors affecting the on-chip transient temperature: average power of the processor, initial temperature and execution time
RCt
iAA eTTRPTRPtT�
������� )()(
CCMT| 72
Thermal Management
� Sensor based– [Mukherjee06, Sharifi08, Cochran09]– Pros: low computation overhead– Cons: imprecision and noise from the raw data
of temperature sensor; fixed position of the sensor
� Model based– [Skadron03, Huang04, Liu06, Rao07]– Pros: estimate the temperature accurately– Cons: high computation overhead
![Page 106: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/106.jpg)
CCMT| 73
HotSpot
CCMT| 74
Steady State Temperature
� Assume P is the power of a thermal element, Ti is the initial temperature and TA is the ambient temperature, then the transient temperature at time t is:
� Assume t -> ∞, we get steady state temperature:
� Time to reach steady state temperature: 20ms –20s
RCtiAAt eTTRPTRPT /)( ���������
As TRPT ���
![Page 107: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/107.jpg)
CCMT| 75
– G: thermal conductance between thermal elements (cores and heatsinks)
– A: thermal conductance between thermal elements and outside environment
– Thermal conductance is the quantity of heat transferred withtemperature difference of one kelvin, measured in W/K
Matrix Model
CCMT| 76
Matrix Model
� Suppose there are u cores and v heatsinks, Tm is the steady-state temperature of thermal element m, TA is the ambient temperature.��
�������Mm
Amnm mPmATTnmGTT Mm ),()()(),()(
(1) PATTR A ����
�
�
����
� � ��
��
1
,0 if ),(),(
if ),,( where vu
ikk
ij jiiAkiGjijiG
R
�
![Page 108: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/108.jpg)
CCMT| 77
Matrix Model
� In a multi-core processor with u cores
– Tcore, 1×u vector of steady-state temperature of cores
– TA, ambient temperature, I = [1, 1, · · · , 1]T
– C, u×u matrix• The amount of change in temperature of mth core caused
bynth core is given by the Cm,n times the change in thermalpower of nth core
– P, 1×u vector of power of cores
PCITT Acore ����
CCMT| 78
Matrix Model
� Generation of matrix C
u
u
HotSpot[P0, P1, …, Pu-1] [T0, T1, …, Tu-1]
HotSpot[P0+α, P1, …, Pu-1] [T’0, T’1, …, T’u-
1]
[ C00, C01, … , C0u-1 ]subtract TuTT -1
C00 C01 C02 ... C0u-1
![Page 109: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/109.jpg)
CCMT| 79
Matrix Model
� Generation of matrix C
u
u
HotSpot[P0, P1, …, Pu-1] [T0, T1, …, Tu-1]
HotSpot[P0, P1+α, …, Pu-1] [T’0, T’1, …, T’u-
1]
[ C10, C11, … , C1u-1 ]subtract TuTT -1
C00 C01 C02 ... C0u-1
C10 C11 C12 ... C1u-1
CCMT| 80
Matrix Model
� Generation of matrix C
u
u
HotSpot[P0, P1, …, Pu-1] [T0, T1, …, Tu-1]
HotSpot[P0, P1, P2+α…, Pu-1] [T’0, T’1, …, T’u-
1]
[ C20, C21, … , C2u-1 ]subtract TuTT -1
C00 C01 C02 ... C0u-1
C10 C11 C12 ... C1u-1
C20 C21 C22 ... C2u-1
![Page 110: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/110.jpg)
CCMT| 81
Matrix Model
� Generation of matrix C
u
u
HotSpot[P0, P1, …, Pu-1] [T0, T1, …, Tu-1]
HotSpot[P0, P1, …, Pu-1+α] [T’0, T’1, …, T’u-
1]
[Cu-10, Cu-11, … , Cu-1u-1 ]subtract TuTT -1
C00 C01 C02 ... C0u-1
C10 C11 C12 ... C1u-1
C20 C21 C22 ... C2u-1
….
Cu-10 Cu-11 Cu-12 ... Cu-1u-1
CCMT| 82
C-Matrix of a 4 core processor
![Page 111: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/111.jpg)
CCMT| 83
Matrix Model
� Compact– Quite simple equations to compute the steady-
state temperature of cores, much easier than HotSpot
� Accurate– Achieve very close steady-state temperature to
HotSpot (Experiments)� Limitations
– Only steady-state temperature– Still need HotSpot to generate C matrix
CCMT| 84
Evaluation
� Multicore parameters: – 4 cores and 16 cores
processors– Each core is
abstracted as a 8mm ×8mm square chip
– For 4 cores processors, each core dissipates 100Wpower at maximum voltage. For 16 cores processor, 50W each core
– The default thermal configuration in HotSpot.
� Matrix C generation– Use 20 different
power configurations to generate 20 different matrix Cs and average them
� DAG generation:– 32, 64, 128, 256, 512
tasks– execution time, 20 –
60 time units– probability of any
two node having an edge between them is set to 0.1
![Page 112: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/112.jpg)
CCMT| 85
Evaluation
� Matrix Model vs HotSpot --- Peak Temperature
CCMT| 86
Evaluation
� Matrix Model vs HotSpot -- Computation Time
![Page 113: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/113.jpg)
CCMT| 87
Temperature Aware Scheduling
� Uniform voltage without throttling– Each core can work at full voltage or has to be shut off
� Uniform voltage with throttling– The voltage of each core has to be the same but can be
varied between a maximum and minimum� Non-uniform voltage with throttling
– The voltage of each core can be varied independently
� Problem: For each of above three types of processors, determine the workload distribution for each core, so that the total throughput across all cores is maximized and the maximum temperature for any core is bounded by a given threshold
• Determine data parallel workloads distribution on multicore processor, so that the total throughput across all cores is maximized and the maximum temperature for any core is bounded by a given threshold
• denotes workloads assigned on core i;• denotes the running time of the workload on core i;• denotes the peak temperature on core I;• denotes the temperature threshold for all cores
m}{1,2,...,i , T s.t.
max/ max
th
m
1i
����
i
ii
T
tw
iwitiTthT
Uniform voltage without throttling
![Page 114: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/114.jpg)
CCMT| 89
Uniform voltage without throttling
Mix
MiTPxCT
xz
i
MjthjijA
Mti
��
�����
�
�
�
�
�
,1or 0
, s.t.
max
Mix
Zx
MiPxCTT
T
i
Mii
MjjijAp
p
��
�
�����
�
�
�
�
1,or 0
, s.t.
min
*
CCMT| 90
Uniform voltage with throttling
Optimization Problem
– Simplify (*) as follows:
where D is equal to – x is given by:
10
(*) s.t.
max
��
���� ��
x
TPxCTx
MjthijA
MiC
Dxj ij
���
,
PTT Ath �
MiC
Dxj ij
i
i�
�
��
�
�
��
,1 ,minmin
![Page 115: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/115.jpg)
CCMT| 91
Non-uniform voltage with throttling
Optimization Problem
*)*(* ,10
(**) , s.t.
max
Mix
MiTPxCT
x
i
MjthjijA
Mii
���
����� �
�
�
�
CCMT| 92
Non-uniform voltage with throttling
� Three possible cases: • Case 1: The threshold temperature is high and all
cores can execute at its maximum voltage without exceed the threshold. In this case all xis are set to 1s
• Case 2: The threshold temperature is low and requires all xis to be less than 1. In this case these xisare all bounded by Equation (**)
• Case 3: The threshold temperature is such that xivalues of some of the cores is limited by the constraint given by equation (***)
![Page 116: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/116.jpg)
CCMT| 93
Non-uniform voltage with throttling
� Case 2
10
s.t.
max
����
� ��
XDXC
xMi
i
*)*(* ,10
(**) , s.t.
max
Mix
MiTPxCT
x
i
MjthjijA
Mii
���
����� �
�
�
�
DCX 1 ���
CCMT| 94
Non-uniform voltage with throttling
� Case 3:• Check Case 1 followed by Case 2. If both fails,
than an approximation is used as in Case 3 by assuming that all xi values are the same and the algorithm for uniform voltage with throttling
![Page 117: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/117.jpg)
CCMT| 95
Evaluation
� CPU: multi-core processors with 4, 16, 32 and 64 cores
� The ambient temperature used was 45.5oC. The maximum allowable temperature was set to 70oC
� Metric: coreper t throughpuMaximum
t throughpuTotal
Cores ofNumber Effective �
Number of cores
Floorplan Area per core Maximum power per core
4 2×2 grid 8mm×8mm 40Watt
16 4×4 grid 4mm×4mm 10Watt
32 8×4 grid 3mm×3mm 5Watt
64 8×8 grid 2mm×2mm 2.5Watt
CCMT| 96
Evaluation
� Uniform voltage without throttling– MIP (Mixed Integer Programming): The solution
derived by our algorithm that minimizes the maximum temperature for the optimal value of P
– BestP: Consider all subsets of size P. Find the subset that corresponds to the lowest maximum temperature
– BestP+1: Consider all subsets of size P+1. Find the subset that corresponds to the lowest maximum temperature
– WorstP: Consider all subsets of size P. Find the subset that corresponds to the highest maximum temperature
![Page 118: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/118.jpg)
CCMT| 97
Evaluation
� Uniform voltage without throttling
CCMT| 98
Evaluation
� Uniform voltage with throttling– Throughput comparison
![Page 119: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/119.jpg)
CCMT| 99
Evaluation
� Uniform voltage with throttling– Computation time comparison
CCMT| 100
Evaluation
� Non-uniform voltage with throttling– Throughput comparison
![Page 120: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/120.jpg)
CCMT| 101
Evaluation
� Non-uniform voltage with throttling– Computation time
CCMT| 102
Decomposing Hot Tasks
� Partition the “hot” tasks into multiple subtasks and interleave these subtasks with “cool” tasks to reduce the overall maximum temperature– To the best of our knowledge, our work is the first
attempt to develop efficient task partitioning algorithms to demonstrate significant temperature reduction.
� Several heuristic task partitioning algorithms using “cool” tasks to interleave “hot” tasks – 1) for a periodic set of tasks with common period– 2) for a periodic set of tasks with individual period
1We define “hot” tasks as tasks with higher average power consumption, and “cool” tasks as tasks with lower average power consumption.
![Page 121: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/121.jpg)
CCMT| 103
Related Work� Dynamic Voltage and Frequency Scaling (Reduce
power)– Dynamic Voltage and Frequency Scaling (DVFS) can be
used to reduce the power consumption by lowering the supply voltage and operating frequency, thereby reduce the on-chip temperature [Brooks2001, Rao2008, Kadin2008, Ebi2009, …]
– Cons: faces a serious problem in time-constrained applications
� Temperature aware task sequencing algorithm (Reduce initial temperature)– Reduce peak temperature compared to a random
sequence [Jayaseelan2008]– Cons: fails to reduce temperature in cases when one or
more of the “hot”1 tasks are long
CCMT| 104
Temperature-aware task partitioning algorithm
Illustrative example
![Page 122: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/122.jpg)
CCMT| 105
Temperature-aware task partitioning algorithm
Illustrative example
CCMT| 106
Temperature-aware task partitioning algorithm
Illustrative example
![Page 123: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/123.jpg)
CCMT| 107
Temperature-aware task partitioning algorithm
Illustrative example
CCMT| 108
Temperature-aware task partitioning algorithm
Illustrative example
![Page 124: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/124.jpg)
CCMT| 109
Temperature-aware task partitioning algorithm
Illustrative example
CCMT| 110
Temperature-aware task partitioning algorithm
Illustrative example
Task Partitioning Algorithm can achieve lower peak temperature g gthan Task Sequencing Algorithm
![Page 125: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/125.jpg)
CCMT| 111
Experiments
� Platform:� CPU:
• ARM Cortex A8 (Simplescalar)• 2-width in-order issue, 32KB instruction cache• 1.5GHz clock speed
� Power simulator: • Wattch
� Temperature evaluation: • HotSpot• Ambient temperature:
� Tasks: Synthetic tasks and real benchmarks are used� Algorithms: Compare with task sequencing algorithm and
EDF algorithm.
CCMT| 112
Experiments
Benchmarks
Simplescalar CACTI
Wattch core
Cache statistics
Core component statistics
Wattch
Power of tasksAmbient Temperatures,
cpu thermal parameters… Temperature-aware task
partitioning algorithms
Peak temperature
RCt
iAA eTTRPTRPtT�
������� )()(
Cpu frequency, cache configurations …
![Page 126: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/126.jpg)
CCMT| 113
Experiments: Periodic tasks with common period
Real benchmarks:
set1 patricia, adpcm, rijndael, susan, crc, FFT, dijkstra, epic
set2 patricia, djpeg, adpcm, sha, FFT, rijndael, susan, rijndael
set3 sha, djpeg, FFT, rijndael, dijkstra, epic, rijndael, susan
set4 rijndael, dijkstra, FFT, gsm, sha, patricia, pegwit, djpeg
CCMT| 114
Experiments: Periodic tasks with common period
� Temperature comparison:
Task partitioning algorithm (TPA) can reduce the peak temperature by up to
5.88o8oC compared with task sequencing algorithm (TSA)
![Page 127: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/127.jpg)
CCMT| 115
Periodic tasks with individual period
EDF scheduling with task partitioning (g (EDFpFp) can also reduce the peak DF scheduling with task ptemperature by up to 6
p66opartitparp66oC
CCMT| 116
Experiments
� Overhead:
Average context switch per task for TPA is as low as 2 (left figure).Average context switch per task foAverage context switch per task for
or TPA or TPoror EDFp
is as low as 2 (left figure).is app is also lower then 2 (right figure).Average context switch per task foor DFpED p s alis
They are tolerable in many practical scenariosss1lso lolsoss1.
1 1 Context switch time on ARM M cpuu can be less than 10us [SEGGER]
![Page 128: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/128.jpg)
CCMT| 117
Conclusion
� We propose to partition the “hot” tasks into multiple subtasks and interleave these subtasks with “cool” tasks to reduce the overall maximum temperature
� We propose two heuristic task partitioning algorithms using “cool” tasks to interleave “hot” tasks – 1) for a periodic set of tasks with common
period– 2) for a periodic set of tasks with individual
period
CCMT| 118
Temperature-aware Scheduling fro Multicores
� Multicore Processors:– Multiple heating sources– Heat interaction between neighboring cores
![Page 129: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/129.jpg)
CCMT| 119
Inter Core Scheduling
CCMT| 120
Experiments
� Platform:� CPU:
• Simplescalar, ARM Cortex A9 (multicore)• 2-width out-of-order issue, 32KB instruction cache• 1.2GHz clock speed.
� Power simulator: • Wattch
� Temperature evaluation: • Temperature simulator: HotSpot• Ambient temperature: 45.15oC
� Tasks: Synthetic tasks and real benchmarks are used� Algorithms: Min-Min, PDTM [Yeo2008], TPS-1(δ=0.33ms),
TPS-2(δ=0.66ms), TPS-3(δ=1.32ms), TPS-3(δ=2.64ms)
![Page 130: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/130.jpg)
CCMT| 121
Experiments
Benchmarks
Simplescalar CACTI
Wattch core
Cache statistics
Core component statistics
Wattch
Power of tasksAmbient Temperatures,
CPU thermal parameters… Temperature-aware task
partitioning algorithms
Peak temperature
CPU frequency, cache configurations …
HotSpot
CCMT| 122
Experiments
� Multicore: – Real benchmarks
![Page 131: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/131.jpg)
CCMT| 123
Experiments
� Multicore: Synthetic tasks
TPS algorithm reduces the peak temperature by up to 11.688o8oC TPS algorithm reducompared with Min
uinin-
s the peak tempsucenn--Min algorithm. p g
PDTM can achieve similar peak temperature reduction, but requires PDTM can33% more
chieve sn ace e makespan
ilsimianan.
CCMT| 124
Experiments
� Multicore: – Real benchmarks
TPS S algorithm reduce the peak temperature by up to o 9.922o2oC CC compared with Minin-TPSSMin
gorithm reduS galn n algorithm, 4.52
u22oce the peak temperature by up too 929. 2cece
22oC compared with PDTM algorithm.g , p gPDTM can achieve similar peak temperature reduction, but requires s 44% % more PDTM camakespan
n acan aanan.
![Page 132: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/132.jpg)
CCMT| 125
Experiments
� Multicore: – TPS vs PDTM: peak temperature undersame
makespan
TPSPS-S 1 relaxed d algorithm reduce the peak temperature by up to o 2000o0oC TPPSS- relaxed1 1 d gorithm reduce thealcompared with PDTM algorithm.
CCMT| 126
Experiments
� Multicore:– TPS vs PDTM: scalability
![Page 133: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/133.jpg)
CCMT| 127
Conclusions
Thermal model addressed core throttling can highly improve performance
Heuristics with transient thermal model have better improvements than methods with steady-state model
Computing time cost: transient is larger than steady-state– It is worthwhile because it is calculated offline.
Different initial configurations on f and t may result in small difference of achievements
– Practically it will be better by using the steady-state solution as initial configuration
CCMT| 128
Conclusion
� We propose to partition the “hot” tasks into multiple subtasks and interleave these subtasks with “cool” tasks to reduce the overall maximum temperature
� We propose heuristic task partitioning algorithms using “cool” tasks to interleave “hot” tasks on both single core and multicore processors
� Experimental results show that out algorithm outperforms existing state-of-art thermal-aware scheduling algorithm in terms of peak temperature and makespan.
![Page 134: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/134.jpg)
CCMT| 129
Transient Thermal models
Steady-state thermal model
– Efficient but does not capture transient effects (worst case scenario)
Transient-state thermal model:– If the average power of core is P over a time period t,
then the temperature at the end of this period T(t) is given by:
G is the thermal conductance matrix C is the thermal capacitance matrix
is the ambient temperatureis the initial temperature
CCMT| 130
Approach
1. We propose a solution to the convex optimization problem with the simple thermal model to solve the problem of maximizing throughput under the temperature constraint.2. We also propose a heuristic algorithm with the transient thermal model to solve the problem with higher accuracy.
![Page 135: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/135.jpg)
CCMT| 131
Evaluation – Matrix Multiplication
General scheme
– High throughput improvement than w/o HLB
– Around 10% throughput improvement than base solution
– With very large workload, solutions of heuristic and base will converge
Homogeneous-scaling scheme
NNon-scaling scheme
Hengxing Tan, and Sanjay Ranka, Thermal-aware Scheduling for Data Parallel Workloads on Multi-Core Processors, ISCC 2014 (Work partially supported by NSF)
CCMTT5 132
Future Work: Energy and Thermal Management
� Varying Architectural Elements─ Processor (Dynamic Voltage Scaling)─ Caches (Dynamic Cache Reconfiguration)─ Buses─ Memory
� Developing Optimized Libraries – Energy─ Performance─ Temperature
B
A
Feasible space
Energy
Time
![Page 136: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/136.jpg)
CCMT| 133
Selected Publications
133
� Jaeyeon Kang, Sanjay Ranka: Energy-Efficient Dynamic Scheduling on Parallel Machines. HiPC –International Conference on High Performance Computing, 2008: 208-219.
� Jaeyeon Kang and Sanjay Ranka, Dynamic Algorithms for Energy Minimization on Parallel Machines., Proceeding of Euromicro International Conference on Parallel, Distributed and network-based Processing (PDP), 2008, pp. 399-406.
� Jaeyeon Kang and Sanjay Ranka, DVS based Energy Minimization Algorithm for Parallel Machines, Proceedings of IEEE International Parallel and Distributed Processing Symposium, 2008, pp. 1-12.
� Zhe Wang and Sanjay Ranka, A Simple Thermal Model for Multi-core Processors and Its Application to Slack Allocation, Proceedings of International Parallel and Distributed Processing Symposium 2010, pp. 1-11.
� Weixun Wang, Prabhat Mishra and Sanjay Ranka, “Dynamic Reconfiguration in Real-Time Systems: Energy, Performance, Reliability and Thermal Perspectives”, Springer, 2012 (Expected)
� Weixun Wang, Sanjay Ranka and Prabhat Mishra, “Energy-Aware Dynamic Reconfiguration Algorithms for Real-Time Multitasking Systems”, SUSCOM, Issue. 1, pages 35-45, 2011 (Invited Paper)
� Weixun Wang, Prabhat Mishra and Sanjay Ranka, “Energy Optimization of Cache Hierarchy in Real-Time Multicore Systems”, TCAD, under review
� Weixun Wang and Prabhat Mishra, “PreDVS: Preemptive Dynamic Voltage Scaling for Real-Time Multitasking Systems”, TODAES, under review
� Weixun Wang, Sanjay Ranka and Prabhat Mishra, “Energy-Aware Dynamic Slack Allocation for Real-Time Multitasking Systems”, SUSCOM, under review
CCMT| 134
Managing Temperature: ApproachesHigh temperature leads to performance loss.
[Rajan2008]
Temperature Thresholding -
• Lower workloads at • Increase workloads at
Zigzag throttling processor speeds
– Zigzag effects cause more loss– Processor will put “hot” cores into
low power state.
![Page 137: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/137.jpg)
CCMT| 135
Modeling Thermal Behavior (HotSpot)
CCMT136
Energy Levers
![Page 138: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/138.jpg)
CCMT| 137
Relaxation� Most real architectures
only support discrete frequency settings to scale– E.g. Given options 1.0
Ghz, 1.5 Ghz, 2.0 Ghzfor dual cores
– Results: 1.4 Ghz, 1.99 Ghz
� Relaxation– Native: downward – relaxation
• 1.0 Ghz, 1.5 Ghz– Our method: relaxing
result frequency to the neighboring discrete value
• 1.0 Ghz, 2.0 Ghz– Practically choose
among 2 or 4 neighbors are good enough
CCMT| 138
Solution with the steady state model
Assumption:– Applications runs for a long time with
constant frequency– Each core completes work simultaneously
Each core will arrive at its steady-state temperature
Use convex solver m}{1,2,...,i ,
, T-T s.t.
max
maxmin
ith
3m
1i
1
m
1i
���
����
�
�
FfF
fG
f
i
iij
i�
![Page 139: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/139.jpg)
CCMT| 139
Solution with transient model
Related Dual Objective Minimize the Makespan across all cores with given workloads and temperature threshold
A general solution uses non-linear solver as SQP
m}{1,2,...,i , T
W s.t.
max min
th ���
i
ii
i
Ttβf
t
CCMT| 140
Iterative Refinement Process
![Page 140: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/140.jpg)
CCMT| 141
Additional constraints
Homogeneous-scaling: cores run at same frequency Objective:
Non-scaling: cores run at fixed frequency but some cores could be turned off
Objective:
��
m
1imax/ max ii ttf�
max/ maxm
1i�� ii ttF�
CCMT| 142
Additional Constraints (with simple model)
Homogeneous-scaling : cores run at same frequency
– Convex problem
Non-scaling: cores run at fixed frequency
– Mixed Integer linear problem
m}{1,2,...,i ,
, T-T s.t.
max
maxmin
ith
3m
1i
1
���
����
FfF
fG
f
ij
�
m}{1,2,...,i , 1or 0
, T-T s.t.
max
ith
m
1i
13
1
��
���
��
�
i
iij
m
i i
x
xGF
xF�
![Page 141: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/141.jpg)
CCMT| 143
Additional Constraints (with Transient model
Homogeneous-scaling: cores run at same frequency
Non-scaling: cores run at fixed frequency but some cores could be turned off
The heuristic can work on both problems
m}{1,2,...,i ,
, TT
, s.t.
max t min
maxmin
thi
m
1i
i
����
���
FfF
Wtf i�
m}{1,2,...,i
, TT
, / s.t.
max t min
thi
m
1i max
i
��
���FWti �
CCMT| 144
Heuristic – local search
Based on Heat Load Balance (HLB)Slicing workloads and move then from hot cores to cool cores
1: Start from an initial configuration including f and t• Distribute total workloads evenly to cores with
2: Move a workload unit • From: the core decided by peak temperature • To: the core decided by max gradient , in term of frequency slice
3: Repeat step 2 until same peak temperature on all cores4: If
• Continue move workload unit until From: decided by gradient
Else if • Backward move workload unit until (reducing f by increase
t)
![Page 142: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/142.jpg)
CCMT| 145
Scheduling time
CCMT| 146
Temperature-aware task partitioning algorithm
� Two major challenges:– 1) Number of Partitions– 2) Sequencing of Subtasks
� Two broad scenarios:– 1) A periodic set of tasks with common period.
All the tasks have the same arrival time and deadline.
– 2) A set of periodic tasks with individual period. Each task may have different arrival time and deadline.
![Page 143: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/143.jpg)
CCMT| 147
Scenario 1: Periodic tasks with common period
� A periodic set of N heterogeneous tasks L, let Pi be the average power consumption during the execution time ci of task τi.
� The goal is to find a sequence of these tasks using task partitioning to minimize the peak temperature
CCMT| 148
Algorithm: Periodic tasks with common period
� Sort the tasks based on the power profile from coolest to hottest
� Group the sorted tasks into k categories with equal number of tasks.
� Partition tasks in category j, 2 <= j <= k, into 2i-1
equal subtasks. Partition tasks in category 1 into 2 equal subtasks.
� for i = 1 to k − 1 do– Interleave tasks of ith category with tasks of (i+1)th
category to form the new (i+1)th category� end for
![Page 144: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/144.jpg)
CCMT| 149
Periodic tasks with common period
3:
2:
1:
T2 : power = 25
T1: power = 20
T3: power = 15
CCMT| 150
Periodic tasks with common period
3:
2:
1:
T2
T1
T3
T2 T2 T2
T1
T3
![Page 145: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/145.jpg)
CCMT| 151
Periodic tasks with common period
3:
1&2:
T2
T1T3
T2 T2 T2
T1T3
CCMT| 152
Periodic tasks with common period
1&2&3:T2 T1T3 T2T2 T2T1T3
![Page 146: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/146.jpg)
CCMT| 153
Scenario 2: Periodic tasks with individual period
� A set of periodic N heterogeneous tasks in a set L where each task has its own period pi. The arrival time ai is equal to the start time of its period and the deadline di is equal to the end time of its period.
� The goal is to find a sequence of these tasks using task partitioning to minimize the peak temperature
CCMT| 154
Algorithm: Periodic tasks with individual period
Use EDF scheduler to get the initial schedule of these taskswhile loop for M times do
Calculate the thermal profile of task sequence, find the “hot” task instance τh where peak temperature occurs.
Partition the task instances whose execution period overlap with the arrival time or deadline of the “hot” task instance.
In the hot interval, remove all the subparts of τh and calculate the available slack for each “cool” task instance.
while there are parts of τh unassigned and some “cool” task instance has available slack dofor each “cool” task instance τci in the hot interval do
if slacki > 0 thenAppend one unit of τh into τci and update the slack for all “cool” task instancesend if
end forend while
If there is still some subparts of τh unassigned, scan the hot interval and assign them uniformly into the idle time.end while
![Page 147: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/147.jpg)
CCMT| 155
Periodic tasks with individual period
� Partition the task instances whose execution period overlaps with the arrival time and/or deadline of the “hot” task instance
� Slack allocation:
iii
isucci
predpredi
ESTLSTslack
cLSTddLST
cESTaaEST
iih
iiih
��
��
��
),,min(
),,max(
��
��
CCMT| 156
Periodic tasks with individual period
![Page 148: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/148.jpg)
CCMT| 157
Periodic tasks with individual period
![Page 149: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/149.jpg)
Load Balancing: Particles and Mesh
T5 2
Computation Partitioning: Types of Parallelism
Independent Parallelism�Bundled simulations using
DAKOTA (e.g. UQ, parametric variations)
�Task Parallelism�Independent models that are
simulated concurrently�(e.g Fluid, particle coupling at
microscale)
�Data Parallelism�Parallelization of Eulerian grid
Parallelization of Lagrangian particles
ndBDv
TIns(em
DPPp
In�
v
�T�
s�(
m
�D�
p
![Page 150: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/150.jpg)
T5 3
Fluid Model Solid Model
Across immersed interfacesAcross cells and elements
Lagrangian/EulerianLevels of AMR
Communication Mapping: Types of Interactions
T5 4
4
Preferential particle clusteringLagrangian remap
Computational power focusing
Extreme event UQ-drivenComputational steering
Adaptive mesh refinement
Load Balancing: Types of Adaptivity
![Page 151: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/151.jpg)
CUDA Memory model
� A Thread block has R/W access to Shared memory
� A block grid has R/W access to Global memory
� A block grid has R/only to Constant memory
� Global memory (of order 4G) resides in DRAM and has very high access latency than the shared memory
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
Images from CUDA programming guide
Global memory access� When accessing global
memory, peak performance is achieved when all threads in a half warp access continuous memory locations.
Non-coalesced access
Coalesced access
Images from CUDA programming guide
![Page 152: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/152.jpg)
Shared memory access� Shared memory is divided into banks� Multiple simultaneous accesses to a bank
result in a bank conflict� Conflicting accesses are serialized
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
No conflicts Conflicting access
Images from CUDA programming guide
4 Phases of PIC algorithm
1. Charge Deposition Phase 3. Force Gathering Phase
4. Particle push Phase2. Field Solve Phase- Compute the forces (Poisson equations) needed for particle motion from the accumulated particle charges
![Page 153: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/153.jpg)
PIC algorithm on a triangular mesh� Irregular structure makes
partitioning complex.� Each particle requires a
search to find the enclosing triangle� This step forms an
additional Search Phasein the PIC algorithm flow
� Search phase forms one of the time consuming steps in the PIC flow
Fig: Mesh from ORNL used for XGC1 benchmarks
GPU Parallelization using Mesh coloring� Triangles in mesh are considered
as nodes of a graph� Triangles with at least one
common vertex are of different colors
� Every GPU kernel works only on one color
� Pros� No conflicting access
� Cons� Needs multiple kernel invocations.� For efficient indexing, we need to
maintain color sorted order of triangles and particles which involves additional computation
![Page 154: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/154.jpg)
GPU Parallelization using Mesh Partitioning� Particles, triangles and
vertices of triangles form the partitioning entities in the PIC problem
� Mesh is partitioned into regions using a virtual rectangular grid
� Each region is mapped to a GPU block
Region 1 Region 2
Region 3 Region 4
GPU Block
Triangular Mesh Partitioning� Triangles that cross region
boundaries are referred as shadow triangles. The vertices of shadow triangles are termed as shadow vertices
� Shadow vertices and triangles are replicated
� Particles, triangles and vertices are represented using linear arrays
Shadow vertices
Region 1 Region 2
Region 3 Region 4
on
Shadow Triangleow vert
![Page 155: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/155.jpg)
Replication of shadow entities� Replication ensures that each block can compute
independently of the other blocks� After the computation, an aggregation step merges the values
from all shadow vertices
Vertices in Region 1
Shadow Vertices in Region 1
Vertices in Region 2
Shadow Vertices in Region 2
Triangles in Region 1
Shadow Triangles in Region 1
Triangles in Region 2
Shadow Triangles in Region 2
Particles in Region 1 Particles in Region 2
GPU kernels� Bucket sort for triangles, vertices and particles� For each simulation iteration
� Triangle search� Field solve phase� Force aggregation of shadow vertices� Particle push phase� Re-sorting of moved particles
![Page 156: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/156.jpg)
Triangle density based partitioning� Ensures effective load
balancing across regions� Expensive pre-processing
step � Need to use a spatial
indexing data structure like KD-tree to partition triangles
� During search phase particles traverse the KD-tree
� KD-tree is not very well suited for GPU
Region 1 Region 2
Region 3
Region 4
Region 5
Region 6
Partitioning using Level 1 grid� The virtual rectangular grid
partitions the mesh into regions
� Pre-processing step is very fast
� Load imbalance due to difference in triangle density
� The linear search for triangles can be a bottleneck
Region 1 Region 2
Region 3 Region 4
![Page 157: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/157.jpg)
Level 2 Partitioning (Partition Region into Sub-Regions� Used only for the search
phase� No replication of shadow
vertices across sub-regions as the vertices are read-only while searching.
� Requires sorting of triangles, and vertices in sub-region order which is again part of pre-processing
� Sub-region order sorting of particles has to be performed after each iteration
Incremental Sort –Uniform Partitioning� After each simulation iteration,
particles have to be re-sorted� For efficient bucket sort all the
sub-regions should be present in shared memory which stores the particle count in the sub-region
� As number of sub-regions increase, it wont fit in shared memory of GPU
� In reality most of the particles will move only to adjacent sub -regions.
� Keep only adjacent regions ( Y) to region (X) in shared memory
X
Y Y
Y
YYY
Y
Y
![Page 158: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/158.jpg)
Non-uniform Partitioning using Level 2 grid� Level 2 grid is not uniform� Dense in regions where
triangle density is more.� Non-uniformity creates
asymmetry and requires more complex pre-processing and indexing methods.
� Incremental sort becomes very complex
Field solve phase� Most of the flops are executed in this phase� Each region is mapped to a GPU block.� GPU block loads the vertices and shadow vertices in a
region to shared memory� Each thread operates on a set of particles in the region� The force is updated on vertices/ shadow vertices in
shared memory. � Different particles can update the same vertex. Hence atomic
update is used. Doesn’t consume much cycles as atomic updates in shared memory are very fast.
� Once the block completed execution the values are written back to global memory
![Page 159: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/159.jpg)
Experimental Results� Mesh from ORNL used for XGC1 benchmarks
� 1.8 Million triangles� Randomly distributed 18 Million particles� Level 1 partitioning uses 32 X 32 rectangular grid (regions)� NVIDIA Tesla T10 GPU with 4GB global memory, 16k shared
memory and 240 computing cores
Comparison of triangle search time (10 simulation iterations)� Uniform and non-uniform
partitioning gives similar performance when there are sufficient number of GPU blocks
� Simpler uniform partitioning would be a better choice GPU blocks Time (ms)
4096 3111.11
9216 1366.21
16384 877.23
25600 609
36864 500.92
50176 427
Non-uniform partitioning
Uniform partitioning
GPU blocks Time (ms)
1024 12561.06
2779 7235.16
22471 989.88
33464 428.51
![Page 160: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/160.jpg)
Particle sorting time (Non-uniform partitioning)� ~20X speedup when using shared memory for sorting
Order of sub-region( Max # of triangles in sub-region)
Sorting using shared memory (Time in ms)
Sorting without using shared memory (Timein ms)
10 185.68 1492.55
200 183.88 1881.56
1000 183.55 2441.08
5000 183.99 3773.92
Particle Incremental Sorting time (Uniform Partitioning )� Relatively independent of
the number of blocks used Number of GPU blocks
Particles per thread
Time in (ms)
35157 2 61.49
17579 4 60.49
4395 16 60.73
1099 64 61.13
550 128 65.79
314 224 63.89
276 255 65.07
![Page 161: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/161.jpg)
Conclusion� Methodologies to Parallelize PIC on triangular mesh using
GPUs� Shadow entities (replication) provides a simpler and
efficient solution� Algorithms discussed are scalable with the size of mesh,
number of particles and can be easily ported to a multi-GPU framework
![Page 162: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/162.jpg)
CCMT
CCMT
Research ThrustNGEE Reconfigurable Platform
CCMT| 2
IntroductionBackground:� Behavioral emulation (BE) approach: manages Exascale complexity via
– BEOs (abstraction of object behavior; not cycle accurate)– Multi-scale (abstraction at micro, meso, and macro levels)
Goal: Research & develop toolset to scale BE approach of system simulation up to Exascale while maintaining required performance (speed)
– Software PDES behavioral simulator– Hardware-accelerated behavioral emulator
Approach: (for behavior emulator)� Explore methods of mapping BEOs onto systems of reconfigurable processors� Investigate use of large-scale reconfigurable supercomputing, RSC (e.g.,
Novo-G#, next-gen RSC) in emulation of Exa/extreme-scale systemsRelated research:� Multi-FPGA systems (Novo-G, Catapult, BEE3-based cluster, Bluehive)� Multi-FPGA sytem interconnect (Novo-G#, BEE3-based cluster)� FPGA-accelerated architectural emulation (RAMP)� Recent interest in FPGA-based heterogeneous computing for big data and
data centers: Microsoft, IBM, Intel, Oracle, Google, Baidu
![Page 163: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/163.jpg)
CCMT| 3
Session 3 Outline � Introduction: motivation, goal, & approach� Mapping BEOs onto reconfigurable (RC) platform
– Novo-G# reconfigurable supercomputer– Basic appBEO, procBEO, & commBEO designs
� Current single-FPGA prototype (NGEEv1)– Current single-device prototype & demo– Additional results & SMP performance comparison
� Transitioning to multiple FPGAs – Questions, identified issues, & possible directions– Direct FPGA-to-FPGA communication via
3D interconnect in Novo-G#– New potential NGEE target architecture– Proposed scalability measure
� NGEEv2 single-FPGA design– Effect of BE V2 methodology improvements
on our FPGA acceleration efforts– Multiple-FPGA considerations
� Conclusions
s
CCMT| 4
Novo-G Reconfigurable Supercomputer
� Developed and deployed at CHREC – Most powerful reconfigurable computer
in (academic) world– 2012 Alexander Schwarzkopf Prize for
Technology Innovation @ NSF Center
� Apps acceleration– In key science domains: bioinformatics,
finance, image & video processing
� Hardware emulation– Behavioral emulation of future-gen
systems, up to Exascale
� 2014 upgrade– 64 GiDEL ProceV (Stratix V D8)– 4x4x4 3D-torus or 6D-hypercube– 6 Rx-Tx links per FPGA– 4x 10 Gbps per link
Novo-G Annual Growth2009: 24 GiDEL ProcStar III cards (96 top-end
Stratix-III FPGAs), each with 4.25GB SDRAM2010: 24 more ProcStar III cards (96 more Stratix-III
FPGAs), each with 4.25GB SDRAM2011: 24 ProcStar IV cards (96 top-end Stratix-IV
FPGAs), each with 8.50GB SDRAM2012: 24 more ProcStar IV cards (96 more Stratix-IV
FPGAs), each with 8.50GB SDRAM2014: 64 ProceV cards (64 top-end Stratix-V FPGAs),
with high-speed 4x4x4 torus or 6D-hypercube
![Page 164: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/164.jpg)
CCMT| 5
Novo-G ProceV Upgrade w/ 3D Torus
Novo-G# (Novo-jee-sharp)• 32 GiDEL ProceV (Stratix V D8)• 4x4x2 3d-torus or 5d-hypercube• 6 Rx-Tx links per FPGA• 4x 10 Gbps per link• Data-link layer: SerialLite III protocol
• Full-duplex, CRC32 protection, in-band or out-of-band flow control
• Physical layer: Interlaken protocol• 64B/67B encoding, multi-lane sync.
CXP to 3-QSFP Cable(provides connectivity
for 3D torus)
4x4x2 Torus(soon to be 4x4x4)
CXP Port (underneath):12x10Gbps channels 8-lane PCI Express Gen3
ProceV Board
QSFP+ daughterboard3x QSFP+ Ports:4x10Gbps
channels each
Stratix V D8device
4x(soo(
8 ProceV nodes
Upgraded Novo-G
Special contributions by Abhijeet Lawande via cost-share from CHREC
CCMT| 6
Novo-G# 3D Torus Protocol Stack
� 3-layer 3D torus protocol stack (shown above) based on IP from Altera and GiDEL� Basic point-to-point services provided by Interlaken and SerialLite-III, network
oriented services provided by RTL code� Direct FPGA interconnect crucial to the scalability of NGEE
Network services3d-torus FPGA architecture
Trans-ceivers
Trans-ceivers
Layer 2 Switch
ProtocolIP
Application
Layer 3 Router
Network layerDimension order routing
Collective routingSource data buffering
Data-link layerPhysical addressing
Packet switchingCongestion control
Data-link layerData framing
Error detection (CRC)
Physical layerClock recovery
Line codingMulti-lane sync.
NN t k
Services availablethrough IP
Services availablethrough RTL design
![Page 165: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/165.jpg)
CCMT| 7
Mapping AppBEO onto Single FPGA� High-level appBEO script (abstraction of target app)
mapped to custom machine code (MIF file)� Stream of instructions for procBEOs� Instruction delivery options
– Pulled from on-chip ROM– Pushed from CPU (Instr. stream from
CPUs through external memories)rnal memories)
OptimizationExploration
CCMT| 8
Mapping ProcBEO onto Single FPGA
� Mimics “real” processor under study– Instruction decoding, timekeeping– No real computation: interpolation of compute operations– Generates tokens to emulate comm packets
� Lightweight processing elements� Initial prototype
– One-to-one mapping of procBEOs to interpolation & comm resources
eapping of nterpolation rces
OptimizationExploration
![Page 166: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/166.jpg)
CCMT| 9
� System-specific: fabric is explicit emulation of target architecture
� Consists of token buffers, arbiter, router, network timer
� Packets transferredcontain characteristicsof (not real) data
| 9
,meransferredaracteristicsl) data
Mapping CommBEO onto Single FPGA
OptimizationExploration
CCMT| 10
Session 3 Outline � Introduction: motivation, goal, & approach� Mapping BEOs onto reconfigurable (RC) platform
– Novo-G# reconfigurable supercomputer– Basic appBEO, procBEO, & commBEO designs
� Current single-FPGA prototype (NGEEv1)– Current single-device prototype & demo– Additional results & SMP performance comparison
� Transitioning to multiple FPGAs – Questions, identified issues, & possible directions– Direct FPGA-to-FPGA communication via
3D interconnect in Novo-G#– New potential NGEE target architecture– Proposed scalability measure
� NGEEv2 single-FPGA design– Effect of BE V2 methodology improvements
on our FPGA acceleration efforts– Multiple-FPGA considerations
� Conclusions
s
![Page 167: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/167.jpg)
CCMT| 11
Current Single-FPGA Prototype (NGEEv1)
Functioning prototype running on single FPGA of Novo-G� No optimization (i.e., max-resource implementation)� Current core density of 90 for Stratix IV, 256 for Stratix V
– Each core contains one each of appBEO, procBEO, and commBEO– Stratix IV currently limited by FPGA block RAM, not logic
• 9x10 mesh on Stratix IV: logic 19%, block memory 100%
– Higher core density on Stratix V• 16x16 mesh on Stratix V: logic 94%, block memory 100%
� appBEO scripts stored in on-chip block RAMs as memory initialization files (MIFs)
� Proc interpolation resources replaced with MIF pre-processing� Explicit emulation of target communication fabric without
congestion modeling� Separate management plane fabric collecting management
tokens for postmortem analysis (e.g., simulation visualization)
CCMT| 12
DEMO: FPGA-specific appBEOs� Generate memory initialization files (mif) to configure FPGA simulator
– R script to convert appBEO instructions into custom NGEE-specific machine code – Generates core-level instruction streams for configuring simulators FPGA bit file
![Page 168: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/168.jpg)
CCMT| 13
DEMO: Simulator Setup & Execution� mif files assembled into FPGA bit file & loaded into FPGA� Custom c driver initiates simulator and collects results from FPGA management plane� Management plane logs core-level events & streams to host
CCMT| 14
Additional Results & SMP Performance Comparison
Experimental SetupExperiments with TileGX36, next-gen TileGX72, & anticipatedIntel Xeon Phi Knights Landing architecture � Single Stratix IV E530 vs SMP on single quad-core Xeon E5520
CPU @ 2.27GHz� Proc/comm configurations: Tile 6x6, Tile 9x8, Knights Landing 9x8� App configuration: work equally distributed to all available cores for
each proc/comm configuration� Apps: 2D MM & Sobel filtering� App kernels executed 250 times to amortize simulator overheads� Compare management results to SMP for equivalency/correctness� Compare execution times to SMP for performance improvement
![Page 169: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/169.jpg)
CCMT| 15
Performance Comparison: 3 Data Points
*
Tile 6x6 Simulated Time(Consistent with
Prediction ErrorSMP results)
FPGA Simulation Time
SMP Simulation Time
Speedup2D MM
1024x1024across 36 cores
2.82x109 ns -0.35%35.7us 4.82ms
~135xSobel
800x600 image across 36 cores
9.27x107 ns -2.61%54.2us 8.08ms
~149x
Tile 9x8 Simulated Time(Consistent with
Prediction ErrorSMP results)
FPGA Simulation Time
SMP Simulation Time
Speedup2D MM
1024x1024across 72 cores
1.66x109 ns To be determined81.1us 10.24ms
~126xSobel
800x600 image across 72 cores
2.58x107 ns To be determined102.0us 17.94ms
~176x
KNL 9x8 Simulated Time(Consistent with
Prediction ErrorSMP results)
FPGA Simulation Time
SMP Simulation Time
Speedup2D MM
1024x1024across 72 cores
5.87x108 ns To be determined81.1us 10.24ms
~126xSobel
800x600 image across 72 cores
1.37x107 ns To be determined102.0us 17.94ms
~176x
Nex
t-gen
Ant
icip
ated
CCMT| 16
Session 3 Outline � Introduction: motivation, goal, & approach� Mapping BEOs onto reconfigurable (RC) platform
– Novo-G# reconfigurable supercomputer– Basic appBEO, procBEO, & commBEO designs
� Current single-FPGA prototype (NGEEv1)– Current single-device prototype & demo– Additional results & SMP performance comparison
� Transitioning to multiple FPGAs – Questions, identified issues, & possible directions– Direct FPGA-to-FPGA communication via
3D interconnect in Novo-G#– New potential NGEE target architecture– Proposed scalability measure
� NGEEv2 single-FPGA design– Effect of BE V2 methodology improvements
on our FPGA acceleration efforts– Multiple-FPGA considerations
� Conclusions
s
![Page 170: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/170.jpg)
CCMT| 17
Transitioning to Multiple FPGAs With Novo-G# as target system architecture, requires modification of current design to allow communication between commBEOs instantiated on different FPGAs over direct FPGA interconnect
Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?Q3. Would the use of other system
architectures be more advantageous given new BE requirements?
CCMT| 18
Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?– Added communication infrastructure & overhead– Multi-layer communication protocol & virtual network fabric– Modified ISA, packet structure, management tokens, inter-
device bandwidth allocation– General design considerations (e.g., arbitrary no. of resources
vs. hardcoded limits of single FPGA)– Likely reduced BEO density
Q2. Implications on speed/scalability?Q3. Would the use of other system
architectures be more advantageous given new BE requirements?
![Page 171: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/171.jpg)
CCMT| 19
Direct FPGA-to-FPGA Communication via 3D Interconnect
Ext. receiver
Ext. receiver
Ext. receiver
Negative Router
Ext. trans-mitter
Ext. trans-mitter
Ext. trans-mitter
Int. trans-mitter
Int. receiver
App logic
Ext. receiver
Ext. receiver
Ext. receiver
Ext. trans-mitter
Ext. trans-mitter
Ext. trans-mitter
-X
-Y
-Z
+X
+Y
+Z
-X
-Y
-Z
+X
+Y
+Z
Positive Router
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
XCVR IP
router_clkxcvr_clk xcvr_clk
256 256 256 256
10 Gbps 10 Gbps
Header Data Data DataHeadH dderHeader Data Data DataD tData
Data Data Data Reserved
256 bits
…
DestX Y Z
Payload size
SourceX Y Z
Header format12 bits 12 bits8 bits
Packet number
8 bits
Reserved
24 bits
App
start_of_packet
size
packet_num
valid
data
source
start_of_packet
packet_num
valid
data
end_of_packet
dest
CCMT| 20
Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?– Added communication infrastructure & overhead– Multi-layer communication protocol & virtual network fabric– Modified ISA, packet structure, management tokens, inter-
device bandwidth allocation– General design description (i.e., arbitrary No. of resources vs.
hardcoded limits of single FPGA)– Likely reduced BEO density
Q2. Implications on speed/scalability?Q3. Would the use of other system
architectures be more advantageous given new BE requirements?
![Page 172: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/172.jpg)
CCMT| 21
Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?– BEO wait times– Inter-FPGA event queuing, flow controls, queuing size, event
reordering– Sharing of BEO resources– Proposed scalability measurement & its use to inform multi-
FPGA design decisions
Q3. Would the use of other system architectures be more advantageous given new BE requirements?
CCMT| 22
Scalability Studies & Projections
Definitions:� Emulation system: Behavioral emulation platform such as Novo-G#� Emulated system: appBEOs (e.g., modeling CMT app) stimulating
archBEOs (e.g., modeling Blue Gene/L)
Open questions to be answered in the future:� For a given emulation system architecture (e.g., #FPGAs, BEO core
density, core design, interconnect arch, etc.),What are the limits of an emulated system?
– Including size (e.g., #BEOs) and emulation performance
� For given requirements of an emulated system (e.g., macro-scale emulation with Blue Gene/L), What emulation system resources are necessary?
– Including #FPGAs, core density, interconnect arch, etc.
![Page 173: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/173.jpg)
CCMT| 23
Potential Scalability MeasureObjective: � Compare scalability(HW) vs scalability(SW)
HW: hardware approach; SW: software SMP approach
Potential Scalability Measure for HW� Ideally entire simulation system is on single large FPGA;
thus, communication between BEOs is at on-chip rateBaseline: – Validated BE model for single-FPGA performance (PfS)
of NGEE (i.e., BE model of FPGA running other BE models)
� Scalability issues arise when BEOs communicate across FPGAs
– Off-chip communication much more costlyApproach – Validated BE model for multiple-FPGA performance (PfM)
of NGEE (possible after multi-FPGA experiments)
Potential Scalability Measure SM(HW) = PfS/PfM
EmulatedEmulatedSystem
Notional FPGA
EmulatedEmulatedSystem
1
Para
llel E
ffici
ency
No. of Devices
FPGA
SMP
CCMT| 24
Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?– BEO wait times– Inter-FPGA event queuing, flow controls, queuing size, event
reordering– Sharing of BEO resources– Proposed scalability measurement & its use to inform multi-
FPGA design decisions
Q3. Would the use of other system architectures be more advantageous given new BE requirements?
![Page 174: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/174.jpg)
CCMT| 25
Transitioning to Multiple FPGAs Q1. Effect on current single-FPGA design?Q2. Implications on speed/scalability?Q3. Would the use of other system
architectures be more advantageous given new BE requirements?
CCMT| 26
New Potential NGEE Target Architecture
Node architecture: POWER8 server with– 2 CPUs– 2 CAPI attached accelerators– 4 accelerator 1D torus
System configuration:– 4 POWER8 servers– 16 Nallatech boards– 16 board 2D torus– Up to 32 FPGAs w/ dual chip boards– CAPI enables a hardware kernel bypass
CCMT
Resource pool functionality & OpenCL support
![Page 175: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/175.jpg)
CCMT| 27
Session 3 Outline � Introduction: motivation, goal, & approach� Mapping BEOs onto reconfigurable (RC) platform
– Novo-G# reconfigurable supercomputer– Basic appBEO, procBEO, & commBEO designs
� Current single-FPGA prototype (NGEEv1)– Current single-device prototype & demo– Additional results & SMP performance comparison
� Transitioning to multiple FPGAs – Questions, identified issues, & possible directions– Direct FPGA-to-FPGA communication via
3D interconnect in Novo-G#– New potential NGEE target architecture– Proposed scalability measure
� NGEEv2 single-FPGA design– Effect of BE V2 methodology improvements
on our FPGA acceleration efforts– Multiple-FPGA considerations
� Conclusions
s
CCMT| 28
NGEEv2 single-FPGA designUpdated design based on current and possible future developments in fundamental BE methodology, emulation system architecture, …
� Effect on FPGA acceleration efforts going forward?– Both single & multi FPGA considerations
� Identified issues– BEv2 modifications: e.g., congestion modeling, global
task graph manipulation, micro-scale symmetry exploitation, multi-pass simulation …
� Possible approaches and directions– Alternative acceleration approaches?– Alternate target system architectures?
![Page 176: Deep Dive, University of Florida (02/2015)](https://reader031.vdocument.in/reader031/viewer/2022020119/589452a31a28ab0e388bc4ae/html5/thumbnails/176.jpg)
CCMT| 29
ConclusionsProgress:� Working single-FPGA prototype (micro-scale) with max-resource
implementation & management plane (no optimization)� Beginning stages of performance optimization & scalability evaluation� New design (NGEEv2) ideation
Plans for March:� Prototype NGEE platform operating on multiple FPGAs� Showcase results from optimization studies
– Increased BEO density per FPGA� Performance comparison with software-based
SMP simulator for multiple appBEO scripts � Upgraded Novo-G# (4x4x4 torus) supporting BE � New performance/scalability predictions
for fully upgraded Novo-G#