parallelism for the masses: opportunities and …...parallelism for the masses “opportunities and...

Andrew A. ChienVice President of Research

Intel Corporation

University of Washington/Microsoft Research Institute on the Concurrency Challenge

August 5August 5, 2008

Parallelism for the Masses: Opportunities and ChallengesParallelism for the Masses:

Opportunities and Challenges

These are my opinions, not These are my opinions, not

necessarily those of my employer.necessarily those of my employer.

Parallelism for the Masses“Opportunities and Challenges”

2

© Intel Corporation

OutlineOutline

• 4 Historic Opportunities

• What’s Going on at Intel

• Which Parallelism Opportunity

• Wouldn’t it be nice if (Challenges)

• Education and Computing Fundamentals

• Questions


3


• Broad range of systems (servers, desktops, laptops, MIDs, smart phones,…) converging to…– A single framework with parallelism and a selection of

CPU’s and specialized elements – Energy efficiency and Performance are core drivers– Must become “forward scalable”

Opportunity #1: Highly Portable, Parallel SoftwareOpportunity #1: Highly Portable, Parallel Software

Parallelism becomes widespread – all software is parallel

Create standard models of parallelism in architecture, expression, and implementation.


4


“Forward Scalable” Software

TimeTimeSoftware with forward scalability can be moved

unchanged from N -> 2N -> 4N cores with continued performance increases.

# Cores# Cores

881616

16163232

Today

NN

2N2N

128128256256

256256512512

Academic Research


5


• Single core growth and aggressive frequency scaling are weakening competitors with other types of architecture innovation

Opportunity #2: Major Architectural Support for ProgrammabilityOpportunity #2: Major Architectural Support for Programmability

Architecture innovations for functionality –programmability, observability, GC, … are now possible

Don’t ask for small incremental changes, be bold and ask for LARGE changes… that make a LARGE difference


6


Mid 60Mid 60’’ssTo Mid 80To Mid 80’’ss

Mid 80Mid 80’’s s To Mid 200xTo Mid 200x

TerascaleTerascaleEraEra

Programming SupportProgramming SupportParallelismParallelismSystem IntegrationSystem Integration

GhzGhz ScalingScalingIssue ScalingIssue Scaling

Language SupportLanguage SupportIntegrationIntegration

New Golden Age of Architectural Support for Programming?New Golden Age of Architectural Support for Programming?


7


• Single chip integration enables closer coupling (cores, caches) and innovation in intercore coordination– Eases performance concerns– Supports irregular, unstructured parallelism

Opportunity #3: High Performance, High Level Programming ApproachesOpportunity #3: High Performance, High Level Programming Approaches

Forward scalable performance with good efficiency may be possible without detailed control.

Transactional, functional, side-effect free, declarative, object-oriented, and many other high level models

will thrive.


8


• Less locality sensitive; Efficient sharing • Runtime techniques more effective for dynamic,

irregular data and programs

• Can we do less tuning? And program at a higher level?

~20X20 cycles400 cyclesOn-dieLatency

~100X~1.2 TB/s12 GB/sOn-dieBandwidth

ImprovementTera-scaleSMPParameter

Huge Opportunity: Manycore != SMP on DieHuge Opportunity: Manycore != SMP on Die


9


Opportunity #4: Parallelism can Add New Kinds of Capability and ValueOpportunity #4: Parallelism can Add New Kinds of Capability and Value

• Additional core and computational capability will be available on-chip, can be exploited at modest additional cost– Single-chip design enables enhancement at low cost

Deploy Parallelism to improve application, software, and user experience

Traditional: user interface, visual computing, large data analysis, machine learning, activity inference

New: Security – network and platform monitoringSoftware – health monitoring,debugging and lifecycle monitoring

Performance - Dynamic tuning and Customization


10


Intel Research Log-Based ArchitecturesProblem Statement• Hypothesis: Architectural support

can significantly improve the performance of online software correctness-checking tools (lifeguards).

• Approach: As application runs, hardware captures execution log; log is transported via processor cache and consumed by lifeguard running on separate core.

Research Activities• Design of architecture support for logging.• Design/evaluation of log compression.• Port of lifeguards to LBA.• Linux system software implementation.• Design of acceleration hardware.• Detailed performance evaluation.• Design for monitoring parallel applications (on-

going).Performance Comparisons• Binary instrumentation: 20-40X slowdown• LBA baseline arch: 3.2-4.3X slowdown• LBA w/acceleration: 2-50% slowdown

Joint w/CMU


11


What’s going on at IntelWhat’s going on at Intel


12


Intel Ct -- C for Throughput Computing

• Ct adds parallel collection objects & methods to C++– Works with standard C++ compilers (ICC, GCC, VC++)

• Ct abstracts away architectural details– Vector ISA width / Core count / Memory model / Cache

sizes• Ct forward-scales software written today

– Ct is designed to be dynamically retargetable to SSE, AVX, Larrabee, and beyond

• Ct is deterministic– No data races

See See whatif.intel.comwhatif.intel.com


13


Intel® Concurrent Collections for C/C++

The work of the tuning expert• Architecture• Actual parallelism • Locality• Overhead • Load balancing• Distribution among processors• Scheduling within a processor

The work of the domain expert• Semantic correctness• Constraints required by the application

Concurrent Collections Spec

Mapping to target platform

The application problem Supports seriousseparation of concerns:

The domain expert does not need to know about parallelism

The tuning expert does not need to know about the domain.

Classifier1

image Classifier2

Classifier3

http://whatif.intel.com


14



15


Intel: Making Parallel Computing Pervasive

EnablingEnablingParallelParallel

ComputingComputing

Academic ResearchUPCRCs

SoftwareProducts

Multi-coreEducation

TBB Open SourcedTBB Open SourcedSTMSTM--Enabled Compiler on Enabled Compiler on Whatif.intel.comWhatif.intel.comParallel Benchmarks at Parallel Benchmarks at PrincetonPrinceton’’s PARSEC sites PARSEC site

Joint HW/SW R&D Joint HW/SW R&D program to enable program to enable

Intel products Intel products 33--7+ in future7+ in future

Intel Tera-scaleResearch

Academic research Academic research seeking disruptiveseeking disruptiveinnovations 7innovations 7--10+10+years out years out

Community and Experimental Tools

Wide array of leadingWide array of leadingmultimulti--core SW core SW

development tools & development tools & info available today info available today

MultiMulti--core Education Programcore Education Program-- 400+ Universities 400+ Universities -- 25,000+ students 25,000+ students -- 2008 Goal: Double this2008 Goal: Double this

IntelIntel®® Academic CommunityAcademic CommunityThreading for MultiThreading for Multi--core SW core SW communitycommunityMultiMulti--core bookscore books


16


Multicores and Manycores

• Dunnington – x86, 6 cores– 16MB L3, 12 cores/socket

• Nehalem (new uArch), – 4 cores, 4-wide, IMC+QPI– 2-way SMT

• LRB x86+Graphics– Large # of cores– Coherent Caches– Shared L2


17


““PolarisPolaris”” Teraflops Research ProcessorTeraflops Research Processor

• Deliver Tera-scale performance– TFLOP @ 62W, Desktop Power,

16GF/W– Frequency target 5GHz, 80 cores– Bi-section B/W of 256GB/s– Link bandwidth in hundreds of GB/s

• Prototype two key technologies– On-die interconnect fabric– 3D stacked memory

• Develop a scalable design methodology– Tiled design approach– Mesochronous clocking– Power-aware capability

I/O AreaI/O Area

I/O AreaI/O Area

PLLPLL

single tilesingle tile

1.5mm1.5mm

2.0mm2.0mm

TAPTAP

21.7

2mm

I/O AreaI/O Area

PLLPLL TAPTAP

12.64mm

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #


18


Terascale Chip Research QuestionsTerascale Chip Research Questions

• Cores– How many? What size?– Homogenous, Heterogeneous– Programmable, Configurable,

Fixed-function

• Chip-level– Interconnect: Topology,

Bandwidth– Coordination– Management

• Memory Hierarchy– # of levels, sharing, inclusion– Bandwidth, novel Technology– Integration/Packaging

• I/O Bandwidth– Silicon-based photonics– Terabit links

Source: CTWatchQuarterly, Feb 2007

Manycore Chips (circa. 2015)?


19


TeraScaleTeraScale Computing ResearchComputing Research

PlatformPlatform

EssentialEssential3D Stacked Memory3D Stacked Memory

Cache HierarchyCache HierarchyVirtualization/PartitioningVirtualization/Partitioning

Scaleable Scaleable OSOS’’ssI/O & NetworkingI/O & Networking

ComplementaryComplementaryCMOS RadioCMOS RadioPhotonicsPhotonics

EESAEESA

ProgrammingProgramming

EssentialEssentialSpeculative MultithreadingSpeculative Multithreading

Transactional memoryTransactional memoryWorkload analysisWorkload analysis

Compilers & LibrariesCompilers & LibrariesToolsTools

ComplementaryComplementaryDiamond media indexingDiamond media indexing

Activity RecognitionActivity RecognitionUsage ModelsUsage Models

MicroprocessorMicroprocessor

EssentialEssentialScalable memoryScalable memory

MultiMulti--core architecturescore architecturesSpecialized coresSpecialized coresScalable fabricsScalable fabrics

Energy efficient circuitsEnergy efficient circuits

ComplementaryComplementarySi Process TechnologySi Process Technology

CMOS VRCMOS VRResiliencyResiliency

100+ Projects Worldwide


20


Which Parallelism Opportunity?Which Parallelism Opportunity?


21


Which Parallelism Opportunity?

• Data Center, Desktop/Laptop, Mobile/embedded

Single Chip Parallelism

InternetInternet

DataCenterDataCenter

& HPC& HPC


22


Unified IA and Parallelism Vision – a Foundation Across Platforms

CE (Canmore), Ultra-Mobile and Low Cost

PC Platforms

Intel® Core™ 2 and NehalemArchitecture

Notebook, Desktop and Server

POWER

PER

FOR

MA

NCE

SYSTEMON A CHIP

Visual Computingand Next Generation

Gaming Console

Integration Options

Example: Graphics


23


Two Students in 2015

How many cores How many cores does your does your

computer have?computer have?

II’’ve got an Intel ve got an Intel thousand core thousand core with 100 OOO with 100 OOO

cores!cores!

End Users don’t care about core counts; they care about capability


24


Chip Real Estate and Performance/Value

• Tukwila – upcoming Intel Itanium processor– 4 cores, 2B transistors, 30MB cache – 50/50 split on transistors

• 1% of chip area = 1/3 MB = 30M transistors• 0.1% of chip area = 1/30 MB = 3M transistors

• How much performance benefit do these elements provide?

• What incremental performance benefit would you expect for the last core in a 100-core? 1000-core?


25


Which Programming Community?

• “Quick functionality, adequate performance”– Matlab, Mathematica, R, SAS, etc.– VisualBasic, PERL, Python, Ruby|Rails

• “Mix of productivity and performance”– Java, C# (Managed languages + rich libraries)

• “Performance Programmers”– C++ and STL– C and Fortran

• HPC has focused nearly exclusively on “Performance Programmers”

Higher Higher ProductivityProductivity

Higher Higher EfficiencyEfficiency


26


A different perspective…

• Guy: “…the parallel part isn’t what makes this hard…”

• Andrew: “… it’s the ugliness of the low level machine details and programming models that makes both parallelism and concurrency difficult…”

• Can we create a world in which the upper layer programmers (Joe’s) can write lots of 100-fold parallel programs?

Don’t forget to focus on high productivity programmers. They write the most code (and the killer apps!) and are

least tolerant of low-level abstractions and tedium.


27


Parallelism for the Masses Focus

• Clients are where the action is – smartphones, laptops, and interaction with the physical world

• Easy to use parallelism for the rapid innovators and moderate programming effort crowd

• Incremental performance improvement matters –more cores translate into higher performance – but program efficiency is less critical

• => 100M’s of systems, M’s of programmers


28


HPC… what have we learned about parallelism and how to approach this…

• Large-scale – extreme scalability is possible, and typically comes with scaling of problems and data

• Portable expression of parallelism matters• Multi-version programming (algorithms and implementations)

is a good idea• High level program analysis is a critical technology• Autotuning is a good idea• Working with domain experts is a good idea

• Locality is hard, modularity is hard, data structures are hard, Efficiency is hard…

• Of course, this list is not exhaustive….


29


HPC… lessons not to learn …

• Programmer effort doesn’t matter• Hardware efficiency matters• Low-level programming tools are acceptable• Low-level control is an acceptable path to

performance• Horizontal locality and control of communication is

the critical performance concern


30


Wouldn’t it be nice if? (Challenges)

• Programmers could write parallel code with minimal effort– Perhaps only a simple limited model– Serve “fast prototyping” programmers (but all tiers of

programmers); a short time to working, parallel code…– Reasonable performance (and “forward scalability”)

delivered by the implementation

• Programmers could migrate gracefully from this high level view to lower-levels of specification– To improve performance– To increase control or provide information to the underlying

runtime system– Parallelism, Locality, “Hints”, etc.


31


Wouldn’t it be nice if?

• There were an easy, modular way to express locality.

• Computation = Algorithms + Data structures

• Efficient Computation = Algorithms + Data Structures + Locality

• How do we associate these in our programming models? Explicit, Implicit, otherwise?– A + DS + L– (A + DS) + L– A + (DS + L)– (A + L) + DS


32



• Parallel program composition preserved correctness of the composed parts

• Interactions when necessary were controllable at the programmer’s semantic level

• These interactions were supported efficiently by the hardware

• The TM vision, but there are still many more approaches to explore


33



• Parallel Data structures could be made correct under Parallel program composition

• Interactions when necessary were controllable at the programmer’s semantic level

• These interactions were supported efficiently by the hardware

• The lock-free vision?; or generalized version for arbitrary user data structures


34


Education and the Foundation of Computing

Education and the Foundation of Computing


35


• Parallelism is in all computing systems, and should be front and center in education– An integral part of early introduction and experience with

programming– An integral part of early algorithms and theory– An integral part of software architecture and engineering

• Observation: No biologist, physicist, chemist, or engineer thinks about the world with a fundamentally sequential foundation.

• The world is parallel! We can design large scale concurrent systems of staggering complexity.

Parallelism Implications for Computing Education in 2015 (and NOW!)Parallelism Implications for Computing Education in 2015 (and NOW!)


36


TodayTodayManycoreManycoreEraEra

100%100%

??%??%

# of Applications Growing Rapidly# of Applications Growing Rapidly

Are All Applications Parallel?Are All Applications Parallel?


37


• Four Historic Opportunities– #1: Highly-portable Parallel Software– #2: Major Architecture Support for Programmability– #3: High-level, High Performance Parallel Programming– #4: Parallelism Add new kinds of capability and value

• What Parallelism Opportunity?– All kinds of clients need parallelism; how to reach the largest

community of programmers?

• Some ambitious, but reasonable “Challenges”– Programmability, Access to Performance, Portable Locality

SummarySummary


38


Wanted: Breakthroughs in Parallel Programming

Questions?

Wanted: Breakthroughs in Parallel Programming

Questions?


39


Copyright © Intel Corporation. *Other names and brands may be claimed as The property of others. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise or used for any public Purpose under any circumstance, without the express written consent and permission of Intel Corporation.

parallelism for the masses: opportunities and …...parallelism for the masses “opportunities and...

Documents