parallelism for the masses: opportunities and …...parallelism for the masses “opportunities and...
TRANSCRIPT
Andrew A. ChienVice President of Research
Intel Corporation
University of Washington/Microsoft Research Institute on the Concurrency Challenge
August 5August 5, 2008
Parallelism for the Masses: Opportunities and ChallengesParallelism for the Masses:
Opportunities and Challenges
These are my opinions, not These are my opinions, not
necessarily those of my employer.necessarily those of my employer.
Parallelism for the Masses“Opportunities and Challenges”
2
© Intel Corporation
OutlineOutline
• 4 Historic Opportunities
• What’s Going on at Intel
• Which Parallelism Opportunity
• Wouldn’t it be nice if (Challenges)
• Education and Computing Fundamentals
• Questions
Parallelism for the Masses“Opportunities and Challenges”
3
© Intel Corporation
• Broad range of systems (servers, desktops, laptops, MIDs, smart phones,…) converging to…– A single framework with parallelism and a selection of
CPU’s and specialized elements – Energy efficiency and Performance are core drivers– Must become “forward scalable”
Opportunity #1: Highly Portable, Parallel SoftwareOpportunity #1: Highly Portable, Parallel Software
Parallelism becomes widespread – all software is parallel
Create standard models of parallelism in architecture, expression, and implementation.
Parallelism for the Masses“Opportunities and Challenges”
4
© Intel Corporation
“Forward Scalable” Software
TimeTimeSoftware with forward scalability can be moved
unchanged from N -> 2N -> 4N cores with continued performance increases.
# Cores# Cores
881616
16163232
Today
NN
2N2N
128128256256
256256512512
Academic Research
Parallelism for the Masses“Opportunities and Challenges”
5
© Intel Corporation
• Single core growth and aggressive frequency scaling are weakening competitors with other types of architecture innovation
Opportunity #2: Major Architectural Support for ProgrammabilityOpportunity #2: Major Architectural Support for Programmability
Architecture innovations for functionality –programmability, observability, GC, … are now possible
Don’t ask for small incremental changes, be bold and ask for LARGE changes… that make a LARGE difference
Parallelism for the Masses“Opportunities and Challenges”
6
© Intel Corporation
Mid 60Mid 60’’ssTo Mid 80To Mid 80’’ss
Mid 80Mid 80’’s s To Mid 200xTo Mid 200x
TerascaleTerascaleEraEra
Programming SupportProgramming SupportParallelismParallelismSystem IntegrationSystem Integration
GhzGhz ScalingScalingIssue ScalingIssue Scaling
Language SupportLanguage SupportIntegrationIntegration
New Golden Age of Architectural Support for Programming?New Golden Age of Architectural Support for Programming?
Parallelism for the Masses“Opportunities and Challenges”
7
© Intel Corporation
• Single chip integration enables closer coupling (cores, caches) and innovation in intercore coordination– Eases performance concerns– Supports irregular, unstructured parallelism
Opportunity #3: High Performance, High Level Programming ApproachesOpportunity #3: High Performance, High Level Programming Approaches
Forward scalable performance with good efficiency may be possible without detailed control.
Transactional, functional, side-effect free, declarative, object-oriented, and many other high level models
will thrive.
Parallelism for the Masses“Opportunities and Challenges”
8
© Intel Corporation
• Less locality sensitive; Efficient sharing • Runtime techniques more effective for dynamic,
irregular data and programs
• Can we do less tuning? And program at a higher level?
~20X20 cycles400 cyclesOn-dieLatency
~100X~1.2 TB/s12 GB/sOn-dieBandwidth
ImprovementTera-scaleSMPParameter
Huge Opportunity: Manycore != SMP on DieHuge Opportunity: Manycore != SMP on Die
Parallelism for the Masses“Opportunities and Challenges”
9
© Intel Corporation
Opportunity #4: Parallelism can Add New Kinds of Capability and ValueOpportunity #4: Parallelism can Add New Kinds of Capability and Value
• Additional core and computational capability will be available on-chip, can be exploited at modest additional cost– Single-chip design enables enhancement at low cost
Deploy Parallelism to improve application, software, and user experience
Traditional: user interface, visual computing, large data analysis, machine learning, activity inference
New: Security – network and platform monitoringSoftware – health monitoring,debugging and lifecycle monitoring
Performance - Dynamic tuning and Customization
Parallelism for the Masses“Opportunities and Challenges”
10
© Intel Corporation
Intel Research Log-Based ArchitecturesProblem Statement• Hypothesis: Architectural support
can significantly improve the performance of online software correctness-checking tools (lifeguards).
• Approach: As application runs, hardware captures execution log; log is transported via processor cache and consumed by lifeguard running on separate core.
Research Activities• Design of architecture support for logging.• Design/evaluation of log compression.• Port of lifeguards to LBA.• Linux system software implementation.• Design of acceleration hardware.• Detailed performance evaluation.• Design for monitoring parallel applications (on-
going).Performance Comparisons• Binary instrumentation: 20-40X slowdown• LBA baseline arch: 3.2-4.3X slowdown• LBA w/acceleration: 2-50% slowdown
Joint w/CMU
Parallelism for the Masses“Opportunities and Challenges”
11
© Intel Corporation
What’s going on at IntelWhat’s going on at Intel
Parallelism for the Masses“Opportunities and Challenges”
12
© Intel Corporation
Intel Ct -- C for Throughput Computing
• Ct adds parallel collection objects & methods to C++– Works with standard C++ compilers (ICC, GCC, VC++)
• Ct abstracts away architectural details– Vector ISA width / Core count / Memory model / Cache
sizes• Ct forward-scales software written today
– Ct is designed to be dynamically retargetable to SSE, AVX, Larrabee, and beyond
• Ct is deterministic– No data races
See See whatif.intel.comwhatif.intel.com
Parallelism for the Masses“Opportunities and Challenges”
13
© Intel Corporation
Intel® Concurrent Collections for C/C++
The work of the tuning expert• Architecture• Actual parallelism • Locality• Overhead • Load balancing• Distribution among processors• Scheduling within a processor
The work of the domain expert• Semantic correctness• Constraints required by the application
Concurrent Collections Spec
Mapping to target platform
The application problem Supports seriousseparation of concerns:
The domain expert does not need to know about parallelism
The tuning expert does not need to know about the domain.
Classifier1
image Classifier2
Classifier3
http://whatif.intel.com
Parallelism for the Masses“Opportunities and Challenges”
14
© Intel Corporation
Parallelism for the Masses“Opportunities and Challenges”
15
© Intel Corporation
Intel: Making Parallel Computing Pervasive
EnablingEnablingParallelParallel
ComputingComputing
Academic ResearchUPCRCs
SoftwareProducts
Multi-coreEducation
TBB Open SourcedTBB Open SourcedSTMSTM--Enabled Compiler on Enabled Compiler on Whatif.intel.comWhatif.intel.comParallel Benchmarks at Parallel Benchmarks at PrincetonPrinceton’’s PARSEC sites PARSEC site
Joint HW/SW R&D Joint HW/SW R&D program to enable program to enable
Intel products Intel products 33--7+ in future7+ in future
Intel Tera-scaleResearch
Academic research Academic research seeking disruptiveseeking disruptiveinnovations 7innovations 7--10+10+years out years out
Community and Experimental Tools
Wide array of leadingWide array of leadingmultimulti--core SW core SW
development tools & development tools & info available today info available today
MultiMulti--core Education Programcore Education Program-- 400+ Universities 400+ Universities -- 25,000+ students 25,000+ students -- 2008 Goal: Double this2008 Goal: Double this
IntelIntel®® Academic CommunityAcademic CommunityThreading for MultiThreading for Multi--core SW core SW communitycommunityMultiMulti--core bookscore books
Parallelism for the Masses“Opportunities and Challenges”
16
© Intel Corporation
Multicores and Manycores
• Dunnington – x86, 6 cores– 16MB L3, 12 cores/socket
• Nehalem (new uArch), – 4 cores, 4-wide, IMC+QPI– 2-way SMT
• LRB x86+Graphics– Large # of cores– Coherent Caches– Shared L2
Parallelism for the Masses“Opportunities and Challenges”
17
© Intel Corporation
““PolarisPolaris”” Teraflops Research ProcessorTeraflops Research Processor
• Deliver Tera-scale performance– TFLOP @ 62W, Desktop Power,
16GF/W– Frequency target 5GHz, 80 cores– Bi-section B/W of 256GB/s– Link bandwidth in hundreds of GB/s
• Prototype two key technologies– On-die interconnect fabric– 3D stacked memory
• Develop a scalable design methodology– Tiled design approach– Mesochronous clocking– Power-aware capability
I/O AreaI/O Area
I/O AreaI/O Area
PLLPLL
single tilesingle tile
1.5mm1.5mm
2.0mm2.0mm
TAPTAP
21.7
2mm
I/O AreaI/O Area
PLLPLL TAPTAP
12.64mm
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
Parallelism for the Masses“Opportunities and Challenges”
18
© Intel Corporation
Terascale Chip Research QuestionsTerascale Chip Research Questions
• Cores– How many? What size?– Homogenous, Heterogeneous– Programmable, Configurable,
Fixed-function
• Chip-level– Interconnect: Topology,
Bandwidth– Coordination– Management
• Memory Hierarchy– # of levels, sharing, inclusion– Bandwidth, novel Technology– Integration/Packaging
• I/O Bandwidth– Silicon-based photonics– Terabit links
Source: CTWatchQuarterly, Feb 2007
Manycore Chips (circa. 2015)?
Parallelism for the Masses“Opportunities and Challenges”
19
© Intel Corporation
TeraScaleTeraScale Computing ResearchComputing Research
PlatformPlatform
EssentialEssential3D Stacked Memory3D Stacked Memory
Cache HierarchyCache HierarchyVirtualization/PartitioningVirtualization/Partitioning
Scaleable Scaleable OSOS’’ssI/O & NetworkingI/O & Networking
ComplementaryComplementaryCMOS RadioCMOS RadioPhotonicsPhotonics
EESAEESA
ProgrammingProgramming
EssentialEssentialSpeculative MultithreadingSpeculative Multithreading
Transactional memoryTransactional memoryWorkload analysisWorkload analysis
Compilers & LibrariesCompilers & LibrariesToolsTools
ComplementaryComplementaryDiamond media indexingDiamond media indexing
Activity RecognitionActivity RecognitionUsage ModelsUsage Models
MicroprocessorMicroprocessor
EssentialEssentialScalable memoryScalable memory
MultiMulti--core architecturescore architecturesSpecialized coresSpecialized coresScalable fabricsScalable fabrics
Energy efficient circuitsEnergy efficient circuits
ComplementaryComplementarySi Process TechnologySi Process Technology
CMOS VRCMOS VRResiliencyResiliency
100+ Projects Worldwide
Parallelism for the Masses“Opportunities and Challenges”
20
© Intel Corporation
Which Parallelism Opportunity?Which Parallelism Opportunity?
Parallelism for the Masses“Opportunities and Challenges”
21
© Intel Corporation
Which Parallelism Opportunity?
• Data Center, Desktop/Laptop, Mobile/embedded
Single Chip Parallelism
InternetInternet
DataCenterDataCenter
& HPC& HPC
Parallelism for the Masses“Opportunities and Challenges”
22
© Intel Corporation
Unified IA and Parallelism Vision – a Foundation Across Platforms
CE (Canmore), Ultra-Mobile and Low Cost
PC Platforms
Intel® Core™ 2 and NehalemArchitecture
Notebook, Desktop and Server
POWER
PER
FOR
MA
NCE
SYSTEMON A CHIP
Visual Computingand Next Generation
Gaming Console
Integration Options
Example: Graphics
Parallelism for the Masses“Opportunities and Challenges”
23
© Intel Corporation
Two Students in 2015
How many cores How many cores does your does your
computer have?computer have?
II’’ve got an Intel ve got an Intel thousand core thousand core with 100 OOO with 100 OOO
cores!cores!
End Users don’t care about core counts; they care about capability
Parallelism for the Masses“Opportunities and Challenges”
24
© Intel Corporation
Chip Real Estate and Performance/Value
• Tukwila – upcoming Intel Itanium processor– 4 cores, 2B transistors, 30MB cache – 50/50 split on transistors
• 1% of chip area = 1/3 MB = 30M transistors• 0.1% of chip area = 1/30 MB = 3M transistors
• How much performance benefit do these elements provide?
• What incremental performance benefit would you expect for the last core in a 100-core? 1000-core?
Parallelism for the Masses“Opportunities and Challenges”
25
© Intel Corporation
Which Programming Community?
• “Quick functionality, adequate performance”– Matlab, Mathematica, R, SAS, etc.– VisualBasic, PERL, Python, Ruby|Rails
• “Mix of productivity and performance”– Java, C# (Managed languages + rich libraries)
• “Performance Programmers”– C++ and STL– C and Fortran
• HPC has focused nearly exclusively on “Performance Programmers”
Higher Higher ProductivityProductivity
Higher Higher EfficiencyEfficiency
Parallelism for the Masses“Opportunities and Challenges”
26
© Intel Corporation
A different perspective…
• Guy: “…the parallel part isn’t what makes this hard…”
• Andrew: “… it’s the ugliness of the low level machine details and programming models that makes both parallelism and concurrency difficult…”
• Can we create a world in which the upper layer programmers (Joe’s) can write lots of 100-fold parallel programs?
Don’t forget to focus on high productivity programmers. They write the most code (and the killer apps!) and are
least tolerant of low-level abstractions and tedium.
Parallelism for the Masses“Opportunities and Challenges”
27
© Intel Corporation
Parallelism for the Masses Focus
• Clients are where the action is – smartphones, laptops, and interaction with the physical world
• Easy to use parallelism for the rapid innovators and moderate programming effort crowd
• Incremental performance improvement matters –more cores translate into higher performance – but program efficiency is less critical
• => 100M’s of systems, M’s of programmers
Parallelism for the Masses“Opportunities and Challenges”
28
© Intel Corporation
HPC… what have we learned about parallelism and how to approach this…
• Large-scale – extreme scalability is possible, and typically comes with scaling of problems and data
• Portable expression of parallelism matters• Multi-version programming (algorithms and implementations)
is a good idea• High level program analysis is a critical technology• Autotuning is a good idea• Working with domain experts is a good idea
• Locality is hard, modularity is hard, data structures are hard, Efficiency is hard…
• Of course, this list is not exhaustive….
Parallelism for the Masses“Opportunities and Challenges”
29
© Intel Corporation
HPC… lessons not to learn …
• Programmer effort doesn’t matter• Hardware efficiency matters• Low-level programming tools are acceptable• Low-level control is an acceptable path to
performance• Horizontal locality and control of communication is
the critical performance concern
Parallelism for the Masses“Opportunities and Challenges”
30
© Intel Corporation
Wouldn’t it be nice if? (Challenges)
• Programmers could write parallel code with minimal effort– Perhaps only a simple limited model– Serve “fast prototyping” programmers (but all tiers of
programmers); a short time to working, parallel code…– Reasonable performance (and “forward scalability”)
delivered by the implementation
• Programmers could migrate gracefully from this high level view to lower-levels of specification– To improve performance– To increase control or provide information to the underlying
runtime system– Parallelism, Locality, “Hints”, etc.
Parallelism for the Masses“Opportunities and Challenges”
31
© Intel Corporation
Wouldn’t it be nice if?
• There were an easy, modular way to express locality.
• Computation = Algorithms + Data structures
• Efficient Computation = Algorithms + Data Structures + Locality
• How do we associate these in our programming models? Explicit, Implicit, otherwise?– A + DS + L– (A + DS) + L– A + (DS + L)– (A + L) + DS
Parallelism for the Masses“Opportunities and Challenges”
32
© Intel Corporation
Wouldn’t it be nice if?
• Parallel program composition preserved correctness of the composed parts
• Interactions when necessary were controllable at the programmer’s semantic level
• These interactions were supported efficiently by the hardware
• The TM vision, but there are still many more approaches to explore
Parallelism for the Masses“Opportunities and Challenges”
33
© Intel Corporation
Wouldn’t it be nice if?
• Parallel Data structures could be made correct under Parallel program composition
• Interactions when necessary were controllable at the programmer’s semantic level
• These interactions were supported efficiently by the hardware
• The lock-free vision?; or generalized version for arbitrary user data structures
Parallelism for the Masses“Opportunities and Challenges”
34
© Intel Corporation
Education and the Foundation of Computing
Education and the Foundation of Computing
Parallelism for the Masses“Opportunities and Challenges”
35
© Intel Corporation
• Parallelism is in all computing systems, and should be front and center in education– An integral part of early introduction and experience with
programming– An integral part of early algorithms and theory– An integral part of software architecture and engineering
• Observation: No biologist, physicist, chemist, or engineer thinks about the world with a fundamentally sequential foundation.
• The world is parallel! We can design large scale concurrent systems of staggering complexity.
Parallelism Implications for Computing Education in 2015 (and NOW!)Parallelism Implications for Computing Education in 2015 (and NOW!)
Parallelism for the Masses“Opportunities and Challenges”
36
© Intel Corporation
TodayTodayManycoreManycoreEraEra
100%100%
??%??%
# of Applications Growing Rapidly# of Applications Growing Rapidly
Are All Applications Parallel?Are All Applications Parallel?
Parallelism for the Masses“Opportunities and Challenges”
37
© Intel Corporation
• Four Historic Opportunities– #1: Highly-portable Parallel Software– #2: Major Architecture Support for Programmability– #3: High-level, High Performance Parallel Programming– #4: Parallelism Add new kinds of capability and value
• What Parallelism Opportunity?– All kinds of clients need parallelism; how to reach the largest
community of programmers?
• Some ambitious, but reasonable “Challenges”– Programmability, Access to Performance, Portable Locality
SummarySummary
Parallelism for the Masses“Opportunities and Challenges”
38
© Intel Corporation
Wanted: Breakthroughs in Parallel Programming
Questions?
Wanted: Breakthroughs in Parallel Programming
Questions?
Parallelism for the Masses“Opportunities and Challenges”
39
© Intel Corporation
Copyright © Intel Corporation. *Other names and brands may be claimed as The property of others. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise or used for any public Purpose under any circumstance, without the express written consent and permission of Intel Corporation.