from earth to htmt: the evolution of a multithreaded architecture model
DESCRIPTION
From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model. Guang R. Gao C omputer A rchitecture & P arallel S ystems L aboratory (CAPSL) University of Delaware. Outline. Introduction The EARTH Execution and Architecture Model The EARTH Programming Model and Threaded-C - PowerPoint PPT PresentationTRANSCRIPT
04/19/23 \Seminar\Spain-00-01 1
From EARTH to HTMT:From EARTH to HTMT:The Evolution of a Multithreaded The Evolution of a Multithreaded
Architecture ModelArchitecture Model
Guang R. GaoComputer Architecture & Parallel Systems
Laboratory (CAPSL)
University of Delaware
04/19/23 \Seminar\Spain-00-01 2
Outline• Introduction • The EARTH Execution and Architecture Model• The EARTH Programming Model and
Threaded-C• Application Studies and Performance Evaluation• Related Work and Conclusions
04/19/23 \Seminar\Spain-00-01 3
Main Challenges:
High-PerformanceParallel Systems
Scalable
forboth Class A and Class B
Applications
04/19/23 \Seminar\Spain-00-01 4
Challenges: The “Killer Latency Problem”
Network
Latency due to:- Communication- Synchronization- task spawning- load balancing
C
NI
M
P
C
NI
M
P
SP2 is hard enough, PC clusters is much worse !
04/19/23 \Seminar\Spain-00-01 5
• Observation I: Many such Applications
have “Bad Latencies” demanding good
support of adaptive fine-grain parallelism
Meeting High-End Application Meeting High-End Application Challenges:Challenges:
[Petaflop-2 Conference, 99-2]
04/19/23 \Seminar\Spain-00-01 6
Observation II: It is not necessarily too
hard to “generate” and “program” fine-
grain threads!
Here Comes the SurpriseHere Comes the Surprise!![Theobald’s Ph.D. thesis, May, 1999][Theobald’s Ph.D. thesis, May, 1999]
However, it may be hard to statically group
them into coarse-grain threads!
04/19/23 \Seminar\Spain-00-01 7
AA BaseBase AdaptiveAdaptive Fine-GrainFine-Grain Multithreaded Execution ModelMultithreaded Execution Model
C1 (Abundance) : a very large pool of threads
C2 (ultra-light weight): can be spawned as easily and as quickly as possible
C3 (Mobility): Adaptively migratable as easily and as quickly as possible
04/19/23 \Seminar\Spain-00-01 8
Motivation of The EARTH Project
How to exploit fine-grain multithreadeding on a parallel system given off-the-shelf microprocessors
04/19/23 \Seminar\Spain-00-01 9
Two Types of Fine-Grain Threads
• A parallel function invocation
• Strand/Fiber - a function body can be divided into several “strands/fibers”
04/19/23 \Seminar\Spain-00-01 10
• A fiber becomes enabled if it has received all input signals
• An enabled fiber may be selected for execution when the required hardware resource has been allocated
• After finished execution, a signal is sent to all destination fiber to update the corresponding sync slots
Fiber within a frame
Parallel function invocation
Call a procedure
SYNC ops
Note: The role of strand !
Threads and FibersThreads and Fibers
04/19/23 \Seminar\Spain-00-01 11
The Execution Model of Fibers
• Dependence-Driven firing rule for fibers
• Fiber is atomic and ultra-light weighted
• Relation with dataflow model (Dennis72)
2 21 2
0 10 2
2 4
Fibers
SignalToken
04/19/23 \Seminar\Spain-00-01 12
• Threaded C = ANSI C + extensions for multithreading
• Extensions include:– Threaded functions
– Threaded synchronization
– Support for global addresses
– Data transfer primitives
• Threaded C is:– The “instruction set” of the
EARTH processor– A target language for
high-level compilers
High-Level LanguageTranslation
Treaded C
Threaded CCompiler
EARTH Platforms
Users
C FORTRAN
The Threaded C Language
04/19/23 \Seminar\Spain-00-01 14
An Evolutionary Path for EARTH
CPU / SU
CPU SU
CPU SU
CPU LINKCPU
SEMi Simulation Platform (Theobald99)
MANNA-dual/spn
SU-ext
SU-int
- Parallel machines- PC-clusters - ...
<=
04/19/23 \Seminar\Spain-00-01 15
Platforms for EARTH
• MANNA:– MANNA is architecture testbed from GMD– benchmarking platform for fine-grain
multithreading
• EARTH-SP2
• EARTH-Beowulf (Linux based)
• EARTH-SUN/SMP/Cluster
04/19/23 \Seminar\Spain-00-01 16
Unique Advantages of EARTH-MANNA Platform
• We can push OS completely out of the way!
• We can design the EARTH runtime system from very low level up
• The invaluable experience/lessons learned from EARTH-MANNA are essential for the successful migration of the EARTH model to other platforms (e.g. the IBM SP-2 story, etc.)
04/19/23 \Seminar\Spain-00-01 20
Sumamry of Recent Experimental Results (Kevin99)
• Impressive speedup and scalability (scalable even with high overhead fine-grain parallel programs: e.g. fib)
• Enhanced Programmability (N-queen-p example)
• Broad applicability
04/19/23 \Seminar\Spain-00-01 21
Experiements
• Example 1 (assorted benchmarks): fib, nqueen, paraffin, tomcatv, matrix-multiply,etc.
• Example 2: Adaptive unstructured grids
• Example 3: Wavelet computation
04/19/23 \Seminar\Spain-00-01 24
Performance of Performance of N-Queens(12)N-Queens(12)[Theobald99][Theobald99]
• 117.8 fold speedup on a 120 node simulation!
• 1,637,099 tokens are generated ! 1,637,099 tokens are generated !
• average, 30+ tokens are maintained per average, 30+ tokens are maintained per processorsprocessors
• n-QUUEN is a useful HTMT benchmark after all ! (Phil Murkey)
04/19/23 \Seminar\Spain-00-01 28
Coarse-Grain Applications
• 116 fold speedup on 120-node machine is achieved for Cannon’s matrix multiply algorithm!
• Deep software systolic-style implementation to exploit paralelism
• Fine-grain mechanisms
04/19/23 \Seminar\Spain-00-01 29
Example 2 --- Adaptive Unstructured Mesh Computation
Observation
• The critical part of the framework is mesh adaptation and load balancing
• Partitioning problem in better shape, remapping problem open
Partitioning
Mapping
Initialization
Solution
Finalization
Adapt? Execution
Balanced?
Expensive?
Repartitioning
Remapping
N
N
Y
Y
Y
N
04/19/23 \Seminar\Spain-00-01 30
Node 0 Node 1
* * *
Node N
The Mapping After a Few Iterations
Node 0 Node 1 Node N
* * *
The Initial Picture
04/19/23 \Seminar\Spain-00-01 31
Initial Results
• About 3000 lines of Threaded-C code
• migration >= 70% (good)
• Unbiased variance = 3 - 5% (very good)
• A good speedup on EARTH-MANNA
has been observed
04/19/23 \Seminar\Spain-00-01 33
Example 3 --- Adaptive Wavelet Transformation
• Load evolution pattern is dynamically changing, but is statically predictable
• Need adaptive load redistribution/grouping
• Mapping onto EARTH [IPPS99]
04/19/23 \Seminar\Spain-00-01 35
HTMT ArchitectureHTMT Architecture
SPIM
SPIM
SPIM
SPIM
SPIMSPIMSPIM
SPIM
SPIM
SPIM
SPIMSPIM
SPELLs
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
04/19/23 \Seminar\Spain-00-01 36
Extensions to CurrentExtensions to CurrentEARTH ModelEARTH Model
• Percolation Model• Memory Model: Location
Consistency• Load balancing and percolation
04/19/23 \Seminar\Spain-00-01 37
HTMT Percolation ModelHTMT Percolation ModelHTMT Percolation ModelHTMT Percolation Model
ParcelInvocation
&Termination
I-PoolParcel
Assembly&
Disassembly
ParcelDispatcher
&Dispenser
T-Pool
A-Pool
D-Pool
SRAM-PIM
CRYOGENIC AREA
Run Time System
DM
A
DM
A
Split-PhaseSynchronization
to SRAM
donestart
CRAM
SCPExecution
Unit
04/19/23 \Seminar\Spain-00-01 38
The System Software ArchitectureThe System Software Architecture
Note:• The threaded-C compiler has
part of its functions embedded in RTS
• The RTS will work with architecture and OS layers to provide the PXM interface
• The performance models Are defined across all layers
Threaded-C Compiler - RTS interface
RTS-OS interfaceRTS-hardware architecture interface
Applications
High-level languagecompiler
Threaded-CCompiler
and Tool Set
RTS
Hardware Architectures
OS
High-levellanguagese.g. parallel Cetc.
HTMT-C/Threaded-C
PXMInterface
Per
form
ance
Mod
els
04/19/23 \Seminar\Spain-00-01 39
Evolution of Multithreaded Architecture Models
Non-dataflowbased
CDC 66001964
MASAHalstead1986
HEPB. Smith1978
Cosmic CubeSeiltz1985
J-MachineDally1988-93
M-MachineDally1994-98
Dataflowmodel inspired
StaticDataflowDennis 1972MIT
MIT TTDAArvind1980
ManchesterGurd & Watson1982
*T/Start-NGMIT/Motorola1991-
SIGMA-IShimada1988
Arg-FetchingDataflowDennisGao1987-88
MDFAGao1989-93
MTAHumTheobaldGao 94
MonsoonPapadopoulos& Culler 1988
P-RISCNikhil & Arvind1989
EM-5/4/X RWC-11992-97
EARTHPACT95’, ISCA96, Theobald99
Iannuci’s1988-92
Others: Multiscalar (1994), SMT (1995), etc.
Flynn’sProcessor1969
CHoPP’77 CHoPP’87
TAMCuller1990
TeraB. Smith1990-
AlwifeAgarwal1989-96
CilkLeiserson
XMTVishkin
04/19/23 \Seminar\Spain-00-01 40
Acknowledgement(Incomplete List)
• Erik Altman • Haiying Cai• Nasser Elmasri• Gerd Heber• Laurie J. Hendren• Herbert Hum• Alberto Jimenez• Prasad Kakulavarapu• Cheng Li• Olivier Maquelin• Andres Marquez
• Shashank Nemawarkar• Zach Ruiz• Sean Ryan• V.C. Sreedhar• Xinan Tang • Kevin Theobald• Ruppa Thulasiram • Parimala Thulasiraman• Xinmin Tian• Yingchun Zhu• J. Nelson Amaral
NSERC, FCAR,DARPA,NSA,NSF,NASA