polar opposites: next generation languages & architectures
DESCRIPTION
Polar Opposites: Next Generation Languages & Architectures. Kathryn S McKinley The University of Texas at Austin. Collaborators. Faculty Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss, Graduate Students - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/1.jpg)
Polar Opposites:Next Generation
Languages & Architectures
Kathryn S McKinleyThe University of Texas at Austin
![Page 2: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/2.jpg)
Collaborators
• Faculty– Steve Blackburn, Doug Burger, Perry Cheng,
Steve Keckler, Eliot Moss,
• Graduate Students– Xianglong Huang, Sundeep Kushwaha,
Aaron Smith, Zhenlin Wang (MTU)
• Research Staff – Jim Burrill, Sam Guyer, Bill Yoder
![Page 3: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/3.jpg)
Computing in the Twenty-First Century
New and changing architectures Hitting the microprocessor wall TRIPS - an architecture for future technology
Object-oriented languages Java and C# becoming mainstream
Key challenges and approaches Memory gap, parallelism Language & runtime implementation
efficiency Orchestrating a new software/hardware
dance Break down artificial system boundaries
![Page 4: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/4.jpg)
Technology Scaling Hitting the Wall
130 nm
100 nm
70 nm
35 nm
20 mm chip edge
Analytically … Qualitatively …
Either way … Partitioning for on-chip communication is key
![Page 5: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/5.jpg)
End of the Road for Out-of-Order SuperScalars
• Clock ride is over– Wire and pipeline limits– Quadratic out-of-order issue logic– Power, a first order constraint
• Major vendors ending processor lines
• Problems for any architectural solution – ILP - instruction level parallelism– Memory latency
![Page 6: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/6.jpg)
Where are Programming Languages?
• High Productivity Languages – Java, C#, Matlab, S, Python, Perl
• High Performance Languages– C/C++, Fortran
• Why not both in one?– Interpretation/JIT vs compilation– Language representation
• Pointers, arrays, frequent method calls, etc.
– Automatic memory management costs Obscure ILP and memory behavior
![Page 7: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/7.jpg)
Outline
• TRIPS– Next generation tiled EDGE architecture– ILP compilation model
• Memory system performance– Garbage collection influence – The GC advantage
• Locality, locality, locality• Online adaptive copying
– Cooperative software/hardware caching
![Page 8: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/8.jpg)
TRIPS
•Project Goals–Fast clock & high ILP in future technologies–Architecture sustains 1 TRIPS in 35 nm
technology–Cost-performance scalability–Find the right hardware/software balance
•New balance reduces hardware complexity & power–New compiler responsibilities & challenges
•Hardware/Software Prototype–Proof-of-concept of scalability and
configurability–Technology transfer
![Page 9: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/9.jpg)
TRIPS Prototype Architecture
![Page 10: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/10.jpg)
Execution Substrate
0 1 2 3
I-cache 0
I-cache 1
I-cache 2
I-cache 3D-cache/LSQ 3
D-cache/LSQ 2
D-cache/LSQ 1
D-cache/LSQ 0
Global CtrlBranch Predictor
I-cache H
Register banksExecution node
Execution array
Interconnect topology & latency exposed to compiler scheduler
![Page 11: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/11.jpg)
Large Instruction Window
Execution Node
opcode src1 src2
opcode src1 src2
opcode src1 src2
Out-of-Order Instruction Buffers form a logical “z-dimension”
in each node
opcode src1 src2
4 logical framesof 4 X 4 instructions
Control
Router
ALU
• Instruction buffers add depth to execution array– 2D array of ALUs; 3D volume of instructions
• Entire 3D volume exposed to compiler
![Page 12: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/12.jpg)
Execution Model
• SPDI - static placement, dynamic issue– Dataflow within a block– Sequential between blocks
• TRIPS compiler challenges– Create large blocks of instructions
• Single entry, multiple exit, predication
– Schedule blocks of instructions on a tile– Resource limitations
• Registers, Memory operations
![Page 13: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/13.jpg)
Block Execution Model
• Program execution– Fetch and map block to TRIPS grid– Execute block, produce result(s)– Commit results– Repeat
• Block dataflow execution– Each cycle, execute a ready instruction at every
node– Single read of registers and memory locations– Single write of registers and memory locations– Update the PC to successor block
• TRIPS core may speculatively execute multiple blocks (as well as instructions)
• TRIPS uses branch prediction and register renaming between blocks, but not within a block
start
end
A
B
C
D
E
![Page 14: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/14.jpg)
Just Right Division of Labor
• TRIPS architecture– Eliminates short-term temporaries– Out-of-order execution at every node in grid– Exploits ILP, hides unpredictable latencies
• without superscalar quadratic hardware• without VLIW guarantees of completion time
• Scale compiler - generate ILP– Large hyperblocks - predicate, unroll, inline, etc.– Schedule hyperblocks
• Map independent instructions to different nodes• Map communicating instructions to same or close nodes
– Let hardware deal with unpredictable latencies (loads) Exploits Hardware and Compiler Strengths
![Page 15: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/15.jpg)
High Productivity Programming Languages
• Interpretation/JIT vs compilation• Language representation
– Pointers, arrays, frequent method calls, etc.
• Automatic memory management costs MMTk in IBM Jikes RVM – ICSE’04, SIGMETRICS’04– Memory Management Toolkit for Java – High Performance, Extensible, Portable– Mark-Sweep, Copying SemiSpace,
Reference Counting– Generational collection, Beltway, etc.
![Page 16: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/16.jpg)
Bump-Pointer
Fast (increment & bounds check)
Can't incrementally free & reuse: must free en masse
Relatively slow (consult list for fit)
Can incrementally free & reuse cells
Free-List
Allocation Choices
![Page 17: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/17.jpg)
Allocation Choices
• Bump pointer– ~70 bytes IA32 instructions, 726MB/s
• Free list– ~140 bytes IA32 instructions, 654MB/s
• Bump pointer 11% faster in tight loop– < 1% in practical setting– No significant difference (?)
• Second order effects?– Locality??– Collection mechanism??
![Page 18: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/18.jpg)
Implications for Locality
• Compare SS & MS mutator– Mutator time– Mutator memory performance: L1, L2 & TLB
![Page 19: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/19.jpg)
javac
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.05
1.1
1.15
1.2
javac mutator time
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d m
uta
tor
tim
e
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.1
1.2
1.3
1.4
1.5
javac L1 misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d L
1 m
isse
s
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.2
1.4
1.6
1.8
javac L2 misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d L
2 m
isse
s
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.2
1.4
1.6
1.8
javac TLB misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d T
LB m
isse
s
![Page 20: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/20.jpg)
pseudojbb
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.05
1.1
1.15
1.2
1.25
jbb mutator time
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d m
uta
tor
tim
e
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.1
1.2
1.3
1.4
jbb L1 misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d L
1 m
isse
s
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
jbb L2 misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d L
2 m
isse
s
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
jbb TLB misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d T
LB m
isse
s
![Page 21: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/21.jpg)
db
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.02
1.04
1.06
1.08
1.1
1.12
db L1 misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d L
1 m
isse
s
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.025
1.05
1.075
1.1
1.125
1.15
db mutator time
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d m
uta
tor
tim
e
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
db L2 misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d L
2 m
isse
s
1 1.21 1.44 1.93 2.47 3.07 3.72 4.43 5.19 6
1
1.05
1.1
1.15
1.2
1.25
db TLB misses
MarkSweep
SemiSpace
Normalized Heap Size
No
rma
lize
d T
LB m
isse
s
![Page 22: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/22.jpg)
Locality &Architecture
![Page 23: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/23.jpg)
MS/SS Crossover 1.6GHz PPC
1
1.5
2
2.5
3
1 2 3 4 5 6
Heap Size Relative to Minimum
Normalized Total Time
1.6GHz PPC SemiSpace
1.6GHz PPC MarkSweep
![Page 24: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/24.jpg)
MS/SS Crossover1.9GHz AMD
1
1.5
2
2.5
3
1 2 3 4 5 6
Heap Size Relative to Minimum
Normalized Total Time
1.6GHz PPC SemiSpace
1.6GHz PPC MarkSweep
1.9GHz AMD SemiSpace
1.9GHz AMD MarkSweep
![Page 25: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/25.jpg)
MS/SS Crossover 2.6GHz P4
1
1.5
2
2.5
3
1 2 3 4 5 6
Heap Size Relative to Minimum
Normalized Total Time
1.6GHz PPC SemiSpace
1.6GHz PPC MarkSweep
1.9GHz AMD SemiSpace
1.9GHz AMD MarkSweep
2.6GHz P4 SemiSpace
2.6GHz P4 MarkSweep
![Page 26: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/26.jpg)
MS/SS Crossover3.2GHz P4
1
1.5
2
2.5
3
1 2 3 4 5 6
Heap Size Relative to Minimum
Normalized Total Time
1.6GHz PPC SemiSpace
1.6GHz PPC MarkSweep
1.9GHz AMD SemiSpace
1.9GHz AMD MarkSweep
2.6GHz P4 SemiSpace
2.6GHz P4 MarkSweep
3.2GHz P4 SemiSpace
3.2GHz P4 MarkSweep
![Page 27: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/27.jpg)
1
1.5
2
2.5
3
1 2 3 4 5 6
Heap Size Relative to Minimum
Normalized Total Time
1.6GHz PPC SemiSpace
1.6GHz PPC MarkSweep
1.9GHz AMD SemiSpace
1.9GHz AMD MarkSweep
2.6GHz P4 SemiSpace
2.6GHz P4 MarkSweep
3.2GHz P4 SemiSpace
3.2GHz P4 MarkSweep
MS/SS Crossover
2.6GHz2.6GHz
1.9GHz1.9GHz
1.6GHz1.6GHz
locality space
3.2GHz3.2GHz
![Page 28: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/28.jpg)
Locality in Memory Management
• Explicit memory management on its way out– Key GC vs Explicit MM insights 20 yrs old– Technology has and is changing
• Generational and Beltway Collectors– Significant collection time benefits over
full heap collectors– Collect young objects– Infrequently collect old space– Copying nursery attains similar locality effects
as full heap
![Page 29: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/29.jpg)
Where are the Misses?
_209_db
0200400600800
100012001400160018002000
Boot ImageImmortal LOS Older GenNursery
Total Accesses (in millions)
hits
misses
Generational Copying Collector
![Page 30: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/30.jpg)
Copy Order
• Static copy orders– Bredth first - Cheney scan– Depth first, hierarchical– Problem: one size does not fit all
• Static profiling per class– Inconsistant with JIT
• Object sampling– Too expensive in our experience
• OOR - Online Object Reordering– OOPSLA’04
![Page 31: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/31.jpg)
OOR Overview
• Records object accesses in each method (excludes cold basic blocks)
• Finds hot methods by dynamic sampling
• Reorders objects with hot fields in higher generation during GC
• Copies hot objects into separate region
![Page 32: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/32.jpg)
Static Analysis Example
Compiler
Hot BBCollect access info
Cold BBIgnore
Compiler
Access List:1. A.b2. ….….
Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }}
![Page 33: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/33.jpg)
Adaptive Sampling
Method Foo { Class A a; try { …=a.b;
… } catch(Exception e){
…a.c }}
Adaptive Sampling
Foo is hot
Foo Accesses:1. A.b2. ….….
A.b is hot
A
B
b…..
c
![Page 34: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/34.jpg)
Advice Directed Reordering
• Example– Assume (1,4), (4,7) and (2,6) are hot field
accesses
– Order: 1,4,7,2,6 : 3,5
1
4
76
2 35
![Page 35: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/35.jpg)
OOR System Overview
BaselineCompiler
SourceCode
ExecutingCode
AdaptiveSampling Optimizing
Compiler
HotMethods
Access InfoDatabase
Register HotField Accesses
Look Up
AddsEntries
GC: copyingobjects
Affects Locality
AdviceGC: CopiesObjects
OOR additionJikes RVMInput/Output
![Page 36: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/36.jpg)
Cost of OOR
Benchmark Default OOR Differencejess 4.39 4.43 0.84%jack 5.79 5.82 0.57%raytrace 4.63 4.61 -0.59%mtrt 4.95 4.99 0.70%javac 12.83 12.70 -1.05%compress 8.56 8.54 0.20%pseudojbb 13.39 13.43 0.36%db 18.88 18.88 -0.03%antlr 0.94 0.91 -2.90%gcold 1.21 1.23 1.49%hsqldb 160.56 158.46 -1.30%ipsixql 41.62 42.43 1.93%jython 37.71 37.16 -1.44%ps-fun 129.24 128.04 -1.03%Mean -0.19%
![Page 37: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/37.jpg)
Performance db
![Page 38: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/38.jpg)
Performance jython
![Page 39: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/39.jpg)
Performance javac
![Page 40: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/40.jpg)
Software is not enoughHardware is not enough
• Problem: inefficient use of cache• Hardware limitations: set associativity, cannot
predict the future• Cooperative Software/Hardware Caching
– Combines high level compiler analysis with dynamic miss behavior
• Lightweight ISA support conveys compiler’s global view to hardware– Compiler-guided cache replacement (evict-
me)– Compiler-guided region prefetching– ISCA’03, PACT’02
![Page 41: Polar Opposites: Next Generation Languages & Architectures](https://reader035.vdocument.in/reader035/viewer/2022062723/56813cb0550346895da65cfe/html5/thumbnails/41.jpg)
Exciting Times
• Dramatic architectural changes– Execution tiles– Cache & Memory tiles
• Next generation system solutions– Moving hardware/software boundaries– Online optimizations– Key compiler challenges (same old…) ILP and Cache Memory Hierarchy