profile-based dynamic optimization research for future computer systems
DESCRIPTION
Profile-Based Dynamic Optimization Research for Future Computer Systems. Takanobu Baba Department of Information Science Utsunomiya University, Japan http://aquila.is.utsunomiya-u.ac.jp November 12, 2004. Brief history of ‘my’ research. 1970’s: The MPG System - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/1.jpg)
Utsunomiya University1 November 12, 2004Seminar@UW-Madison
Profile-Based Dynamic Optimization Research for Future Computer Systems
Takanobu Baba
Department of Information Science
Utsunomiya University, Japan
http://aquila.is.utsunomiya-u.ac.jp
November 12, 2004
![Page 2: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/2.jpg)
Utsunomiya University2 November 12, 2004Seminar@UW-Madison
Brief history of ‘my’ research• 1970’s: The MPG System
A Machine-Independent Efficient Microprogram
Generator
• 1980’s: MUNAP
A Two-Level Microprogrammed Multiprocessor
Computer
• 1990’s: A-NET
A Language-Architecture Integrated Approach
for Parallel Object-Oriented Computation
![Page 3: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/3.jpg)
Utsunomiya University3 November 12, 2004Seminar@UW-Madison
A Two-Level Microprogrammed Multiprocessor Computer-MUNAP
A 28-bit vertical microinstruction activates up to 4 nanoprograms in 4 PU’s every machine cycle
MUNAP
![Page 4: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/4.jpg)
Utsunomiya University4 November 12, 2004Seminar@UW-Madison
A Parallel Object-Oriented Total Architecture A-NET(Actors-NETwork )
• Massively parallel computation•Each node consists of a PE and a router.•PE has the language-oriented, typical CISC architecture.•The programmable router is topology- independent.
A-NET Multicomputer
![Page 5: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/5.jpg)
Utsunomiya University5 November 12, 2004Seminar@UW-Madison
Current dynamic optimization projects• Computation-oriented:
– YAWARA: A meta-level optimizing computer system
– HAGANE: Binary-level multithreading
• Communication-oriented:– Spec-All: Aggressive Read/Write Access
Speculation Method for DSM Systems – Cross-Line: Adaptive Router Using Dynamic
Information
![Page 6: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/6.jpg)
Utsunomiya University6 November 12, 2004Seminar@UW-Madison
YAWARA: A Meta-Level Optimizing Computer System
![Page 7: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/7.jpg)
Utsunomiya University7 November 12, 2004Seminar@UW-Madison
Background• Moore’s Law will be maintained by the
semiconductor technology
• how can we utilize the huge amount of transistors for speedup of program execution?
• our idea is to utilize some chip area for
dynamically and autonomously tuning the configuration of on-chip multiprocessor
![Page 8: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/8.jpg)
Utsunomiya University8 November 12, 2004Seminar@UW-Madison
Base-level processor
Memory
Instructions and data
Results of computation
Meta-level
Profile ofcontrol anddata
Meta-levelprocessor
Base-level processor
Memory
Results of optimization
Base-level
![Page 9: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/9.jpg)
Utsunomiya University9 November 12, 2004Seminar@UW-Madison
Design considerations
• HW vs. SW reconfiguration
→ SW reconfiguration
• Static vs. dynamic reconfiguration
→ both a static and dynamic reconfig. capability
• Homogeneous vs. heterogeneous architecture
→ unified homogeneous structure
![Page 10: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/10.jpg)
Utsunomiya University10 November 12, 2004Seminar@UW-Madison
MT: Management Thread, PT: Profiling Thread, OT: Optimizing Thread, CT: Computing Thread
MT
Meta-level
PT PTPTPTPT PT
OTOTOT
OT
OT
OT
Base-level
CTCT
CTCT
CTCTCT CT
CTManagement Thread
ProfilingApplication
Optimization
Memory
Basic concepts of thread-level reconfiguration
CT
CTCT
CT
OT CT
![Page 11: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/11.jpg)
Utsunomiya University11 November 12, 2004Seminar@UW-Madison
Execution modelManagement Thread
(MT)
Computing Thread(CT)
collect profile
Optimizing Thread(OT)collect profile
optimization initiate condition satisfied
sleep
wake up
activate
activate
Computing Thread(CT)
Profiling Thread(PT)
Profiling-centric
sleep
sleep
Profiling Thread(PT)
collect profile
optimization initiate condition satisfied
Computing-centric
![Page 12: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/12.jpg)
Utsunomiya University12 November 12, 2004Seminar@UW-Madison
MT OT
OT OT
OTOT
PT
PT PT
Meta-level Base-level
CT
CT
CT
CT
PT
PT
PT
CT
CTPT
PT
CT
CTPT
PT
OT
OTOT
OT PT
OT
OTOT
OTOT
CT
CT
MT OT
OT OT
CT
OTOT
PT
PT
PT CT
CT
CT
CT
CT
CT
CT
CT CT
CT CT
CT CT
CT CT
CT CTCT
CT
PT
PT
PT
PT
OT
OT
OT
MT OT
OT
CT
PT
CT CT
CTCT
PT CT
CT
CT
CT
CT
CT
CT CT
CT CT
CT CTCT
CT
PT
CT
CT CT
CT CT
CT
CT CT
CT CTCT
Change of configurations by meta-level optimization
![Page 13: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/13.jpg)
Utsunomiya University13 November 12, 2004Seminar@UW-Madison
The YAWARA System
• an implementation of the computation model
• the SW system consists of static and dynamic optimization systems
• the HW system includes uniformly structured thread engines (TE); each TE can execute base- and meta-level threads
spirit of YAWARA ・・・ "A flexible method prevails where a rigid one fails."
![Page 14: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/14.jpg)
Utsunomiya University14 November 12, 2004Seminar@UW-Madison
Source Code(C/C++,Java,Fortran,…)
SOS(Static Optimization System)
ExecutionProfile
DOS(Dynamic Optimization System)
Code Analysis
Info
Static feedback
Dynamic feedback
Run-timeProfile
Executable image
Execution Results
TE(Thread Engine)
Thread Engines
TE(Thread Engine)TE
(Thread Engine)TE(Thread Engine)
TE(Thread Engine)TE
(Thread Engine)TE(Thread Engine)TE
(Thread Engine)
TE(Thread Engine)TE
(Thread Engine)TE(Thread Engine)TE
(Thread Engine)
Software System
![Page 15: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/15.jpg)
Utsunomiya University15 November 12, 2004Seminar@UW-Madison
TE TE TE
TE TE TE TE
TE TE TE TE
TE TE TE TE
Hardware System
net-work
IN
net-workOUTregister
fileINT*4 + FP*1
thread-data
cache
D$
to/from network
I$
Thread Engine(TE)
profiling buffer
thread- code cache
thread-0thread
-1thread-2 thread
-N
executioncontrol
I$ D$
feedback-directed resource control
profiling controller
TE
![Page 16: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/16.jpg)
Utsunomiya University16 November 12, 2004Seminar@UW-Madison
Hot loop
Example application – compress –
8
911 1
213
21 2
2
8
9
1112
13
21
22
Hot path
hot loop / hot path detection (PT, OT)
・ speculative multithreading code generation・ helper threads generation・ path predictor generation (OT)
・ management thread ( MT )
Hot path#0Speculative
thread
#0
#1
Speculative multithreading usingpath prediction mechanism
#1
hit
9
1112
8
19
109
11 1
2
13
14
15
17
18
20
23
24
25
19
8
16
21
22
10
14
15
17
18
20
19
16
9
11 1
2
13
23
24
25
8
21
22
12
14
15
17
18
20
23
16
21
22
109
1112
24
25
19
8
Phased behavior
・ speculative multithreading profiling (PT)
Base
Meta
miss #1⇒
i - 1
i i +1#0
#1hit #0
(CT)
![Page 17: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/17.jpg)
Utsunomiya University17 November 12, 2004Seminar@UW-Madison
Conclusion -YAWARA-
• we proposed an autonomous reconfiguration mechanism based on dynamic behavior
• we also proposed a software and hardware system, called YAWARA, that implements the reconfiguration efficiently
• we are now developing the software system and the simulator.
![Page 18: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/18.jpg)
Utsunomiya University18 November 12, 2004Seminar@UW-Madison
Prediction and Execution Methods of Frequently Executed Two Paths for
Speculative Multithreading
YAWARA@PDCS2004
![Page 19: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/19.jpg)
Utsunomiya University19 November 12, 2004Seminar@UW-Madison
Occurrence ratios of the top-two paths
54.5% 22.4%
48.2% 42.1%
97.0% 3.0%
80.7% 19.3%
compress/compress
ijpeg/forward_DCT
m88ksim/killtime
li/sweep
The top two paths occupy 80-100% of execution
#1 path #2 path other p aths
![Page 20: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/20.jpg)
Utsunomiya University20 November 12, 2004Seminar@UW-Madison
Two-level path prediction• Introducing two-level branch prediction
– history register keeps sequence of #1 path executions (1: #1, 0: the other paths)
– counter table counts #1 path executions
1101 v0v1
v13v14v15
:
history register
threshold: X
if v13 >= Xpredict #1
otherwisepredict #2
counter table
Single Path Predictor (SPP)Single Path Predictor (SPP)
![Page 21: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/21.jpg)
Utsunomiya University21 November 12, 2004Seminar@UW-Madison
Another path predictor
1101 v0v1
v13v14v15
:
#1 pathhistory register
if v13 >= v2predict #1
otherwisepredict #2
#1 pathcounter table
Dual Path Predictor (DPP)Dual Path Predictor (DPP)
0010 v0v1
:v14v15
v2
#2 pathhistory register #2 path
counter table
![Page 22: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/22.jpg)
Utsunomiya University22 November 12, 2004Seminar@UW-Madison
Single Speculation (SS)
#1
path
#1
path
continuespeculative execution
executenon-speculativethread
#1
path
recovery process
#1
path
When a thread fails …Abort succeeding threads
Recovery processNon-speculative execution
Speculation failure degrades performance
Continue speculative execution
![Page 23: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/23.jpg)
Utsunomiya University23 November 12, 2004Seminar@UW-Madison
Double Speculation (DS)
• Even when 1st speculation fails,secondary choice has high possibilitybecauseTop-Two Paths are Dominant.
54.5% 22.4%
48.2% 42.1%
97.0% 3.0%
80.7% 19.3%
compress/compress
ijpeg/forward_DCT
m88ksim/killtime
li/sweep
expected #2 hit = 49.2%
expected #2 hit = 81.3%
expected #2 hit = 100%expected #2 hit = 100%
![Page 24: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/24.jpg)
Utsunomiya University24 November 12, 2004Seminar@UW-Madison
Double Speculation (DS)
If secondary speculation succeeds,performance loss is not so large.
#1
path
#2
path
#1
path
#1
path
continuespeculative execution
#1
path
recovery process
#2
path
#1
path
secondaryspeculation
![Page 25: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/25.jpg)
Utsunomiya University25 November 12, 2004Seminar@UW-Madison
Evaluation flow
hot-path detection(SIMCA)
thread codes• #1 path speculative thread• #2 path speculative thread• non-speculative thread
performanceestimator
path historyacquisition (SIMCA)
thread-codegeneration
path execution history
speculation hit ratiospeed-up ratio
![Page 26: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/26.jpg)
Utsunomiya University26 November 12, 2004Seminar@UW-Madison
SPP DPP
SPP DPP
Prediction success ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
succ
. rat
io (%
)su
cc. r
atio
(%)
forward_DCTforward_DCT
compresscompress
![Page 27: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/27.jpg)
Utsunomiya University27 November 12, 2004Seminar@UW-Madison
SPP DPP
SPP DPP
Prediction success ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
20
40
60
80
100
succ
. rat
io (%
)su
cc. r
atio
(%)
sweepsweep
killtimekilltime
![Page 28: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/28.jpg)
Utsunomiya University28 November 12, 2004Seminar@UW-Madison
SS DS
SS DS
Speed-up ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
1.0
2.0
0
1.0
2.0
3.0
4.0
speed
-up
ra
tio
speed
-up
ra
tio
forward_DCTforward_DCT
compresscompress
S100
P1only
1 2 3 4 65 7 8 9 10 11 12 13 14 15 16S100
P1only
![Page 29: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/29.jpg)
Utsunomiya University29 November 12, 2004Seminar@UW-Madison
SS DS
SS DS
Speed-up ratio
history length
1 2 3 4 65 7 8 9 10 11 12 13 14 15 160
2.0
3.0
0
1.0
2.0
3.0
speed
-up
ra
tio
speed
-up
ra
tio
sweepsweep
killtimekilltime
S100
P1only
1 2 3 4 65 7 8 9 10 11 12 13 14 15 16S100
P1only
1.0
![Page 30: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/30.jpg)
Utsunomiya University30 November 12, 2004Seminar@UW-Madison
Conclusions- Two-Path-Limited Speculative Multithreading -
• We proposed
- path prediction method and predictors
- speculation methods
for path-based speculative multithreading
• Preliminary performance estimation results are shown
![Page 31: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/31.jpg)
Utsunomiya University31 November 12, 2004Seminar@UW-Madison
Current and future works
• Accurate and detailed evaluation for various applications
SPEC 2000, MediaBench, …
• Integration to our Dynamic Optimization
Framework YAWARA
![Page 32: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/32.jpg)
Utsunomiya University32 November 12, 2004Seminar@UW-Madison
Current dynamic optimization projects• Computation-oriented:
– YAWARA: A meta-level optimizing computer system
– HAGANE: Binary-level multithreading
• Communication-oriented:– Spec-All: Aggressive Read/Write Access
Speculation Method for DSM Systems – Cross-Line: Adaptive Router Using Dynamic
Information
![Page 33: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/33.jpg)
Utsunomiya University33 November 12, 2004Seminar@UW-Madison
HAGANE:Binary-Level Multithreading
![Page 34: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/34.jpg)
Utsunomiya University34 November 12, 2004Seminar@UW-Madison
Background
• Multithread programming is not so easy.
→ Automatic multithreading system
However…
• Source codes are not always available.
→ Multithreading at binary level
![Page 35: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/35.jpg)
Utsunomiya University35 November 12, 2004Seminar@UW-Madison
STO(Static Translator
& Optimizer)
Source Binary Code
Multithreaded Binary Code(statically translated)
Process Memory Image
DTO(Dynamic Translator
& Optimizer)
Multithread Processor
Multithreaded Binary Code(dynamically translated)
AnalysisInfo
Execution Profile Info
Binary Translator & Optimizer System
ExecutionProfile
![Page 36: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/36.jpg)
Utsunomiya University36 November 12, 2004Seminar@UW-Madison
Thread Pipelining Model
Continuation
TSAG
Computation
Write-back
Continuation
TSAG
Computation
Write-back
Continuation
TSAG
Computation
Write-backTSAG = Target Store Address Generation
Thread i
Thread i+1
Thread i+2
- Loop iterations are mapped onto threads
![Page 37: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/37.jpg)
Utsunomiya University37 November 12, 2004Seminar@UW-Madison
mtc1 $zero[0],$f4addu $v1[3],$zero[0],$zero[0]
bstrslti $v0[2],$v1[3],5000beq $v0[2],$zero[0],$ST_LL0addu $t0[8],$a0[4],$zero[0]addu $t1[9],$a1[5],$zero[0]addi $v1[3],$v1[3],1addi $a0[4],$a0[4],4addi $a1[5],$a1[5],4lfrkwtsagdaddu $t2[10],$sp[28],$zero[0]altsw $t2[10]tsagdl.s $f0,0($t0[8])l.s $f2,0($t1[9])l.s $f4,0($t2[10])mul.s $f0,f0,f2add.s $f4,$f4,$f0sttsw $t2[10],$f4
$ST_LL0:estrmov.s $f0,$f4jr $ra[31]
Example translation
Source Binary Code
Translated Code
・ Thread Management Instructions
・ Overhead code for multithreading
mtc1 $zero[0],$f4addu $v1[3],$zero[0],$zero[0]
$BB1:l.s $f0,0($a0[4])l.s $f2,0($a1[5])mul.s $f0,f0,f2addiu $v1[3],$v1[3],1add.s $f4,$f4,$f0slti $v0[2],$v1[3],5000addiu $a1[5],$a1[5],4addiu $a0[4],$a0[4],4bne $v0[2],$zero[0],$BB1
$BB2:mov.s $f0,$f4jr $ra[31]
Cont.
TSAG
Comp.
W.B.
![Page 38: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/38.jpg)
Utsunomiya University38 November 12, 2004Seminar@UW-Madison
L1 Data Cache
L1 Instruction Cache
ExecutionUnit
CommunicationUnit
MemoryBuffer
Write-BackUnit
Thread Processing UnitExecution
Unit
CommunicationUnit
MemoryBuffer
Write-BackUnit
Thread Processing Unit
● ● ●
Superthreaded Architecture
![Page 39: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/39.jpg)
Utsunomiya University39 November 12, 2004Seminar@UW-Madison
m88ksim (SPECint95)
0
1
2
3Sp
eedu
p R
atio
4 8 16
Number of Thread Units
No Unroll
Unroll 4
Unroll 8
Unroll 16
•poor speedup ratios•loop unrolling does not affect the performance •number of iterations is quite small.
![Page 40: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/40.jpg)
Utsunomiya University40 November 12, 2004Seminar@UW-Madison
ijpeg (SPECint95)
0
1
2
3
4
5
Spee
dup
Rat
io
4 8 16
Number of Thread Units
No Unroll
Unroll 4
Unroll 8
Unroll 16
•the thread code size is too small to hide the thread management overhead• loop unrolling is effective to achieve good speedup ratios• excessive loop unrolling causes performance degradation• number of iterations is not so large.
![Page 41: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/41.jpg)
Utsunomiya University41 November 12, 2004Seminar@UW-Madison
swim (SPECfp95)
0123456789
1011
Sp
eed
up
Rat
io
4 8 16
Number of Thread Units
No Unroll
Unroll 4
Unroll 8
Unroll 16
• good speedup ratios• loop unrolling is effective to achieve linear speedup• number of iterations is large.
![Page 42: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/42.jpg)
Utsunomiya University42 November 12, 2004Seminar@UW-Madison
Conclusion-HAGANE-
• We have evaluated the binary-level multithreading using some SPEC95 benchmark programs.
• The performance evaluation results indicate:– the thread code size should be large enough to
improve the performance.– loop unrolling is effective for the small loop body.– excessive loop unrolling degrades performance
![Page 43: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/43.jpg)
Utsunomiya University43 November 12, 2004Seminar@UW-Madison
A Methodology ofBinary-Level Variable Analysis
for Multithreading
HAGANE@PDCS2004
![Page 44: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/44.jpg)
Utsunomiya University44 November 12, 2004Seminar@UW-Madison
Background and Objective
Usually, loop-iterations are interrelated through memory variables, such as induction ones.
Binary-level variable analysis method is strongly required for binary-level multithreading.
However, it is difficult to analyze this kind of dependency at binary level.
![Page 45: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/45.jpg)
Utsunomiya University45 November 12, 2004Seminar@UW-Madison
for (i = 1; i < N; i++) {z = i * 2;x = a[i-1];
y = x * 3;a[i] = z + y;
}
Example Binary Codelw $a1[5], 16($s8[30])lw $v1[3], 16($s8[30])lw $a0[4], 16($s8[30])sll $v1[3], $v1[3], 0x2addu $v1[3], $v1[3], $a2[6]lw $v0[2], 16($s8[30])lw $v1[3], -4($v1[3])addiu $v0[2], $v0[2], 1sw $v0[2], 16($s8[30])lw $v0[2], 16($s8[30])sll $a1[5], $a1[5], 0x1sll $a0[4], $a0[4], 0x2sll $v0[2], $v1[3], 0x1addu $v0[2], $v0[2], $v1[3]lw $v1[3], 16($s8[30])addu $a0[4], $a0[4], $a2[6]addu $a1[5], $a1[5], $v0[2]sw $a1[5], 0($a0[4]) slt $v1[3], $v1[3], $a3[7]
Thread ji++
load a[i-1]
s tore a[i]
load a[i-1]
i++
s tore a[i]
Thread j+1
-4($v1[3])
0($a0[4])
![Page 46: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/46.jpg)
Utsunomiya University46 November 12, 2004Seminar@UW-Madison
Binary-Level Variable Analysis
(1) Register values are analyzed using data flow trees.
(2) When register values, used for memory references, are judged as the same, the memory location is regarded as a virtual register.
(3) Using the virtual registers, steps (1) and (2) are repeated.
![Page 47: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/47.jpg)
Utsunomiya University47 November 12, 2004Seminar@UW-Madison
$2#2+
$3#1+
$2#1
$5#1+
$4#0
$0 $0
Construction of Dataflow Tree
addiu $29#1, $29#0, -8
sw $0, 0($29#1)
addu $5#1, $0, $0
lw $2#1, 0($29#1)
addu $3#1, $5#1, $4#0
addiu $5#2, $5#1, 1
addu $2#2, $2#1, $3#1
sw $2#2, 0($29#1)
slti $2#3, $5#2, 100
bne $2#3, $0, L1
![Page 48: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/48.jpg)
Utsunomiya University48 November 12, 2004Seminar@UW-Madison
Example Normalization
$2#2+
$2#1sll
14
$7#1+
2
0 $4#0
$2#2+
$4#0 4
14*
![Page 49: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/49.jpg)
Utsunomiya University49 November 12, 2004Seminar@UW-Madison
Detection ofLoop Induction Variables
Loop induction variable is the register, which– has inter-iteration dependency, and– increases with a fixed value between iterations.
$V2#2+
1$V2#1
The concept of virtual register makes it possible to detect induction variables on memory.
![Page 50: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/50.jpg)
Utsunomiya University50 November 12, 2004Seminar@UW-Madison
Application
• 101.tomcatv of SPECfp95 Benchmark
• Fortran to C translator ver. 19940927
• GCC cross compiler ver 2.7.2.3 for SIMCA
• Data set: test
• The six most inner loops (#1-#6) are selected
• They have induction variables on memory
![Page 51: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/51.jpg)
Utsunomiya University51 November 12, 2004Seminar@UW-Madison
Speedup Ratios
9.804
1.643
5.178
1.800
3.5832.611
5.361
0
2
4
6
8
10
12
#1 #2 #3 #4 #5 #6 ALL
Loop
Spee
dup
rati
o
![Page 52: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/52.jpg)
Utsunomiya University52 November 12, 2004Seminar@UW-Madison
Conclusion -Binary-Level Variable Analysis-
• We proposed a binary-level variable analysis method.
• This method makes it possible to detect induction variables and the increment/decrement values.
• The detected information allows us to multithread binary codes; they may not be multithreaded without our algorithm.
• We attained up to 9.8 speedup by the multithreading.
![Page 53: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/53.jpg)
Utsunomiya University53 November 12, 2004Seminar@UW-Madison
Summary
• Dynamic optimization projects at our laboratory
• The results show the performance improvement quantitatively in each project
![Page 54: Profile-Based Dynamic Optimization Research for Future Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022070410/56814544550346895db211ed/html5/thumbnails/54.jpg)
Utsunomiya University54 November 12, 2004Seminar@UW-Madison
What’s the next step of computer architecture research?
• from performance to reliability? or low power?
e.g. dependable computing• architecture for new device technologies?
e.g. quantum computing
However….
if we stick to conventional high-performance computing research,
what’s the promising way?