dusd(labs) breaking the memory wall for scalable microprocessor platforms wen-mei hwu with john w....
TRANSCRIPT
DUSD(Labs)
Breaking the Memory Wall Breaking the Memory Wall for Scalable Microprocessor Platformsfor Scalable Microprocessor Platforms
Wen-mei HwuWen-mei Hwu
withwith
John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player,Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player,
Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,Dan R. Burke, Nacho Navarro, Steve S. LumettaDan R. Burke, Nacho Navarro, Steve S. Lumetta
University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign
PACT Keynote, October 1, 2004
Semiconductor computing platform Semiconductor computing platform challengeschallenges
performanceperformance
billion transistorsbillion transistors
powerpower costcost
DSP/
ASIP
DSP/
ASIP
securitysecurity
Inte
llige
nt R
AM
Inte
llige
nt R
AM
feature setfeature setreliabilityreliability
Mem. Latency/BandwidthMem. Latency/BandwidthPower ConstraintsPower Constraints
Mic
ropr
oces
sors
Mic
ropr
oces
sors
Reconfigurability
Reconfigurabilityacce
lera
tors
acce
lera
tors
O/S limitationsO/S limitationsS/W inertiaS/W inertia
wire loadwire load
process variationprocess variation
leakageleakage fab costfab cost
PACT Keynote, October 1, 2004
ASIC/ASIP economicsASIC/ASIP economics
Optimistically, ASIC/ASSP revenues growing 10–20% / yearOptimistically, ASIC/ASSP revenues growing 10–20% / year Engineering portion of budget is supposed to be trimmed every year (but Engineering portion of budget is supposed to be trimmed every year (but
never is)never is)
Chip development costs rising faster than increased revenues and decreased Chip development costs rising faster than increased revenues and decreased
engineering costs can make up the differenceengineering costs can make up the difference
Implies Implies 40% fewer IC designs 40% fewer IC designs (doing more applications) - every process (doing more applications) - every process
generation!!generation!!
Number of ICDesigns ≤
Per-chip Development
Cost
Total ASIC/ASSPRevenues
EngineeringCosts×
10-20%
5-20%
40%
30-100%
PACT Keynote, October 1, 2004
ASIPs: non-traditional programmable platformsASIPs: non-traditional programmable platforms
Level of concurrency mustLevel of concurrency must
be comparable to ASICsbe comparable to ASICs
XScaleCore
HashEngine
Scratch-pad
SRAM
RFIFO
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
QD
RS
RA
M
QD
RS
RA
M
QD
RS
RA
M
QD
RS
RA
M
RD
RA
M
RD
RA
M
RD
RA
M
PC
I
CSRs
TFIFO
SP
I4 / C
SIX
ASIPs will be on-chip, high-ASIPs will be on-chip, high-
performance multi-processorsperformance multi-processors
PACT Keynote, October 1, 2004
Example embedded ASSP implementationsExample embedded ASSP implementations
Intel IXP1200 Network ProcessorPhilips Nexperia (Viper)
MIPS
VLIW
PACT Keynote, October 1, 2004
What about the general purpose worldWhat about the general purpose world
Clock frequency increase of computing engines is slowing downClock frequency increase of computing engines is slowing down Power budget hinders higher clock frequencyPower budget hinders higher clock frequency Device variation limits deeper pipeliningDevice variation limits deeper pipelining Most future perf. improvement will come from concurrency and specializationMost future perf. improvement will come from concurrency and specialization
Size increase of single-thread computing engines is slowing downSize increase of single-thread computing engines is slowing down Power budget limits number of transistors activated by each instructionPower budget limits number of transistors activated by each instruction
Need finer-grained units for defect containmentNeed finer-grained units for defect containment
Wire delay is becoming a primary limiter in large, monolithic designsWire delay is becoming a primary limiter in large, monolithic designs
The approach to covering all applications with a primarily single The approach to covering all applications with a primarily single execution model is showing limitationsexecution model is showing limitations
PACT Keynote, October 1, 2004
Impact of Transistor VariationsImpact of Transistor Variations
130nm
30%
5X
FrequencyFrequency~30%~30%
LeakageLeakagePowerPower
~5X~5X
0.90.9
1.01.0
1.11.1
1.21.2
1.31.3
1.41.4
11 22 33 44 55Normalized Leakage (INormalized Leakage (Isbsb))
No
rmal
ized
Fre
qu
en
cyN
orm
aliz
ed F
req
ue
ncy
Source: Shekhar Borkar, Intel
PACT Keynote, October 1, 2004
Metal InterconnectsMetal Interconnects
Interconnect RC DelayInterconnect RC Delay
Source: Shekhar Borkar, Intel
0
0.5
1
500 250 130 65 32
Lin
e C
ap (
Rel
ativ
e)
Low-K ILD
1
10
100
1000
500 250 130 65 32
Lin
e R
es (
Rel
ativ
e)
1
10
100
1000
10000
350 250 180 130 90 65
Del
ay (
ps) Clock Period
RC delay of 1mm interconnect
Copper Interconnect
1
10
100
500 250 130 65 32
RC
Del
ay (
Rel
ativ
e)
0.7x Scaled RC Delay
PACT Keynote, October 1, 2004
Measured SPECint2000 PerformanceMeasured SPECint2000 Performanceon real hardware with same fabrication technologyon real hardware with same fabrication technology
0
200
400
600
800
1000
1200
GEOMEAN
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.tw olf
Intel ecc 7.1 20030909 IMPACT 20031010 gcc 3.2 Pentium4 2GHz (Willamette)
McKinley 1GHz/3MB2P HP zx6000 8GB RAMlinux 2.4.21-imp-smp
Date: October 2003
PACT Keynote, October 1, 2004
General processor coresGeneral processor cores Very low power compute Very low power compute
and memory structuresand memory structures O/S provides lightweight O/S provides lightweight
access to custom featuresaccess to custom features
Acceleration logicAcceleration logic Application specific logicApplication specific logic High-bandwidth, distributed High-bandwidth, distributed
storage (RAM, registers)storage (RAM, registers) To developer, behave like To developer, behave like
software componentssoftware components
Memory systemMemory system Data delivery to processor Data delivery to processor O/S and virtual memory O/S and virtual memory
issuesissues Intelligent memory Intelligent memory
controllerscontrollers
Application processorsApplication processors Lightweight compute enginesLightweight compute engines High-bandwidth, distributed High-bandwidth, distributed
storage (RAM, registers)storage (RAM, registers) High-bandwidth, scalable High-bandwidth, scalable
interconnectinterconnect
Convergence of future computing platformsConvergence of future computing platforms
PACT Keynote, October 1, 2004
Breaking the memory wall withBreaking the memory wall withdistributed memory and data movementdistributed memory and data movement
ACC
LOCALMEMORY
ACC
MA
INM
EMO
RY
GPP
MTM
ACC
LOCALMEMORY
Memory transfermodule
schedules system-wide bulk data
movement
General-purpose processororchestrates activity
Accelerated activities and associatedprivate data are localized forbandwidth, power efficiency
Accelerators can usescheduled, streaming
communication...
or can operate onlocally-buffered data
pushed to them inadvance
PACT Keynote, October 1, 2004
Parallelization with deep analysis: Parallelization with deep analysis: Deconstructing von Neumann Deconstructing von Neumann [IWLS2004][IWLS2004]
Memory dataflow that enablesMemory dataflow that enables Extraction of independent memory access streamsExtraction of independent memory access streams
Conversion of implicit flows through memory into explicit communicationConversion of implicit flows through memory into explicit communication
Applicability to mass software base requires pointer analysis, Applicability to mass software base requires pointer analysis,
control flow analysis, array dependence analysiscontrol flow analysis, array dependence analysis
CPUCPUWeight_Ai (Az, F_ga3, Ap3)
Weight_Ai (Az, F_g4, Ap4)
Residu (Ap3, &syn_subfr[i],)
Copy (Ap3, h, 11)
Set_zero (&h[11], 11)
Syn_filt (Ap4, h, h, 22, &h)
tmp = h[0] * h[0];for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i];tmp1 = tmp >> 8;tmp = h[0] * h[1];for (i = 1 ; i < 21 ; i++) tmp = tmp + h[i] * h[i+1];tmp2 = tmp >> 8;if (tmp2 <= 0) tmp2 = 0;else tmp2 = tmp2 * MU; tmp2 = tmp2/tmp1;
preemphasis (res2, temp2, 40)
Syn_filt (Ap4, res2, &syn_p),40, mem_syn_pst, 1);
agc (&syn[i_subfr], &syn)29491, 40)
res2
m_syn
F_g3
F_g4
Az_4
synth
syn
Ap3
Ap4
h
tmp
tmp1
tmp2
CPUDRAM
DRAM
Weight_Ai Weight_Ai
Copy+Set_zero Residu
Syn_filt
Corr0/Corr1
preemph
agc
Syn_filt
PE’sres2
m_syn
F_g3
F_g4
Az_4
synth
syn
Ap3
Ap4
h
tmp
tmp1
tmp2
PE’sDRAM
PACT Keynote, October 1, 2004
Memory bottleneck exampleMemory bottleneck example(G.724 Decoder Post-filter, C code)(G.724 Decoder Post-filter, C code)
Problem:Problem: Production/consumption occur with different patterns across 3 kernels Production/consumption occur with different patterns across 3 kernels Anti-dependence in Anti-dependence in preemphasispreemphasis function function (loop reversal not applicable)(loop reversal not applicable) Consumer must wait until producer finishes Consumer must wait until producer finishes
Goal:Goal: Convert memory access to inter-cluster communication Convert memory access to inter-cluster communication
** * *+
Residu
preemphasis
** * *+
Syn_filt
res
[0:39]
[39:0] [39:0]
[0:39] MEMtime
PACT Keynote, October 1, 2004
Breaking the memory bottleneckBreaking the memory bottleneck
Remove anti-dependence by array Remove anti-dependence by array
renamingrenaming
Apply loop reversal to match Apply loop reversal to match
producer/consumer I/Oproducer/consumer I/O
Convert array access to inter-Convert array access to inter-
component communicationcomponent communication
** * *+
Residu
+** * *
Syn_filt
res
preemphasis
res2
timeInterprocedural pointer analysis + array dependence test +
array access pattern summary+ interprocedural memory data flow
PACT Keynote, October 1, 2004
Full system environmentFull system environment Linux running on PowerPCLinux running on PowerPC Lean system with custom Linux Lean system with custom Linux
(Nacho Navarro, UIUC/UPC)(Nacho Navarro, UIUC/UPC) Virtex 2 Pro FPGA logic treated Virtex 2 Pro FPGA logic treated
as software componentsas software components
Removing memory bottleneckRemoving memory bottleneck Random memory access Random memory access
converted to dataflowconverted to dataflow Memory objects assigned to Memory objects assigned to
distributed Block RAMdistributed Block RAM
SW / HW communicationSW / HW communication PLB vs. OCM interfacePLB vs. OCM interface
A prototyping experience with the A prototyping experience with the Xilinx ML300Xilinx ML300
FPGA
PLB - BRAMInterface
PPCP
LB
BRAM
OC
M
FPGA
DDR
BRAM
FPGA
PLB - BRAMInterface
PPCP
LB
BRAM
OC
M
FPGA
DDR
BRAM
PACT Keynote, October 1, 2004
Initial results from our ML300 testbedInitial results from our ML300 testbed
Case study: GSM vocoderCase study: GSM vocoder Main filter in FPGAMain filter in FPGA
Rest in software running under Rest in software running under Linux with customized supportLinux with customized support
Straightforward software/ Straightforward software/ accelerator communications accelerator communications patternpattern
Fits in available resources on Fits in available resources on Xilinx ML300 V2P7Xilinx ML300 V2P7
Performance compared to all-Performance compared to all-software execution, with software execution, with communication overheadcommunication overhead
Projected filter latency
~8x~32x
Cy
cle
s
0
1000
2000
3000
14000
15000
16000
Software Naïve Optimized
Hardware implementation
0 1 77
1 2 88
2 3 99
7
6
8
65
76
87
Cy
cle
s
0
1
2
× × × × × ×× ×
×
×
×
- + + +
- +
-
-
-
+
Shift Register
PACT Keynote, October 1, 2004
Applicationsand
Systems Software
Applicationsand
Systems Software
Grand challengeGrand challenge
Moving the mass-market software base to heterogeneous Moving the mass-market software base to heterogeneous
computing architecturescomputing architecturesEmbedded computing platforms in the near termEmbedded computing platforms in the near term
General purpose computing platforms in the long runGeneral purpose computing platforms in the long run
PlatformsPlatforms
Programming models
Restructuringcompilers
Communications andstorage management
Accelerator architectures
OS support
PACT Keynote, October 1, 2004
Slicing through software layersSlicing through software layers
To
tal System
Cu
stom
izer
Application
Middleware Layer
Middleware Layer
O/S Layer
O/S Layer
Circuits
Device Drivers
IntegratedApplicationSoftware
CompositeLow-layer
Custom Circuits
Layered designparadigm
Automated customimplementation
Automation delivers productivity and efficiency
To
tal System
Cu
stom
izer
Application
Middleware Layer
Middleware Layer
O/S Layer
O/S Layer
Circuits
Device Drivers
Layered designparadigm
Automation delivers productivity and efficiency
IntegratedApplicationSoftware
CompositeLow-layer
Custom Circuits
Automated customimplementation
Middleware
Middleware
O/S Layer
O/S Layer
Circuits
Drivers
Intermediateexperimental step
To
tal System
Cu
stom
izer
Cus
tom
feat
ure
Use
r le
vel d
river
Application
Cus
tom
feat
ure
Use
r le
vel d
river
PACT Keynote, October 1, 2004
Taking the first step: pointer analysisTaking the first step: pointer analysis
To what can this variable point? (points-to)To what can this variable point? (points-to) Can these two variables point to the same thing? (alias)Can these two variables point to the same thing? (alias)
Fundamental to unraveling communications through memory: programmers like Fundamental to unraveling communications through memory: programmers like
modularity and pointers!modularity and pointers!
Pointer analysis is Pointer analysis is abstract executionabstract execution Model all possible executions of the programModel all possible executions of the program Has to include important facets, or result won’t be usefulHas to include important facets, or result won’t be useful Has to ignore irrelevant details, or result won’t be timelyHas to ignore irrelevant details, or result won’t be timely Unrealizable dataflow Unrealizable dataflow = artifacts of “corners cut” in the model= artifacts of “corners cut” in the model
Typically, emphasis has been on timeliness, not resolution, because expensive Typically, emphasis has been on timeliness, not resolution, because expensive
algorithms cause unstable analysis time – for typical alias uses, may be OK…algorithms cause unstable analysis time – for typical alias uses, may be OK…
……but we have new applications that can benefit from higher accuracybut we have new applications that can benefit from higher accuracy Data flow unraveling for logic synthesis and heterogeneous systemsData flow unraveling for logic synthesis and heterogeneous systems
PACT Keynote, October 1, 2004
How to be fast, safe and accurate?How to be fast, safe and accurate?
An efficient, accurate, and safe pointer analysis based on the An efficient, accurate, and safe pointer analysis based on the
following two key ideasfollowing two key ideas
CEO
VP VP
MANAGERMANAGER
MANAGER MANAGERMANAGER
WORKER WORKER
WORKERWORKER
WORKER WORKER
WORKERWORKER
Incr
ease
d A
bstr
actio
n
Efficient analysis of a large program Efficient analysis of a large program necessitates that only relevant details are necessitates that only relevant details are
forwarded to a higher level componentforwarded to a higher level componentThe algorithm can locally cut its The algorithm can locally cut its
losses (like a bulkhead) …losses (like a bulkhead) …
… … to avoid a global explosion in to avoid a global explosion in problem sizeproblem size
PACT Keynote, October 1, 2004
One facet: context sensitivityOne facet: context sensitivity
Context sensitivity – avoids unrealizable data Context sensitivity – avoids unrealizable data
flow by distinguishing proper calling contextflow by distinguishing proper calling context
What assignments to What assignments to aa and and gg receive? receive? CI: CI: aa and and gg each receive each receive 11 and and 33
CS: CS: gg receives only receives only 11 and and aa receives only receives only 33
Typical reactions to CS costsTypical reactions to CS costs Forget it, live with lots of unrealizable dataflowForget it, live with lots of unrealizable dataflow
Combine it with a “cheapener” like the lossy Combine it with a “cheapener” like the lossy compression of a Steensgaard analysiscompression of a Steensgaard analysis
We want to do better, but we may sometimes We want to do better, but we may sometimes
need to mix CS and CI to keep analysis fastneed to mix CS and CI to keep analysis fast
*p := q;
r := g + 5;
jade(int *p, int q)int r;
x := z
jade1(&g, 1)
jade2(&a, 3)
int g;
iris()int a;
g := 1a := 3
r := 6
Desired resultsDesired results
ExampleExample
PACT Keynote, October 1, 2004
Context Insensitive (CI)Context Insensitive (CI)
a := 1
g := 3
r := 8*p := q;
r := g + 5;
p := &a;
p := &g; q := 1;
q := 3;
Collecting all the assignments in the program and Collecting all the assignments in the program and
solving them simultaneously yields a context solving them simultaneously yields a context
insensitive solutioninsensitive solution
Unfortunately, this leads to three spurious solutions.Unfortunately, this leads to three spurious solutions.
PACT Keynote, October 1, 2004
Context Sensitive (CS): Naïve processContext Sensitive (CS): Naïve processint g;
iris()int a;
jade1(&g, 1)
jade2(&a, 3) g := 1
a := 3
Retention of side Retention of side effect still leads to effect still leads to spurious resultsspurious results
*p := q;
r := g + 5;
jade(int *p, int q)int r;
x := z
p := &g;
p := &a;
q := 1;
q := 3;
g := 1 a := 3
jade2*p2 := q2;
p2 := &a2; q2 := 3;
r2 := g + 5;
x2 := z2
jade1
x1 := z1
p1 := &g; q1 := 1;*p1 := q1;r1 := g + 5;
Excess Excess statements statements unnecessary unnecessary and costlyand costly
g := 3
a := 1
r := 6
r := 8
PACT Keynote, October 1, 2004
CS: “Accurate and Efficient” approachCS: “Accurate and Efficient” approachint g;
iris()int a;
jade1(&g, 1)
jade2(&a, 3)
p1 := &g1; q1 := 1;
*p1 := q1;jade1
*p2 := q2;
p2 := &a2; q2 := 3;jade2
p := &g;
p := &a;
q := 1;
q := 3;
g := 1 a := 3
g := 1
a := 3
Now, only correct Now, only correct result derivedresult derived
Compact summary Compact summary of jade usedof jade used
int r;
jade(int *p, int q)
*p := q;
r := g + 5;
*p := q; r := 6Summary accounts for all Summary accounts for all
side-effects. DELETE side-effects. DELETE assignment to prevent assignment to prevent
contaminationcontaminationx := z
PACT Keynote, October 1, 2004
Analyzing large, complex programsAnalyzing large, complex programs [SAS2004][SAS2004]
Bench-Bench-markmark
INACCURATEINACCURATE Context Context
InsensitiveInsensitive
(seconds)(seconds)
PREV PREV Context-Context-
Sensitive Sensitive
(seconds)(seconds)
NEW NEW Context-Context-SensitiveSensitive
(seconds)(seconds)
espressoespresso 22 99 11lili 11 13321332 11
ijpegijpeg 22 8585 11perlperl 44 408408 1111gccgcc 5252 HOURSHOURS 124124
perlbmkperlbmk 155155 MONTHSMONTHS 198198gapgap 6262 33503350 117117
vortexvortex 55 136136 33twolftwolf 11 22 11
This results in an efficient analysis This results in an efficient analysis process without loss of accuracyprocess without loss of accuracy
Originally, problem size exploded as Originally, problem size exploded as more contexts were encounteredmore contexts were encountered
New algorithm contains problem New algorithm contains problem size with each additional contextsize with each additional context
008.espresso099.go 130.li
124.m88ksim 175.vpr134.perl
176.gcc 254.gap 255.vortex
1E+00
1E+02
1E+04
1E+06
1E+08
1E+10
1E+12
1E+14
Naï
ve
Exh
au
sti
ve
In
lin
ing
1 3 5 7 9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
Ne
w C
om
pa
cti
on
Alg
ori
thm
Call Graph Depthmain() leaves
1 3 5 7 9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
Call Graph Depthmain() leaves
1012
104
PACT Keynote, October 1, 2004
Example application and current challengesExample application and current challenges[PASTE2004][PASTE2004]
Improved efficiency increases the Improved efficiency increases the scope over which unique, heap-scope over which unique, heap-
allocated objects can be discoveredallocated objects can be discovered
Example: Improved analysis algorithms Example: Improved analysis algorithms provide more accurate call graphs provide more accurate call graphs (below) instead of a blurred view (below) instead of a blurred view
(above) for use by program (above) for use by program transformation toolstransformation tools
A multitudeof distinct
objects
Observed Connectivity1 10 100 1000 10000
1
10
100
1000
Dis
cove
red
Ob
jec
ts
132.ijpeg
BETTER
WORSE
A few, highly-connected
objects
3
2
1
0
ANALYSISSCOPE
......
......
PACT Keynote, October 1, 2004
From benchmarks to broad application code baseFrom benchmarks to broad application code base
The long term trend is for all code to go through a compiler and be managed The long term trend is for all code to go through a compiler and be managed by a runtime systemby a runtime system Microsoft code base to go through Phoenix – OpenIMPACT participationMicrosoft code base to go through Phoenix – OpenIMPACT participation Open source code base to go through GCC/OpenIMPACT under GelatoOpen source code base to go through GCC/OpenIMPACT under Gelato
The compiler and runtime will perform deep analysis to allow tool to have The compiler and runtime will perform deep analysis to allow tool to have visibility into softwarevisibility into software Parallelizers, debuggers, verifiers, models, validation, instrumentation, configuration, Parallelizers, debuggers, verifiers, models, validation, instrumentation, configuration,
memory managers, runtime, memory managers, runtime, etc.etc.
systemsApplicationsApplications systemsOperating systemsOperating systems systemsLibrariesLibraries
systemsCompilerCompiler
systemsRuntime and ToolsRuntime and Tools
systemsHardwareHardware
PACT Keynote, October 1, 2004
Global memory dataflow analysisGlobal memory dataflow analysis
Integrates analyses to deconstruct memory “black box”Integrates analyses to deconstruct memory “black box” Interprocedural pointer analysis: allow programmer to use Interprocedural pointer analysis: allow programmer to use
language and modularity without losing transformabilitylanguage and modularity without losing transformability
Array access pattern analysis: figure out communication among Array access pattern analysis: figure out communication among loops that communicate through arraysloops that communicate through arrays
Control and data flow analyses: enhance resolution by Control and data flow analyses: enhance resolution by understanding program structureunderstanding program structure
Heap analysis extends analysis to much wider software baseHeap analysis extends analysis to much wider software base
SSA-based inductor detection and dependence test have SSA-based inductor detection and dependence test have
been integrated into IMPACT environmentbeen integrated into IMPACT environment
PACT Keynote, October 1, 2004
foo (int *s, int L){ int *p=s, i; for (i=0; i<L; i++) *p = ...; p++; }
foo writes A[0:63] stride 1
bar reads A[1:64]stride 1
procedure call
parametermapping
Read from *(t) Read from *(t) to *(t+M) to *(t+M) with stride 1with stride 1
Procedure body
summary forthe whole loop
Write *ploop body
main(...){ int A[100]; foo(A, 64);foo(A, 64); bar(A+1, 64)bar(A+1, 64) }
bar (int *t, int M){ int *q=t, i; for (i=0; i<M; i++) … = *q; q++; }
Write from *(s) Write from *(s) to *(s+L) to *(s+L) with stride 1with stride 1
Read *q
Data flow analysis
determines that A[64] is not from foo
Pointer relation analysis
restates p/q in terms of s/t
Example on deriving memory data flowExample on deriving memory data flow
PACT Keynote, October 1, 2004
Conclusions and outlookConclusions and outlook
Heterogeneous multiprocessor systems will be the model Heterogeneous multiprocessor systems will be the model
for both general purpose and embedded computing for both general purpose and embedded computing
platforms in the futureplatforms in the futureBoth are motivated by powerful trendsBoth are motivated by powerful trends
Shorter term adoption for embedded systemsShorter term adoption for embedded systems
Longer term for general purpose systemsLonger term for general purpose systems
Programming models and parallelization of traditional Programming models and parallelization of traditional
programs to channel software to these new platformsprograms to channel software to these new platformsFeasibility of deep pointer analysis demonstratedFeasibility of deep pointer analysis demonstrated
Many need to participate in solving this grand challenge problem Many need to participate in solving this grand challenge problem