uppsala architecture stefanos kaxiras, alexandra …stefanos kaxiras, alexandra jimborean which...

1
Department of Informa-on Technology, Uppsala University h9p://it.uu.se/ UART Uppsala Architecture Research Team How Can We Improve DAE? DAE = Decoupling Uses from Loads Selec-ng Loads: Crea-ng Alterna-ve Access Phases Why DAE? DAE + Frequency Scaling Saves Energy Evalua-ng Versions at Run-me and Selec-ng the Best Performing One Overcoming Address Recomputa-on: Hois-ng Loads to Access Phases Info Problem: Hois-ng all loads Into an Access Phase Is infeasible: address computa-on overhead diminishes benefits! So)ware Decoupled Access-Execute Kim-Anh Tran, Konstan-nos Koukos, Stefanos Kaxiras, Alexandra Jimborean Which loads to select for prefetching? How to avoid address recomputa-on? 1 2 1 2 Run>me Evalua>on Phase: evaluate each combina-on! Pick best combina-on for remaining itera-ons A 0 E E E A 1 E A 1 E A 1 A 1 A 2 E Orig ... Run-time Iteration 0 10 20 30 40 50 60 slice 0 slice 1 slice 2 slice 3 slice 4 slice 5 ... Loop Slice Original DAE CPU frequency f opt CPU frequency f max f min Execuon Memory-bound: run on low frequency Compute-bound: run on high frequency Original code: compromise frequency to balance between energy and performance. DAE helps to save energy by adjusng the frequency. vs. Problem: Address recomputa3on in the Execute Phase is expensive. Create one Access Phase for each level of indirec-on: Orig Original Code Legend n-th Access Version Execute Phase (same for all A n ) A n E Run each version (Original and alterna-ve Access-Execute versions) for a couple of itera>ons was iden-fied as the best performing version: se9le on this combina-on! for (...) { } unroll * 2 L 1 =ld x[i] L 2 =ld y[i] L 1 +L 2 for (...) { in-place DAE L 1 =ld x[i] L 2 =ld y[i] L 1 +L 2 } L 3 =ld x[i+1] L 4 =ld y[i+1] L 3 +L 4 for (...) { L 1 =ld x[i] L 2 =ld y[i] L 1 +L 2 } L 3 =ld x[i+1] L 4 =ld y[i+1] L 3 +L 4 Register Transfer of data from Access to Execute via registers Access Phase Execute Phase for (...) { } for (...) { x[i] for (...) { for (...) { Version 1: 1-indirecon a[x[i]] b[a[x[i]]] y[i] Indirecon Legend Access Phase Execute Phase or or Version 2: 2-indirecon Version 3: 3-indirecon x[i] y[i] } x[i] a[x[i]] y[i] x[i] a[x[i]] b[a[x[i]]] y[i] } } for (...) { } for (...) { } for (...) { } for (...) { } for (...) { } Access Phase Execute Phase Memory Access Computaon Cache prefetch into cache use data for (...) { } Legend Why DAE within one loop? By applying DAE within one loop, we may transfer data via registers: Access phase loads data into a register, Execute phase directly uses register. Why unrolling? Unrolling is required as we now apply DAE within the loop Reference: A. Jimborean, K. Koukos, V. Spiliopoulos, D. Black-Schaffer, S. Kaxiras. Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage- Frequency scaling. In Proc. of CGO’14 Contact: [email protected]

Upload: others

Post on 13-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Uppsala Architecture Stefanos Kaxiras, Alexandra …Stefanos Kaxiras, Alexandra Jimborean Which loads to select for prefetching? How to avoid address recomputaon ? 1 2 1 2 Run>me Evalua>on

DepartmentofInforma-onTechnology,UppsalaUniversity h9p://it.uu.se/

UARTUppsalaArchitecture

ResearchTeam

HowCanWeImproveDAE?

DAE=DecouplingUsesfromLoads

Selec-ngLoads:Crea-ngAlterna-veAccessPhases

WhyDAE?DAE+FrequencyScalingSavesEnergy

Evalua-ngVersionsatRun-meandSelec-ngtheBestPerformingOne

OvercomingAddressRecomputa-on:Hois-ngLoadstoAccessPhases Info

Problem:Hois-ngallloadsIntoanAccessPhaseIsinfeasible:addresscomputa-onoverheaddiminishesbenefits!

So)wareDecoupledAccess-ExecuteKim-AnhTran,Konstan-nosKoukos,StefanosKaxiras,AlexandraJimborean

Whichloadstoselectforprefetching?

Howtoavoidaddress

recomputa-on?

1

2

1

2

Run>meEvalua>onPhase:evaluateeachcombina-on!

Pickbestcombina-onforremainingitera-ons

A0 E E EA1 EA1 EA1A1 A2 EOrig ...

Run-t

ime

Iteration0 10 20 30 40 50 60slice 0 slice 1 slice 2 slice 3 slice 4 slice 5 ...Loop

Slice

Original

DAE

CPUfrequency fopt

CPUfrequencyfmax

fminExecu�on

Memory-bound:runon lowfrequency

Compute-bound:runonhighfrequency

Originalcode:compromisefrequencytobalancebetweenenergyandperformance.

DAEhelpstosaveenergybyadjus�ngthefrequency.

vs.

Problem:Addressrecomputa3onintheExecutePhaseisexpensive.

CreateoneAccessPhaseforeachlevelofindirec-on:

Orig OriginalCode

Legend

n-thAccessVersion

ExecutePhase(sameforallAn)

An

E

Runeachversion(Originalandalterna-veAccess-Executeversions)foracoupleofitera>ons

wasiden-fiedasthebestperformingversion:se9leonthiscombina-on!

for(...){

}

unroll*2L1=ldx[i]L2=ldy[i]L1+L2

for(...){in-placeDAE

L1=ldx[i]L2=ldy[i]L1+L2

}

L3=ldx[i+1]L4=ldy[i+1]L3+L4

for(...){

L1=ldx[i]L2=ldy[i]

L1+L2

}

L3=ldx[i+1]L4=ldy[i+1]

L3+L4

Register

TransferofdatafromAccesstoExecute

viaregisters

AccessPhase

ExecutePhase

for(...){

}

for (...) {

x[i]

for (...) { for (...) {

Version1:1-indirec�on

a[x[i]]b[a[x[i]]]

y[i]

Indirec�on

Legend

AccessPhase

ExecutePhase

or or

Version2:2-indirec�on

Version3:3-indirec�on

x[i]y[i]

x[i]a[x[i]]

y[i]

}

x[i]a[x[i]]

y[i]

x[i]a[x[i]]

b[a[x[i]]]y[i]

} }for (...) {

}

for (...) {

}

for (...) {

}

for(...){

}

for(...){

}

AccessPhase

ExecutePhase

MemoryAccessComputa�on

Cache

prefetchintocache

usedata

for(...){

}Legend

WhyDAEwithinoneloop?ByapplyingDAEwithinoneloop,wemaytransferdataviaregisters:Accessphaseloadsdataintoaregister,Executephasedirectlyuses

register.

Whyunrolling?UnrollingisrequiredaswenowapplyDAEwithintheloop

Reference:A.Jimborean,K.Koukos,V.Spiliopoulos,D.Black-Schaffer,S.Kaxiras.Fixthecode.Don’ttweakthehardware:AnewcompilerapproachtoVoltage-Frequencyscaling.InProc.ofCGO’14

Contact:[email protected]