licentiate thesis seminar uppsala university, 25/9 – 2003 efficient synchronization and

48
e 2003 Licentiate Thesis Seminar Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Uppsala University, 25/9 – 2003 Efficient Synchronization an Efficient Synchronization an Coherence for Nonuniform Coherence for Nonuniform Communication Architectures Communication Architectures Zoran Radovic Zoran Radovic [email protected] [email protected]

Upload: noah

Post on 15-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic [email protected]. Introduction: Cache. “Scratch pad” Kladdpapper. $. Memory. A. B. P. A: 5. B: 80. A = 5 B = A + 75. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003

Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures

Zoran RadovicZoran [email protected]@it.uu.se

Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003

Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures

Zoran RadovicZoran [email protected]@it.uu.se

Page 2: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Introduction: Cache

P

Memory

A

B

A = 5B = A + 75

“Scratch pad”

Kladdpapper

$

A: 5

B: 80

Page 3: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Introduction: Cache Coherence

P2

Memory

P1 P3Web serverDatabase

serveretc.

A: 5A: 5A: 5

A = 5B = A + 75

A = A + 1

B: 80

AB

Cache-to-cacheTransferA: 6

A:=0A:=56

Y:=X

BARRIER

LOCK(CS)

UNLOCK

CacheCoherence

Page 4: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Inside a Real Thing ...

Page 5: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Switch

Nonuniform Memory AccessArchitecture (NUMA)

Many NUMA optimizations are proposed Page migration speed up accesses to “private” data Page replication speed up reads to “shared” data

Does not help communication… E.g., cache-to-cache transfers

P

$

P

$

P

$

P

$

P

$

P

$

P

$

P

$

Memory Memory

12 – 10

Accesstime ratio ...

Page 6: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

A “new” propertyof NUMAs…

NUCA

Nonuniform CommunicationArchitecture (NUCA)

NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future (Today): CMP, SMT (~ 10)

NUCAratio

Switch

P

$

P

$

P

$

P

$

P

$

P

$

P

$

P

$

Memory Memory

1 2 – 10...

NUCA optimizationsare getting important for

future architectures!

NUCA optimizationsare getting important for

future architectures!

Page 7: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Outline

Introduction NUCA Locks

Paper A: RH Lock Paper B: HBO Locks

Beating the Real Thing … Paper C: DSZOOM – Software-based Shared Memory Paper D: THROOM – POSIX Front-end Paper E: SAIT & Write Permission Cache (WPC)

Contributions Future Work

Page 8: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Synchronization Basics

Locks are used to protect critical section (CS) data

CS examples: Bank account status Global counters Number of on-line visitors …

A:=0 BARRIER

LOCK(L)A:=A+1

UNLOCK(L)LOCK(L)B:=A+5

UNLOCK(L)

Page 9: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Synchronization Example

P1

$

P2

$

P4

$

Memory

Test / SpinTest / SpinTest / SpinLock CS flag Update CS dataUnlock

Lock CS flag Update CS dataUnlock

lockhandover

Locks are used to protect

critical section (CS) data

“CS efficiency”

= CS flag

= CS data

Write BUSY tokento the flag…

Page 10: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Memory

Large System Synchronization

Switch

MemoryMemory

TestTestLock UpdateUnlock

TestTestTestTestTest

TestTestTestTestTestTestTest

Lock UpdateUnlockTest

TestTestTestTest

P9

$

P10

$

P12

$…

P5

$

P6

$

P8

$…

P1

$

P2

$

P4

$…

TestTestTestTest

TestTestTestTestTest

TestTestTestTest

Lock UpdateUnlock

Three problems under contentionwith Spin (Test&Set) locks:

1) Test and invalidation traffic2) Lock handover3) CS efficiency

Three problems under contentionwith Spin (Test&Set) locks:

1) Test and invalidation traffic2) Lock handover3) CS efficiency

Page 11: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Vasaloppet“Contention Problem in Sweden”

Traditional cross-country ski race90 km …

85.6533 km to go… CS

Page 12: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Spin Locks Under Contention

Amount of Contention

Spin locks

Spin locksw/ backoff

Cri

tic

al S

ecti

on

(C

S)

Co

st

IF (more contention) THEN less efficient CS …

“The more important the slower it runs…”

IF (more contention) THEN less efficient CS …

“The more important the slower it runs…”

Page 13: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Making it Scalable: Queues …

First-come,first-served order Starvation avoidance Maximal fairness Reduced traffic

Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93

Page 14: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Queue-based Locks

Amount of Contention

Spin locks

Spin locksw/ backoff

CS

Co

st

Queue-based locks IF (more contention) THEN constant CS cost …

IF (more contention) THEN constant CS cost …

Page 15: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Raytrace Speedup

0

1

2

3

4

5

6

7

8

9

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS MCS

Sun WildFire (WF)

NUCA Ratio = 6

14 14

WF

Page 16: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

This Thesis

Amount of Contention

Queue-based locks

Spin locks

Spin locksw/ backoff

NUCA locks

CS

Co

st

IF (more contention) THEN more efficient CS …

“The more important the faster it runs…”

IF (more contention) THEN more efficient CS …

“The more important the faster it runs…”

Page 17: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Raytrace Speedup

0

1

2

3

4

5

6

7

8

9

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS MCS

NUCA Locks

Sun WildFire (WF)

14 14

WF

Page 18: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

NUCA Locks

Switch

MemoryMemoryMemory

TestTestTestTestLock/Unlock

Lock/Unlock

P

$

P

$

P

$…

P

$

P

$

P

$…

P

$

P

$

P

$…

TestTestTestTestTestTestTest

1) Reduces traffic(one CPU per node is testing…)

2) Improves lock handover3) More efficient CS

(local traffic is cheaper)

1) Reduces traffic(one CPU per node is testing…)

2) Improves lock handover3) More efficient CS

(local traffic is cheaper)

Page 19: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Application PerformanceRaytrace Speedup

WF

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS

MCS

Page 20: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Application PerformanceRaytrace Speedup

WF

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS

MCS

HBO

HBO_GT

RH LockRH Lock

Page 21: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Total Traffic: Raytrace

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

TATAS TATAS_EXP MCS HBO_GT

Local Transactions Global Transactions

Page 22: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Outline

IntroductionNUCA Locks

Paper A: RH LockPaper B: HBO Locks

Beating the Real Thing … Paper C: DSZOOM – Software-based Shared Memory Paper D: THROOM – POSIX Front-end Paper E: SAIT & Write Permission Cache (WPC)

Contributions Future Work

Page 23: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Servers vs. Clusters

A:=0 A:=56

Y:=X

BARRIER

LOCK(CS)

UNLOCK

A:=0 A:=56

Y:=X

BARRIER

LOCK(CS)

UNLOCK

??

Page 24: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Popular Solutions

Solution 1: more hardware (HW-DSM) Transparent for programmers Usually good scalability Expensive, hard verification, long time to market …

Solution 2: simple HW + software (SW-DSM) Can use more complex (adaptive) protocols Traditionally poor scalability for many programs Shorter time to market, simple to upgrade/customize

Page 25: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

The DSZOOM proposal

Page 26: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

DSZOOM Cluster

DSZOOM Nodes: Each node consists of an unmodified

workstation/server Server’s hardware provides memory protocols for

caches and memory within each machine

+ DSZOOM Cluster Network:

“Standard” and fast cluster interconnect Inexpensive user-level remote memory access

+ DSZOOM software

Memory protocols between nodes, synchronization

Page 27: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Problems with Traditional SW-DSMs

Large coherence units (4-8kB) False Sharing! Weaker Memory Models[e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …]

Protocol agent messaging is slow Most efficiency lost in interrupt/poll

CPUs

MemProt.agent

CPUs

MemProt.agent

LD a

Page 28: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Our proposal: DSZOOM Run entire protocol in requesting-processor

No protocol agent communication!

Assumes user-level remote memory access put, get, and atomics [ InfiniBand ]

Fine-grain memory protocols (64 bytes)

Hardware-like memory models [Shasta, Blizzard, Sirocco]

Page 29: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Global Coherency ActionRead data modified in a third node: 3–hop read

DIR

Mem

WritePerm.

1.atomic

3b. put

2a.atomic

2b. get

data

3a. put

Requestor

LD a

“Blocking directory” protocol

Node 1

Node 2

Node 3

Page 30: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Squeezing protocols into binaries…

...cmp %g0, %l5

bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

...cmp %g0, %l5

bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

ld [%o1 + 64], %o0

ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop

OriginalProgram

DSZOOMProgram

Fast-path Protocol

Code

Slow-pathProtocol

Code(C-code)

Page 31: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Compilation Process

ParallelProgramming

Constructs

a.out

(Un)executable

EEL

DSZOOMRun-Time Library

GNU

gcc

link

UnmodifiedApplication

MemoryProtocols(C-code)

Page 32: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

ResultsExecution Times in Seconds (16 CPUs)

0

1

2

3

4

5

6

7

8

9

10

Exe

cutio

n tim

e [s

econ

ds]

E6000 16 CPUs CC-NUMA 2x8 DSZOOM-WF 2x8

HW SW16

8 8 8 8

DSZOOM

Page 33: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Outline

IntroductionNUCA Locks

Paper A: RH LockPaper B: HBO Locks

Beating the Real Thing …Paper C: DSZOOM – Software-based Shared Memory Paper D: THROOM – POSIX Front-end Paper E: SAIT & Write Permission Cache (WPC)

Contributions Future Work

Page 34: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

THROOMTowards Higher Transparency …

a.out

Unmodified POSIX thread(Pthread) Application

EELMemory

Protocols(C-code)

ParallelProgramming

Constructs

a.out

(Un)executable

EEL

DSZOOMRun-Time Library

GNUgcc

link

UnmodifiedApplication

MemoryProtocols(C-code)

Transparent runtime support:-- memory allocation-- thread creation / termination-- synchronization-- I/O…

Page 35: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

SAIT Overview

SAIT = SPARC Assembler Instrumentation Tool Instrument assembler files

More information about programs is available

Support for liveness analysis

SourceFile cc

.s assembler

outputSAIT

.s instrumented

assemblerld

User Library(e.g., protocols)

calls

linkUser Library

(e.g., protocols)User Library

(e.g., protocols)snippets.txt

a.out

Used in severalUART projects!

Used in severalUART projects!

Page 36: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Write Permission Cache (WPC)

P

Memory

P

Write permission: A, B, DWrite permission: A, B, D

WPC WPC WPC

WritePermissio

n?

Store instrumentationis expensive…

PP

Store A

AA

Page 37: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Contributions

Nonuniform Communication Architecture (NUCA) Several NUCA-locks that exploit NUCAs:

RH lock Three HBO locks

DSZOOM: Novel SW-DSM system THROOM: Supporting POSIX binaries on clusters SAIT: SPARC Assembler Instrumentation Tool WPC: Write Permission Cache

Page 38: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Future Work

NUCA locks for the DSZOOM system Instrumentation optimizations

Compiler support Optimizing backend

Further WPC studies/optimizations Protocol optimizations

Adaptive Invalidate/Update “Push based” protocols

Page 39: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003

Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures

Zoran RadovicZoran [email protected]@it.uu.se

Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003

Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures

Zoran RadovicZoran [email protected]@it.uu.se

Page 40: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Fairness Study2-node Sun WildFire, 28 CPUs

02468

10121416182022242628

0 5 10 15Time [seconds]

Num

ber

of F

inis

hed

Pro

cess

ors TATAS

MCS

HBO_GT

t

Page 41: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Traditional Microbenchmark

for (i = 0; i < iterations; i++) { LOCK(L); /* null/small Critical Section */ UNLOCK(L);}

For each thread:

Page 42: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Lock performanceTraditional microbenchmark

0

5

10

15

20

25

30

35

40

45

50

55

60

0 4 8 12 16 20 24 28

Number of Processors

Tim

e [m

icro

seco

nds]

TATAS

MCS

HBO_GT

WF

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28Number of Processors

Nod

e ha

ndof

fs [

%]

TATAS

MCS

HBO_GT

Page 43: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

New Microbenchmark

for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_workcritical_work); // CS UNLOCK(L); static_delay(); random_delay();}

More realistic node handoffs for queue-locks Constant number of processors Control the “amount of contention”

Page 44: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs

3

4

5

6

7

8

9

10

11

12

0 500 1000 1500 2000critical_work

Tim

e [s

econ

ds]

TATAS

MCS

HBO_GT

WF

14 14

0

10

20

30

40

50

60

0 500 1000 1500 2000

critical_work

Nod

e ha

ndof

fs [

%]

Page 45: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Results (2)Normalized Execution Time Breakdowns (16 CPUs)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Store

Load

Locks

Barriers

ILC

Task

SW8 8

EEL

Page 46: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

Instrumentation Performance

Program Problem Size%LD

%ST

InstrumentationOverhead

FFT 1,048,576 points (48.1 MB) 19.0 16.5 1.38

LU-Cont 10241024, block 16 (8.0 MB) 15.5 9.4 1.59

LU-Non-Cont 10241024, block 16 (8.0 MB) 16.7 11.1 1.50

Radix 4,194,304 items (36.5 MB) 15.6 11.6 1.13

Barnes-Hut 16,384 bodies (32.8 MB) 23.8 31.1 1.03

FMM 32,768 particles (8.1 MB) 17.5 13.6 1.06

Ocean-Cont 514514 (57.5 MB) 27.0 23.9 1.34

Ocean-Non-Cont 258258 (22.9 MB) 11.6 28.0 1.24

Radiosity Room (29.4 MB) 26.3 27.2 1.07

Raytrace Car (32.2 MB) 19.0 18.1 1.21

Water-nsq 2,197 mols., 2 steps (2.0 MB) 13.4 16.2 1.06

Water-sp 2,197 mols., 2 steps (1.5 MB) 15.7 13.9 1.09

Average 18.4 18.3 1.22

Page 47: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

1-entry WPC

0

4

8

12

16

20

24

28

32

FFT

LU-c

ont

LU-n

on-c

ont

Radix

Barne

s

Choles

kyFM

M

Ocean

-con

t

Ocean

-non

-con

t

Radios

ity

Raytra

ce

Wat

er-n

sq

Wat

er-s

p

Ave

rag

e #

Sto

res

un

til U

NL

OC

K

64 bytes 128 bytes 256 bytes

Page 48: Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

[email protected] Licentiate Thesis Seminar Sept 25, 2003

2-entry WPC

0

4

8

12

16

20

24

28

32

FFT

LU-c

ont

LU-n

on-c

ont

Radix

Barne

s

Choles

kyFM

M

Ocean

-con

t

Ocean

-non

-con

t

Radios

ity

Raytra

ce

Wat

er-n

sq

Wat

er-s

p

Ave

rag

e #

Sto

res

un

til U

NL

OC

K

64 bytes 128 bytes 256 bytes