software transactional memory for large scale clusters...

Post on 14-Mar-2018

222 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Software Transactional Memory for Large Scale Clusters

Robert L. Bocchino Jr. and Vikram S. AdveUniversity of Illinois at Urbana-Champaign

Bradford L. ChamberlainCray Inc.

Transactional Memory (TM)

Can simplify parallel programming

Well studied for small-scale, cache-coherent platforms

No prior work on TM for large scale platforms

• Potentially thousands of processors

• Distributed memory, no cache coherence

• Slow communication between nodes

2

Why STM On Clusters?

TM is a natural fit for PGAS Languages

• UPC, CAF, Titanium, Chapel, Fortress, X10, ...

• Address space is global (unlike message passing)

• But data distribution is explicit (unlike cc-NUMA, DSM)

Commodity clusters are in widespread use

Software transactional memory (STM) is natural choice

• Communication done in software anyway

• Could leverage hardware TM support if it exists

3

What’s New About STM on Clusters?

Classic STM Cluster STM

4

What’s New About STM on Clusters?

Classic STM Cluster STM

4

Primary overhead Extra scalar ops Extra remote ops

What’s New About STM on Clusters?

Classic STM Cluster STM

4

Read and write Words of data Blocks of data

Primary overhead Extra scalar ops Extra remote ops

What’s New About STM on Clusters?

Classic STM Cluster STM

4

Heap address space Uniform Partitioned

Read and write Words of data Blocks of data

Primary overhead Extra scalar ops Extra remote ops

What’s New About STM on Clusters?

Classic STM Cluster STM

4

STM metadata Uniform Distributed

Heap address space Uniform Partitioned

Read and write Words of data Blocks of data

Primary overhead Extra scalar ops Extra remote ops

What’s New About STM on Clusters?

Classic STM Cluster STM

4

Distributing computation for data locality N/A on p { ... }

STM metadata Uniform Distributed

Heap address space Uniform Partitioned

Read and write Words of data Blocks of data

Primary overhead Extra scalar ops Extra remote ops

Research Contributions

First STM designed for high performance on large clusters

• Block data movement

• Computation migration

• Distributed metadata

Experimental evaluation of prototype

• Performance vs. locks

• New design tradeoffs

Decomposition of STM design space into eight axes

5

Outline

Interface Design

Algorithm Design

Evaluation

• Cluster STM vs. Manual Locking

• Read Locks vs. Read Validation

Conclusion

6

Interface Design

Compiler target, not primarily for programmers

Correct use guarantees serializability of transactions

7

Interface Design

Compiler target, not primarily for programmers

Correct use guarantees serializability of transactions

atomic {// Array assignmentcache = A;compute(cache);A = cache;

}

Chapel

7

Interface Design

Compiler target, not primarily for programmers

Correct use guarantees serializability of transactions

atomic {// Array assignmentcache = A;compute(cache);A = cache;

}

Chapel

stm_start(...)stm_get(&cache, &A,...)compute(&cache);stm_put(&A, &cache, ...);stm_commit(...);

Cluster STM API

compiler

7

Interface Summary

Transaction start and commit

Transactional memory allocation

Block data movement to and from transactional store

Remote execution of transactional computations

8

Block Data Movement

p1tx

non-

txp2

A

B

CAll ops occur on p1

9

Block Data Movement

p1tx

non-

txp2

stm_get(work_proc=p2,src=A, dest=B, size=n, ...)

A

B

CAll ops occur on p1

9

Block Data Movement

p1tx

non-

txp2

stm_put(work_proc=p2,src=B, dest=A, size=n, ...)

stm_get(work_proc=p2,src=A, dest=B, size=n, ...)

A

B

CAll ops occur on p1

9

Block Data Movement

p1tx

non-

txp2

stm_read(src=C, dest=B, size=n, ...)

stm_put(work_proc=p2,src=B, dest=A, size=n, ...)

stm_get(work_proc=p2,src=A, dest=B, size=n, ...)

A

B

CAll ops occur on p1

9

Block Data Movement

p1tx

non-

txp2

stm_read(src=C, dest=B, size=n, ...)

stm_put(work_proc=p2,src=B, dest=A, size=n, ...)

stm_write(src=B, dest=C, size=n, ...)

stm_get(work_proc=p2,src=A, dest=B, size=n, ...)

A

B

CAll ops occur on p1

9

Remote Work for Exploiting Locality

p1 p2

10

Remote Work for Exploiting Locality

p1 p2

remoteputs & gets

10

Remote Work for Exploiting Locality

p1 p2

10

Remote Work for Exploiting Locality

p1 p2

localreads & writes

10

Remote Work for Exploiting Locality

p1 p2

stm_on(work_proc=p2,function=f, ...)

localreads & writes

10

Remote Work for Exploiting Locality

p1

Chapel: on p2 { f(...); }

p2

stm_on(work_proc=p2,function=f, ...)

localreads & writes

10

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

11

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

stm_start(src_proc=p1)

11

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

stm_start(src_proc=p1)

11

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

stm_start(src_proc=p1)

stm_read(src_proc=p1)

11

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

stm_start(src_proc=p1)

stm_read(src_proc=p1)stm_write(src_proc=p1)

11

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

stm_start(src_proc=p1)

stm_commit(src_proc=1)

stm_read(src_proc=p1)stm_write(src_proc=p1)

11

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

stm_start(src_proc=p1)

stm_commit(src_proc=1)

stm_read(src_proc=p1)stm_write(src_proc=p1)

localops

11

Remote Work Inside Transaction

p1

atomic { on p2 { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

stm_start(src_proc=p1)

stm_commit(src_proc=1)

stm_read(src_proc=p1)stm_write(src_proc=p1)

single commit message

localops

11

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

12

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

12

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...)

stm_start(src_proc=p1)

12

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)

stm_start(src_proc=p1)

12

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)

stm_write(src_proc=p1)

stm_start(src_proc=p1)

12

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)

stm_write(src_proc=p1)

stm_start(src_proc=p1)

stm_commit(src_proc=p1)

12

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)

stm_write(src_proc=p1)

stm_start(src_proc=p1)

stm_commit(src_proc=p1)

12

Transaction Inside Remote Work

p1

on p2 { atomic { f(...); } }

p2

stm_on(src_proc=p1, work_proc=p2,function=f, ...) stm_read(src_proc=p1)

stm_write(src_proc=p1)

stm_start(src_proc=p1)

stm_commit(src_proc=p1)

localops

12

Outline

Interface Design

Algorithm Design

Evaluation

• Cluster STM vs. Manual Locking

• Read Locks vs. Read Validation

Conclusion

13

Algorithm Design

Shared metadata

Transaction-local metadata

STM design choices

14

Shared Metadata

Some metadata must be visible to all transactions

• Read and write locks

• Validation ID

Memory overhead compromise

• Each metadata word guards one conflict detection unit (CDU)

• CDU size is s ≥ 1 words

• s > 1 may introduce false sharing

15

Shared Metadata

Some metadata must be visible to all transactions

• Read and write locks

• Validation ID

Memory overhead compromise

• Each metadata word guards one conflict detection unit (CDU)

• CDU size is s ≥ 1 words

• s > 1 may introduce false sharing

Keep metadata on same processor as corresponding CDU

15

Transaction-Local Metadata

p1 p2

stm_on(src_proc=p1)stm_start(src_proc=p1)

stm_commit(src_proc=p1)

stm_read(src_proc=p1)

stm_write(src_proc=p1)

16

Transaction-Local Metadata

p1 p2

stm_on(src_proc=p1)stm_start(src_proc=p1)

stm_commit(src_proc=p1)

stm_read(src_proc=p1)

stm_write(src_proc=p1)

p2 stores tx metadata for p1

16

Transaction-Local Metadata

p1 p2

stm_on(src_proc=p1)stm_start(src_proc=p1)

stm_commit(src_proc=p1)

stm_read(src_proc=p1)

stm_write(src_proc=p1)

p2 uses stored metadata to commit on

behalf of p1

p2 stores tx metadata for p1

16

Transaction-Local Metadata

p1 p2

stm_on(src_proc=p1)stm_start(src_proc=p1)

stm_commit(src_proc=p1)

stm_read(src_proc=p1)

stm_write(src_proc=p1)

p2 uses stored metadata to commit on

behalf of p1

p2 stores tx metadata for p1

Keep metadata local to processor where access occurred

16

STM Design Choices

Eight-axis design space (see paper)

Choose four axes to explore

• CDU size

• Read locks (RL) vs. read validation (RV)

• Undo log (UL) vs. write buffer (WB)

• Early acquire (EA) vs. late acquire (LA)

17

STM Design Choices

Eight-axis design space (see paper)

Choose four axes to explore

• CDU size

• Read locks (RL) vs. read validation (RV)

• Undo log (UL) vs. write buffer (WB)

• Early acquire (EA) vs. late acquire (LA)

Some tradeoffs come out differently on clusters

17

STM Design Choices

Eight-axis design space (see paper)

Choose four axes to explore

• CDU size

• Read locks (RL) vs. read validation (RV)

• Undo log (UL) vs. write buffer (WB)

• Early acquire (EA) vs. late acquire (LA)

Some tradeoffs come out differently on clusters

Discuss RL vs. RV here

17

STM Design Choices

Eight-axis design space (see paper)

Choose four axes to explore

• CDU size

• Read locks (RL) vs. read validation (RV)

• Undo log (UL) vs. write buffer (WB)

• Early acquire (EA) vs. late acquire (LA)

Some tradeoffs come out differently on clusters

Discuss RL vs. RV here

See paper for additional results

17

Outline

Interface Design

Algorithm Design

Evaluation

• Cluster STM vs. Manual Locking

• Read Locks vs. Read Validation

Conclusion

18

Evaluation

Benchmarks

• Micro Benchmarks: Intset, Hashmap Swap

• Graph clustering: SSCA 2 Kernel 4

Machine

• Intel Xeon cluster

• Two cores per node

• Myrinet

19

Outline

Interface Design

Algorithm Design

Evaluation

• Cluster STM vs. Manual Locking

• Read Locks vs. Read Validation

Conclusion

20

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

Locks, 6M operationsSTM, 6M operations

Locks, 100M operationsSTM, 100M operations

Intset Results

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

Locks, 6M operationsSTM, 6M operations

Locks, 100M operationsSTM, 100M operations

Intset Results

Switching problem sizes at p = 8

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

Locks, 6M operationsSTM, 6M operations

Locks, 100M operationsSTM, 100M operations

Intset Results

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

Locks, 6M operationsSTM, 6M operations

Locks, 100M operationsSTM, 100M operations

Intset Results

Bump at p < 4

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

Locks, 6M operationsSTM, 6M operations

Locks, 100M operationsSTM, 100M operations

Intset Results

21

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

Locks, 6M operationsSTM, 6M operations

Locks, 100M operationsSTM, 100M operations

Intset Results

Good scaling after p = 4Locks, STM about the same

21

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

locks STM22

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

locks STM

Bump is smaller

22

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

locks STM22

SSCA 2 Results

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512

Tim

e (

seconds, lo

g s

cale

)

Number of processors (log scale)

locks STM

Good scaling after p = 4STM overhead about 2.5x

22

Outline

Interface Design

Algorithm Design

Evaluation

• Cluster STM vs. Manual Locking

• Read Locks vs. Read Validation

Conclusion

23

Implementation

Read locks (RL)

• Metadata word holds write bit and reader count

• Abort on attempt to read or write conflicting CDU

Read validation (RV)

• Metadata word holds write bit and validation ID

• Increment ID with each write

• Abort at commit if ID changes after CDU is read

24

Implementation

Read locks (RL)

• Metadata word holds write bit and reader count

• Abort on attempt to read or write conflicting CDU

Read validation (RV)

• Metadata word holds write bit and validation ID

• Increment ID with each write

• Abort at commit if ID changes after CDU is read

Validation requires extra remote op at commit

24

Implementation

Read locks (RL)

• Metadata word holds write bit and reader count

• Abort on attempt to read or write conflicting CDU

Read validation (RV)

• Metadata word holds write bit and validation ID

• Increment ID with each write

• Abort at commit if ID changes after CDU is read

Validation requires extra remote op at commit

No global cache ⇒ no helpful cache effects

24

Ratio of RV to RL Runtimes

0

1

2

3

4

1 2 4 8 16 32 64 128 256 512

Intset SSCA 2

Number of processors

Rat

io

25

Results Summary

Good scalability to p = 512

Good performance

• Nearly identical to locks on micro benchmarks

• Some STM overhead for SSCA2

Different design tradeoffs

• RL outperforms RV

• EA outperforms LA (see paper)

• Minimal penalty for WB (see paper) 

26

Outline

Interface Design

Algorithm Design

Evaluation

Conclusion

27

Conclusion

Presented design and evaluation of Cluster STM

• First STM for high performance on large scale clusters

• Good performance, scaling to p = 512

• New evaluation of design tradeoffs

Future work

• Exploiting shared memory within a node

• Nested parallelism in a transaction

• Dynamic spawning of threads

28

Thank You

29

top related