wormbench a configurable application for evaluating transactional memory systems medea workshop...

Post on 02-Jan-2016

215 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

WormBenchA Configurable Application for Evaluating

Transactional Memory Systems

MEDEA Workshop

26.10.2008

Ferad Zyulkyarov1, 2, Sanja Cvijic3, Osman Unsal1, Adrian Cristal1, Eduard Ayguade1, 2, Tim Harris4, Mateo Valero1, 2

1Barcelona Supercomputing Center, 2Universitat Politecnica de Catalunya,

3Belgrade University, 4Microsoft Research Cambridge UK

Outline

• Transactional Memory

• Idea

• Motivation

• WormBench Features

• WormBench main components

• WormBench input – run configuration

• Analysis

• Modeling STAMP’s genome

• Conclusion

Transactional Memory

atomic{ < statements >}

Idea

• Inspired by the Snake game

• Worms are active objects

• Worms move in a BenchWorld

• On every move Worms do computation

Motivation - General

• We don’t know how exactly to write TM applications

• 1:1 Converting applications from locks is not correct approach– For example, is it the same to convert lock

based application into message passing synchronization 1:1?

Motivation - Existing TM Applications (1/2)

• STAMP [IISWC’2008]– specific to TL2 [ISCA’2007]– does not have lock based implementation– tm_write() and tm_read() carefully used – thus

assuming perfect compiler

• STMBench7 [EuroSys’2007]– Suitable for STM– Too big data structures (700.000 bytes); too

long transactions (10 tx/s)

Motivation - Existing TM Applications (2/2)

• SPLASH-2 [ISCA’1995]– Embarrassingly parallel– Fine grain locking– Not suitable for the intended TM usage

pattern (coarse grain locking)

• Haskell STM Benchmark [CF’2007]– Implemented in declarative language– Depends on language and type system

enforced constraints (TVar, monads)

WormBench’s Goal

• Unify the features of existing TM applications

• A tool for instrumenting multi-threaded applications

• Set of run configurations to serve as a baseline to evaluate TM systems among each other and locks

• Specific run configuration that stresses a particular design or implementation aspect of a TM system such as the sizes of internally used buffers.

WormBench Features (1/2)

• Implemented in imperative language C#– Compiling with Bartok

• Follows the object oriented programming concepts

• Critical sections are marked with atomic– Can be used to test the compiler infrastructure

• Represents typical parallel application with shared data

• Highly configurable through run configurations

WormBench Features (2/2)

• Suitable for HTM, STM and Hybrid TM variants

• No assumptions about TM system design and implementation

• Lock based and transactional implementation for comparison purposes

• Sanity check verification for the underlying TM system

Main Objects in MainBench

• BenchWorld– BenchWorldNode

• Worm– Body

– Head

• Message

Example

• Worm– Body length 8

– Head Size 4

• Operations– Sum – ahead

– Average – right

– Min - ahead

WormBench Input – Run configuration

• Size of the BenchWorld;

• Number of worms (number of threads);

• Body length of each worm;

• Head size of each worm;

• The number and type of worm operations that each worm has to perform while moving

Instantiates Common Sync Scenarios (1/2)

• Object access serializability– Guarding a shared variable with locks

• Two phase locking and its derivatives– Locking protocol which attempts non-blocking

fine grain locking avoiding dead-lock

• Multiple granularity locking– Fine-grain locking technique used to lock a

region in a collection/hierarchical data structure

Instantiates Common Sycnh Scenarios (2/2)

• Dining Philosophers– Deadlock scenario

• Barrier synchronization– Worms wait until all the group (or all worms)

reach certain point in execution

Retry or Conditional Atomic

Retry

Mostly neglected utilization of retry or conditional atomic.

Currently Available Worm Operations (1/2)

• Read-only– Sum– Average– Min– Max– Median

• I/O– Checkpoint– Undo

Currently Available Worm Operations (1/2)

• Read dominated– Replace min with average

– Replace max with average

– Replace median with average

– Replace min and max

• Write dominated– Sort

– Transpose

• Leave message – for complex synchronization scenarios– Goto node message

Worm Operations – Execution Distribution

Op B[1.1]H[1.1] B[4.4]H[4.4] B[8.8]H[8.8] B[1.8]H[1.8]

Sum 0.42 0.433 0.194 0.31Avg 0.42 0.278 0.324 0.434Median 0.839 3.649 9.354 5.14Min 0.315 0.588 0.278 0.372Max 0.42 0.588 0.33 0.537Rep Max with Avg 1.364 0.711 0.427 0.743Rep Min with Avg 0.735 0.742 0.537 0.702Rep Med with Avg 2.518 4.793 11.412 6.689Rep Max and Min 2.099 0.588 0.634 0.929Rep Med with Min 2.728 5.009 11.186 7.06Rep Med and Max 2.518 5.257 11.387 7.122Sort 1.679 6.586 11.257 7.184Transpose 1.154 3.247 2.369 3.365Checkpoint 1.12 1.45 1.982 1.522Undo 1.06 1.32 1.85 1.488

Total 19.389 35.239 63.521 42.597

Worm Operations – Execution Distribution

Op B[1.1]H[1.1] B[4.4]H[4.4] B[8.8]H[8.8] B[1.8]H[1.8]

Sum 0.42 0.433 0.194 0.31Avg 0.42 0.278 0.324 0.434Median 0.839 3.649 9.354 5.14Min 0.315 0.588 0.278 0.372Max 0.42 0.588 0.33 0.537Rep Max with Avg 1.364 0.711 0.427 0.743Rep Min with Avg 0.735 0.742 0.537 0.702Rep Med with Avg 2.518 4.793 11.412 6.689Rep Max and Min 2.099 0.588 0.634 0.929Rep Med with Mix 2.728 5.009 11.186 7.06Rep Med and Max 2.518 5.257 11.387 7.122Sort 1.679 6.586 11.257 7.184Transpose 1.154 3.247 2.369 3.365Checkpoint 1.12 1.45 1.982 1.522Undo 1.06 1.32 1.85 1.488

Total 19.389 35.239 63.521 42.597

Worm Operations – TM Characteristics

Op1 2 4 8

R W R W R W R W1 11 3 11 4 11 6 11 102 11 3 11 4 11 6 11 103 11 3 11 4 11 6 11 104 11 3 11 4 11 6 11 105 11 3 11 4 11 6 11 106 14 4 14 5 14 7 14 117 14 4 14 5 14 7 14 118 14 4 14 5 14 7 14 119 14 4 14 5 14 7 14 1110 14 4 14 5 14 7 14 1111 14 4 14 5 14 7 14 1112 16 4 16 5 16 7 16 1113 16 4 16 5 16 7 16 1114 11 3 11 4 11 6 11 1115 11 3 11 4 11 6 11 11

Worm Head Size is fixed to 1 and body length is 1, 2, 4, 8

Worm Operations – TM Characteristics

Op1 2 4 8

R W R W R W R W1 11 3 11 4 11 6 11 102 11 3 11 4 11 6 11 103 11 3 11 4 11 6 11 104 11 3 11 4 11 6 11 105 11 3 11 4 11 6 11 106 14 4 14 5 14 7 14 117 14 4 14 5 14 7 14 118 14 4 14 5 14 7 14 119 14 4 14 5 14 7 14 1110 14 4 14 5 14 7 14 1111 14 4 14 5 14 7 14 1112 16 4 16 5 16 7 16 1113 16 4 16 5 16 7 16 1114 11 3 11 4 11 6 11 1115 11 3 11 4 11 6 11 11

Worm Head Size is fixed to 1 and body length is 1, 2, 4, 8

Worm Operations – TM Characteristics

Op1 2 4 8

R W R W R W R W1 11 3 14 3 26 3 74 32 11 3 14 3 26 3 74 33 11 3 14 3 26 3 74 34 11 3 14 3 26 3 74 35 11 3 14 3 26 3 74 36 14 4 17 4 29 4 77 47 14 4 17 4 29 4 77 48 14 4 17 4 29 4 77 49 14 4 17 5 29 5 77 510 14 4 17 5 29 5 77 511 14 4 17 5 29 5 77 512 16 4 19 7 31 19 79 6713 16 4 19 7 31 19 79 6714 11 3 14 3 26 3 74 315 11 3 14 3 26 3 74 3

Body Length is fixed to 1 and head size is 1, 2, 4, 8

Worm Operations – TM Characteristics

Op1 2 4 8

R W R W R W R W1 11 3 14 3 26 3 74 32 11 3 14 3 26 3 74 33 11 3 14 3 26 3 74 34 11 3 14 3 26 3 74 35 11 3 14 3 26 3 74 36 14 4 17 4 29 4 77 47 14 4 17 4 29 4 77 48 14 4 17 4 29 4 77 49 14 4 17 5 29 5 77 510 14 4 17 5 29 5 77 511 14 4 17 5 29 5 77 512 16 4 19 7 31 19 79 6713 16 4 19 7 31 19 79 6714 11 3 14 3 26 3 74 315 11 3 14 3 26 3 74 3

Body Length is fixed to 1 and head size is 1, 2, 4, 8

Analyzing Sample Run Configurations

• Lock vs Transactions

• Change in BenchWorld size

• Change in worms’ body length and size

• Initialization of worms for smaller BenchWorld

Lock vs Transactions

Throughput ~ Worms ~ BenchWorld

Relationship between throughput, worms’ size and BenchWorld

Initializing Worms for Smaller BenchWorldHow the conflict rate is affected when worms are initialized for smaller BenchWorld.

Averaged ResultsWorms initialized for 128x128 and

run in 128x, 256x, 512x and 1024xsize BenchWorld

Modeling Genome App. From STAMP

• To obtain the results shown on Table IV we used the following run configuration:– Worms body length = 1

– Worms head size = 4

– BenchWorld of size 52x52

– Worm Operations: Randomly generated stream of worm operations, where the ration between the worm operations was Operations

(1:2:3:4:5:6:7:8:9:10:11:12:13:14:15) = Ration(1:1:1:0:0:2:1:1:1:1:1:1:2:0:0)

T#Commit Rate Read per TX Write per TX Speedup

Gen. WB Gen. WB Gen. WB Gen. WB1 1 1 36.362 31.480 1.374 1.962 1 12 0.998 0.998 34.260 31.609 1.373 1.962 2.177 1.44 0.994 0.995 37.974 31.815 1.371 1.962 3.474 2.28 0.985 0.987 46.219 32.300 1.377 1.963 5.435 2.867

Future Work

• Toolset that automatically generates a run configuration representing a user defined transactional and runtime behavior, e.g.:– Commit rate 80%– Reads per TX = 6– Writes per TX 2– Runtime = 100 moves/ms

• Implement BenchWorld as– Linked list– Sparse matrix

Future Work

• Understand how the Messaging works in BenchWorld

• Prepare a baseline set of run configurations to benchmark TM systems (HTM, STM and hybrid TMs)

• Fine grain version using two-phase locking

Conclusion

• WormBench highly configurable workload for TM

• TM design and implementation independent

• Critical sections defined by language level atomic blocks

• Coarse lock based version

• Sanity check for the overall TM system

• But still small that does not exercise language extensions for TM and their semantics

Край

top related