optimistic intra-transaction parallelism using thread level speculation

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Chris Colohan1, Anastassia Ailamaki1,J. Gregory Steffan2 and Todd C. Mowry1,3

1Carnegie Mellon University2University of Toronto

3Intel Research Pittsburgh

Chip Multiprocessors are Here!

2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them?

IBM Power 5

AMD Opteron

Intel Yonah

Multi-Core Enhances Throughput

Database ServerUsers

Transactions DBMS Database

Cores can run concurrent transactions and improve

throughput

Cores can run concurrent transactions and improve

throughput

Using Multiple CoresDatabase ServerUsers

Transactions DBMS Database

Can multiple cores improvetransaction latency?

Parallelizing transactions

SELECT cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line;}

Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries

Short queries dominate in commercial workloads

Intra-transaction parallelism Each thread spans multiple queries

Hard to add to existing systems! Need to change interface, add latches and locks,

worry about correctness of parallel execution…

Intra-transaction parallelism Breaks transaction into threads

Hard to add to existing systems! Need to change interface, add latches and locks,

worry about correctness of parallel execution…

Thread Level Speculation (TLS)makes parallelization easier.

Thread Level Speculation (TLS)

Sequential

Parallel

Epoch 1 Epoch 2

Sequential

Violation!

Parallel

Use epochs

Detect violations Restart to

recover Buffer state

Oldest epoch: Never restarts No buffering

Worst case: Sequential

Best case: Fully parallelData dependences limit performance.Data dependences limit performance.

Epoch 1 Epoch 2

TransactionProgrammer

DBMS Programmer

Hardware Developer

A Coordinated Effort

Choose epoch boundaries

Remove performance bottlenecks

Add TLS support to architecture

So what’s new?

Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance

Halve transaction latency on four cores

Related Work

Optimistic Concurrency Control (Kung82)

Sagas (Molina&Salem87)

Transaction chopping (Shasha95)

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

Case Study: New Order (TPC-C)

Only dependence is the quantity field Very unlikely to occur (1/100,000)

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

78% of transactionexecution time

Case Study: New Order (TPC-C)GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;

TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

GET cust_info FROM customer;UPDATE district WITH order_id; INSERT order_id INTO new_order;

TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line;}

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the

DBMS Results Conclusions

Dependences in DBMSTim

Dependences serialize execution!

Example: statistics gathering pages_pinned++ TLS maintains serial ordering of

increments To remove, use per-CPU counters

Performance tuning: Profile execution Remove bottleneck dependence Repeat

Buffer Pool Management

Buffer Pool

get_page(5)

ref: 1

put_page(5)

ref: 0

get_page(5)put_page(5)

Buffer Pool

get_page(5)

ref: 0

put_page(5)

get_page(5)

put_page(5)

TLS ensures first epoch gets page

first.Who cares?

TLS ensures first epoch gets page

first.Who cares?

TLS maintains original load/store order

Sometimes this is not needed

Buffer Pool

get_page(5)

ref: 0

put_page(5)

get_page(5)

put_page(5)

= Escape SpeculationIsolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state

Isolated: undoing get_page will not affect other transactionsUndoable: have an operation (put_page) which returns the system to its initial state

• Escape speculation• Invoke operation• Store undo function• Resume speculation

put_page(5)get_page(5)

Buffer Pool

get_page(5)

ref: 0

put_page(5)

get_page(5)

put_page(5)

Not undoable!

= Escape Speculation

Buffer Pool

get_page(5)

ref: 0

put_page(5)

get_page(5)

put_page(5)

get_page(5)

put_page(5)

Delay put_page until end of epoch Avoid dependence

= Escape Speculation

Removing Bottleneck Dependences

We introduce three techniques: Delay operations until non-speculative

Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment

Escape speculation Buffer pool, memory, and cursor allocation

Traditional parallelization Memory allocation, cursor pool, error

checks, false sharing

Outline

Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

Experimental Setup

Detailed simulation Superscalar, out-of-

order, 128 entry reorder buffer

Memory hierarchy modeled in detail

TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100

transactions Measuring latency not

throughput

32KB4-wayL1 $

Rest of memory system

32KB4-wayL1 $

2MB 4-way L2 $

Optimizing the DBMS: New Order

No Opt

Error C

Sharin

Idle CPU

ViolatedCache Miss

Cache misses

increase

Cache misses

increase

Other CPUs not helping

Can’t optimize

much more

Can’t optimize

much more

26% improvemen

Optimizing the DBMS: New Order

No Opt

Error C

Sharin

Idle CPU

ViolatedCache Miss

This process took me 30 days and <1200 lines of code.

Other TPC-C Transactions

New Order Delivery Stock Level Payment Order Status

Idle CPU

FailedCache Miss

3/5 Transactions speed up by 46-66%

Conclusions

TLS makes intra-transaction parallelism practical Reasonable changes to transaction,

DBMS, and hardware Halve transaction latency

Needed backup slides (not done yet)

2 proc. Results Shared caches may change how you

want to extract parallelism! Just have lots of transactions: no sharing TLS may have more sharing

Any questions?

For more information, see:www.colohan.com

Backup Slides Follow…

LATCHES

Latches

Mutual exclusion between transactions Cause violations between epochs

Read-test-write cycle RAW Not needed between epochs

TLS already provides mutual exclusion!

Latches: Aggressive Acquire

Acquirelatch_cnt++…work…latch_cnt--

Homefree

latch_cnt++…work…(enqueue release)

Commit worklatch_cnt--

Homefree

latch_cnt++…work…(enqueue release)

Commit worklatch_cnt--Release

cal secti

Latches: Lazy Acquire

Acquire…work…Release

Homefree

(enqueue acquire) …work…(enqueue release)

AcquireCommit workRelease

Homefree

(enqueue acquire)…work…(enqueue release)

AcquireCommit workRelease

cal secti

HARDWARE

TLS in Database Systems

Non-DatabaseTLS

TLS in DatabaseSystems

Large epochs:• More dependences

• Must tolerate

• More state• Bigger buffers

Large epochs:• More dependences

• Must tolerate

• More state• Bigger buffers

Concurrenttransactions

Feedback Loop I know this is

parallel!

for() { do_work();}

par_for() { do_work();}

Must…Make…Faster

feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

Violations == Feedback

Sequential

Violation!

Parallel

0x0FD80xFD200x0FC00xFC18

Must…Make…Faster

Eliminating Violations

Violation!

Parallel

*q==*q

Violation!

Eliminate *p Dep.

0x0FD80xFD200x0FC00xFC18

Optimization maymake slower?

Tolerating Violations: Sub-epochs

*q=Violation!

Sub-epochs

*q==*q

Violation!

Eliminate *p Dep.

Sub-epochs

Started periodically by hardware How many? When to start?

Hardware implementation Just like epochs

Use more epoch contexts No need to check

violations between sub-epochs within an epoch

*q=Violation!

Sub-epochs

Old TLS Design

Rest of memory systemRest of memory system

L1 $L1 $ L1 $ L1 $ L1 $

Buffer speculative state in write back L1

Invalidation

Detect violations through

invalidations

Rest of system only sees committed

Restart by invalidating speculative lines

Problems:

• L1 cache not large enough• Later epochs only get values on commit

Problems:

• L1 cache not large enough• Later epochs only get values on commit

New Cache Design

Buffer speculative and non-speculative state for all epochs

Speculative writes immediately visible

to L2 (and later epochs)

Detect violations at lookup time

Invalidation coherence between L2 caches

L1 $ L1 $ L1 $ L1 $

Invalidation

Restart by invalidating speculative lines

New Features

L1 $L1 $ L1 $ L1 $ L1 $

Speculative state in L1 and L2 cache

Cache line replication (versions)

Data dependence tracking within cache

Speculative victim cache

Scaling

2 CPUs

4 CPUs

8 CPUs

2 CPUs

4 CPUs

8 CPUs

Modified with 50-150 items/transaction

Idle CPU

Failed Speculation

IUO Mutex Stall

Cache Miss

Instruction Execution

Evaluating a 4-CPU system

No Sub

Baseli

No Spe

Idle CPU

Failed Speculation

IUO Mutex Stall

Cache Miss

Instruction Execution

Original benchmark run on

Parallelized benchmark run on 1

Without sub-epoch support

Parallel execution

Ignore violations (Amdahl’s Law limit)

Sub-epochs: How many/How big?

Number of Sub-epochs/Instructions per Sub-epoch

• Supporting more sub-epochs is better• Spacing depends on location of violations

• Even spacing is good enough

• Supporting more sub-epochs is better• Spacing depends on location of violations

• Even spacing is good enough

Query Execution

Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries

These generate violations.These generate violations.

Applying TLS

1. Parallelize loop2. Run benchmark3. Remove bottleneck4. Go to 2 T

Outline

Hardware Developer

TransactionProgrammer

DBMS Programmer

Violation Prediction

*q==*qDone

Predict Dependences

*q==*q

Violation!

Eliminate R1/W2

Violation Prediction

*q==*qDone

Predict Dependences

Predictor problems: Large epochs

many predictions Failed prediction

violation Incorrect prediction

large stall

Two predictors required:

Last store Dependent load

TLS Execution

Violation!

L1 $L1 $ L1 $ L1 $ L1 $

TLS Execution

Violation!

1valid CPU 2 CPU 3

TLS Execution

Violation!

valid CPU 2 CPU 3

TLS Execution

Violation!

*p11 1

valid CPU 2 CPU 3

TLS Execution

Violation!

*p11 1

valid CPU 2 CPU 3

TLS Execution

Violation!

*p11 1

valid CPU 2 CPU 3

Replication

Violation!

*p11 1

1 *q11 1 *q11

Can’t invalidate line if it contains two epoch’s changes

valid CPU 2 CPU 3

Replication

Violation!

*p11 1

valid CPU 2 CPU 3

Replication

Violation!

Makes epochs independent Enables sub-epochs

*p11 1

valid CPU 2 CPU 3

Sub-epochs

*p=*q==*p

valid CPU 1b CPU 1c

CPU 1d

CPU 1a

1d …………

Uses more epoch contexts Detection/buffering/rewind is “free” More replication:

Speculative victim cache

get_page() wrapper

page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret;

tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut);

ret = get_page(id);

tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation()

return ret;}

Wraps get_page()

get_page() wrapper

ret = get_page(id);

return ret;}

No violations while calling get_page()

May get bad input data from speculative thread!

get_page() wrapper

ret = get_page(id);

return ret;}

get_page() wrapper

ret = get_page(id);

return ret;}

Only one epoch per transaction at a time

How to undo get_page()

get_page() wrapper

ret = get_page(id);

return ret;}

get_page() wrapper

ret = get_page(id);

return ret;}

Isolated Undoing this operation

does not cause cascading aborts

Undoable Easy way to return

system to initial state

Can also be used for: Cursor management malloc()

TPC-C Benchmark

Company

Warehouse 1 Warehouse W

District 1 District 2 District 10

Cust 1 Cust 2 Cust 3k

TPC-C Benchmark

WarehouseW

DistrictW*10

CustomerW*30k

OrderW*30k+

Order LineW*300k+

New OrderW*9k+

HistoryW*30k+

StockW*100k

Item100k

What is TLS?

while(cond) {

x = hash[i];

hash[j] = y;

= hash[3];

hash[10] =

= hash[19];

hash[21] =

= hash[33];

hash[30] =

= hash[10];

hash[25] =

What is TLS?

while(cond) {

x = hash[i];

hash[j] = y;

= hash[3];

hash[10] =

= hash[19];

hash[21] =

= hash[33];

hash[30] =

= hash[10];

hash[25] =

Processor A Processor B Processor C Processor D

Thread 1 Thread 2 Thread 3 Thread 4

What is TLS?

while(cond) {

x = hash[i];

hash[j] = y;

= hash[3];

hash[10] =

= hash[19];

hash[21] =

= hash[33];

hash[30] =

= hash[10];

hash[25] =

Violation!

What is TLS?

while(cond) {

x = hash[i];

hash[j] = y;

= hash[3];

hash[10] =

...attempt_commit()

= hash[19];

hash[21] =

...attempt_commit()

= hash[33];

hash[30] =

...attempt_commit()

Violation!

= hash[10];

hash[25] =

...attempt_commit()

What is TLS?

while(cond) {

x = hash[i];

hash[j] = y;

= hash[3];

hash[10] =

...attempt_commit()

= hash[19];

hash[21] =

...attempt_commit()

= hash[33];

hash[30] =

...attempt_commit()

= hash[10];

hash[25] =

...attempt_commit()

Violation!

= hash[10];

hash[25] =

...attempt_commit()

Thread 4

TLS Hardware Design

What’s new? Large threads Epochs will communicate Complex control flow Huge legacy code base

How does hardware change? Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding) Sub-epochs

L1 Cache Line

SL bit “L2 cache knows this line has been speculatively

loaded” On violation or commit: clear

SM bit “This line contains speculative changes” On commit: clear On violation: SM Invalid

Otherwise, just like a normal cache

escaping Speculation

Speculative epoch wants to make system visible change!

Ignore SM lines while escaped Stale bit

“This line may be outdated by speculative work.” On violation or commit: clear

L1 to L2 communication

L2 sees all stores (write through) L2 sees first load of an epoch

NotifySL message

L2 can track data dependences!

L1 Changes Summary

Add three bits to each line SL SM Stale

Modify tag match to recognize bits Add queue of NotifySL requests

L2 Cache Line

CPU1 CPU2

Cache line can be: Modified by one CPU Loaded by multiple CPUs

Cache Line Conflicts

Three classes of conflict: Epoch 2 stores, epoch 1 loads

Need “old” version to load Epoch 1 stores, epoch 2 stores

Need to keep changes separate Epoch 1 loads, epoch 2 stores

Need to be able to discard line on violation

Need a way of storing multiple conflicting versions in the cache

Cache line replication

On conflict, replicate line Split line into two copies Divide SM and SL bits at split point Divide directory bits at split point

Replication Problems

Complicates line lookup Need to find all replicas and select “best” Best == most recent replica

Change management On write, update all later copies Also need to find all more speculative

replicas to check for violations On commit must get rid of stale lines

Invalidation Required Buffer (IRB)

Victim Cache

How do you deal with a full cache set? Use a victim cache

Holds evicted lines without losing SM & SL bits

Must be fast every cache lookup needs to know: Do I have the “best” replica of this line?

Critical path Do I cause a violation?

Not on critical path

Summary of Hardware Support

Sub-epochs Violations hurt less!

Shared cache TLS support Faster communication More room to store state

RAOs Don’t speculate on known operations Reduces amount of speculative state

Summary of Hardware Changes

Sub-epochs Checkpoint register state Needs replicas in cache

Shared cache TLS support Speculative L1 Replication in L1 and L2 Speculative victim cache Invalidation Required Buffer

RAOs Suspend/resume speculation Mutexes “Undo list”

TLS Execution

Violation!

L1 $L1 $ L1 $ L1 $ L1 $

Invalidation

Problems with Old Cache Design

Database epochs are large L1 cache not large enough

Sub-epochs add more state L1 cache not associative enough

Database epochs communicate L1 cache only communicates committed

Intro Summary

TLS makes intra-transaction parallelism easy

Divide transaction into epochs Hardware support:

Detect violations Restart to recover

Sub-epochs mitigate penalty Buffer state

New process: Modify software

avoid violations improve performance

Money! pwhile()

Money!pwhile()

Money! pwhile()

Money!pwhile()

Money! pwhile()

Money!pwhile()

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

The Many Faces of Ogg

Must tune reorder buffer…

Duh…pwhile()

Money!pwhile()

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

Removing Bottlenecks

Three general techniques: Partition data structures

malloc Postpone operations until non-

speculative Latches and locks, log entries

Handle speculation manually Buffer pool

Bottlenecks Encoutered

Buffer pool Latches & Locks Malloc/free Cursor queues Error checks False sharing B-tree performance optimization Log entries

Duh…pwhile()

Money! pwhile()

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

Performance on 4 CPUs

Unmodified benchmark: Modified benchmark:

Incremental Parallelization

e4 CPUs

ScalingUnmodified benchmark: Modified benchmark:

Parallelization is Hard

Programmer Effort

Hand Parallelization

What we want

Parallelizing Compiler

Tuning

TuningTuning

Begin transaction {

} End transaction

Begin transaction { Read customer info Read & increment order # Create new order

} End transaction

Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info }} End transaction

Duh…pwhile()

while(too_slow)

make_faster();

Duh…pwhile()

Money!pwhile()

while(too_slow)

make_faster();

while(too_slow)

make_faster();

while(too_slow)

make_faster();

Step 2: Changing the Software

No problem!

Loop is easy to parallelize using TLS! Not really Calls into DBMS invoke complex

operations Ogg needs to do some work

Many operations in DBMS are parallel Not written with TLS in mind!

Resource Management

Mutexes acquired and released

Locks locked and unlocked

Cursors pushed and popped from free stack

Memory allocated and freed

Buffer pool entries Acquired and released

Mutexes: Deadlock?

Problem: Re-ordered acquire/release operations!

Possibly introduced deadlock?

Solutions: Avoidance:

Static acquire order Recovery:

Detect deadlock and violate

Like mutexes, but: Allows multiple readers No memory overhead when not held Often held for much longer

Treat similarly to mutexes

Cursors

Used for traversing B-trees Pre-allocated, kept in pools

Maintaining Cursor Pool

Release

Violation!

Parallelizing Cursor Pool

Use per-CPU pools: Modify code: each CPU gets its own

pool No sharing == no violations! Requires cpuid() instruction

Release

head Get

Release

Memory Allocation

Problem: malloc() metadata causes

dependences

Solutions: Per-cpu memory pools Parallelized free list

The Log

Append records to global log Appending causes dependence Can’t parallelize:

Global log sequence number (LSN) Generate log records in buffers Assign LSNs when homefree

B-Trees

Leaf pages contain free space counts Inserts of random records – o.k. Inserting adjacent records

Dependence on decrementing count Page splits

Infrequent

Other Dependences

Statistics gathering Error checks False sharing

Related Work

Lots of work in TLS: Multiscalar (Wisconsin) Hydra (Stanford) IACOMA (Illinois) RAW (MIT)

Hand parallelizing using TLS: Manohar Prabhu and Kunle Olukotun

(PPoPP’03)

Any questions?

Why is this a problem?

B-tree insertion into ORDERLINE table Key is ol_n

DBMS does not know that keys will be sequential

Each insert usually updates the same btree page

for(ol_n=0; ol_n<15; ol_n++) { INSERT into ORDERLINE (ol_n, ol_item, ol_cnt) VALUES (:ol_n, :ol_item, :ol_cnt);}

Sequential Btree Inserts

free free

item free

free free

item item

free free

item item

item free

free free

item item

free free

Outline

Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences

Aggressive update propagation (implicit forwarding)

Sub-epochs Results and analysis

Outline

Tolerating dependences

Aggressive update propagation Get for free!

Sub-epochs Periodically checkpoint epochs Every N instructions?

Picking N may be interesting Perhaps checkpoints could be set before the

location of previous violations?

Outline

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

Why not faster?

Possible reasons: Idle cpus

9 epochs/region average Two bundles of four and one of one ¼ of cpu cycles wasted!

RAO mutexes Violations Cache effects Data dependences

Why not faster?

Possible reasons: Idle cpus RAO mutexes

Not implemented yet Ooops!

Violations Cache effects Data dependences

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations

21/969 epochs violated Distance 1 “magic synchronized” 2.2Mcycles (over 4 cpus)

About 1.5%

Cache effects Data dependences

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects

Deserves its own slide. Data dependences

Cache effects of speculation

Only 20% of references are speculative! Speculative references have small impact

on non-speculative hit rate (<1%) Speculative refs miss a lot in L1

9-15% for reads, 2-6% for writes L2 saw HUGE increase in traffic

152k refs to 3474k refs Spec/non spec lines are thrashing from L1s

Why not faster?

Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

Oh yeah! Btree item count

Split up btree insert? alloc and write Do alloc as RAO Needs more thought

L2 Cache Line

CPU1 CPU2

Why are you here?

Want faster database systems Have funky new hardware –

How can we apply TLS todatabase systems?

Side question: Is this a VLDB or an ASPLOS talk?

Divide transaction into TLS-threads Run TLS-threads in parallel;

maintain sequential semantics Profit!

Why parallelize transactions?

Decrease transaction latency Increase concurrency while avoiding

concurrency control bottleneck A.k.a.: use more CPUs, same # of

xactions

The obvious: Database performance matters

Shopping List

What do we need? (research scope)

Cheap hardware Thread Level Speculation (TLS)

Minor changes allowed.

Important database application TPC-C

Almost no changes allowed!

Modular database system BerkeleyDB

Some changes allowed.

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

Outline

What’s new?

Database operations are: Large Complex

Large TLS-threads Lots of dependences Difficult to analyze

Want: Programmer optimization effort =

faster program

Hardware changes summary

Must tolerate dependences Prediction? Implicit forwarding?

May need larger caches May need larger associativity

Outline

Parallelization Strategy

1. Pick a benchmark2. Parallelize a loop3. Analyze dependences4. Optimize away dependences5. Evaluate performance6. If not satisfied, goto 3

Outline

TLS Hardware The Benchmark (TPC-C) Changing the database system

Resource management The log B-trees False sharing

Results Conclusions

Outline

Results

Viola simulator: Single CPI Perfect violation prediction No memory system 4 cpus Exhaustive dependence tracking

Currently working on an out-of-order superscalar simulation (cello)

10 transaction warm-up Measure 100 transactions

Outline

Conclusions

TLS can improve transaction latency Violation predictors important

Iff dependences must be tolerated TLS makes hand parallelizing easier

Improving Database Performance

How to improve performance: Parallelize transaction Increase number of concurrent

transactions

Both of these require independence of database operations!

Begin transaction {

} End transaction

Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder)

} End transaction

Begin transaction { Read customer info (customer, warehouse) Read & increment order # (district) Create new order (orders, neworder) For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (orderline) }} End transaction

Parallelizethis loop

Implementing on a Real DB

Using BerkeleyDB “Table” == “Database” Give database any arbitrary key will return arbitrary data (bytes) Use structs for keys and rows Database provides ACID through:

Transactions Locking (page level) Storage management

Provides indexing using b-trees

Parallelizing a Transaction

For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}

Parallelizing a Transaction

For each item in order { Get item info (item) Decrement count in stock (stock) Record order info (order line)}

•Get cursor from pool•Use cursor to traverse b-tree•Find row, lock page for row•Release cursor to pool

Release

Violation!

Parallelizing Cursor Pool 1

Use per-CPU pools: Modify code: each CPU gets its own

pool No sharing == no violations! Requires cpuid() instruction

Release

head Get

Release

Dequeue and enqueue: atomic and unordered

Delay enqueue until end of thread Forces separate pools Avoids modification

of data structGet

Release

head Get

Release

Atomic unordered dequeue & enqueue

Cursor struct is “TLS unordered”

Struct defined as a byte range in memory

Release

headGet

Release

Mutex protect dequeue & enqueue;declare pointer to cursor struct to be “TLS unordered”

Any access through pointer does not have TLS applied

Pointer is tainted, any copies of it keep this property

Problems with 3 & 4

What exactly is the boundary of a structure?

How do you express the concept of object in a loosely-typed language like C?

A byte range or a pointer is only an approximation.

Dynamically allocated sub-components?

Mutexes in a TLS world

Two types of threads: “real” threads TLS threads

Two types of mutexes: Inter-real-thread Inter-TLS-thread

Inter-real-thread Mutexes

Acquire == get mutex for all TLS threads

Release == release for current TLS thread May still be held by another TLS thread!

Inter-TLS-thread Mutexes

Should never interact between two real threads

Implies no TLS ordering between TLS threads while mutex is held

But what do to on a violation? Can’t just throw away changes to memory Must undo operations performed in

critical section

Parallelizing Databases using TLS

Split transactions into threads Threads created are large

60k+ instructions 16kB of speculative state More dependences between threads

How do we design a machine which can handle these large threads?

The “Old” Way

Committed state

Memory System

Speculative state

The “Old” Way

Advantages Each epoch has its own L1 cache Epoch state does not intermix

Disadvantages L1 cache is too small! Full cache == dead meat No shared speculative memory

The “New” Way

L2 cache is huge! State of the art in caches, Power5:

1.92MB 10-way L2 32kB 4-way L1

Shared speculative memory “for free” Keeps TLS logic off of the critical path

TLS Shared L2 Design

L1 Write-through write-no-allocate [CullerSinghGupta99]

Easy to understand and reason about Writes visible to L2 – simplifies shared

speculative memory L2 cache: shared cache architecture with

replication Rest of memory: distributed TLS coherence

“Real” speculative state

Memory System

Cached speculative state

Explain from the top down

“Real” speculative state

Memory System

Cached speculative state

Part II: Dealing with dependences

Predictor Design

How do you design a predictor that: Identifies violating loads Identifies the last store that causes them Only triggers when they cause a problem Has very very high accuracy

Sub-epoch design

Like checkpointing Leave “holes” in epoch # space Every 5k instructions start a new

epoch Uses more cache to buffer changes

More strain on associativity/victim cache Uses more epoch contexts

Summary

Supporting large epochs needs: Buffer state in L2 instead of L1 Shared speculative memory Replication Victim cache Sub-epochs

Any questions?

optimistic intra-transaction parallelism using thread level speculation

Documents

instruction-level parallelism (ilp): speculation, reorder...

risk premia & speculation

optimistic intra-transaction parallelism on chip

the return of silicon efficiency - page d'accueil / lirmm...

speculation...

lecture 19: instruction level parallelism -- dynamic...

lecture 8: openmp. parallel programming models parallel...

chapter 16 - instruction-level parallelism and superscalar...

9. alternative konzepte: parallele funktionale...

currency substitution, speculation, and crises:...

parallelism: avoiding faulty parallelism

1. parallel databases introduction i/o parallelism...

lecture 16: instruction level parallelism -- dynamic ... ·...

risk speculation

optimistic replication - uw computer sciences user...

1 comp 740: computer architecture and implementation montek...

anna university of technology madurai · pdf file ·...

hardware speculation

cse 502 graduate computer architecture lec 10+11 – more...

optimistic parallelism requires...