lecture 5 a brief history of tm

LECTURE 5A Brief History of TM

Precursors of Computing: ENIAC

• 5000 ops/second• 486k $ in 1946• 19k vacuum tubes• 200K watts• 67 cubic meters

Latest trends: Intel Nehalem

• 1.9 billion transistors• 12 billion ops per second• 4 microprocessors• 8 MB of on-chip memory• 100 W• 246 square millimeters

The Way: Not just Chip Frequency!

• 1970s: Programmable controllers, single chip microprocessors

• 1980s: Instruction pipelines, cache hierarchies• 1990s: Speculative execution, Superscalar

processors• 2000s: Multicore chips, embedded computing

Pipelining

• Split the processing of an instruction into a series of independent steps

• Classic pipeline– Instruction Fetch (IF)– Instruction Decode (ID)– Execute (EX)– Memory Access (MEM)– Register Write Back (WB)

Pipelining

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Instr 1 IF ID EX MEM WB

Instr 2 IF ID EX MEM

Instr 3 IF ID EX

Instr 4 IF ID

Different parts of the CPU used for different stages of the pipeline

Pipelining

• Throughput: Speed of the slowest step instead of the whole instruction

• More expensive design• Performance of a pipelined processor depends

on the executing program, and is harder to predict than a non-pipelined processor

Superscalar

• Executes multiple instructions per clock cycle by simultaneously dispatching to redundant functional units

• Think of it as multiple parallel pipelines, each processing instructions from a single stream

• Limitation: Degree of intrinsic parallelism in the stream

Out of Order Execution (OOE)

• Multiple instructions fetched• Instructions dispatched to an instruction queue

(also called instruction buffer or reservation stations)

• Instruction waits in the queue until the input operands are available

• Note that the instruction may leave the queue before earlier instructions

• Results are queued

Speculation in ILP

• Pipelining, OOE, Superscalars all consist of certain “speculation”

Branch prediction

• There has always been some speculation “circuitry” in processors

Forms of parallelism

• Functional: Perform tasks that are functionally different in parallel, e.g. building a house – plumber, carpenter, electrician

• Pipeline: Perform tasks that are different in a particular order, e.g. lunch buffet

• Data: Perform the same task on different data, e.g. grading exams, MapReduce

Limitations of ILP

• Finite amount of ILP in any sequence of instructions

• Another possibility: Thread Level Parallelism (Functional parallelism)

• How to get multiple threads? – Write parallel programs– Thread level speculation– Code parallelization

Thread Level Speculation

• Takes a sequence of instructions• Arbitrarily breaks it into a sequenced group of

threads that may run in parallel• Allows for oblivious parallelization of

sequential programs• Parallelization by speculation dynamically

finds parallelism at runtime, and thus is not conservative

Code parallelization

• Implemented in compilers, e.g. SUIF

• Problems: Hard to identify dependencies between pieces of code and data at compile time

CMP (Chip Multiprocessors)

• Forward data between parallel threads• Detect when reads occur too early• Safely discard speculative state after violations• Retire speculative writes in correct order

• Examples: Stanford HYDRA, Wisconsin Multiscalar, CMU Stampede (1995-2000)

Cache Coherence

• Consistency of data stored in local caches of a shared resource (Wiki definition)

• Protocols– MESI– MOESI– MOSI– MSI

MAIN MEMORY

INTERCONNECTION NETWORK

CACHE CACHE CACHE CACHE

P1 P2 P3 P4

2-state Invalidation Cache Protocol

VALID INVALID

PrRd / BusRd

PrWr / BusWr

PrRd / --

PrWr / BusWr

X/Y: Action X / Reaction YPrRd: Processor ReadPrWr: Processor WriteBusRd: Fetch a cache blockBusWr: Write through one word--: No action

Write Through, No AllocationValid indicates cache presence

2-State Protocol

• Simple hardware and protocol

• Requires high bandwidth (every write goes on bus!)

3-state Protocol (MSI)

• Modified

• Shared

• Invalid

MSI State Diagram

BusRdX/BusWB

PrRd / --BusRd/--

PrRd / --

PrWr / --

PrWr / BusRdX

PrWr / BusRdXBusRdX/--

PrRd /BusRdBusRd/BusWB

Further Improvements

• MESI: Illinois protocol

• MOESI

FIRST TRANSACTIONAL MEMORIES

Precursors: Knight (1986)

• Idea of TLS• Two caches per processor• The first idea to propose the use of caches and

cache coherence to maintain and enforce ordering among speculatively parallelized regions of a sequential code in the presence of unknown memory dependencies

The word “Transactional Memory”

• Introduced by Herlihy and Moss in 1991

• Idea: Adapt the cache coherence protocol so that transactional accesses are monitored

ISCA 93

• Six new instructions– Load-transactional– Load-transactional-exclusion– Store-transactional– Commit– Abort– Validate

• New processor flags– Tactive: Is a transaction currently active?– Tstatus: Is the active transaction in progress, or aborted?

Transactional Cache

• States: MESI• Additional transactional tags: EMPTY,

NORMAL, XCOMMIT, XABORT• Transactional operations create two entries:

one with XCOMMIT and one with XABORT• Modifications made to XABORT on Store

Extra three bus cycles

• T_READ: On a transactional load

• T_RFO: On a transactional load exclusive, or a store

• BUSY: Full cache or other reasons (prevent deadlocks or mutual aborts)

Load_transactional

• LT: – Search TxCache for an XABORT entry. Return if one

exists– No XABORT entry Search for a NORMAL entry.

Change it to XABORT. Allocate a second entry with tag XCOMMIT and same data

– Else, issue a T_READ cycle. Behaves as Goodman’s read. Two entries created: tagged with XABORT and XCOMMIT.

Load_transactional_exclusive

• Similar to LT

• Instead of T_READ, T_RFO used on a miss

• Similar to LTX

• Changes the XABORT entry’s data too

Validate

• Returns the TSTATUS flag

• If the TSTATUS flag is FALSE– Sets TSTATUS to TRUE– Sets TACTIVE to FALSE

• Discards the XABORT entries (sets their tags as EMPTY)

• Sets the tags of XCOMMIT entries as NORMAL• Sets the TSTATUS to TRUE• Sets the TACTIVE to FALSE

Commit

• Discards the XCOMMIT entries (sets their tags to EMPTY)

• Sets the tags of XABORT entries to NORMAL• Sets TSTATUS to TRUE• Sets TACTIVE to FALSE

Digression

• Why transactional memories instead of locks?

• Locks create several problems and require programmers to properly use them– Priority inversion: Lower priority process that holds a lock

preempted when a higher priority that needs the lock– Convoying: Process holding a lock is descheduled, and no

other process can progress– Deadlock: Two or more processes attempt to lock same

set of objects in different orders

Digression

• Transactional memory was invented as a faster means of performing lock-free synchronization

• That is why, earliest TM implementations have no misspeculations. They have aborts due to capacity constraints (HTM) or lock contentions

Speculative Lock Elision (SLE)

• Another reason to use TM!• Speculatively execute critical sections guarded

by locks• Use cache coherence and rollback for recovery

from misspeculation

Hardware TMs in general

• Great idea, efficient implementations• Limitations– High cost of implementation– Small transactional buffer sizes– Context switches

• Solutions: Unbounded HTM

SOFTWARE TM

Advantage

• More flexible than hardware, allows to experiment with variety of algorithms

• Fewer limitations imposed by fixed size hardware, like caches

Access Granularity

• Detects conflicting accesses on objects / words / regions

• Object: Easy implementation, but lot of false conflicts

• Word: Less false conflicts• Region: Less overhead than words

Update

• How the global memory is updated: Direct / deferred

• Direct: The transaction directly modifies the object itself, logs the original value in order to restore in case of abort

• Deferred: The transaction makes local modifications, and changes global memory only on commit

Conflict Detection

• When are the conflicts detected: Eager / lazy / mixed• What is a conflict: Multiple accesses, one of them is

a write• For commit, a transaction must acquire every

location updated. Eager if acquired at the first update operation, lazy if done at the time of commit.

• Mixed: Eagerly detects write/write conflicts, and lazily detects read/write conflicts

STM: 1995

• Memory to be accessed in a transaction known in advance

• Lock-free: Transactions help each other

• Motivation: Replace N-word CAS, implement lock-free data structures etc

The System Model

We assume that every shared memory location supports these 4 operations: Writei(L,v) - thread i writes v to L Readi(L,v) - thread i reads v from L LLi(L,v) - thread i reads v from L and marks that

L was read by I SCi(L,v) - thread i writes v to L and returns

success if L is marked as read by i. Otherwise it returns failure.

Threadclass Rec {

boolean stable = false;boolean,int status= (false,0); //can have two values…boolean allWritten = false;int version = 0;int size = 0;int locs[] = {null};int oldValues[] = {null};

Each thread is defined by an instance of a Rec class(short for record).

The Rec instance definesthe current transaction thethread is executing (only one transaction at a time)

The STM Object

Memory

Ownerships

statusversionsizelocs[]oldValues[]

This is the shared memory

Pointers to threads

Flow of a transaction

startTransaction Thread i

initialize

transaction

acquireOwnershipsagreeOldValues

calcNewValues

updateMemory

releaseOwnerships

isInitiator?

ThreadsSTM

(Failure,failed loc)

Initiatehelping

transactionto failed loc

(isInitiator:=F)

(Null, 0)

Success

Failure

The STM Objectpublic class STM {

int memory[];Rec ownerships[];

public boolean, int[] startTranscation(Rec rec, int[] dataSet){...};

private void initialize(Rec rec, int[] dataSet)private void transaction(Rec rec, int version, boolean isInitiator) {...};private void acquireOwnerships(Rec rec, int version) {...};private void releaseOwnershipd(Rec rec, int version) {...};private void agreeOldValues(Rec rec, int version) {...};private void updateMemory(Rec rec, int version, int[] newvalues) {...};

Implementationpublic boolean, int[] startTranscation(Rec rec, int[] dataSet) {

initialize(rec, dataSet);rec.stable = true;transaction(rec, rec.version, true);rec.stable = false;rec.version++;if (rec.status) return (true, rec.oldValues);else return false;

This notifies other threads that I can be helped

rec – The thread that executes this transaction.dataSet – The location in memory it needs to own.

Implementation

private void transaction(Rec rec, int version, boolean isInitiator) {acquireOwnerships(rec, version); // try to own locations

(status, failedLoc) = LL(rec.status); if (status == null) { // success in acquireOwnerships

if (versoin != rec.version) return;SC(rec.status, (true,0));

(status, failedLoc) = LL(rec.status);if (status == true) { // execute the transaction

agreeOldValues(rec, version);int[] newVals = calcNewVals(rec.oldvalues); updateMemory(rec, version);releaseOwnerships(rec, version);

}else { // failed in acquireOwnerships

releaseOwnerships(rec, version);if (isInitiator) {

Rec failedTrans = ownerships[failedLoc];if (failedTrans == null) return;else { // execute the transaction that owns the location you want

int failedVer = failedTrans.version;if (failedTrans.stable) transaction(failedTrans, failedVer, false);

rec – The thread that executes this transaction.version – Serial number of the transaction.isInitiator – Am I the initiating thread or the helper?

Another thread own the locations I need and it hasn’t finished its transaction yet.

So I go out and execute its transaction in order to help it.

Implementation

private void acquireOwnerships(Rec rec, int version) {for (int j=1; j<=rec.size; j++) {

while (true) do {int loc = locs[j];if LL(rec.status) != null return; // transaction completed by some other

threadRec owner = LL(ownerships[loc]); if (rec.version != version) return; if (owner == rec) break; // location is already mineif (owner == null) { // acquire location

if ( SC(rec.status, (null, 0)) ) { if ( SC(ownerships[loc], rec) ) { break; }}

}else {// location is taken by someone else

if ( SC(rec.status, (false, j)) ) return;}

If I’m not the last one to read this field, it means that another thread is trying to execute this transaction. Try to loop until I succeed or until the other thread completes the transaction

Implementation

private void agreeOldValues(Rec rec, int version) {for (int j=1; j<=rec.size; j++) {

int loc = locs[j];if ( LL(rec.oldvalues[loc]) != null ) {

if (rec.version != version) return;SC(rec.oldvalues[loc], memory[loc]);

private void updateMemory(Rec rec, int version, int[] newvalues) {for (int j=1; j<=rec.size; j++) {

int loc = locs[j];int oldValue = LL(memory[loc]);if (rec.allWritten) return; // work is doneif (rec.version != version) return;if (oldValue != newValues[j]) SC(memory[loc], newValues[j]);

}if (! LL(rec.allWritten) ) {

if (rec.version != version) SC(rec.allWritten, true);}

Copy the dataSet to my private space

Selectively update the shared memory

DSTM: 2003

• Object granularity

• Deferred update

• Eager conflict detection

• Indirection

• Validation

TL2: 2006

• Lock based

• Smart idea: Keep validation fast

• Many of the recent STM use TL2 as its base

Trends

• Initially: Lock-free

• Then: Obstruction-free

• Now: Mostly Lock based

• Reason: Simplicity pays off!

Homework 2

• Q1. Review a paper on HTM/STM

McRT STM

Bartok STM

Swiss TM

Tiny STM

Log TM

Question 2

• Understand the importance of different validation steps in DSTM and TL2

• Due date for Homework 2: 25 November

References• Cache Coherence Protocols: Evaluation Using a

Multiprocessor Simulation Model (Archibald and Baer, TOCS 1986)

• Transactional Memory: Architectural Support for Lock-Free Data Structures (Maurice Herlihu and J.Eliot B. Moss, ISCA 1993)

• Software Transactional Memory (Nir Shavit and Dan Touitou, PODC 2005)

• STM for Dynamic-sized Data Structures (Maurice Herlihy et al., PODC 2003)

• Transactional Locking II (Dave Dice et al., DISC 2006)

Next Lecture

• Correctness Properties in TM

• Formal Semantics of TM

lecture 5 a brief history of tm

instruction queue

different data

instruction pipelines

processing instructions

clock cycle

instruction buffer

multiple parallel pipelines

speculative execution

Documents

real analysis; brief lecture notes...

rs9113 connect-io-n tm module family product brief

stochastic calculus for finance brief lecture...

cs154, lecture 8: undecidability, mapping reductions ·...

tm 665 lecture 11

brief lecture on sentiment analysis

stochastic calculus for finance brief lecture notes

lecture 2: brief history of radar & satellite meteorology

tm&i lecture 1 2011

skp16c62p tutorial 1 tm - webpages.uncc.edu · skp16c62p...

lecture 2: brief review on probability...

upgrading to backup exec 16 key benefits -...

lecture 26 te and tm waves

lecture 1: biotechnology: a brief...

a (very) brief introduction to epistemology lecture...

a brief history of microprocessors lecture l11.0 sections...

lecture 21 probabilistic tm

behavior rating inventory of executive function brief tm

lecture 2 a brief introduction to evolutionary thinking

lecture 01 (tuesday 18 september). lecture 01 what is a tm,...