an update on haskell h/stm - share. · pdf filelowered implementation level of tm runtime....

32
An Update on Haskell H/STM 1 Ryan Yates and Michael L. Scott University of Rochester TRANSACT 10, 6-15-2015 1 This work was funded in part by the National Science Foundation under grants CCR-0963759, CCF-1116055, CCF-1337224, and CCF-1422649, and by support from the IBM Canada Centres for Advanced Studies. 1/16

Upload: doankien

Post on 06-Mar-2018

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

An Update on Haskell H/STM1

Ryan Yates and Michael L. Scott

University of Rochester

TRANSACT 10, 6-15-2015

1This work was funded in part by the National Science Foundation undergrants CCR-0963759, CCF-1116055, CCF-1337224, and CCF-1422649, and bysupport from the IBM Canada Centres for Advanced Studies.

1/16

Page 2: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Outline

Haskell TM.Our implementations using HTM.Performance results.Future work.

Slides: http://goo.gl/0ZFJXJ Paper: http://goo.gl/Er29ef

2/16

Page 3: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell TM

At last TRANSACT we reported around 280 open source librariesin the Haskell ecosystem that depend on STM. Now there are over400.

Reasons for using Haskell STM:

It is easy!

Expressive API with retry and orElse.

Trivial to build into libraries.

3/16

Page 4: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell STM Example

transA = do

v <- dequeue(queue1)

return v

transB = do

v <- dequeue(queue2)

if someCondition(v)

then return v

else retry

...

mainLoop = do

a <- atomically(transA ‘orElse‘ transB ‘orElse‘ ...)

handleRequest(a)

mainLoop

4/16

Page 5: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Existing Haskell STM Implementations

Glasgow Haskell Compiler (GHC), 7.8.

Explicit transactional variables.

Object based.

Lazy value-based validation.

5/16

Page 6: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Existing Haskell STM Implementations

Coarse-grain Lock (STM-Coarse)

Serialize commits with a global lock.

Similar to NOrec [Dalessandro et al., 2010,Dalessandro et al., 2011, Riegel et al., 2011].

Fine-grain Locks (STM-Fine)

Lock for each TVar.

Two-phase commit.

Similar to OSTM [Fraser, 2004].

6/16

Page 7: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell with HTM

Hybrid TM (Hybrid)

Three levels for transactions (CF [Matveev and Shavit, 2013]).

Full transactions in hardware.

Software transaction, commit in hardware.

Full software fallback.

HLE Commit (HTM-Coarse, HTM-Fine)

Like coarse-grain and fine-grain lock STMs, but usinghardware transactions to elide locks around commit.

7/16

Page 8: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Red-black tree performance

Last year

200 nodes, 84,000 tree operations per second on 4 threads.

Now

50,000 nodes, 24,000,000 tree operations per second on 72 threads.

8/16

Page 9: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Red-black tree performance

What changed?

Constant space2 metadata tracking for retry.

Lowered implementation level of TM runtime.

Avoid reevaluating expensive thunks.

Fixed PRNG.

Also improved benchmarking for more accurate measurement.

2Nearly.

9/16

Page 10: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Results (Intel c© XeonTM E5-2699 v3 two socket, 36-core)

1 8 18 36 54 72

0

0.5

1

1.5

2

2.5·107

Threads

Tre

eop

erat

ion

sp

erse

con

d

HashMapSTM-CoarseHTM-Coarse

HybridSTM-FineHTM-Fine

10/16

Page 11: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Results (bad PRNG) (Intel c© XeonTM E5-2699 v3 two socket, 36-core)

1 8 18 36 54 72

0

0.5

1

1.5

2

2.5·107

Threads

Tre

eop

erat

ion

sp

erse

con

d

HashMapSTM-CoarseHTM-Coarse

HybridSTM-FineHTM-Fine

11/16

Page 12: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Results (single socket) (Intel c© XeonTM E5-2699 v3 two socket, 36-core)

1 2 4 6 8 10 12 14 16 18

0

0.5

1

1.5

·107

Threads

Tre

eop

erat

ion

sp

erse

con

d

HashMapSTM-CoarseHTM-Coarse

HybridSTM-FineHTM-Fine

12/16

Page 13: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Results retry (Intel c© XeonTM E5-2699 v3 two socket, 36-core)

0 20 40 60

0

2

4

6

·106

Threads

Qu

eue

tran

sfer

sp

erse

con

d

STM-CoarseHTM-Coarse

HybridSTM-FineHTM-Fine

13/16

Page 14: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Future Implementation Work TStruct

Flexible transactional variable granularity.

Unboxed mutable variables.

right

left

parent

color

value

key

hash

lock

Node

right

left

parent

color

value

key

hash

lock

Node

color

right

left

parent

value

key

Node

hash

value

TVar

color

right

left

parent

value

key

Node

14/16

Page 15: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Future Implementation Work (orElse)

Supporting orElse in Hardware Transactions

t1 = atomically (a ‘orElse‘ b)

Atomically choose second branch when the first retrys.

No direct support in hardware for a partial rollback.

If the first transaction does not write to any TVars, there isnothing to roll back.

Keep a TRec while running the first transaction.

Or rewrite the first transaction to delay writes until after thechoice to retry.

15/16

Page 16: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Summary

We have a much better understanding of the performanceissues.

Performance on a concurrent set is competative at scale.

Good performance for infrequent retry use cases.

HTM is a useful and flexible tool that helps performance.

We have a roadmap for future improvements.

Slides: http://goo.gl/0ZFJXJ Paper: http://goo.gl/Er29ef

16/16

Page 17: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

17/16

Page 18: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Supporting retry

Existing retry Implementation

When retry is encountered, add the thread to the watch listof each TVar in the transaction’s TRec.

When a transaction commits, wake up all transactions inwatch lists on TVars it writes.

18/16

Page 19: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Supporting retry

Hardware Transactions

Replace watch lists with bloom filters for read sets.

Support read-only retry directly in HTM.

Record write-set during HTM then perform wake-ups afterHTM commit.

19/16

Page 20: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Wakeup Structure

Committed writer transactions search blocked thread read-setsin a short transaction eliding a global wakeup lock.

Committing HTM read-only retry transactions atomicallyinsert themselves in the wakeup structure by writing theglobal wakeup lock inside the hardware transaction.

Releases lock when successfully blocked.

Aborts wakeup transaction (short and cheap).

Serializes HTM retry transactions (rare anyway).

20/16

Page 21: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Future Implementation Work (orElse)

Existing orElse Implementation

Atomic choice between transactions biased toward the first.

Nested TRecs allow for partial rollback.

If the first transaction encounters retry, throw away thewrites, but merge the reads and move to the secondtransaction.

21/16

Page 22: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell STM Metadata Structure

new

old

tvar

...

new

old

tvar

new

old

tvar

index

prev

TRec

prev

next

thread

Watch Queue

prev

next

thread

Watch Queue

watch

value

TVar

color

right

left

parent

value

key

Node

22/16

Page 23: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell HTM Metadata Structure

...

thread

read-set

Wakeup

write-set

read-set

HTRec

hash

value

TVar

color

right

left

parent

value

key

Node

23/16

Page 24: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell Before TStruct

color

right

left

parent

value

key

Node

hash

value

TVar

color

right

left

parent

value

key

Node

24/16

Page 25: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell with TStruct

right

left

parent

color

value

key

hash

lock

Node

right

left

parent

color

value

key

hash

lock

Node

25/16

Page 26: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell STM TQueue Implementation

data TQueue a = TQueue (TVar [a]) (TVar [a])

dequeue :: TQueue a -> a -> STM ()

dequeue (TQueue _ write) v = modifyTVar write (v:)

enqueue :: TQueue a -> STM a

enqueue (TQueue read write) =

readTVar read >>= \case(v:vs) -> writeTVar read vs >> return v

[] -> reverse <$> readTVar write >>= \case[] -> retry

(v:vs) -> do writeTVar write []

writeTVar read vs

return v

26/16

Page 27: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell STM Implementation

Fairly standard commit protocol, but missing optimizations frommore recent work.

Commit

Coarse grain: perform writes while holding the global lock.

Fine grain:

Acquire locks for writes while validating.

Check that read-only variables are still valid while holding thewrite locks.

Perform writes and release locks.

27/16

Page 28: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell STM

Broken code that we are not allowed to write!

transferBad :: TVar Int -> TVar Int -> Int -> STM ()

transferBad accountX accountY value = do

x <- readTVar accountX

y <- readTVar accountY

writeTVar accountX (x + v)

writeTVar accountY (y - v)

if x < 0

then launchMissles

else return ()

28/16

Page 29: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Haskell STM

Broken code that we are not allowed to write!

thread :: IO ()

thread = do

transfer a b 200

transfer a c 300

29/16

Page 30: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

C ABI vs Cmm ABI

GHC’s runtime support for STM is written in C.

Code is generated in Cmm and calls into the runtime areessentially foreign calls with significant extra overhead.

We avoid this by writing the HTM support in Cmm.

Typeclass machinery could allow deeper code specialization.

30/16

Page 31: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

Lazy Evaluation

Lazy evaluation may lead to false conflicts due to the updatestep that writes back the fully evaluated value.

One solution could be to delay performing updates (to sharedvalues) until after a transaction commits.

Races here are fine as any update must represent the samevalue.

31/16

Page 32: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate

References

[Dalessandro et al., 2011] Dalessandro, L., Carouge, F., White, S., Lev, Y., Moir, M., Scott, M. L., and Spear,M. F. (2011).Hybrid NOrec: A case study in the effectiveness of best effort hardware transactional memory.In Proc. of the 16th Intl. Symp. on Architectural Support for Programming Languages and Operating Systems(ASPLOS), pages 39–52, Newport Beach, CA.

[Dalessandro et al., 2010] Dalessandro, L., Spear, M. F., and Scott, M. L. (2010).NOrec: Streamlining STM by abolishing ownership records.In Proc. of the 15th ACM Symp. on Principles and Practice of Parallel Programming (PPoPP), pages 67–78,Bangalore, India.

[Fraser, 2004] Fraser, K. (2004).Practical lock-freedom.PhD thesis, University of Cambridge Computer Laboratory.

[Matveev and Shavit, 2013] Matveev, A. and Shavit, N. (2013).Reduced hardware NOrec.In 5th Workshop on the Theory of Transactional Memory (WTTM), Jerusalem, Israel.

[Riegel et al., 2011] Riegel, T., Marlier, P., Nowack, M., Felber, P., and Fetzer, C. (2011).In Proc. of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 53–64,San Jose, CA.

32/16