yoshihiro oyama, kenjiro taura, toshio endo, akinori yonezawa

26
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo

Upload: sammy

Post on 30-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer. Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation

on Shared Memory Parallel Computer

Yoshihiro Oyama, Kenjiro Taura,

Toshio Endo, Akinori Yonezawa

Department of Information Science, Faculty of Science,

University of Tokyo

Page 2: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Background

“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated

Languages with fine-grain threads• promising approach to handle the complexity

Page 3: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Motivation

Q: Are fine-grain threads really effective?

• Easy to describe irregular parallelism?• Scalable?• Fast?

Case studies to answer the Q are few

Many sophisticated designs and implementation techniqueshave been proposed so far, but

Page 4: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Goal

Case study to better understandthe effectiveness of fine-grain threads

C + Solaris threads

VS.

• program description cost• speed on 1 PE• scalability on 64PE SMP

in terms of

our language Schematic

approach w/o fine-grain threads

approach withfine-grain threads

Page 5: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Overview

Applications ( RNA & CKY )

Solutions without fine-grain threads

Solutions with fine-grain threads

Performance evaluation

Page 6: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Case Study 1: RNA- protein secondary structure prediction -

Algorithm simple node traversal + pruning

finding a path• satisfying certain condition• with largest weight

unbalanced tree

Page 7: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Case Study 2: CKY- context-free grammar parser -

calculation of matrix elements

depends on all s

She is a girl whose mother is a teacher.

calculation time significantlyvaries from element to element

actual size 100≒

Page 8: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
Page 9: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

To create a threadfor each node large overhead

communicationwith memory

Task Pool

P P P

Solution without Fine-grain Threads(RNA)

Page 10: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

calculating 1 element→ 0 ~ 200 synchronization

P P P

decision strategy?• trial & error• prediction

Solution without Fine-grain Threads(CKY )

how to implement?• small delay → simple spin• large delay → block wait

Page 11: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
Page 12: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Schematic [Taura et. al 96] = Scheme + future + touch [Halstead 85]

(define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2)))))

thread creation

synchronization

channel

Language with Fine-grain Threads

Page 13: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Thread Management in Schematic• Lazy Task Creation [Mohr et al. 91]

PE A PE B

future future

future

future

future future

future

future

future

stac

k future

future

future

Page 14: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Synchronization on Register

PE A PE B

• StackThreads [Taura 97]

register

memory

register

register

register register

registerregister

register

memory

register

memory

Page 15: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Synchronization by Code Duplication

heuristics to decide which to duplicate+

if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ...}

cont(c, v){ }

work A

work B ver. 1;

work B ver. 2;

work A work B(touch r)

simple spin

block wait

Page 16: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
Page 17: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

What description can be omittedin Schematic? Management of fine-grain tasks

Synchronization details

future ⇔ manipulation of task pool + load balance

touch ⇔ manipulation of comm. medium + aggressive optimizations

SchematicC + thread

Page 18: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Codes for Parallel Execution

int search_node(...){ if (condition) { } else { child = ...; ... search_node(...); ... ... ...}

C

(define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...)))

Schematic

whole: 1566 lines whole: 453 lines

parallel: 537 lines (34 %)

parallel: 29 lines (6.4 %)

for parallelexecution

RNA

Page 19: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Condition) Sun Ultra Enterprise 10000

(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)

GC time not included Runtime type check omitted

Page 20: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Sequential)

0

1

2

3

RNA CKY

norm

aliz

ed e

laps

ed t

ime

C Schematic

Page 21: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Parallel)

0

10

20

30

40

50

0 10 20 30 40 50 60# of PEs

spee

dup

C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)

Page 22: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Related Work

ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on

• namespace management

• data locality

• object-consistency model

Page 23: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Conclusion

We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization

We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C

Page 24: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Other Applications 1/2)

14.7

0

1

2

3

4

Fib Tak Qsort Knapsack Grobner SPLASH2

norm

aliz

ed e

laps

ed t

ime

C Schematic

Page 25: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Other Applications 2/2)

0

10

20

30

40

50

0 10 20 30 40 50 60

# of PEs

spee

dup

Fib Tak Nqueen QsortKnapsack Puzzle QAP SPLASH2

Page 26: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Identifying Overheads

0

200

400

600

800

1000

normal no poll no GCcheck

stolentagopt.

flagcheck

usesmalltag

globalvaropt.

C

norm

aliz

ed e

laps

ed t

ime