on-line parallel tomography

1

On-line Parallel Tomography

Shava SmallenUCSD

2

I) Introduction to On-line Parallel Tomography

II) Tunable On-line Parallel Tomography

III) User-directed application-level scheduler

IV) Experiments

V) Conclusion

Talk Outline

3

What is tomography?

• A method for reconstructing the interior of an object from its projections

• At the National Center for Microscopy and Imaging Research (NCMIR), tomography is applied to electron microscopy to study specimens at the cellular and subcellular level

4

Tomogram of spiny dendrite(Images courtesy of Steve Lamont)

Example

5

Parallel Tomography at NCMIR

• Embarrassingly parallel

X

Y

slice

specimen

Z

scanlineprojection

projection

scanline

6

NCMIR Usage Scenarios

Off-line parallel tomography (off-line PT)

– Data resides somewhere on secondary storage

– Single, high quality tomogram

– Reduce turnaround time

– Previous work (HCW’ 00)

On-line parallel tomography (on-line PT)

– Data streamed from the electron microscope

• long makespan, configuration errors, etc.

– Iteratively computed tomogram

– Soft real-time execution

7

On-line PT

• Real-time feedback on quality of data acquisition1 ) First projection acquired from microscope2 ) Generate coarse tomogram3 ) Iteratively refine tomogram using subsequent

projections (refresh)• Update each voxel value • Size of tomogram is constant

8

NCMIR Target Platform

• Multi-user, heterogenous resources– NCMIR cluster

• SGI Indigo2, SGI Octane, SUN ULTRA, SUN Enterprise

• IRIX, Solaris– Meteor cluster

• Pentium III dual proc• Linux, PBS

– Blue Horizon• AIX, Loadleveler, Maui Scheduler

network

slices

preprocessor

ptomo

ptomo

ptomo

ptomo

ptomo

writer

On-line PT Architecture

projection

scanlines

tomogram

10

On-line PT Design

1) Frame on-line parallel tomography as a tunable application– Resource limitations / dynamic– Availability of alternate configurations [Chang,et al]

• each configuration corresponds to different output quality and resource usage

2) Coupled with user-directed application-level scheduler (AppLeS)– adaptive scheduler– promote application performance

11

On-line PT Configuration

• Triple: (f, r, su)• Reduction factor (f)

– Reduce resolution of data reduce both computation and communication

• Projections per refresh (r)– Reduce refinement frequency reduce

communication• Service Units - (su)

– Increase cost of execution increase computational power

12

User Preferences

• Best configuration (f, r, su) = (1, 1, 0 )• Several possible configurations user

specifies bounds– projections should be at least size 256x256

• 1 f 4 or 1 f 8– user could tolerate up to a 10 minute time wait

• 1 r 13– reasonable upper bound

• 0 su (50 x acquisition period x c)

13

User-directed

• Feasible?– Use dynamic load information– if work allocation found

• Better? – e.g.

1. (1, 6, 4) - best f2. (2, 2, 8) - good su/r3. (2, 1, 20) - best r

reduction factor

projections per refresh

service units

generaterequest

displaytriples

adjustrequest

reviewtriples

processrequest

findwork

allocation

executeon-line PT

accepts one

rejects all

infeasible

feasible

User-directed AppLeS

User

User-directed AppLeS

15

Triple Search

• Search parameter space– If triple satisfies constraints feasible

• Constrained optimization problem based on soft real-time execution– compute constraint– transfer constraint

• Heuristics to reduce search space– e.g. assume user will always choose (1,2,1)

over (1,2,4)

16

Work Allocation

work allocation

transfer constraints

cost

user constraints

compute constraints

cpu availability

processor availability

ptomo-to-writer bandwidth

subnet-to-writer bandwidth

Multiple mixed-integer programs approx soln

17

Experiments

• Impact of dynamic information on scheduler performance

• Usefulness of tunability Grid environments

• Scheduling latency

18

Dynamic Information

• We fix the triple and let schedulers determine work allocation

Infinite bandwidth

Dynamic bandwidth

Dedicated cpu

wwa wwa+bw

Dynamic cpu

wwa+cpu AppLeS

19

• Evaluate schedulers– Repeatibility – Long makespan– several resource environments

• Simgrid (Casanova [CCGrid’2001])– API for evaluating scheduling algorithms

• tasks• resources modeled using traces

– E.g. Parameter sweep applications [HCW’00]• Simtomo

Simulation

20

relative refresh lateness

expected refresh period

actual refresh period

• Relative refresh lateness

Performance Metric

21

NCMIR experiments

• Traces (8 machines)– 8 hour work day on March 8th, 2001

• Ran simulations throughout day at 10 minute intervals

8:00 am 4:00 pm

22

Perfect Load Predictions

0 1 2 3 4 5 6 7 810

0

101

102

103

104

hours since 3/8/2001 - 8:00 PST

mea

n re

lativ

e re

fresh

late

ness

wwawwa+cpuwwa+bwAppLeS

23

Imperfect Load Predictions

0 1 2 3 4 5 6 7 8100

101

102

103

104

hours since 3/8/2001 - 8:00 PST

mea

n re

lativ

e re

fresh

late

ness


24

Synthetic Grids

• Bandwidth predictibility– Average prediction error– pi {L, M, H}

– p1 p2 p3

• e.g. LMH

– 27 types– 2510 Grids

x 4 schedulers– 10,040 simulations

writer

cluster3

cluster2

cluster1

p1

p2

p3

25

wwa wwa+cpu wwa+bw AppLeS 0

500

1000

1500

2000

2500

3000

scheduler

num

ber o

f run

s1st2nd3rd4th

Relative Scheduler Performance

705.89 658.91 127.10 1.07

26

Partial Ordering

• Performance vs. bandwidth predictability• Grid predictibility

– Partial orders using p1 p2 p3

– Comparable/Not Comparable• e.g. HML is comparable to HLL• e.g. HLM is not comparable to LHM

• HHH, HHM, HMM, HLM, MLM, LLM, LLL

27

Example Partial Order

HHH HHM HMM HLM MLM LLM LLL . 10

0

101

102

103

104

rela

tive

refre

sh la

tene

ss (s

econ

ds)


28

Tunability Experiments

• How useful is tunability?– variability

• Fixed topology– categorized traces

• L, M, H

– v1 v2 v3 v4 v5

– 243 Grid types cluster2

cluster1

writer

supercomputer

v2

v1

v3

v4

v5

29

Tunability Experiments

• Run over a 2 day period– back-to-back– assume single user

model• f, r, su

• Set of triples chosen– T = {1,…,61}

02

46

8

05

10150

2

4

6

x 104

fr

su

30

Tunability Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fract

ion

of c

hang

es

parameters

frsu

• Count how many times a triple changed per 2-day simulation

• e.g.– 12.9%– 25.7%

31

0 2 4 6 8 100

1000

2000

3000

4000

5000

6000

7000

seconds

num

ber o

f exp

erim

ents

Scheduling Latency

• Time to search for feasible triples• e.g.

– 88% under 1 sec– 63% under 1 sec

32

Conclusions and Future Work

• Grid-enabled version of on-line parallel tomography– Tunable application

• Tunability is useful in Grid environments– User-directed AppLeS

• Importance of bandwidth predictability – e.g. rescheduling

• Scheduling latency is nominal

• Production use

on-line parallel tomography

Documents