dorian arnold computer science department university of...

45
Department of Computer Science Dorian Arnold Computer Science Department University of New Mexico

Upload: others

Post on 28-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Department of Computer Science

Dorian ArnoldComputer Science Department

University of New Mexico

Page 2: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

Life Before ICL

Page 3: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

January 1999 – August, 2001

NetSolve Project

ICL Collaborators:

◦ Henri, Sudesh, Sathish, Jakob, Thara, Michelle, Dieter, Keith S., Tsinghua, Victor, Susan, Shirley, Keith M., Ganapathy, Nathan, David

MY ICL Tenure

Page 4: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

PhD Student, U. of Wisconsin, ’01 - ’08◦ Scalable, reliable communication infrastructure

◦ Scalable, lightweight tools and applications

Asst. Professor, U. of New Mexico, since ’09◦ Carry-over from dissertation work (naturally)

◦ Autonomous Infrastructure

◦ Virtualization for HPC

◦ Thin Computing

Life After ICL

Page 5: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

What is HPC?◦ It depends … on who you ask? Computer architect? OS researcher? Applied mathematician? (Dead-end career ) Domain scientist? Distributed systems researcher!

Many domains of expertise◦ One should not need cross-domain expertise to use HPC

resources effectively

The Method behind the Madness

Page 6: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

My Research Foci

Make HPC systems easier to use without sacrificing performance and reliability

Page 7: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Department of Computer Science

Tree-based Overlay Networks

Page 8: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab8

FEApplicationfront-end

Applicationback-ends BE BE BE BE BE BE BE BE

Data management Large volumes

Data analysisCentralized analysis leads to

computational bottlenecks

Many resources to manage E.g. control channels

Doesn’t Scale!

We need Scalable Tools and Applications

Page 9: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

Key TBŌN Abstractions

FE

BE BE BE BE BE BE BE BE

CP

CP CP CP CP

CP CP

PacketFilter

FilterState

Filters◦ Executed by processes◦ Persistent state

Channels◦ Reliable◦ Order-preserving

Streams◦ Define sub-groups◦ Distinguish dataflows◦ Specify filter routine

FE

BE BE BE BE BE

CP

CP CP CP

CP CP

Page 10: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

The Multicast/Reduction Network MRNet is our prototype TBŌN◦ Developed by Arnold, Roth and Miller

Used by many research installations◦ Paradyn (University of Wisconsin)◦ Stack Trace Analysis Tool (LLNL)◦ TauOverMRNet (University of Oregon)◦ TBON-FS (University of Wisconsin)◦ Image Analysis (University of Wisconsin)◦ CEPBA-Tools (Universitat Politècnica de Catalunya)◦ Open|SpeedShop (Krell Institute)◦ TotalView (TotalView Tech.)

Ongoing collaborations: RENCI, Juelich◦ Scalable Tool Communication Infrastructure???????

Page 11: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Department of Computer Science

Page 12: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

Stack Trace Analysis ToolExtreme Scale Debugging

STAT is a lightweight debugging aid thatuses stack traces to classify processequivalence and profile application.

Thousands of tasks reduce to few classes.

Analyze representatives with full debugger

Temporal analysis determines tasks’relative progress

Goal: Scale up to machine sizes andscale down the information

Page 13: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab13

MPI MPI MPI MPI MPI MPI MPIMPI

Stack Trace Analysis ToolFE

BE BE BE BE BE BE BE BE

CP

CP CP CP CP

CP CP

Application Processes

STAT Front-end

STAT Daemon

MRNetCommunication

ProcessSTAT Filter

Page 14: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab14

STAT Performance on BlueGene/L

0

0.1

0.2

0.3

0.4

0.5

0.6

0K16K

32K48K

64K80K

96K112K

128K144K

160K176K

192K208K

Number of Application Processes

Mer

ge L

aten

cy (s

econ

ds)

1-deep (VN Mode)

2-deep (VN Mode)

3-deep (VN Mode)

zero to 212,992in 0.4!

Page 15: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Department of Computer Science

Page 16: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

Observations and Recovery ApproachExplicit state replication is expensiveUse inherent information redundancy

Strong data consistency is expensiveUse weak data consistency

Global coordination is expensiveUse localized protocols to satisfy global requirements

Information must be disseminated globallyUse TB• N for efficient, scalable dissemination

Page 17: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

State Compensation Compensate for lost state using inherently

redundant information from surviving processes◦ Avoid overhead of explicit data replication

State composition◦ Lightweight mechanism for idempotent aggregations

State decomposition◦ For non-idempotent aggregations◦ Requires two coordination phases No overhead in the absence of failures O(log(N)) processes participate in failure recovery

Page 18: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

State Composition Example

1

7

3

4 1

9

5{1,3,5}{1,3,4} {1,5}

{1,3}

{1,3,4,5} {1,5,8}

4,5 5,8

1,3

{1,8,9}Use orphans states

to compensate for failure

Page 19: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

State Composition Example

1

7

3

4 1 5{1,3,5}{1,3,4} {1,5}

{1,3}

{1,3,4,5}

4,5

1,3

{1,8,9}

1,8,9

1,51

7

3

4 1

9

5{1,3,5}{1,3,4} {1,5}

{1,3}

{1,3,4,5} {1,5,8}

4,5 5,8

1,3

{1,8,9}

Page 20: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

State Composition Example

1

7

3

4 1

9

5{1,3,5}{1,3,4} {1,5}

{1,3}

{1,3,4,5} {1,5,8}

4,5 5,8

1,3

{1,8,9}

1

7

3

4 1 5{1,3,5}{1,3,4} {1,5}

{1,3}

{1,3,4,5}

4,5

1,3

{1,8,9}

1,8,9

1,5

Page 21: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

State Composition Example

7 4 5

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,8}

{1,3,4,5} {1,5,8,9}

9

1,3

{1,5,8,9}

7 4

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,8,9}

{1,3,4,5}

4,5,8,9

{1,5,8,9}

5

4,5,81,3

Page 22: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

State Composition Example

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,8,9}

{1,3,4,5,7} {1,5,8,9}

1,3

{1,5,8,9}

4,5,8

7

9

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,8,9}

{1,3,4,5,7}

7

4,5,8,9

{1,5,8,9}

1,3

Page 23: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

State Composition Example

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,7,8,9}

{1,3,4,5,7} {1,5,8,9}

1,3

{1,5,8,9}

4,5,8

79

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,8,7,9}

{1,3,4,5,7}

7

4,5,8,9

{1,5,8,9}

1,3Output stream converges

to that of non-failed execution

Page 24: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

Robust, Scalable Data Aggregation

0

10

20

30

40

50

60

70

80

90

FIM

S

MAX

AVG

FIM

S

MAX

AVG

FIM

S

MAX

AVG

FIM

S

MAX

AVG

FIM

S

MAX

AVG

FIM

S

MAX

AVG

Fan-out at Failed Process

Rec

over

y La

tenc

y (m

illis

econ

ds)

l(overall) l(new_parent)l(connect) l(compensate)l(cleanup)

7

8

9

10

11

12

13

14

15

0 30 60 90 120 150 180Time (seconds)

Th

rou

gh

pu

t (p

acke

ts/s

eco

nd

)

Failure Recovery Latency

Application Throughput as Failures are Injected

Page 25: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Department of Computer Science

Page 26: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

Autonomous Middleware

System Knowledge in Application

Sys

tem

Eas

e-of

-use

Application Knowledge in System

Sys

tem

Gen

eral

ity

GoalEfficient, scalable systems from:

Approach: Dynamic, Autonomous OperationSelf-configuring: Automatic TBŌN topology configuration

Self-monitoring: TBŌN health and performance

Self-healing: TBŌN Fault tolerance and failure recovery

Self-optimizing: Dynamic TBŌN reconfiguration to improve performance

Challenges• Reliable service at scale

• Choosing the “best” TBŌN topologies?– Load and system characteristics may vary

over time

• Online improvement of TBŌN performance?– Throughput, latency, resource consumption,

startup costs, …

• Flexible, elegant solution space

Monitoring

Detecting Deciding

Acting

Sensors Effectors

Events

Symptoms Decisions

Actions

System-agnosticApplications

Application-agnosticSystems

Page 27: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

27

Problem Statement:Application Efficiency and Scalability

Necessary SystemKnowledge in App.

Sys

tem

ease

-of-

use

Necessary ApplicationKnowledge in System

Sys

tem

Gen

eral

ity

1. How much system-specific knowledgedoes application (developer) need?

2. How much application-specific knowledge does system (developer) need?

3. How far can we get answering “NONE” and “NONE”?

Page 28: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

28

The Approach:An Autonomous TB• N Infrastructure

TB• N Autonomy aka the self-* properties:• Self-configuring

– Automatic TB• N topology configuration

• Self-monitoring– TB• N health and performance

• Self-healing– TB• N Fault tolerance and failure recovery

• Self-optimizing– Dynamic TB• N reconfiguration to improve performance

Page 29: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

29

Research Challenges

• How can we provide a reliable TB• N service in the presence of failures?

• How do we choose the “best” TB• N topologies?– Application load and system characteristics may vary over time

• How can we dynamically improve TB• N performance?– Throughput, latency, resource consumption, startup costs, …

• Can we design a flexible, elegant solution space?

Page 30: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

30

“Performance Failures”

• What is a performance failure?– Generally, a sub-optimal topology, or

– Realizing (much) less than optimal performance• Data aggregation latency and throughput• Resource under-utilization• Imbalanced topologies

– Per application?– Per flow/stream?

Page 31: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

31

Per Flow Topologies

• “best” topology depends upon– Participating end-points– Data aggregation operation– Application data rate– …

• “best” is different for different streams!– How can we efficiently enable different topologies for

different flows?

Page 32: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Department of Computer Science

Page 33: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Department of Computer Science

Page 34: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

34

TB• N Components for Autonomy

Monitoring

Detecting Deciding

Acting

Sensors EffectorsEvents

SymptomsDiagnosis

Decisions

Actions

Sensors: hw/sw characteristics, runtime events, etc.Monitoring: collecting/correlating events to identify patterns andsymptoms, e.g. threshold checking, etc.Detecting: evaluate symptoms to determine if problems existsand action is necessary, e.g. do we have a bottleneck?

Deciding: determining how best to modify topologyActing: effecting the recommended topology changesKey Challenges:• Decentralization• Low (background) overhead• Rapid execution• Must provide more benefits than drawbacks!

Page 35: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Scalable Systems Lab

Example TBŌN Reductions Simple◦ Min, max, sum, count, average◦ Concatenate

Complex◦ Clock synchronization [Roth, Arnold, Miller ’03]◦ Time-aligned aggregation [Roth, Arnold, Miller ’03]◦ Vision algorithms [Arnold, Pack, Miller ’06]◦ Graph Analysis [Arnold et al. ’07], [Roth, Miller ’05]◦ Equivalence relations

Page 36: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

36

Integer Union Example

43

37

51

34

51

11

81

95

{ }{ } { }

{ }

{ } { }

{ }

Page 37: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

37

Integer Union Example

4

3

31

5

1

34

5

1

11

8

1

95

{1}{3} {1}

{ }

{ } { }

{1}

Page 38: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

38

Integer Union Example

4

17

5

34

5

11

8

95

{1,5}{3,4} {1,5}

{ }

{1,3} {1}

{1,8}

1,3 1

Page 39: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

39

Integer Union Example

1

7

3

4 1

9

5{1,3,5}{1,3,4} {1,5}

{1,3}

{1,3,4,5} {1,5,8}

{1,8,9}

4,5 5,8

1,3

Page 40: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

40

Integer Union Example

7 4 5

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,8}

{1,3,4,5} {1,5,8,9}

{1,5,8,9}

9

4,5,81,3

Page 41: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

41

Integer Union Example

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,8,9}

{1,3,4,5,7} {1,5,8,9}

{1,5,8,9}

7

4,5,81,3

9

Page 42: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

Integer Union Example

42

{1,3,4,5}{1,3,4,7} {1,5}

{1,3,4,5,7,8,9}

{1,3,4,5,7} {1,5,8,9}

{1,5,8,9}

7

4,5,81,3

9

Page 43: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

MRNet Front-end Interface

43

front_end_main(){Network * net = new Network (topology_file);

Communicator * comm = net->get_BroadcastCommunicator();

Stream * stream =new Stream( comm, IMAX_FILT, WAITFORALL);

stream->send(“%s”, “go”);

stream->recv(“%d”, &result);}

Page 44: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

44

MRNet Back-end Interfaceback_end_main(){Stream * stream;Packet *p;char * s;

Network * net = new Network();

net->recv(&tag, &p, &stream);p->unpack( “%s”, &s );

if(s == “go”){stream->send(“%d”, rand_int);

}}

Page 45: Dorian Arnold Computer Science Department University of ...icl.cs.utk.edu/20/presentations/Dorian_Arnold.pdfScalable Systems Lab PhD Student, U. of Wisconsin, ’01 -’08 Scalable,

45

MRNet Filter Interfaceimax_filter(vector<Packet> packets_in,

vector<Packet> packets_out){for( i=0; i<packets_in.size; i++){result = max( result,

packets[i].get_int());}

Packet p(“%d”, result);

packets_out.pushback(p);}