netsolve

21
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve

Upload: werner

Post on 15-Jan-2016

19 views

Category:

Documents


0 download

DESCRIPTION

NetSolve. Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve. Objectives. Harnessing vast computational resources on the network Hardware Software Convenient for scientific computing community - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NetSolve

NetSolve

Henri Casanova and Jack DongarraUniversity of Tennessee and Oak Ridge National Laboratoryhttp://www.cs.utk.edu/netsolve

Page 2: NetSolve

Objectives

Harnessing vast computational resources on the network Hardware Software

Convenient for scientific computing community Reducing installation and programming

overhead Masking complexity related to distributed

computing

Page 3: NetSolve

Computation-Sharing Models Proxy Computing

Data

CodeDataCode

Client Server

Computation on the server

Page 4: NetSolve

Computation-Sharing ModelsCode Shipping

CodeData

Client Server

Computation on the client

Code

Page 5: NetSolve

Computation-Sharing ModelsRemote Computation

DataData

Client Server

Computation on the server

Code

Page 6: NetSolve

Design issues

Platform independence to accommodate heterogeneityUser friendlyExtensibilityLoad balancingFault tolerance

Page 7: NetSolve

NetSolve Architecture

“OS”

Resources

Page 8: NetSolve

NetSolve Organization and Operation

Page 9: NetSolve

NetSolve Client Interface

C, Fortran, Java, Matlab, and Mathematica

>> a = rand(100); b= rand(100,1);>> x = netsolve(’ax = b’, a, b);

>> a = rand(100); b= rand(100,1);>> request = netsolve_nb (’send’, ’ax = b’, a, b);>> x = netsolve_nb(’probe’, request);

Not ready>> x= netsolve_nb(’wait’, request);

Page 10: NetSolve

NetSolve Wrappers

Problem description file for extensibility@PROBLEM ipars@INCLUDE ”ipars.h”@LIB /home/user/lib/libipars.a@DECRIPTIONParallel Sub-Surface Flow Simulator@INPUT 2@OBJECT STRING CHAR model@OBJECT FILE CHAR infile

Compiled into wrappers around scientific librariesXDR for platform-independent data transfer

Page 11: NetSolve

NetSolve Load Balancing

Assigning a task to the “best” machine Establishing a performance model

Network delay, server properties, task properties Measuring and monitoring dynamic system

states

Load balancing at a finer granularity Parallelism through non-blocking interface Task migration

Page 12: NetSolve

NetSolve Fault Tolerance

Inter-server fault toleranceFault tolerance among NetSolve

servers

Intra-server fault toleranceFault tolerance within a NetSolve

server

Page 13: NetSolve

NetSolve Fault Tolerance Inter-server Fault Tolerance

Performed by NetSolve agentsBasic approach Failure detection + task reallocation Overload detection + task migration

Introducing NetSolve storage servers Store checkpoints or any information related

to fault tolerance (must be platform-independent)

No reliance on failed or overloaded server for task migration

Page 14: NetSolve

NetSolve Fault ToleranceIntra-server Fault Tolerance

Not a new problemCould be invisible to NetSolveCan take advantage of platform-specific features for fault tolerancePossible integration with inter-server fault tolerance

Page 15: NetSolve

Diskless Checkpointing Checksums and Reverse Computation

Diskless checkpointing eliminates the need for stable storageN servers + a checkpointing server At any point, consistent checkpoints taken

at N servers (stored in memory) A checksum of checkpoints stored at the

checkpointing server Rollback using reverse computation State recovery using the checksum

Page 16: NetSolve

Applications

MCell with NetSolveLarge code, small data

Matlab with NetSolveTradeoffs between parallelism and

overhead

IPARS with NetSolveImageVision with NetSolve

Page 17: NetSolve
Page 18: NetSolve

Integration with ScaLAPACK

Page 19: NetSolve

Integration with Condor

Page 20: NetSolve

Integration with Ninf

Page 21: NetSolve

Conclusion

An interesting infrastructure for sharing computational resourcesBoth software and hardware

Convenience, performance, and reliabilityPlayground for fault tolerance Both general and specific