distributed software systems: cyberinfrastructure and geoinformatics chaitan baru

48
1 Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru San Diego Supercomputer Center

Upload: miller

Post on 18-Mar-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru. San Diego Supercomputer Center. Domain-specific Cybertools (software). Shared Cybertools (software). Distributed Resources (computation, storage, communication, etc.). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

1

Distributed Software Systems: Cyberinfrastructure and

Geoinformatics

Chaitan Baru

San Diego Supercomputer Center

Page 2: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

2

Hardware

Integrated Cyberinfrastructure System Source: Dr. Deborah Crawford, Chair, NSF CI Working Committee

Middleware Services

DevelopmentTools & Libraries

Applications• Geosciences• Environmental Sciences• Neurosciences• High Energy Physics … •

Domain-specific Cybertools (software)

Shared Cybertools (software)

Distributed Resources (computation, storage, communication, etc.)

Educ

atio

n an

d Tr

aini

ng

Disc

over

y &

Inno

vatio

n

Page 3: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

3

Community Cyberinfrastructure Projects

Middleware Services

DevelopmentTools & Libraries

Distributed Computing, Instruments and Data Resources

Friendly Work-Facilitating PortalsAuthentication - Authorization - Auditing - Workflows - Visualization - Analysis

Bio

med

ical

Info

rmat

ics

(BIR

N)

Hig

h En

egy

Phys

ics

(GriP

hyN

)

Geo

scie

nces

(GEO

N)

Ecol

ogic

al O

bser

vato

ries

(NEO

N)

Eart

hqua

ke E

ngin

eerin

g (N

EES)

Oce

an O

bser

ving

(OR

ION

)

Hardware

Adapted from: Prof. Mark Ellisman, UC San Diego

Shared Tools

ScienceDomains

Your Specific Tools & User Apps.

Page 4: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

4

Data, Tools, & Computation• Data

– Field observations– Laboratory analyses– Sensor-based data (land, airborne, satellite)

• Tools– QA/QC, simple transformations and analyses– Complex models

• Computation– Community codes– Access to high-performance computing– Data Intensive Computing

Page 5: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

5

Variety of Geoinformatics Efforts

• Data collection– Digital data collection in the field– “When does it become cyberinfrastructure”?

• Database curation– E.g. EarthChem, Paleobiology, MorphoBank, Paleo

Pollen, etc….– When does it become “tools” and “community codes”

• Software Development– Tools: gravity and magnetics, paleogeography,

geochemistry, seismic data products, …– Community codes: SCEC-CME, CIG, …

Page 6: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

6

Variety of Geoinformatics Efforts

• High Performance Computing– LiDAR data management– Seismic analyses– Petascale initiative

• Data Integration– E.g. CUAHSI HIS– Also, a pressing need in projects like

EarthScope

Page 7: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

7

Cyberinfrastructure

To provide access to all of these “resources” and support “interoperability” among them

Cyberinfrastructure: The Common Platform Across Distributed Projects

Data Collection

Data ManagementAnd Curation

Tool Development

Modeling and Integration

Page 8: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

8

Example: USArray Data Flow

• Deploy field sensor arrays– Across US

• Collect data from sensor arrays and perform QA/QC– One of the sites is SIO, San Diego

• Archive data for community access– IRIS, Seattle EarthScope/USArray: Single

project, multiple participants.

Page 9: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

9

D. Harding, NASA

Point Cloudx, y, z, …

Example: LiDAR Workflow

Courtesy: Chris Crosby, ASU

Survey

Analyze / “Do Science”

Interpolate / Grid

Single goal: Multiple projects, multiple participants, e.g. NCALM,

GEON, ASU, NASA, USGS, …

Page 10: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

10

GEON Cyberinfrastructure

• Funded by NSF IT Research program • Multi-institution collaboration between IT and Earth

Science researchers• GEON Cyberinfrastructure provides:

– Authenticated access to data and Web services– Registration of data sets, tools, and services with metadata– Search for data, tools, and services, using ontologies– Scientific workflow environment and access to HPC– Data and map integration capability– Scientific data visualization and GIS mapping

Page 11: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

11

Key Informatics Areas• Portals

– Authenticated, role-based access to cyber resources: data, tools, models, model outputs, collaboration spaces, …

• Data Integration– Search, discovery and integration of data from heterogeneous information

sources (“mediation” and “semantic integration”)• Use of workflow systems, and access to HPC

– Ability to “program” at a higher level of abstraction– Sharing of models, along with “provenance” information– Gateways to HPC environments

• Management of Geospatial Information– Using GIS capabilities, map services, geospatial data integration

• Visualization of 3D, 4D geospatial data and information

Page 12: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

12

Distributed System Definition• A Distributed System is

– one in which the hardware and software components in networked computers communicate and coordinate their activities only by passing messages, e.g. the Internet

• A Distributed Database System is – one in which data is stored at several sites, each

managed by a database system (DBMS) that can run independently

Page 13: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

13

Distributed System Models

• Client – ServerClient A

Client B

Server 1 Client CNetworkNetwork

invocation

response

Process 1

Process 3

Process 2Network

Network

• Peer to Peer

Page 14: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

14

Remote Service Invocation• TCP/IP

– Basic Internet protocol for computer communications– Platform for building a number of other open or

proprietary, “higher-level” communications protocols• Communication at a higher-level of abstraction

• http– Open protocol based on TCP/IP for the Web– Fixed set of “verbs” (actions) used to transfer HTML

documents• CORBA, Java RMI

– Protocols based on an object model

Page 15: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

15

SRBArchives

HPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Sybase

File SystemsUnix, NT,Mac OSX

User

Dublin Core

Resource,Mthd, User

User Defined

ApplicationMeta-data

RemoteProxies

DataCutter

MetadataExtractionC, C++,

Linux I/OUnix Shell

Java, NTBrowsers

WebPrologPredicate

MCAT

SDSC Storage Resource Broker “Virtualizing” storage

http://www.sdsc.edu/srb

Page 16: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

16

SRB Client/Server ModelSRB Client

Network

SRB Server Network SRB Server B

SRB peer-to-peer protocol

Oracle Server

OracleClient

Network

Network

HPSSClient

HPSSserver

Data are requested using an SRB ID and a “file abstraction” (open,

close, read, write)

Page 17: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

17

OpenDAP

• Client/Server model

OpenDAPClients

Network

OpenDAP Servers

Page 18: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

18

OpenDAP

From: Peter Cornillon & Jim Gallagherhttp://www.opendap.org/support/stennis_tutorial.html

Data Data Data Data Data Data Data

Matlab

HDF4 JDBC

FreeFromFITS

CDF CEDAR

Data

netCDF

netCDF HDF4 Matlab

Data

DSP

DSP

Data

JGOFS

Tables SQL FITS CDFFlat

Binary CEDAR

Data

CODAR

Data

ESML

GeneralCODAR

Servers

netCDF C netCDF Java

IDVFerret GrADS VisAD ncBrowse Matlab ExcelIDL Access

MatlabClient

IDLClient

Clients

Page 19: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

19

• Data are requested with a URL.

• http://www.cdc.noaa.gov/cgi-bin/nph-nc/datasets/Reynolds_sst

• Protocol Machine name OPeNDAP server Directory File name

?sst[10:10][0:90][0:180]

Constraint

• User can impose a constraint on the data to be acquired from a data set by appending a constraint expression to the end of the URL

OpenDAP Data Request

Page 20: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

20

Remote Service Invocation with Web Services

• A Web Service is a simple protocol for invoking remote services on the Web. It is:– A network “endpoint”, i.e. server, that implements one or more

“ports”. • `Each port is defined by the message types that accepts and the

messages it returns.– Specified by a “Web Service Definition Language” xml document.

• Given the WSDL for a web service you know all you need to interact with it.

• Web Service Standards also exist for security, policy, reliability, addressing, notification, choreography and workflow.– It is the basis for MS .NET, IBM Websphere, SUN, Oracle, BEA,

HP, …– It is the basis for the new Grid standards like WSRF and OGSA.

Page 21: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

21

Web Site vs Web ServiceFrom: “Building Grid Applications and Portals, An Approach Based on

Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004

• Web Site– Designed to pass http

get/post/put request to between a browser and a web server.

– Google has a web site.

• Web Service– Designed for services to

talk to other services by exchanging xml messages

– Google also provides a web service so Google may be used in distributed apps

Client’s Browser

WebServer Web

Service

WebService

WebService

Page 22: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

22

Grid ServicesFrom: “Building Grid Applications and Portals, An Approach Based on

Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004• Grid: A distributed, heterogeneous set of resources

– Integrated by a pervasive layer of services – Goal: allow users to view it as a single system

• More than the Internet (which forms part of the resource layer)

• Builds on the Web by building on web services

Security

Data ManagementService

AccountingService

Logging

Event Service

PolicyAdministration& Monitoring

Grid Orchestration

Registries andName binding

Reservations And Scheduling

Open Grid Service Architecture Layer

Web Services Resource Framework – Web Services NotificationPhysical Resource Layer

Page 23: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

23

Access Interfaces and Levels of Access

• Web service, native application program interface, ODBC/JDBC, filesystem

filesystem

DBMS

Web Server “stack”

SOAP server stack

Application Program

Mount remote filesystems

Expose ODBC/JDBC interface (and full SQL)

URLs and http

WSDL and SOAPApplication can also be “wrapped” as a Web Service

SRB, OpenDAP, etc…

Page 24: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

24

Authentication

• Client – Server models

Client A Server 1NetworkUser

Client-sideauthentication

Server-sideauthentication

Server 2Server 3

? ?

Page 25: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

25

Common Authentication

CertificateAuthority

Client

ObtainCredentials

Server 1Invoke withCredentials

VerifyCredentials

Server 2 Server 3

Page 26: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

26

Portal server 2

Grid Account Management Architecture (GAMA): Single sign-on in GEON (also used in a number of other projects)

Karan Bhatia, Kurt Mueller, Choonhan Youn, Sandeep Chandra

Portal server 1

GAMA server

CACL

Myproxy

CAS

OG

SA

Grid

se

rvic

es w

rapp

er

Servlet container

import user

retrieve credential

Stand-alone applications

retrieve credential

DBgridportlets

Java keystore

Java keystore

gama

GridSphere

Servlet container

create user

Page 27: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

27

Systems Issues

• Load Balancing, Failover, Replication

Client

Server 1

Server 2

Server 3

Multiple servers for load balancing, failover

Data replication

Page 28: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

28

Distributed Data Access

• What is the issue?• Ability to access data stored in multiple, different

databases using a single request, e.g.– Get geologic information from multiple geologic

databases– Get employee information from all branches

• Ability to update data stored in multiple databases, e.g.– Transfer salary amount from University to my bank

account – Transfer funds from Visa account to vendor’s account

Page 29: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

29

Distributed data access

Client

Database 1 Database 2 Database 3

Homogeneous: mySQL mySQL mySQLHeterogeneous: mySQL Oracle DB2

How about creating a “cached” local copy?

mySQL Excel ASCII flat file

Sources may be data repositories or metadata catalogs

Page 30: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

30

Data Warehousing

Client

Data Source 1 Data Source 2 Data Source 3

Data Warehouse(common schema)

ETL

– Extract– Transform– Load ETL ETL

1. Load data from sources to warehouse

2. Query processing interaction only between client and warehouse

But, warehouse data could be “stale”, i.e. out of synch with source data…

Page 31: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

31

Data integration via middleware

Client

Database 1 Database 2 Database 3

Data integration Middleware

(aka Mediator)

1. Each client request goes to sources, via middleware 2. Result collected by

middleware and returned to client

Page 32: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

32

Warehousing vs Mediation• Warehousing: User ETL to “massage” local data

to fit into a common global, warehouse schema • Mediation: Modify user query to match schemas

exported by each source– But, which schema does the user query?– The Integrated View Schema– Sources “export” a view (the export schema)

• Federated databases– Local sources belong to different “administrative

domains”, i.e. different owners.– Local autonomy

Page 33: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

33

The Canonical Mediator / Wrapper Architecture

Client Application

Wrapper Wrapper Wrapper Wrapper

Mediator(Integrated view in mediator data model, e.g. relational, XML)

Local viewin local data model

Export viewin mediator data model

Q1

Q11 Q12 Q13 Q14

Cacheddata

Wrapper processes could execute at sources, at mediator, or elsewhere

q14Data source 1

Local schema

Data source 2

Local schema

Data source 3

Local schema

Data source 4

Local schema

Page 34: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

34

Example: A Relational Mediator

Client Application

Mediator(Relational data model)

Wrapper Wrapper

Relational DBMSe.g. PostGIS

Shape file

Page 35: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

35

Example: A Shape-file Based Mediator

Client Application

Mediator(Shape file-based data model)

Wrapper Wrapper

Relational DBMSe.g. PostGIS

Shape file

Page 36: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

36

Example: An XML Mediator

User / Applications

Mediator(XML-based data model, e.g. GML)

Wrapper Wrapper

Relational DBMSe.g. PostGIS

Shape file

Wrapper

XML filee.g. ArcXML

Page 37: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

37

User Authentication and Access Control

Client Application

Mediator

Wrapper Wrapper

Data source 1

Data source 2

2. User connects to mediator (passes credentials to mediator)

1. User authenticates to system

3. Mediator connects to sourcesa) Using original user credentialsb) Or, mapped credentials (role-based access)

4. Need to define users or roles in sources

How about using GAMA for

authentication?

Page 38: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

38

Different types of heterogeneity in data integration

• Platform heterogeneity: different OS platforms

• DBMS heterogeneity: different database systems, e.g. SQLServer, mySQL, DB2

• Data type heterogeneity• Schema heterogeneity• Heterogeneity in units, accuracy, resolution• Semantic heterogeneity

Page 39: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

39

• A long standing Computer Science problem• Simple case

– Mediator View: (SampleID varchar, Rock_Type varchar, Age int) – In Source2 Table, map Age to int

Wrapper: convert between int and varchar for Age

WrapperSample ID: Rock type: Age: … varchar varchar int

Schema Integration

Sample ID: Rock type: Age: … varchar varchar varchar

Source 1Table

Source 2Table

Page 40: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

40

Another integration scenario

– Mediator View:(SampleID varchar, Rock_Type varchar, Age varchar, Era varchar, Period varchar)

– In Source 2 Table, parse Age to obtain sub-components of the field

Sample ID: Rock type: Eon: Era: Period:varchar varchar varchar varchar varchar

Phanerozoic Mesozoic Jurassic

“Phanerozoic/mesozoic;jur”

Source 1Table

Sample ID: Rock type: Age:varchar varchar varchar

Source 2Table

Page 41: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

41

A more advanced integration scenario

• Mediator View: (SampleID varchar, Rock_Type varchar, Eon varchar, Era varchar, Period varchar)– Same as Source1 table schema

• Query: Get rock types for all rocks from the Jurassic period

Sample ID: Rock type: Eon: Era: Period:varchar varchar varchar varchar varchar

Phanerozoic Mesozoic Jurassic

150

Source 1Table

Sample ID: Rock type: Age:varchar varchar int

Source 2Table

Page 42: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

42

Doing the integration

• Query sent to mediator:SELECT DISTINCT(Rock_Type) FROM Mediator_View WHERE Period=‘Jurrasic’

• Query to Source 1:SELECT DISTINCT(Rock_Type) FROM Source1_Table WHERE Period=‘Jurrasic’

• For Source2, need to map Period=“Jurassic” to Age values

Sample ID: Rock type: Age:varchar varchar int

Source 2 TableEon: Era: Period: Min Maxvarchar varchar varchar int int

Geologic_Time Table

Page 43: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

43

Query “fragment” sent to Source 2

• SELECT DISTINCT (S2.Rock_Type) FROM

Source2_Table S2, Geologic_Time_Table GT

WHERE GT.Period = ‘Jurrasic’ AND (S2.Age >= GT.Min) AND (S2.Age <= GT.Max)

Where is the Geologic_Timetable stored ?

Page 44: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

44

Data Integration Carts™

• Integrating data sets without explicitly creating views• An example request:

Plot all gravity data points that fall within the spatial extent of rocks of a given type, in the Rocky Mountain testbed region– Use GEONsearch to find all gravity and geologic data using

bounding box for “Rocky Mountain testbed region”• Need gazeteer / spatial ontology to determine Rocky Mountain region• Need to know classification of datasets (as gravity and geology)• Intersect extent of gravity and geologic datasets (from metadata) with

extent of Rocky Mountain region– Plot gravity point data that fall within polygons of rocks of given

type

Page 45: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

45

Ad hoc integration

GEONsearch Plot mapMap

Data Integration Cart™ Query

Search MetadataCatalog

“Geologic and gravitydata in Rocky Mountains”

Page 46: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

46

Data Registration

Igneous

Granite Quartzmonzonite

Rock Classification Ontology

Gravitydataset

(X, Y)Metadata

Geologicdataset

Lat, Long, RockType Metadata

Item DetailRegistration

Item Registration(Schema registration)

Location

Latitude Longitude

Spatial Ontology

Point Polygon

Page 47: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

47

Page 48: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru

48

Another complex query

• Query: Get rock types for all rocks from the mesozoic era– Easy to do for Source 1: Era = “Mesozoic”– For Source 2:

• Need to find numeric age range for Mesozoic– Find age range across all subclasses of Mesozoic

(Cretaceous, Jurassic, Triassic)

• Select all Source 2 Table records whose age range falls within the Mesozoic age range