lessons on process and standards in other science communities

42
1 Lessons on Process and Standards in other science communities IMAG Model Sharing Strategies Workshop NIH April 10 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 http://grids.ucs.indiana.edu/ ptliupages/presentations/ [email protected] http:// www.infomall.org

Upload: lawson

Post on 07-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

Lessons on Process and Standards in other science communities. IMAG Model Sharing Strategies Workshop NIH April 10 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 - PowerPoint PPT Presentation

TRANSCRIPT

11

Lessons on Process and Standards in other science

communities IMAG Model Sharing Strategies Workshop

NIH April 10 2007

Geoffrey FoxComputer Science, Informatics, Physics

Pervasive Technology Laboratories

Indiana University Bloomington IN 47401

http://grids.ucs.indiana.edu/ptliupages/presentations/[email protected] http://www.infomall.org

22

What is a Model Electronically? This should have a label – a URI It should have a collection of data or metadata defining it It might have some way of building composite models by joining

multiple smaller models together• Need to be able to define connections

Maybe there are also “mechanisms” to manipulate model or evolve it in time

A computer program defines the data as values and the mechanisms as subroutines/methods• Programs can be Fortran, Python, C#, Prolog

• Declarative or Imperative; Scripted or Compiled

However in spite of software engineering, computer programs are very hard to share and re-use

33

What are Questions? What are the models we are trying to define? What is Process to decide on needed standards and

their Syntax Are we mainly concerned about data defining the

model and/or the programs that build the model Where are overlaps between IMAG requirements and

other computer science or science fields Is the barrier to sharing models “science” (i.e. it is not

clear what the common interfaces are) or “systematization” (we agree on interface points but don’t have a common syntax)

44

Some Examples There are many examples of relevant efforts to encourage

sharing of models DMSO (Defense Modeling and Simulation Office) produced

HLA (High Level Architecture) as a (pre-CORBA/Web Service) way of defining military models as discrete event simulations• Good but out of date

The Open Geospatial Consortium OGC http://www.opengeospatial.org/ is a consortium of 339 organization setting excellent standards for Geographical Information Systems• We could develop a BIS Biological Information System?

International Virtual Observatory Alliance IVOA http://www.ivoa.net/ is 16 organizations (each of which is a collection like EVO the European Virtual Obsevatory) is defining sharing standards for astronomy data

5

Virtual Observatory Astronomy GridIntegrate Experiments

Radio Far-Infrared Visible

Visible + X-ray

Dust Map

Galaxy Density Map

66

OGC Standards IStandard Definition Specification

Geography Markup Language (GML)

GML is an XML grammar written in XML Schema for the modeling, transport, and storage of geographic information. GML provides a variety of kinds of objects for describing geography including features, coordinate reference systems, geometry, topology, time, units of measure and generalized values.

ISO/TC 211/WG 19136OGC 03-105r1Version: 3.1.0Date:2004-02-07 Pages: 601

Observations and Measurements (O&M)

The general models and XML encodings for observations and measurements, including but not restricted to those using sensors. Based on GML.

OGC 05-087r3Version: 0.13.0Date: 2006-02-24Pages: 136

Sensor Model Language (SensorML)

The general models and XML encodings for sensors. OGC 05-086Date: 2005-10-05Version: 1.0Pages 110

Web Feature Service (WFS)

WFS allows a client to retrieve and update geospatial data encoded in GML from multiple Web Feature Services. The specification defines interfaces for data access and manipulation operations on geographic features, using HTTP as the distributed computing platform. Via these interfaces, a Web user or service can combine, use and manage geodata -- the feature information behind a map image -- from different sources.

OGC 04-094Date: 2005-05-03Version: 1.1.0Pages: 131

77

OGC Standards IIStandard Definition Specification

Web Map Service (WMS)

A Web Map Service (WMS) produces maps of spatially referenced data dynamically from geographic information. This International Standard defines a “map” to be a portrayal of geographic information as a digital image file suitable for display on a computer screen.

OGC 06-042Date: 2006-03-15Version: 1.3.0Pages: 85

Web CoverageService (WCS)

WCS extends the WMS interface to allow access to geospatial “coverages" (raster data sets) that represent values or properties of geographic locations, rather than WMS generated maps (pictures).

OGC 03-065r6Date: 2003-08-27Version: 1.0.0Pages: 67

Catalogue Services

Catalogue Service Implementation Specification defines a common interface that enables diverse but conformant applications to perform discovery, browse and query operations against distributed heterogeneous catalog servers.

OGC 02-087r3Date: 2002-12-13Version: 1.1.1Pages: 239

Filter Encoding Filter Encoding defines an XML encoding for filter expressions. A filter expression constrains property values to create a subset of a group of objects. The goal, typically, is to operate on just those objects by, for example, rendering them in a different color or saving them to another format.

OGC 04-095Date: 3 May 2005Version: 1.1.0Pages: 40

88

WMS uses WFS that uses data sources

Railroads

RiversBridges

Interstate Highways

90

WFS Server

SQL Query

Railroads

[a-b]

SQ

L Q

uery

Riv

er [a

-d]

Bri

dge

[1-5

]

SQL QueryHigway [12-18]

`

ClientWMS

GetFeature

FeatureCollection

Get

Feat

ure

Feat

ureC

olle

ctio

n

<gml:featureMember> <fault> <name> Northridge2 </name> <segment> Northridge2

</segment> <author> Wald D. J.</author> <gml:lineStringProperty> <gml:LineString

srsName="null"> <gml:coordinates>

-118.72,34.243 -118.591,34.176 </gml:coordinates>

</gml:LineString> </gml:lineStringProperty> </fault> </gml:featureMember>

Defines Earthquake Fault

99

OGC Standards Typify a common competition – there is a similar effort by

Technical Committee tasked by the International Standards Organization (ISO/TC211).

Are very complex – GML specification itself is over 600 pages Underlie the success of GIS and enabled through first through

ESRI (ArcInfo) and Minnesota Map Server and now through Google Maps

Are built in XML (as they should be) but for efficiency one• Transmits through binary XML• Stores in SQL databases not in XML databases

Define some tings (catalog) which are unnecessary as provided by a broader community

Observations and Measurements work for any time series and so are also broader but no competition!

1010

OGC Standards Structure Have a language GML that defines the field – this

would be CellML and SBML in the case of Biology and CML for ChemInformatics

Have a user interface (the Map) captured as a Web Map Service

Have a “pixel data” service WCS the Web Coverage Service

Have a “vector” (feature, property) data service WFS the Web Feature Service• Note any Earth Science simulation or data analysis can be

thought of as accepting WFS compatible data and producing WFS or WCS compatible output

1111

Grid Workflow Datamining in Earth Science Work with Scripps Institute Grid services controlled by workflow process real time

data from ~70 GPS Sensors in Southern California

Streaming DataSupport

TransformationsData Checking

Hidden MarkovDatamining (JPL)

Display (GIS)

NASA GPS

Earthquake

Real Time

Archival

1212

Data Federation The IVOA activities is aimed largely at supporting interoperable

data repositories that can feed into the image processing filtering needed to extract signals• There us not so much simulation

ChemInformatics has most data in NIH’s PubChem but will need to federate additional repositories such as those produced by individual Chemistry groups and the raw data from NIH screening centers

Every county (total 92) in Indiana has its own GIS and something equivalent to a WFS holding information not yet known to Google! (e.g. our house pinpoint address and assessment)• Need to federate all these to support state agencies

So federation of distributed resources a major issue and WFS uses “capabilities” to support this

13

GIS Grid of “Indiana Map” and ~10 Indiana counties with accessible Map (Feature) Servers from different vendors. Grids federate different data repositories (cf Astronomy VO federating different observatory collections)

Indiana County Map Grid

14

Browser +Google Map API

Cass County Map Server

(OGC Web Map Server)

Hamilton County Map Server(AutoDesk)

Marion County Map Server

(ESRI ArcIMS)

Browser client fetches image tiles for the bounding box using Google Map API. Tile Server

Cache Server

Adapter Adapter Adapter

Tile Server requests map tiles at all zoom levels with all layers. These are converted to uniform projection, indexed, and stored. Overlapping images are combined.

Must provide adapters for each Map Server type .

The cache server fulfills Google map calls with cached tiles at the requested bounding box that fill the bounding box.

Google Maps Server

15

Searched on Transit/TransportationSearched on Transit/Transportation

1616

Service or Web service Approach One uses GML, CML etc. to define the data in a system and one

uses services to capture “methods” or “programs” In eScience, important services fall in three classes

• Simulations• Data access, storage, federation, discovery• Filters for data mining and manipulation

Services use something like WSDL (Web Service Definition Language) to define interoperable interfaces (see OPAL talk!)

WSDL establishes a “contract” independent of implementation between two services or a service and a client

Services should be loosely coupled which normally means they are coarse grain

Services will be composed (linked together) by mashups (typically scripts) or workflow (often XML – BPEL)

Software Engineering and Interoperability/Standards are closely related

1717

Philosophy of Web Service Grids Much of Distributed Computing was built by natural

extensions of computing models developed for sequential machines

This leads to the distributed object (DO) model represented by Java and CORBA• RPC (Remote Procedure Call) or RMI (Remote Method

Invocation) for Java Key people think this is not a good idea as it scales badly

and ties distributed entities together too tightly• Distributed Objects Replaced by Services

Note CORBA was considered too complicated in both organization and proposed infrastructure• and Java was considered as “tightly coupled to Sun”• So there were other reasons to discard

Thus replace distributed objects by services connected by “one-way” messages and not by request-response messages

1818

Web services Web Services build

loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles.

Web Services interact by exchanging messages in SOAP format

The contracts for the message exchanges that implement those interactions are described via WSDL interfaces.

Databases

Humans

ProgramsComputational resources

Devices

reso

urce

s

BP

EL,

Jav

a, .N

ET

serv

ice

logi

c

<env:Envelope> <env:Header> ... </env:header> <env:Body> ... </env:Body></env:Envelope> m

essa

ge p

roce

ssin

g

SO

AP

and

WS

DL

SOAP messages

1919

A typical Web Service In principle, services can be in any language (Fortran .. Java ..

Perl .. Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining)

The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python

PaymentCredit Card

WarehouseShippingcontrol

WSDL interfaces

WSDL interfaces

Security CatalogPortalService

Web Services

Web Services

CICC Web Service Infrastructure

Portal ServicesRSS FeedsUser ProfilesCollaboration as in Sakai

Grid ServicesService RegistryJob Submission and Management

Local ClustersIU Big RedTeraGrid, Open Science Grid

Varuna.netQuantum Chemistry

Statistics Services Database Services

Core functionality Computation functionality 3D structures byFingerprints Regression CIDSimilarity Classification SMARTSDescriptors Clustering 3D Similarity2D diagrams Sampling distributionsFile format conversion

Docking scores/poses byApplications Applications CID

Docking Predictive models SMARTSFiltering Feature selection Protein

2D plots Docking scoresToxicity predictions

Anti-cancer activity predictionsCID, SMARTS

Cheminformatics Services

DruglikenessArbitrary R code (PkCell)

Mutagenecity predictionsPubChem related data by

Pharmacokinetic parametersOSCAR Document AnalysisInChI Generation/SearchComputational Chemistry (Gamess, Jaguar etc.)

Where Does The Functionality Come From?

Indiana University VOTables NCI DTP predictions Database services

Cambridge University InChi generation / search OSCAR

OpenEye Docking

DigitalChemistry BCI fingerprints DivKMeans

CDK Cheminformatics

University of Michigan PkCell

R Foundation R package

NIH PubChem PubMed

gNova Consulting

European Chemicals Bureau ToxTree toxicity predictions

2222

Service Modeling Language (SML) Submitted to W3C by industry giants 21 March 2007 A model in SML is realized as a set of interrelated XML

documents. The XML documents contain information about the parts of an IT service, as well as the constraints that each part must satisfy for the IT service to function properly. Constraints are captured in two ways:

Schemas – these are constraints on the structure and content of the documents in a model. SML uses a profile of XML Schema 1.0 as the schema language. SML also defines a set of extensions to XML Schema to support inter-document references.

Rules – are Boolean expressions that constrain the structure and content of documents in a model. SML uses a profile of Schematron (goes between documents) and XPath 1.0 for rules.

2323

Models in SML Models focus on capturing all invariant aspects of a service/system that

must be maintained for the service/system to be functional. Models are units of communication and collaboration between designers,

implementers, operators, and users; and can easily be shared, tracked, and revision controlled. This is important because complex services are often built and maintained by a variety of people playing different roles.

Models drive modularity, re-use, and standardization. Most real-world complex services and systems are composed of sufficiently complex parts.  Re-use and standardization of services/systems and their parts is a key factor in reducing overall production and operation cost and in increasing reliability.

Models represent a powerful mechanism for validating changes before applying the changes to a service/system. Also, when changes happen in a running service/system, they can be validated against the intended state described in the model. The actual service/system and its model together enable a self-healing service/system – the ultimate objective. Models of a service/system must necessarily stay decoupled from the live service/system to create the control loop

Models enable increased automation of management tasks. Automation facilities exposed by the majority of IT services/systems today could be driven by software – not people – for reliable initial realization of a service/system as well as for ongoing lifecycle management.

2424

Structured v Unstructured Metadata The schema’s that are defined by GML etc. are

structured definitions The traditional semantic web approach is largely based

on structured metadata (OWL) that one can analyze precisely

UML was for example used by OGC in developing standards

In the “real world”, unstructured annotation has been very successful as seen in Connotea, del.icio.us and CiteULike

2525

How to set standards If one is Google, you can just define the standard and not bother

to discuss it!• Google maps does not support OGC standards

The growth in distributed computing has spurred a great deal of standards work as we need the different parts of system built by different people

Often meet every few weeks to build a standard in 12 months OASIS defines a process and doesn’t define an architecture W3C is most prestigious OGF Open Grid Forum has an eScience section that is currently

led by me Or do it outside any standards body as in fact most domain

specific standards are done• Note IVOA has meetings from time to time at OGF to coordinate their

astronomy standards with general Grid standards

26

The Grid and Web Service Institutional Hierarchy

OGSA GS-*and some WS-*GGF/W3C/….XGSP (Collab)

WS-* fromOASIS/W3C/Industry

Apache Axis.NET etc.

Must set standards to get interoperability

2: System Services and Features(WS-* from OASIS/W3C/Industry)

Handlers like WS-RM, Security, UDDI Registry

3: Generally Useful Services and Features(OGSA and other GGF, W3C) Such as

“Collaborate”, “Access a Database” or “Submit a Job”

4: Application or Community of Interest (CoI)Specific Services such as “Map Services”, “Run

BLAST” or “Simulate a Missile”

1: Container and Run Time (Hosting) Environment (Apache Axis, .NET etc.)

XBMLXTCE VOTABLECMLCellML

27

The Ten areas covered by the 60 core WS-* Specifications

WS-* Specification Area Examples

1: Core Service Model XML, WSDL, SOAP

2: Service Internet WS-Addressing, WS-MessageDelivery; Reliable Messaging WSRM; Efficient Messaging MOTM

3: Notification WS-Notification, WS-Eventing (Publish-Subscribe)

4: Workflow and Transactions BPEL, WS-Choreography, WS-Coordination

5: Security WS-Security, WS-Trust, WS-Federation, SAML, WS-SecureConversation

6: Service Discovery UDDI, WS-Discovery

7: System Metadata and State WSRF, WS-MetadataExchange, WS-Context

8: Management WSDM, WS-Management, WS-Transfer

9: Policy and Agreements WS-Policy, WS-Agreement

10: Portals and User Interfaces WSRP (Remote Portlets)

28

Activities in Global Grid Forum Working Groups

GGF Area GS-* and OGSA Standards Activities

1: Architecture High Level Resource/Service Naming (level 2 of slide 6),Integrated Grid Architecture

2: Applications Software Interfaces to Grid, Grid Remote Procedure Call, Checkpointing and Recovery, Interoperability to Job Submittal services, Information Retrieval,

3: Compute Job Submission, Basic Execution Services, Service Level Agreements for Resource use and reservation, Distributed Scheduling

4: Data Database and File Grid access, Grid FTP, Storage Management, Data replication, Binary data specification and interface, High-level publish/subscribe, Transaction management

5: Infrastructure Network measurements, Role of IPv6 and high performance networking, Data transport

6: Management Resource/Service configuration, deployment and lifetime, Usage records and access, Grid economy model

7: Security Authorization, P2P and Firewall Issues, Trusted Computing

29

Two-level Programming I• The Web Service (Grid) paradigm implicitly assumes a

two-level Programming Model• We make a Service (same as a “distributed object” or

“computer program” running on a remote computer) using conventional technologies– C++ Java or Fortran Monte Carlo module

– Data streaming from a sensor or Satellite

– Specialized (JDBC) database access

• Such services accept and produce data from users files and databases

• The Grid is built by coordinating such services assuming we have solved problem of programming the service

Service Data

3030

Two-level Programming II The Grid is discussing the composition of distributed

services with the runtime interfaces to Grid as opposed to UNIX pipes/data streams

Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs

Such interpretative environments are the single processor analog of Grid Programming

Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately

Service1 Service2

Service3 Service4

3131

Grid Workflow Data Assimilation in Earth Science Grid services triggered by abnormal events and controlled by workflow process real

time data from radar and high resolution simulations for tornado forecasts

Typical graphical interface to service composition

32

3 Layer Programming Model

Application(level 1 Programming)

Application Semantics (Metadata, Ontology)Level 2 “Programming”

Basic Web Service Infrastructure

Web Service 1

Workflow (level 3) Programming BPEL

WS 2 WS 3 WS 4

MPI Fortran C++ etc.

Semantic Web

Workflow can be built on top of NaradaBrokering as messaging layer

33

Database

SS

SS

SS

SS

SS

SS

SS

SS

SS

SS

FS

FS

FS

FS

FS

FS

FS

FS FS

FS

FS

FS

FS

FS

FS

FS

FS FS

FS

FS

PortalFS

OS

OS

OS

OS

OS

OS

OS

OS

OS

OS

OS

OS

MD

MD

MD

MD

MD

MD

MD

MD

MD

MetaDataFilter Service

Sensor Service

OtherService

AnotherGrid

Raw Data Data Information Knowledge Wisdom

Decisions

SS

SS

AnotherService

AnotherService

SSAnother

Grid SS

AnotherGrid

SS

SS

SS

SS

SS

SS

SS

SS

FS

SOAP Messages

3434

Information Management/Processing SOAP messages transport information expressed in a

semantically rich fashion between sources and services that enhance and transform information so that complete system provides

• Semantic Web technologies like RDF and OWL help us have rich expressivity

Data Information Knowledge transformation We build application specific information

management/transformation systems ASIS for each application domain

One special domain is the system itself where the metadata associated with services, sessions, Grids, messages, streams and workflow is itself managed and supported by an SIIS

3535

Generalizing a GIS Geographical Information Systems GIS have been

hugely successful in all fields that study the earth and related worlds • They define Geography Syntax (GML) and ways to store,

access, query, manipulate and display geographical features• In SOA, GIS corresponds to a domain specific XML language

and a suite of services for different functions above However such a universal information model has not

been developed in other areas even though there are many fields in which it appears possible• BIS Biological Information System• MIS Military Information System• IRIS Information Retrieval Information System• PAIS Physics Analysis Information System• SIIS Service Infrastructure Information System

3636

ASIS Application Specific Information System I a) Discovery capabilities that are best done using WS-*

standards b) Domain specific metadata and data including

search/store/access  interface. (cf WFS). Lets call generalization ASFS (Application Specific Feature Service)• Language to express domain specific features (cf GML). Lets call

this ASL (Application Specific language)• Tools to manipulate information expressed in language and key

data of application (cf coordinate transformations). Lets call this ASTT (Application specific Tools and Transformations)

• ASL must support Data sources such as sensors (cf OGC metadata and data sensor standards) and repositories. Sensors need (common across applications) support of streams of data

• Queries need to support archived (find all relevant data in past)   and streaming (find all data in future with given properties)

• Note all AS Services behave like Sensors and all sensors are wrapped as services

• Any domain will have “raw data” (binary) and that which has been filtered to ASL. Lets call ASBD (Application Specific Binary Data)

3737

ASIS Application Specific Information System II Lets call this ASVS (Application Specific Visualization Services)

generalizing WMS for GIS The ASVS should both visualize information and provide a way of

navigating (cf GetFeatureInfo) database (the ASFS) The ASVS can itself be federated and presents an ASFS output interface d) There should be application service interface for ASIS from which all

ASIS service inherit e) There will be other user services interfacing to ASIS All user and system services will input and output data in ASL using

filters to cope with ASBD

AS Tool(generic)

AS“Sensor”

ASRepository

AS Service(user defined)

ASVSDisplay

AS Tool(generic)

Messages using ASL

Filter, Transformation, Reasoning, Data-mining, Analysis

3838

Mashups v Workflow? Mashup Tools are reviewed at http://blogs.zdnet.com/Hinchcliffe/?p=63 Workflow Tools are reviewed by Gannon and Fox

http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf Both include

scripting in PHP, Python, sh etc. as both implement distributed programming at level of services

Mashups use all types of service interfaces and do not have the potential robustness (security) of Grid service approach

Typically “pure” HTTP (REST)

Web 2.0 APIs http://www.programmableweb.com/apis currently

(March 3 2007) 388 Web 2.0 APIs with GoogleMaps the most used in Mashups

This site acts as a “UDDI” or “OGC Catalog” for Web 2.0

The List of Web 2.0 API’s Each site has API

and its features Divided into

broad categories Only a few used a

lot (34 API’s used in more than 10 mashups)

RSS feed of new APIs

3 more Mashups each day For a total of 1609

March 3 2007 Note ClearForest

runs Semantic Web Services Mashup competitions (not workflow competitions)

Some Mashup types: aggregators, search aggregators, visualizers, mobile, maps, games

Growing number of commercial Mashup Tools

APIs/Mashups per Protocol Distribution

REST SOAP XML-RPC REST,XML-RPC

REST,XML-RPC,

SOAP

REST,SOAP

JS Other

google google mapsmaps

netvibesnetvibes

live.comlive.com

virtual virtual earthearth

google google searchsearch

amazon S3amazon S3

amazon amazon ECSECS

flickrflickrebayebay

youtubeyoutube

411sync411syncdel.icio.usdel.icio.us

yahoo! searchyahoo! searchyahoo! geocodingyahoo! geocoding

technoratitechnorati

yahoo! imagesyahoo! imagestrynttrynt

yahoo! localyahoo! local

Number ofMashups

Number ofAPIs