persistent management of distributed data reagan w. moore general atomics, inc

27
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc. San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/

Upload: lyle

Post on 21-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc. San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/. Topics. Data management systems Data collections, digital libraries Distributed data management Data grids Persistent data management - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Persistent Management of Distributed Data

Reagan W. MooreGeneral Atomics, Inc.

San Diego Supercomputer [email protected]

http://www.npaci.edu/DICE/

Page 2: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Topics

• Data management systems– Data collections, digital libraries

• Distributed data management– Data grids

• Persistent data management– Persistent archives

• Common infrastructure for data management

Page 3: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data Collections

• Define the context for describing a collection of digital entities– Context specified by metadata attributes– Provenance, origin of the digital entities– Administrative, location of the digital entities– Technical, purpose of the digital entities

• Support organization of attributes as hierarchy of sub-collections

Page 4: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Digital Libraries

• Provide services on the data collection– Ingestion, loading of attribute values– Extensibility, definition of new attributes– Discovery, queries on attributes– Browsing, hierarchical listing– Presentation, formatting specified data models

Page 5: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data Grids

• Manage data in a distributed environment– Logical name space, provide global identifier– Data access, storage system abstraction– Replication, disaster back up– Uniform access, common API across file

systems, archives, and databases– Single sign-on, authenticate across

administration domains

Page 6: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Persistent Archives

• Manage technology evolution– Storage system abstraction, support data

migration across storage systems– Information repository abstraction, support

catalog migration to new databases– Logical name space, support global persistent

identifier

Page 7: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Storage Resource Broker

• Integration of collection-based management of digital entities, with– Remote data access through storage system

abstraction– Catalog access through information repository

abstraction– Automation through collection-owned data

Page 8: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Capabilities

• Support legacy systems

• Integrate archives with file systems

• Share distributed data

• Maintain persistent collection

• Control data access

Page 9: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Digital Entities• Digital entities are “images of reality”,

made of– Data, the bits (zeros and ones) put on a storage

system– Information, the attributes used to assign

semantic meaning to the data– Knowledge, the structural relationships

described by a data model

• Every digital entity requires information and knowledge to correctly interpret and display

Page 10: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Digital Entities

• Files– Text documents, images, spread sheets, binary

files

• URLs

• Database query commands

• Databases

Page 11: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Digital Entities

• Register digital entities into a catalog

• Assign metadata to describe each digital entity

• Separate management of the associated data bits from management of the metadata

• Support manipulation of each digital entity data type

Page 12: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Technology Management

Old Storage System

New Operating System

Old Application

Digital Object

Old Display System

Wrap Storage System Wrap Display System

Migrate Encoding Format

Page 13: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Preservation of Data

• Migration – Preserve the data bits– Preserve the digital entity name– Preserve the information and knowledge

content for presentation by new applications

Page 14: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Migration Advantages

• By migrating the digital entity encoding format to new standards, more sophisticated technologies can be applied to express the information and knowledge content inherent in collections of digital entities.

• Requires the ability to associate data model with digital entity

Page 15: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Uniform API

• Provide common access semantics

• Map from the interface preferred by your application to the interfaces required by legacy storage systems

Page 16: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data CatalogCommon APIs

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 17: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Discovery Transparencies

• Naming transparency - find a data set without knowing its name– Map from attributes to a global file name

• Location transparency - access a data set without knowing where it is– Map from global file name to local file name

• Access transparency - access a data set without knowing the type of storage system– Federated client-server architecture

Page 18: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data CatalogTransparencies

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 19: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Persistent Collection

• Maintain authenticity– Authenticate all accesses– Assign roles for access control lists (curation,

write, annotate, read)– Manage audit trails of all operations

• Collection-owned data– All accesses through the data management

system

Page 20: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data CatalogPersistency

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 21: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Preservation(Similar requirements to a data grid)

• Name transparency– Find a file by attributes (map from attributes to global name)

• Location transparency– Access a file by a global identifier (map from global to local

file name)

• Access transparency– Use same API to access data in archive or file cache

• Authenticity– Disaster recovery, replicate data across storage systems– Audit and process management

Page 22: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Unix Shell

Java, NTBrowsers

WebWSDL

GridFTP

SDSC Storage Resource Broker & Meta-data CatalogPreservation

ArchivesHPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Postgres

File SystemsUnix, NT,Mac OSX

Application

HRM

AccessAPIs

Servers

Storage AbstractionCatalog Abstraction

DatabasesDB2, Oracle, Sybase

C, C++, Libraries

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency Management / Authorization-AuthenticationPrimeServer

Linux I/O

DLL /Python

Page 23: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Convergence of Technologies

• Data grids as basis for distributed data management– Federation of distributed resources– Creation of logical name space to automate discovery

• Distributed data collections– Discovery based on attributes– Distributed data storage systems

• Digital libraries– Development of services for manipulating, viewing data

• Persistent archives– Management of technology evolution

Page 24: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Data Naming Ontologies

Concept space Discipline concepts

Data grid Global Identifier

Collection Discipline attributes

Archive / file systems Local file name

Data model Attributes that describe data structure

Page 25: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Knowledge Creation Roadmap

• Knowledge syntax (consensus)– RDF, XMI, Topic Map

• Knowledge management (recursive operations)– Oracle parallel database

• Knowledge manipulation (spatial/procedural rules)– Generation of inference rules and mapping to data models

• Knowledge generation (scalable inference engine)– Application of inference rules in inference engine

Page 26: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Knowledge Based Data Grid Roadmap

AttributesSemantics

Knowledge

Information

Data

Ingest Services

Management AccessServices

(Model-based Access)

(Data Handling System - SRB)

MC

AT

/HD

F

Gri

ds

XM

L D

TD

SD

LIP

XT

M D

TD

Rul

es -

KQ

L

InformationRepository

Attribute- based Query

Feature-basedQuery

Knowledge orTopic-Based Query / Browse

KnowledgeRepository for Rules

RelationshipsBetweenConcepts

FieldsContainersFolders

Storage(Replicas,Persistent IDs)

Page 27: Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc

National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center

Further Information

http://www.npaci.edu/DICE