persistent management of distributed data reagan w. moore general atomics, inc
DESCRIPTION
Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc. San Diego Supercomputer Center [email protected] http://www.npaci.edu/DICE/. Topics. Data management systems Data collections, digital libraries Distributed data management Data grids Persistent data management - PowerPoint PPT PresentationTRANSCRIPT
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Persistent Management of Distributed Data
Reagan W. MooreGeneral Atomics, Inc.
San Diego Supercomputer [email protected]
http://www.npaci.edu/DICE/
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Topics
• Data management systems– Data collections, digital libraries
• Distributed data management– Data grids
• Persistent data management– Persistent archives
• Common infrastructure for data management
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Data Collections
• Define the context for describing a collection of digital entities– Context specified by metadata attributes– Provenance, origin of the digital entities– Administrative, location of the digital entities– Technical, purpose of the digital entities
• Support organization of attributes as hierarchy of sub-collections
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Digital Libraries
• Provide services on the data collection– Ingestion, loading of attribute values– Extensibility, definition of new attributes– Discovery, queries on attributes– Browsing, hierarchical listing– Presentation, formatting specified data models
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Data Grids
• Manage data in a distributed environment– Logical name space, provide global identifier– Data access, storage system abstraction– Replication, disaster back up– Uniform access, common API across file
systems, archives, and databases– Single sign-on, authenticate across
administration domains
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Persistent Archives
• Manage technology evolution– Storage system abstraction, support data
migration across storage systems– Information repository abstraction, support
catalog migration to new databases– Logical name space, support global persistent
identifier
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Storage Resource Broker
• Integration of collection-based management of digital entities, with– Remote data access through storage system
abstraction– Catalog access through information repository
abstraction– Automation through collection-owned data
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Capabilities
• Support legacy systems
• Integrate archives with file systems
• Share distributed data
• Maintain persistent collection
• Control data access
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Digital Entities• Digital entities are “images of reality”,
made of– Data, the bits (zeros and ones) put on a storage
system– Information, the attributes used to assign
semantic meaning to the data– Knowledge, the structural relationships
described by a data model
• Every digital entity requires information and knowledge to correctly interpret and display
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Digital Entities
• Files– Text documents, images, spread sheets, binary
files
• URLs
• Database query commands
• Databases
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Digital Entities
• Register digital entities into a catalog
• Assign metadata to describe each digital entity
• Separate management of the associated data bits from management of the metadata
• Support manipulation of each digital entity data type
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Technology Management
Old Storage System
New Operating System
Old Application
Digital Object
Old Display System
Wrap Storage System Wrap Display System
Migrate Encoding Format
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Preservation of Data
• Migration – Preserve the data bits– Preserve the digital entity name– Preserve the information and knowledge
content for presentation by new applications
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Migration Advantages
• By migrating the digital entity encoding format to new standards, more sophisticated technologies can be applied to express the information and knowledge content inherent in collections of digital entities.
• Requires the ability to associate data model with digital entity
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Uniform API
• Provide common access semantics
• Map from the interface preferred by your application to the interfaces required by legacy storage systems
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Unix Shell
Java, NTBrowsers
WebWSDL
GridFTP
SDSC Storage Resource Broker & Meta-data CatalogCommon APIs
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRM
AccessAPIs
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Sybase
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Discovery Transparencies
• Naming transparency - find a data set without knowing its name– Map from attributes to a global file name
• Location transparency - access a data set without knowing where it is– Map from global file name to local file name
• Access transparency - access a data set without knowing the type of storage system– Federated client-server architecture
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Unix Shell
Java, NTBrowsers
WebWSDL
GridFTP
SDSC Storage Resource Broker & Meta-data CatalogTransparencies
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRM
AccessAPIs
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Sybase
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Persistent Collection
• Maintain authenticity– Authenticate all accesses– Assign roles for access control lists (curation,
write, annotate, read)– Manage audit trails of all operations
• Collection-owned data– All accesses through the data management
system
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Unix Shell
Java, NTBrowsers
WebWSDL
GridFTP
SDSC Storage Resource Broker & Meta-data CatalogPersistency
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRM
AccessAPIs
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Sybase
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Preservation(Similar requirements to a data grid)
• Name transparency– Find a file by attributes (map from attributes to global name)
• Location transparency– Access a file by a global identifier (map from global to local
file name)
• Access transparency– Use same API to access data in archive or file cache
• Authenticity– Disaster recovery, replicate data across storage systems– Audit and process management
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Unix Shell
Java, NTBrowsers
WebWSDL
GridFTP
SDSC Storage Resource Broker & Meta-data CatalogPreservation
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRM
AccessAPIs
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Sybase
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Convergence of Technologies
• Data grids as basis for distributed data management– Federation of distributed resources– Creation of logical name space to automate discovery
• Distributed data collections– Discovery based on attributes– Distributed data storage systems
• Digital libraries– Development of services for manipulating, viewing data
• Persistent archives– Management of technology evolution
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Data Naming Ontologies
Concept space Discipline concepts
Data grid Global Identifier
Collection Discipline attributes
Archive / file systems Local file name
Data model Attributes that describe data structure
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Knowledge Creation Roadmap
• Knowledge syntax (consensus)– RDF, XMI, Topic Map
• Knowledge management (recursive operations)– Oracle parallel database
• Knowledge manipulation (spatial/procedural rules)– Generation of inference rules and mapping to data models
• Knowledge generation (scalable inference engine)– Application of inference rules in inference engine
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Knowledge Based Data Grid Roadmap
AttributesSemantics
Knowledge
Information
Data
Ingest Services
Management AccessServices
(Model-based Access)
(Data Handling System - SRB)
MC
AT
/HD
F
Gri
ds
XM
L D
TD
SD
LIP
XT
M D
TD
Rul
es -
KQ
L
InformationRepository
Attribute- based Query
Feature-basedQuery
Knowledge orTopic-Based Query / Browse
KnowledgeRepository for Rules
RelationshipsBetweenConcepts
FieldsContainersFolders
Storage(Replicas,Persistent IDs)
National Partnership for Advanced Computational InfrastructureSan Diego Supercomputer Center
Further Information
http://www.npaci.edu/DICE