1 data design implementation and support for build 2b november 30, 2011 steve hughes

45
1 Data Design Implementation and support for Build 2b November 30, 2011 Steve Hughes

Upload: walter-pearson

Post on 28-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Data Design Implementationand support for Build 2b

November 30, 2011

Steve Hughes

Topics• Overview• Key Requirements and Drivers• Build 2b Deliverables• Build 2b Deployment• Issues• Next Steps

2

PDS4 Architecture

3

Data Architecture Concepts

Tagged Data Object (Information Object)

Label Schema

Used to Create

Describes

Extracted/Specialized

InformationModel

Data Object

Data Element

Class

has

Planetary ScienceData Dictionary

Expressed As

Product

Validates

Topics• Overview• Key Requirements and Drivers• Build 2b Deliverables• Build 2b Deployment• Issues• Next Steps

5

DRIVERS FOR PDS4 Build 2a

6

RECOMMENDATION TO MC (2009) IMPLEMENTATION

• Replace PDS3 ad hoc information model with a PDS4 information model that is managed using modern tools

• The PDS4 Information Model has been designed and managed using the Protégé Ontology Modeling Tool.

• Replace ad hoc PDS3 product definitions with PDS4 products that are defined in the model

• The PDS4 Products and their components are defined using the modeling tool.

• The modeling tool provides rigorous definitions.

• The Product definition is based on the Open Archive Information System (OAIS) Reference Model, an ISO standard.

• Require data product formats to be derivations from a core set

• Support transformation from the core set.

• Four fundamental data structures have been defined.

• Additional data structures are subclasses of the four fundamental structures.

• Software written for the fundamental structures is inherited by the subclasses.

DRIVERS FOR PDS4 Build 2a

7

RECOMMENDATION TO MC (2009) IMPLEMENTATION

• Replace “homegrown” PDS data dictionary structure with an international standard.

• The PDS4 Data Dictionary structure is based on the ISO/IEC 11179 specification.

• Adopt a modern data language/grammar (XML) where possible for all tool implementations

• The PDS4 Information model is implemented in XML.

DRIVERS FOR PDS4 Build 2a

8

REQUIREMENT IMPLEMENTATION

1.3.X – Provide Data Dictionary • The PDS4 data dictionary database was developed and is compliant with the ISO/IEC 11179 specification.

• It is used to produced both data dictionary documents and data dictionary products for the registry and data dictionary service.

1.4.1 PDS will define a standard for organizing, formatting, and documenting planetary science data

• The PDS4 Information Model defines the archive organization, data formats, and product labeling standards.

• The PDS4 Standards Reference documents additional requirements.

1.4.2 PDS will maintain a dictionary of terms, values, and relationships for standardized description of planetary science data

• The PDS4 Data Dictionary defined the attributes, classes, and relationships for defining planetary science data.

1.4.3 PDS will define a standard grammar for describing planetary science data

• XML and XML Schema 1.1 have been adopted for the PDS4 implementation.

DRIVERS FOR PDS4 Build 2a

9

REQUIREMENT IMPLEMENTATION

1.4.4 PDS will establish minimum content requirements for a data set (primary and ancillary data)

• The PDS4 Information Model defines observational and ancillary product types. These products are collected into PDS4 Collections and Archive Bundles.

1.4.5 PDS will, for each mission or other major data provider, produce a list of the minimum components required for archival data

• The PDS4 Information Model defines the archive bundle and its product collections. The archive bundle and its collections are customized for each mission.

3.1.2 PDS will develop and maintain online interfaces for discipline-specific searching

• The PDS4 Information Model and Data Dictionary defines information that is needed for search.

2.3.1 PDS will develop and publish procedures for determining syntactic and semantic compliance with its standards

• The adoption of XML and XML Schema 1.1 provide syntactic and semantic standards

• They provide utilities and tools for validation.

Topics• Overview• Key Requirements and Drivers• Build 2a Deliverables• Build 2b Deployment• Issues• Next Steps

10

Build 2a Scope

• Begin supporting PDS4 label design for LADEE and MAVEN; Begin planning/testing migration

• Support the Policy on Acceptable PDS4 Data Formats

• Support transition of the central catalog to the registry infrastructure

• Deploy early PDS4 software tools and services 11

Build 2a Deliverables

12

Document/Artifact Processes

1 Introduction Data Provider

2 Concepts Document Standards Development

3 Glossary

4 Jumpstart Guide

5 Data Provider’s Handbook

6 Standards Reference

7 Data Dictionary

8 Example Products

10 Generic Schemas

11 Information Model

PDS4 Documents in Context

ConceptsDocument

Big Picture

StandardsReference

RequirementsUser Friendly

XML Schemas

Blueprints

PDS4Product Labels

Deliverables

Data Dictionary

Definitions

PDS4 InformationModel Specification

RequirementsEngineering Specification

Informative

Data Provider’sHandbook

Cookbook

deriv

egenerates

references

creates /validates

inst

ruct

generates

refe

renc

es

RegistryConfiguration File

Object Descriptions

configures

generates

Registry

Product Tracking and Cataloging

gene

rate

s

Introduction toPDS4 Documentation

Jumpstart

Data DictionaryTutorial

Complete

Some TBD

Legend

Data Format Deliverables vis-à-vis Policy

14

Policy Deliverable

PDS shall accept the following PDS4 data formats:

• Fixed-width binary and ASCII tables that are composed of identically structured records

• Table_Base - The Table Base class defines a heterogeneous repeating record of scalars.

• Table_Character and Table_Binary are defined as types of Table_Base.

• N-dimensional arrays of homogeneous binary elements (N<=16)

• Array_Base - The Array Base class defines a homogeneous N-dimensional array of scalars.

Data Format Deliverables vis-à-vis Policy

15

Policy Deliverable

• Variable-width character 'spreadsheets' that are composed of repeating, M-field, stream-delimited records where the fields themselves are (separately) delimited and may have variable widths (M>0)

• Delimited_Table - The Delimited_Table class defines a simple table (spreadsheet) with delimited fields and records.

• It is defined as a type of Parsable_Byte_Stream.

• NAIF/SPICE files • The SPICE_Kernel_Binary and SPICE_Kernel_Text classes describe SPICE files.

• PDS shall accept ASCII text and PDF/A formats for PDS4 documentation. PDS shall accept JPEG, GIF, and TIFF images for figures accompanying documents. PDS shall accept any of the approved structures and formats for browse products.

• Product_Document - A Product Document is a product consisting of a single logical document comprised of one or more document formats.

• ASCII Text and PDF/A are currently allowed as document formats.

• JPEG, GIF, TIFF, and PNG are allowed as non-science image formats.

TheDeliverables

from 10K

PDS4 Model

PDS4 Products

PDS4 Data Formats

19

Base

Extensions/Restrictions

PDS4 Observational Product

Identification_Area

Cross_Reference_Area

Observation_Area

File_Area

Digital_Object

Subject_Area

Bibliographic_Reference

Mission_AreaNode_Area

Observing_System

Reference_Entry

[0..1]

[1]

[1]

[1..*]

[0.*]

[0..*]

[0..*]

[1..*]

[0..*]

[0..*]

[1]

Data_Standards [1]

Data Standards Development Process

Domain Knowledge

PDS4 Information

Model

Information Modeling

Tool

• Domain expertise was captured in the PDS4 Information Model as an ontology.

• The model represents a consensus of the domain experts.

• The model is the single source for the PDS4 Data Standards, for example the generated XML Schemas.

Filter and Translator

XML Schema

(Generic)XML Schema

(Generic)XML Schema

(Generic)XML Schema

(Generic)

Topics• Overview• Key Requirements and Drivers• Build 2b Deliverables• Build 2b Deployment• Issues• Next Steps

22

Build 2b Deployment• Resolve build 2a liens (to be discussed) and

generate a build 2b deployment

• Generate a release of the information model, companion documents and supporting tutorial material

• Generate new schemas

• Generate registry configuration information

• Post key documents to PDS website

23

Topics• Overview• Key Requirements and Drivers• Build 2b Deliverables• Build 2b Deployment• Issues• Next Steps

24

Chart of Review Comments

Total: 1173

Total: 1935

Build 2a Identified Liens

27

Lien Brief ExplanationNeed to finalize and freeze the information model for Build 2b incorporating high priority changes identified in Build 2a.

Address issues found with the information model focusing primarily on the core components of the product labels and the aggregate products, collections and bundles.

Need capabilities to support local data dictionary validation and the creation of schema and human-readable definition lists.

There is a lack of instructions for creating, validating, and using local keywords and classes (this includes lack of support for generating human-readable definition lists for peer review).

Build 2a Identified Liens

28

Lien Brief Explanation

Need to baseline the current documentation; Need to provide additional information/ changes.

Documents are still overlapping, not up to date, inconsistent in areas, and have gaps.

Need to finalize and freeze the XML Schema for Build 2b incorporating the extension schemas currently under testing by the DDWG

Newer “extension” style schemas are not yet mature enough to be used by an external data provider. They seem to be preferred over the older but stable “flat” schemas that were available for the node exercises. Both are currently produced and produce similar labels.

Topics• Overview• Key Requirements and Drivers• Build 2b Deliverables• Build 2b Deployment• Issues• Next Steps

29

Build 2b Actions – Jan ‘12• Finalize and freeze the information model for Build 2b

incorporating high priority changes identified in Build 2a.

• Use existing capabilities to support local data dictionary validation and the creation of schema and human-readable definition lists.

• Baseline the current documentation•Add any additional information/ changes to an online resource (e.g., wiki)

• Finalize and freeze the XML Schema for Build 2b incorporating the extension schemas currently under testing by the DDWG

.30

Conclusion• The PDS4 Information Model represents the DDWG

consensus.• A large number of decisions resulting from much

discussion were captured in the model.• All had a say, not everyone always got their way.

• On the scheduled date the model will be frozen and the PDS4 Data Standards will be generated and deployed.• The schemas, the dictionary, and all other

generated artifacts will be consistent with the model.• The current consensus, as reflected in the model will be

operational.

31

Acknowledgements*

Ed BellRichard ChenDan CrichtonAmy CulverPatty GarciaEd GrayzeckEd GuinnessMitch GordonSean HardmanLyle HuberSteve HughesChris IsbellSteve Joy

* Anyone who sat through a DDWG 2-hour telecon or provided useful input.

Ronald JoynerDebra KazdenTodd KingJoe MafiMike MartinThomas MorganLynn NeakrasePaul RamirezAnne RaughMark RoseElizabeth RyeBoris SemenovDick SimpsonSusie Slavney

Peter AllanDavid HeatherMichel GangloffSanta MartinezThomas RoatschAlain Sarkissian

Thank You

Questions and Answers

33

Backup

34

Too Many {objects, classes, schemas, …}

• Abstract (vacuous) classes are used for organizational purposes.• These are not included in the schemas and many

are being deleted.

• Subclasses of the four fundamental structures are used to partition the set of allowed structures, for example the Array_2D_Image subclass of Array_Base.• Question to be answered, does the PDS want to

provide software specific to Array_2D_Image?• All Array_Base software works for any

Array_2D_Image.

35

Too Many {objects, classes, schemas, …}

• Subclasses of a product component are used to provide specificity, for example, the subclass Bundle_Member_Entry.• There are three methods, change the name,

change the namespace (new file), or use optional attributes.

• Some specific subclasses are used for special purposes, for example Table_Field_Checksum in an Inventory.• Consider using Schematron Assert statements to

validate.

.36

Too Many {objects, classes, schemas, …}

• Some classes result from the process of normalization, for example array_axis and array_element.• Emperor Joseph II: …And there are simply too many

notes, that's all. Just cut a few and it will be perfect. Mozart: Which few did you have in mind, Majesty?

.37

Action Item Flowchart

By the numbers• Fundamental Data Structures – 4• Lines of Schema Code• Flat 18K• Master 4k-6k

• Classes dropped (Master) – nn• SimpleTypes dropped (Master) – 200• Actionable items closed – 1.5K• Actionable items open - < 50• Issues from reviews – 1k+

.39

Totals

Internal IPDA External Readiness Total

Narrative 11 4 18 15 48

Documentation 143 152 250 87 632

Actionable 1 15 16 31 63

Discussion 13 76 42 43 174

Research 8 5 33 44 90

Kudo 34 24 29 1 88

System/Tools 4 6 3 22 35

Discipline 1 4 14 10 29

Process 0 13 0 1 14

Total 215 299 405 254 1173

Post Build 2b – Summer ‘12

• Develop discipline level classes for the next phase of data set migration

• Refine the document suite and its organization.• Support development of tools scheduled for the

next build.• Support development of data dictionary and

local data dictionary services.

41

Capability Matrix

42

Capability Matrix

43

Capability Matrix

44

Capability Matrix

45