data, metadata, and ontology in ecology matthew b. jones national center for ecological analysis and...
TRANSCRIPT
Data, Metadata, and Ontology in Ecology
Matthew B. Jones
National Center for Ecological Analysis and Synthesis (NCEAS)University of California Santa Barbara
and many major collaborators:Mark Schildhauer, Josh Madin, Jing Tao, Chad Berkley, Dan Higgins, Peter McCartney, Chris Jones, Shawn Bowers, Bertram Ludaescher,
and others
April 24, 2007
Scaling-up Synthesis
• More than 400 projects at NCEAS– have produced over 1000 publications that
synthesize and re-use existing data
– massive investment in compiling, integrating, and analyzing data
• Building custom database for each project is not logistically feasible
• Instead, need loosely-coupled systems that accommodate heterogeneity
Dilemma: no unified model
• No single database suffices
– Data warehouses use federated schemas• any data that does not fit is not captured
• original data transformed to fit federation– this is a form of data integration for one purpose
– Numerous data warehouses exist• not extensible for all data
• VegBank, ClimbDB, GenBank, PDB, etc.
• Metadata-based data collections
– Loosely-coupled metadata and data collections– No constraints on data schemas– Data discovery based on metadata
– Dynamic data loading and query based on metadata descriptions
Data Collections
PhysicalPhysicalDataData
FormatFormat
Access and Access and DistributionDistribution
LogicalLogicalDataData
ModelModel
MethodsMethodsCoverage:Coverage:
Space, Time, Space, Time, TaxaTaxa
Identity andIdentity andDiscovery Discovery
InformationInformation
<EML>
A …• modular• extensible• comprehensive
• Ecological Metadata Language
What is EML?
EML: Selected relationships
1995 2000 2005‘91 ‘92 ‘93 ‘94 ‘96 ‘97 ‘98 ‘99 ‘01 ‘02 ‘03 ‘04 ‘06 ‘07 ‘08 ‘09
EML1.0.0
EML1.3.0
EML1.4.x
EML2.0.0
CSDGM1.0
Michener ’97 paper
ESA FLEDReport
NBIIBDP
ISO 19115
DublinCore
OBOE
XML1.0
EML2.0.1
A simple EML example
eml
packageId: sbclter.316.18
system: knb
dataset
title: Kelp Forest Community Dynamics: Benthic Fish
creator
individualName
contact
surName: Reed
surName: Evans
individualName
Data Discovery
Geographic, Temporal, and Taxonomic coverage
Logical Model: Attribute structure
• Describes data tables and their variables/attributes
• a typical data table with 10 attributes– some metadata are likely apparent, other ambiguous
– missing value code is present
– definitions need to be explicit, as well as data typing
YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES2001 8 2001-08-22 ABUR 1 0-20 CLIN 5 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 11 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 10 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 14 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 7 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 19 06 .2001 8 2001-08-22 ABUR 1 21-40 COTT 5 06 .2001 8 2001-08-22 ABUR 2 0-20 CLIN 5 06 .2001 8 2001-08-22 ABUR 2 21-40 NF 0 06 .2001 8 2001-08-27 AHND 1 0-20 NF 0 03 .
SpeciesCodes
Valuebounds
DateFormat
Codedefinitions
EML Measurement Scale
LowMedium
High
Equidistant Equidistant on number on number
scale, scale, meaningful meaningful
ratioratio
Equidistant Equidistant on number on number
scalescale
OrderedOrderedCategoriesCategoriesCategoriesCategories
Points on Points on calendar calendar timescaletimescale
MaleFemale 3 Celsius 5 meter 6-Oct-2004
TextualTextual
OrdinalOrdinalNominalNominal
NumericNumeric
RatioRatioIntervalInterval DatetimeDatetime
DatesDates
Logical Model: unit Dictionary
• Consistent assignment of measurement units
– Quantitative definitions in terms of SI units
– ‘unitType’ expresses dimensionality• time, length, mass, energy are all ‘unitType’s
• second, meter, gram, pound, joule are all ‘unit’s
MassMass
kilogramkilogram
gramgram
UnitType Unit
x1000
Collating metadata
• Most scientists know all of this information about their data– EML simply provides a standardized format for
recording the information
• Enables data exchange across organizations and software systems
Knowledge Network for Biocomplexity (KNB)
PISCOPISCO
KNB IIKNB II
ANDAND
... (26)... (26)
GCEGCE LTERLTER
NCEASNCEAS
ESAESA
OBFSOBFS
KNB 1KNB 1
Building a community data network
• Simplified data sharing • Immediate change tracking • Redundant backup • Data maintained by individuals• Access controlled by individuals
EML-described data in the KNB
Data Packagesin the KNB
2002 2003 2004 2005 2006
Year
02000
4000
6000
8000
1000012000
Cum
ula
tive c
oun
t
Kepler: dynamic data loading
Data source from EcoGrid(metadata-driven ingestion)
res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)
R processing script
Kepler supports dynamic data loading:
• Data sources are discovered via metadata queries
• EML metadata allows arbitrary schemas to be loaded into an embedded database
• Data queries can be performed before data flows downstream
Importance of semantics
• So far we’ve dealt only with the logical data model– any semantics in EML in natural language
• The computer doesn’t really understand:– what is being measured– how measurements relate to one another– how semantics map to logical structure
• Analysis depends on understanding the semantic contextual relationships among data measurements– e.g., density measured within subplot
Provide extension points for loading specialized domain ontologies
Goal: semantically describe the structure of scientific observation and measurement as found in a data set
Observation ontology (OBOE)
Entities represent real-world objects or concepts that can be measured.
Observations are made about particular entities.
Every measurement has a characteristic, which defines the property of the entity being measured.
Observations can provide context for other observations.
slide from J. Madin
Semantic annotation
Observation Ontology
Data set
Mapping between data and the ontology via semantic annotation
slide from J. Madin
• Relational data lacks critical semantic information• no way for computer to determine that “Ht.” represents a “height” measurement • no way for computer to determine if Plot is nested within Site or vice-versa• no way for computer to determine if the Temp applies to Site or Plot or Species
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Date Site Plot Species Height10/12 Hendricks 1 AHYA 12.210/12 Hendricks 1 AHYA 11.010/12 Hendricks 1 AHYA 9.7… … … … …
h
Date LocationName HeightTaxonomicNameLabelCharacteristic: Area
Time Space Space OrganismEntity:
hasContext hasContext hasContext
Tree Plot Species CountA 1 AHYA 3A 2 AHYA 2A 3 AHYA 8… … … …
Organism Space Organism
Label AbundanceTaxonomicNameReplicate
Entity:
Characteristic: Area
hasContext hasContext
A
B
C
Observation ontology
slide from J. Madin
Extension points
Observation
A high-level assertion that a thing was observed
?
All things (concrete and conceptual) that are observable
Entity
An extension point for domain-specific terms
Entity extension
Asserts a “containment” relationship between entities
Context
Context is transitive
Context
Observations are composed of measurements, which refer measurable characteristics to the entity being observed
Measurement
Characteristic
Summary
• EML captures critical metadata• OBOE adds critical semantic descriptions
• Data discovery and integration tools can be built that leverage metadata and ontologies
• Metadata and ontologies permit:– Loosely-coupled systems
– Schema independence in data systems
– Semantic data integration
– Capturing data that is collected, rather than derived product
Vegetation Schema Questions
• Vegetation schema– Exchange standard or federation?
• Can we accommodate all data that is collected in vegetation plots?– or just a transformed subset
• XML? RDF? OWL? other?• Should a vegetation schema link to other
evolving community standards?– EML?– OBOE?
Questions?
• http://www.nceas.ucsb.edu/ecoinformatics/• http://knb.ecoinformatics.org/• http://seek.ecoinformatics.org/• http://kepler-project.org/
• Knowledge Representation Working Group• Mark Schildhauer, Matt Jones (NCEAS)• Shawn Bowers, Bertram Ludaescher, Dave
Thau (UCD)• Deana Pennington (UNM)• Serguei Krivov, Ferdinando Villa (UVM)• Corinna Gries, Peter McCartney (ASU)• Rich Williams (Microsoft)
Acknowledgements
Acknowledgments
• This material is based upon work supported by:
• The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.
• Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis
• The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
• The Andrew W. Mellon Foundation.
• Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence