A Conversation on Issues in Data Vetting ,
Metadata,
& Data Warehousing for Production and Storage
in the Consortium Environment
Ed Rockwell and Josh Wilcox
GEOG 482 / 582
GIS Data Management
November 13, 2013
The Goals of Our Discussion
Two overarching topics
1.Vetting geospatial data
• Quality Measures?
• How do we go about
doing it?
2.Querying metadata
• What is required?
Look at work flow process and
possible system architecture
Discuss protocols and workflows for vetting data and metadata as a first
step for developing such protocols for the Puget Sound Region gdb.
This is a consortium database
environment
Why are we doing this?
We are doing this for two reasons:
1.It needs to be done for Puget Sound Region gdb.
2.We need to do it for Assignment 6
• Feature datasets submitted in Assignment 6 will be
vetted (have data quality assessed) for inclusion in
the Puget Sound Region GDB.
Our Plan for Today’s Discussion
• Look at examples from other organizations which have implemented databases• Earth System Science Workbench: A Data Management Infrastructure for
Earth Science Products (2001)
• The modENCODE Data Coordination Center: Lessons in Harvesting Comprehensive Experimental Details (2011)
• Western Electricity Coordinating Council (2012)
• Discussion on Data Quality• Beyond Accuracy: What Data Quality Means to Data Consumers (1996)
• Data Quality Assessment (2002)
• Data in Science
• Standards for Environmental Measurement Using GIS: Toward a Protocol for Protocols (2006)
• Provide next steps for setting up protocol for Puget Sound Region gdb
Data Collection, Vetting Examples From
Different Fields
Project Subject Area / Industry
• Earth System Science Workbench (ESSW) Database designed to track
Satellite Imagery.
• modENCODE Data Coordination Center National Human Genome
Research Institute effort to track
DNA experiments.
• Western Electricity Coordination
Council’s Environmental Task Force
Identify and catalog GIS and
cultural datasets for purpose of
evaluating transmission
alternatives.
Earth System Science Workbench: A Data
Management Infrastructure for Earth Science
Products (2001)
• Earth Systems Science Workbench (ESSW) is a data management infrastructure designed for tracking satellite imagery & related data.
• Consists of:
• Lab Notebook metadata service
• No duplicate-Write Once Read Many (ND-WORM) storage service
• The Lab Notebook Server receives data from a researcher's workstation, (acting as a client).
• The Lab Notebook is a Java client/server application.
• The Lab Notebook server collects the specific metadata values sent from a client, constructs XML documents using these values according to previously defined metadata templates.
• These XML documents are then transferred to a relational database.
• ND-WORM process gives each file a unique identification based on the file’s content; If a duplicate file comes in, the ND-WORM will record the new file name as an alias, but not save the file.
ESSW’s System Architecture
Earth System Science Workbench: A Data Management Infrastructure
for Earth Science Products
The modENCODE Data Coordination Center:
Lessons in Harvesting Comprehensive
Experimental Details (2011)
• Design principles for a Data Coordination Center in the Life
Sciences field
Perhaps the greatest challenge in making a large and diverse
body of data available to the greater community is providing
easy lookup of relevant submissions.
• Two approaches to metadata
• Controlled Approach: Many required items specified through
controlled vocabulary (CV) terms
• Looser approach: Free-text form: encourages a high rate of
deposition, but often results in less-consistent and often
underspecified descriptions of experimental details.
DCC Workflow
Submitting data to the modENCODE DCC is a four step process.
1.Discussions between a data provider and a DCC to determine the required metadata and data formats for a given category of submission.
2.Submission of data to the DCC.
3.A series of automated and manual QC checks. If it does not pass, it is returned to the data provider for modification.
4.Once a submission satisfies all requirements, it is distributed to the community through the GBrowse genome browser, modMine query interface, graphical submission filtering tool and the public repositories.
The modENCODE Data Coordination Center: lessons in
harvesting comprehensive experimental details.
A Work Flow Example, WECC
1. Western Electricity Coordinating Council (WECC) is a non-profit corporation made up of various stakeholder in the bulk electricity industry in the Western US, Baja Mexico, and Alberta and British Columbia.
2. Its Environmental Data Task Force (EDTF) is currently
undergoing a data review process for identifying a catalog of GIS data sets for the purpose of planning and evaluating potential transmission alternatives.
Environmental Update and Review
Protocol Process Flow Chart
Western Electricity Coordinating Council (2012)
WECC / EDTF Data Quality Review
Process
Reviewer – The initials of the analyst who performed the quality assessment for a given data set.
Spreadsheet entry: Initials of reviewer (DM; JW; EP; KA)
Review Date – The date the analyst performed the quality assessment for a given data set.
Spreadsheet entry: Date (Year_MonthDay)
Metadata – A metadata record is a file of information, usually presented as an XML document, which captures the basic characteristics of a data or information resource. The metadata entry will indicate whether electronic metadata exists for the dataset and, if so, its level of completion.
The reviewer will read the entire metadata, using ArcCatalog’s metadata viewer.
Spreadsheet entry: C = Complete
SC = Substantially complete (80% or more complete)
PC = Partially Complete (10 – 80% complete)
A = Absent (0-10% complete)
Add comment if: There is something stated in the metadata that bears upon the reliability or quality of the dataset beyond the quality components described herein.
WECC / EDTF Data Quality Review Process,
cont.
Lineage – This entry indicates whether the lineage (history of processing) can be known through examination of the metadata.
Spreadsheet entry:
Y = Yes, the lineage information is substantially complete
N = No, the lineage information is absent or incomplete
Compilation Scale – This entry stores the map scale, or aerial scale, at which the dataset was compiled. This information may be found under “Data Quality Information” or in the content descriptions of the metadata, or otherwise may be found on the data source’s website.
Spreadsheet entry: Scale denominator (e.g., 24,000 for a 1:24,000 scale map)
Positional Accuracy – This entry stores the stated or inferred horizontal accuracy of the features. If positional accuracy is not explicitly provided in the metadata, it may be estimated by the map compilation scale (if known) using the following guide from National Map Accuracy Standards:
For maps on publication scales larger than 1:20,000, no more than 10 percent of the points tested shall be in error by more than 1/30 inch, measured on the publication scale; for maps on publication scales of 1:20,000 or smaller, 1/50 inch.
Add comment if: Entry is calculated from map scale.
An Example: WECC EDTF Assessment
Steps
Operationally, the GIS analyst will perform the fitness-for-use assessment through the following steps:
1.Make a copy of the inventory spreadsheet for editing. The spreadsheet is stored on ICF’s server location, K:\Irvine\GIS\Projects\WECC\00843_10\reference documents\About Data\Data Inventory Documents.
2.Begin assessment of a particular dataset.
3.Download and, if necessary, unzip the data to the established project folders.
4.Open the dataset (or a representative example of it) in ArcCatalog.
5.Determine whether metadata is present.
6.If metadata is present, thoroughly review it using the different viewing styles in ArcCatalogdepending on what item is being investigated.
7.Inspect the metadata to try to assess each of the other quality components (as listed above), and complete the spreadsheet entries for those components.
8.After reviewing the metadata, peruse the data itself, or a representative sample, in ArcCatalog or ArcMap, for:
i. Geometry integrity
ii. Attribution integrity
iii. General usability
9.If there are any remaining unknown values for any quality components in the assessment, attempt to ascertain quality through the data source’s website or other published sources.
10.Record the comments made by Subject Matter Experts (SMEs), such as members of the EDTF, as to their opinion of the usability of a dataset, in the Quality Comments field. Record the reviewers initials, and date of the assessment, in the Quality Comments field.
11.Save the spreadsheet with a name indicating the version date.
Data Quality
“Thanks to computers, huge databases brimming with
information are at our fingertips, just waiting to be tapped. They
can be mined to find sales prospects among existing customers;
they can be analyzed to unearth costly corporate habits; they can
be manipulated to divine future trends. Just one problem: Those
huge databases may be full of junk. . . .In a world where people
are moving to total quality management, one of the critical areas
is data.”
- Wall Street Journal
Data Quality Challenge
• Most DQ measures are developed ad hoc based on project
context;
• Very few task-independent “fundamental principles” have
been defined;
• Data quality assessments can be both subjective and
objective.
Note: These ideas are from Data Quality Assessment, Pipino et al.
Standard quantitative measures
• Simple ratio: free-of-error, completeness, consistency, concise
representation, relevancy, ease of manipulation;
• Min/max operation: believability, timeliness;
• Weighted average: believability.
Data Quality Research Approach
• Intuitive: selects attributes of interest based on researchers’
experience;
• Theoretical: focuses on how the data manufacturing process;
• Empirical: analyzes data collected from data consumers to
identify attributes of interest.
Empirical Approach
• Based on methods developed in marketing research:
• Identify consumer needs;
• Define the hierarchical structure of those needs;
• Measure their importance re: product purchase.
• Assumption: data equivalent to products -> data users
equivalent to consumers
Empirical Approach
• Collect DQ attributes from consumers (first survey);
• Collect importance ratings and structure them into a hierarchy
of data users’ data quality needs (second survey) and
structure them;
• Assess data quality based on these hierarchical dimensions;
• Compare data quality to a benchmark from a best-practices
organization;
• Measure the distance between IS professional and data
consumer assessments.
Challenges
• Considering the viewpoint of data users:
• Requires the inclusion of subjective dimensions;
• Increases the difficulty of classifying certain dimensions, e.g.,
completeness and timeliness.
DQ Categories and Dimensions
DQ attributes
The PSP/DQ Model
Example of the IQ Benchmark Gap
Example of the IQ Role Gap
Environmental Update and Review
Protocol Process Flow Chart
Further research
• Developing a questionnaire to measure perceived data
quality;
• Crowd sourcing data quality improvements;
• Using a framework checklist during data requirements
analysis.
Some quick takeaways from Data Sharing
in the Sciences
1. Data Integrity: (Whole, Consistent, Correct)• It’s expensive to fix data; it’s better to create processes that ensure it is
never compromised.
• Fixity: Create a checksum or digital signature. Check data in a repository on a regular schedule.
2. Versioning:• NIH (Nat. Inst. Health) accepts only data which supports publications,
(eg. Final Data, vs. supporting data or pre-publication data.)
• NASA data hierarchy: Level 0, unprocessed instrument data at full resolution; Level 1A, unprocessed instrument data at full resolution, time referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and geo-referencing parameters; Level 1B, data that have been processed within the range of the sensor filter; Level 2, Level 3, Level 4.
3. Persistence: Maintaining digital data over time.
Note: These ideas are from “Data Sharing in Sciences, Annual Review of Information Science &
Technology, ” Kowalczyk and Shankar .
A Proposed Protocol for GIS Data
• Written from the perspective of a researcher doing work on physical
activity in the Health Sciences field using the prevalent Planning and
Transportation data .
• Studies reporting environmental variables often fail to explain in a
manner that would allow replication by other investigators how
variables are derived.
• Examples of Problems with Consistency: Street Patterns, Land-Use
(What is Mixed Use?)
• Accuracy: Valuation, some municipalities underestimate homes by 5 to
7%, others 9 to 10%.
Note: These ideas are from “Standards for Environmental Measurement Using GIS: Toward a Protocol
for Protocols,” Forsyth et. al.
A Proposed GIS Protocols for Data
1. Basic Concept: A statement of the concept that the variable is intended to represent, with a discussion about its place in the literature and previous use.
2. Basic Formula, or Basic Definition, Basic Procedure: A more specific formula or definition of the variable, but without enough detail to create a GIS-based measure.
3. Detailed Definition: An even more specific formula, including data sources and the spatial unit at which the variable is measured which affects the measurement.
4. Comments and Explanations: The questions likely to occur when operationalizing formulae.
5. GIS Approach: A description of the measurement in outline, in a form that a GIS expert could use to perform measures, or that someone using a different software program could use to develop their own steps.
6. GIS Steps: Detailed GIS instructions using Arc 8 or Arc 9 designed to be comprehensible to infrequent users of GIS
What would be next steps for a
Puget Sound Region gdb protocol?