Download - A Conversation on Issues in Data Vetting , Metadata ...courses.washington.edu/...582_DataVetting_Metadata...5.Determine whether metadata is present. 6.If metadata is present, thoroughly

A Conversation on Issues in Data Vetting ,

Metadata,

& Data Warehousing for Production and Storage

in the Consortium Environment

Ed Rockwell and Josh Wilcox

GEOG 482 / 582

GIS Data Management

November 13, 2013

The Goals of Our Discussion

Two overarching topics

1.Vetting geospatial data

• Quality Measures?

• How do we go about

doing it?

2.Querying metadata

• What is required?

Look at work flow process and

possible system architecture

Discuss protocols and workflows for vetting data and metadata as a first

step for developing such protocols for the Puget Sound Region gdb.

This is a consortium database

environment

Why are we doing this?

We are doing this for two reasons:

1.It needs to be done for Puget Sound Region gdb.

2.We need to do it for Assignment 6

• Feature datasets submitted in Assignment 6 will be

vetted (have data quality assessed) for inclusion in

the Puget Sound Region GDB.

Our Plan for Today’s Discussion

• Look at examples from other organizations which have implemented databases• Earth System Science Workbench: A Data Management Infrastructure for

Earth Science Products (2001)

• The modENCODE Data Coordination Center: Lessons in Harvesting Comprehensive Experimental Details (2011)

• Western Electricity Coordinating Council (2012)

• Discussion on Data Quality• Beyond Accuracy: What Data Quality Means to Data Consumers (1996)

• Data Quality Assessment (2002)

• Data in Science

• Standards for Environmental Measurement Using GIS: Toward a Protocol for Protocols (2006)

• Provide next steps for setting up protocol for Puget Sound Region gdb

Data Collection, Vetting Examples From

Different Fields

Project Subject Area / Industry

• Earth System Science Workbench (ESSW) Database designed to track

Satellite Imagery.

• modENCODE Data Coordination Center National Human Genome

Research Institute effort to track

DNA experiments.

• Western Electricity Coordination

Council’s Environmental Task Force

Identify and catalog GIS and

cultural datasets for purpose of

evaluating transmission

alternatives.

Earth System Science Workbench: A Data

Management Infrastructure for Earth Science

Products (2001)

• Earth Systems Science Workbench (ESSW) is a data management infrastructure designed for tracking satellite imagery & related data.

• Consists of:

• Lab Notebook metadata service

• No duplicate-Write Once Read Many (ND-WORM) storage service

• The Lab Notebook Server receives data from a researcher's workstation, (acting as a client).

• The Lab Notebook is a Java client/server application.

• The Lab Notebook server collects the specific metadata values sent from a client, constructs XML documents using these values according to previously defined metadata templates.

• These XML documents are then transferred to a relational database.

• ND-WORM process gives each file a unique identification based on the file’s content; If a duplicate file comes in, the ND-WORM will record the new file name as an alias, but not save the file.

ESSW’s System Architecture

Earth System Science Workbench: A Data Management Infrastructure

for Earth Science Products

The modENCODE Data Coordination Center:

Lessons in Harvesting Comprehensive

Experimental Details (2011)

• Design principles for a Data Coordination Center in the Life

Sciences field

Perhaps the greatest challenge in making a large and diverse

body of data available to the greater community is providing

easy lookup of relevant submissions.

• Two approaches to metadata

• Controlled Approach: Many required items specified through

controlled vocabulary (CV) terms

• Looser approach: Free-text form: encourages a high rate of

deposition, but often results in less-consistent and often

underspecified descriptions of experimental details.

DCC Workflow

Submitting data to the modENCODE DCC is a four step process.

1.Discussions between a data provider and a DCC to determine the required metadata and data formats for a given category of submission.

2.Submission of data to the DCC.

3.A series of automated and manual QC checks. If it does not pass, it is returned to the data provider for modification.

4.Once a submission satisfies all requirements, it is distributed to the community through the GBrowse genome browser, modMine query interface, graphical submission filtering tool and the public repositories.

The modENCODE Data Coordination Center: lessons in

harvesting comprehensive experimental details.

A Work Flow Example, WECC

1. Western Electricity Coordinating Council (WECC) is a non-profit corporation made up of various stakeholder in the bulk electricity industry in the Western US, Baja Mexico, and Alberta and British Columbia.

2. Its Environmental Data Task Force (EDTF) is currently

undergoing a data review process for identifying a catalog of GIS data sets for the purpose of planning and evaluating potential transmission alternatives.

Environmental Update and Review

Protocol Process Flow Chart

Western Electricity Coordinating Council (2012)

WECC / EDTF Data Quality Review

Process

Reviewer – The initials of the analyst who performed the quality assessment for a given data set.

Spreadsheet entry: Initials of reviewer (DM; JW; EP; KA)

Review Date – The date the analyst performed the quality assessment for a given data set.

Spreadsheet entry: Date (Year_MonthDay)

Metadata – A metadata record is a file of information, usually presented as an XML document, which captures the basic characteristics of a data or information resource. The metadata entry will indicate whether electronic metadata exists for the dataset and, if so, its level of completion.

The reviewer will read the entire metadata, using ArcCatalog’s metadata viewer.

Spreadsheet entry: C = Complete

SC = Substantially complete (80% or more complete)

PC = Partially Complete (10 – 80% complete)

A = Absent (0-10% complete)

Add comment if: There is something stated in the metadata that bears upon the reliability or quality of the dataset beyond the quality components described herein.

WECC / EDTF Data Quality Review Process,

cont.

Lineage – This entry indicates whether the lineage (history of processing) can be known through examination of the metadata.

Spreadsheet entry:

Y = Yes, the lineage information is substantially complete

N = No, the lineage information is absent or incomplete

Compilation Scale – This entry stores the map scale, or aerial scale, at which the dataset was compiled. This information may be found under “Data Quality Information” or in the content descriptions of the metadata, or otherwise may be found on the data source’s website.

Spreadsheet entry: Scale denominator (e.g., 24,000 for a 1:24,000 scale map)

Positional Accuracy – This entry stores the stated or inferred horizontal accuracy of the features. If positional accuracy is not explicitly provided in the metadata, it may be estimated by the map compilation scale (if known) using the following guide from National Map Accuracy Standards:

For maps on publication scales larger than 1:20,000, no more than 10 percent of the points tested shall be in error by more than 1/30 inch, measured on the publication scale; for maps on publication scales of 1:20,000 or smaller, 1/50 inch.

Add comment if: Entry is calculated from map scale.

An Example: WECC EDTF Assessment

Steps

Operationally, the GIS analyst will perform the fitness-for-use assessment through the following steps:

1.Make a copy of the inventory spreadsheet for editing. The spreadsheet is stored on ICF’s server location, K:\Irvine\GIS\Projects\WECC\00843_10\reference documents\About Data\Data Inventory Documents.

2.Begin assessment of a particular dataset.

3.Download and, if necessary, unzip the data to the established project folders.

4.Open the dataset (or a representative example of it) in ArcCatalog.

5.Determine whether metadata is present.

6.If metadata is present, thoroughly review it using the different viewing styles in ArcCatalogdepending on what item is being investigated.

7.Inspect the metadata to try to assess each of the other quality components (as listed above), and complete the spreadsheet entries for those components.

8.After reviewing the metadata, peruse the data itself, or a representative sample, in ArcCatalog or ArcMap, for:

i. Geometry integrity

ii. Attribution integrity

iii. General usability

9.If there are any remaining unknown values for any quality components in the assessment, attempt to ascertain quality through the data source’s website or other published sources.

10.Record the comments made by Subject Matter Experts (SMEs), such as members of the EDTF, as to their opinion of the usability of a dataset, in the Quality Comments field. Record the reviewers initials, and date of the assessment, in the Quality Comments field.

11.Save the spreadsheet with a name indicating the version date.

Data Quality

“Thanks to computers, huge databases brimming with

information are at our fingertips, just waiting to be tapped. They

can be mined to find sales prospects among existing customers;

they can be analyzed to unearth costly corporate habits; they can

be manipulated to divine future trends. Just one problem: Those

huge databases may be full of junk. . . .In a world where people

are moving to total quality management, one of the critical areas

is data.”

- Wall Street Journal

Data Quality Challenge

• Most DQ measures are developed ad hoc based on project

context;

• Very few task-independent “fundamental principles” have

been defined;

• Data quality assessments can be both subjective and

objective.

Note: These ideas are from Data Quality Assessment, Pipino et al.

Standard quantitative measures

• Simple ratio: free-of-error, completeness, consistency, concise

representation, relevancy, ease of manipulation;

• Min/max operation: believability, timeliness;

• Weighted average: believability.

Data Quality Research Approach

• Intuitive: selects attributes of interest based on researchers’

experience;

• Theoretical: focuses on how the data manufacturing process;

• Empirical: analyzes data collected from data consumers to

identify attributes of interest.

Empirical Approach

• Based on methods developed in marketing research:

• Identify consumer needs;

• Define the hierarchical structure of those needs;

• Measure their importance re: product purchase.

• Assumption: data equivalent to products -> data users

equivalent to consumers

Empirical Approach

• Collect DQ attributes from consumers (first survey);

• Collect importance ratings and structure them into a hierarchy

of data users’ data quality needs (second survey) and

structure them;

• Assess data quality based on these hierarchical dimensions;

• Compare data quality to a benchmark from a best-practices

organization;

• Measure the distance between IS professional and data

consumer assessments.

Challenges

• Considering the viewpoint of data users:

• Requires the inclusion of subjective dimensions;

• Increases the difficulty of classifying certain dimensions, e.g.,

completeness and timeliness.

DQ Categories and Dimensions

DQ attributes

The PSP/DQ Model

Example of the IQ Benchmark Gap

Example of the IQ Role Gap

Environmental Update and Review

Protocol Process Flow Chart

Further research

• Developing a questionnaire to measure perceived data

quality;

• Crowd sourcing data quality improvements;

• Using a framework checklist during data requirements

analysis.

Some quick takeaways from Data Sharing

in the Sciences

1. Data Integrity: (Whole, Consistent, Correct)• It’s expensive to fix data; it’s better to create processes that ensure it is

never compromised.

• Fixity: Create a checksum or digital signature. Check data in a repository on a regular schedule.

2. Versioning:• NIH (Nat. Inst. Health) accepts only data which supports publications,

(eg. Final Data, vs. supporting data or pre-publication data.)

• NASA data hierarchy: Level 0, unprocessed instrument data at full resolution; Level 1A, unprocessed instrument data at full resolution, time referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and geo-referencing parameters; Level 1B, data that have been processed within the range of the sensor filter; Level 2, Level 3, Level 4.

3. Persistence: Maintaining digital data over time.

Note: These ideas are from “Data Sharing in Sciences, Annual Review of Information Science &

Technology, ” Kowalczyk and Shankar .

A Proposed Protocol for GIS Data

• Written from the perspective of a researcher doing work on physical

activity in the Health Sciences field using the prevalent Planning and

Transportation data .

• Studies reporting environmental variables often fail to explain in a

manner that would allow replication by other investigators how

variables are derived.

• Examples of Problems with Consistency: Street Patterns, Land-Use

(What is Mixed Use?)

• Accuracy: Valuation, some municipalities underestimate homes by 5 to

7%, others 9 to 10%.

Note: These ideas are from “Standards for Environmental Measurement Using GIS: Toward a Protocol

for Protocols,” Forsyth et. al.

A Proposed GIS Protocols for Data

1. Basic Concept: A statement of the concept that the variable is intended to represent, with a discussion about its place in the literature and previous use.

2. Basic Formula, or Basic Definition, Basic Procedure: A more specific formula or definition of the variable, but without enough detail to create a GIS-based measure.

3. Detailed Definition: An even more specific formula, including data sources and the spatial unit at which the variable is measured which affects the measurement.

4. Comments and Explanations: The questions likely to occur when operationalizing formulae.

5. GIS Approach: A description of the measurement in outline, in a form that a GIS expert could use to perform measures, or that someone using a different software program could use to develop their own steps.

6. GIS Steps: Detailed GIS instructions using Arc 8 or Arc 9 designed to be comprehensible to infrequent users of GIS

What would be next steps for a

Puget Sound Region gdb protocol?

Download - A Conversation on Issues in Data Vetting , Metadata ...courses.washington.edu/...582_DataVetting_Metadata...5.Determine whether metadata is present. 6.If metadata is present, thoroughly

Top Related