a geospatial data catalog and metadata management tools for the u.s. environmental protection...
TRANSCRIPT
A Geospatial Data Catalog and Metadata Management Tools
for the U.S. Environmental Protection Agency’s
Western Ecology Division
David L. Bradford
Geosciences
Oregon State University
Introduction
• U.S. EPA Summer Internship: Western Ecology Division, Corvallis, OR
• Large amount of GIS data (4 Tb) representing 20+ years worth of research
• Common national datasets• Virtually no metadata and no central index• Hard to know whether/where data exist• MISSION: come up with a catalog for these
geospatial data…• …with one intern, no budget, no new
infrastructure, and do it all in 14 weeks?
Introduction
• Background: the Western Ecology Division (WED) & the need for metadata
• Research questions, hypothesis: give them a fish or teach them to fish?
• Approach: system development life cycle• Results: EPA Synchronizer, GeoData
Gateway, & metadata “harvesting”• Discussion & Conclusions: automating
metadata creation, overcoming institutional inertia
Background
• June through September, 2007• The WED – laboratory under the National
Health & Environmental Effects Research Laboratories (NHEERL)
• EPA Office of Research & Development (ORD)
• Project team: Connie Burdick, Denis White, Randy Comeleo, Patrick Clinton, & yours truly
• Help from: Office of Environmental Information (OEI) GeoData Gateway team
Metadata• Information about data• Self-indexing, fitness for purpose, how to
manipulate(Green & Bossomaier, 2002; Longley et al., 2005)
• Time-consuming (i.e. expensive) to create(e.g., Ma, 2007)
• A “hassle” for the analyst• Standard: Federal Geographic Data
Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) (FGDC, 1998)
• LINCHPIN: GOOD METADATA• Objective: Tools to create standards-
compliant metadata and automate the process as much as possible
Existing EPA Process
• WED projects launched, GIS data created• Different PIs, different goals, shared analysts• Before: informal “over-the-cubicle-wall”
communication was sufficient to manage data; could get by without metadata
• Now: informal methods breaking down • GIS analysts/contractors recently dispersing
to different offices, buildings, sites• Data now require multiple disk volumes
Existing Resources & Infrastructure
• Data storage: Windows NT-based servers (2.5 Tb), Linux RAID server (1.5 Tb)
• Web server: Windows NT-based (IIS)• ESRI ArcGIS Suite, ArcObjects Libraries• EPA Metadata Editor (EME)• Second Copy (batch file copy utility)• GeoData Gateway (GDG)• Microsoft Visual Studio 2005 Integrated
Development Environment (IDE)
Other Parameters and Constraints
• Budget: 1 summer intern
• Team: 4 analysts, 1 developer (the intern), 1 GDG administrator, local tech support
• Users: 14 GIS analysts (half contract staff); ~ 50 local GIS data “consumers”
• Data: 4 Tb (coverages & shapefiles)
Other Parameters and Constraints (cont.)
• Standards & Policies– FGDC-CSDGM– EPA National Geospatial Data Policy– EPA Metadata Technical Specification v1.0– GeoData Gateway Governance Structure
• Primary constraint: Don’t relocate the data! Interlinked, interdependent datasets
Challenges
• Can an effective geospatial catalog system be assembled, using existing EPA resources, that has minimal long-term administrative costs?
• Can such a system be more than just a one-time inventory, i.e., can the solution be sustained by the WED GIS community long after the programmer leaves?
Propositions• A sustainable geospatial catalog solution can
be developed using existing or freely available (e.g., open source) tools, software components, and EPA resources
• Regardless of architecture, in order to be self-sustaining, it will require that primary GIS users implement a policy of creating consistent metadata
• The system cannot be fully implemented within 14 weeks
Approach
• System Development Life Cycle
– Identify the need: done
– Requirements Analysis: identify resources,
constraints, functionality, user interfaces
– Architectural Design: weigh options,
choose strategy, develop “blueprint”
Approach (cont.)
• System Development Life Cycle (cont.)
– Software Development: code missing
components, unit test
– Integrated System Testing: implement
components and test entire system
– User Training and Implementation: “roll it
out”
Results: Requirements Analysis
• Support existing processes
• Use existing infrastructure
• Arcane, “homegrown” solution: No
• Low maintenance solution: Yes
• User interfaces:– ArcGIS-Integrated– Web Portal
• Don’t relocate datasets
Results: Architectural Design1. Metadata creation/
maintenance• GIS analyst responsibility• But, as automated as
possible using EPA Synchronizer - new software tool
• Edit/validate metadata using EPA Metadata Editor (EME) - existing tool
• EPA Synchronizer uses EME Defaults Database (local MS Access database)
• Once this step happens, the rest is magic
Results: Architectural Design
2. Internal “harvesting” of metadata
• Weekly server process that runs automatically (Second Copy)
• Locates all new & modified metadata files contained within specified disk volumes
• Copies metadata files (including their containing directory structure) to a “web accessible folder” (WAF) on the WED’s intranet server
Results: Architectural Design
3. GeoData Gateway (GDG) metadata harvest
• ESRI GIS Portal Toolkit server (the catalog system)
• maintained by EPA Office of Environmental Information
• Configured to automatically harvest the WED’s metadata from the WAF
• Validates metadata and posts to GDG catalog
Results: Architectural Design
4. Users search GDG using ArcCatalog or a web browser
• full-text searchable on any metadata element value
• can search using geographic extent (completely within or overlapping)
• results returned include full local path to actual dataset
Results: Software Development
• Synchronization: the term used by ESRI to describe the update of metadata using internal dataset info
© 2002 ESRI
Results: Software Development
• A custom tool, called the EPA Synchronizer, was developed based on ESRI white paper and sample code
• Written in Visual Basic using ArcObjects libraries
• Can automatically create most of the metadata, pulling values from two sources: dataset, and EME defaults database
• User then inserts Title, Abstract, Purpose, & Supplemental Info using EME
Results: Software Development
• Synchronization: the term used by ESRI to describe the update of metadata using internal dataset info
© 2002 ESRI
Results: Unit Testing
Remainder of processIs automated.
Results: Integrated System Testing
• Identify major commonly-used national and regional datasets
• Start process of creating metadata for them
• Automated processes for harvesting metadata would be triggered
• Full system test would be enabled
• This step has barely begun
Results: User Training and Implementation
• Implementation has not yet occurred• Draft of instructional user documentation
completed, focused on metadata creation and catalog searching
• Technical instructions detail installation and configuration of software tools, harvesting processes, and GDG administration
• Catalog (create metadata for) select existing datasets
• Create metadata for new datasets
Discussion
• Seemingly monumental challenge at first, but untapped existing resources emerged (GDG, EME, Second Copy, web server)
• Federated approach: – autonomy in data maintenance– non-intrusive data access– no changes to data structure
• An elegant, minimalist solution
Discussion• But the jury is still out.• Odds of success would increase with:
– Dedicated permanent staff vs. temporary; GIS service and support requires GIS skills, administrative skills, and IT skills (Longley et al., 2005; Longstreth, 1995)
– A champion in the organization; someone needs to foster a high level of support for the project (Obermeyer, 1995)
– Conscious effort to overcome institutional inertia; turf battles, unwillingness to reorganize can kill a project (Evans and Ferreira, 1995)
– Formalized quality control of digital information– Less paranoia, less government red tape
Conclusion• Data used in a shared environment become
cleaner – more complete and correct (Craig, 1995)
• Useful legacy datasets will receive new metadata• Some unseen hurdles remain; will need a
champion to see it through• GDG team has plans to bundle EPA Synchronizer
with EPA Metadata Editor
Obermeyer and Pinto, 1994
Craig, William J. (1995). Why We Can’t Share Data: Institutional Inertia. In: Onsrud, H.J. and G. Rushton (Eds.) Sharing Geographic Information. Rutgers University & the Center for Urban Policy Research, New Brunswick, New Jersey: 107-118.
ESRI (2002). Creating a Custom Metadata Synchronizer, An ESRI White Paper. July 2002. ESRI, Redlands, CA. http://www.esri.com, last accessed November 26, 2007.
Evans, John and J. Ferreira Jr. (1995). Sharing Spatial Information in an Imperfect World: Interactions Between Technical and Organizational Issues. In: Onsrud, H.J. and G. Rushton (Eds.) Sharing Geographic Information. Rutgers University, Center for Urban Policy Research, New Brunswick, New Jersey: 448-460a.
FGDC (1998). FGDC-STD-001-1998, Content Standard for Digital Geospatial Metadata, Federal Geographic Data Committee, June 1998.
Green, David and T. Bossomaier (2002). Online GIS and Spatial Metadata. Taylor & Francis, London; New York.
Longley, Paul A., M.F. Goodchild, D.J. Maguire, and D.W. Rhind (2005). Geographic Information Systems and Science, 2nd Ed. John Wiley & Sons, Ltd, Chichester, West Sussex, England.
Longstreth, Karl (1995). GIS Collection Development, Staffing, And Training. Journal of Academic Librarianship, vol. 21 no. 4: 267-275.
Ma, Jin (2007). SPEC Kit 298: Metadata. Association of Research Libraries, Washington, DC.
Obermeyer, Nancy J. (1995). Reducing Inter-Organizational Conflict To Facilitate Sharing Geographic Information. In: Onsrud, H.J. and G. Rushton (Eds.) Sharing Geographic Information. Rutgers University, Center for Urban Policy Research, New Brunswick, New Jersey: 138-148.
Obermeyer, Nancy J. and J.K. Pinto (1994). Managing Geographic Information Systems. The Guilford Press, New York.
Literature Cited
¿Preguntas?