introduction to field station databases john porter department of environmental sciences university...

32
Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia

Upload: sarah-lucas

Post on 26-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Introduction to Field Station

Databases

Introduction to Field Station

Databases

John PorterDepartment of

Environmental SciencesUniversity of Virginia

Page 2: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

RoadmapRoadmap

• Why do we need field station databases?

• Challenges for Ecological Databases

• Database characteristics and types

• Evolving a Database• Software Tools and Hardware

• Why do we need field station databases?

• Challenges for Ecological Databases

• Database characteristics and types

• Evolving a Database• Software Tools and Hardware

Page 3: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

WHY have Scientific Databases?

• Improvement of data quality– multiple users provides multiple

opportunities for detecting and correcting problems in data

• Cost– data costs less to save than to collect

again– with environmental data, often data

cannot be collected again at any cost

• Improvement of data quality– multiple users provides multiple

opportunities for detecting and correcting problems in data

• Cost– data costs less to save than to collect

again– with environmental data, often data

cannot be collected again at any cost

Page 4: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

WHY have Scientific Databases?

• Environmental Policy and Management– environmental policy decisions

require data that are regional or national, but most ecological data is collected at smaller scales

– numerous Federal initiatives•NII - National Information Infrastructure•FGDC - Federal Geographic Data

Committee

• Environmental Policy and Management– environmental policy decisions

require data that are regional or national, but most ecological data is collected at smaller scales

– numerous Federal initiatives•NII - National Information Infrastructure•FGDC - Federal Geographic Data

Committee

Page 5: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

WHY have Scientific Databases?

•New Science– Long Term

• long-term studies depend on databases to retain project history

– Synthesis• use of data for a purpose other than

which it was collected

– Integrated, multidisciplinary projects• depend on databases to facilitate sharing

of data

Page 6: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Attracting Researchers

Which do you choose?• Field Station A

– Beautiful mountain forest setting

– Modern Laboratories

• Field Station B– Beautiful mountain

forest setting– Modern

Laboratories– Climate and

Meteorological Data

– Biodiversity Data– Soils Data– Topographic Data

Page 7: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

ChallengesChallenges•Resources

– Equipment

•Resources–Operational expenses

•Resources–Personnel

•Resources– Equipment

•Resources–Operational expenses

•Resources–Personnel

Page 8: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Challenges for Scientific Databases

• Long-term perspective – without databases, most data do not

outlive project that collected them– goal: data that is accessible and

interpretable 20-years in the future• technological - need persistent media

that does not become technologically obsolete

•contextual - need to capture context of data collection

•semantic - terms need to be well-defined

• Long-term perspective – without databases, most data do not

outlive project that collected them– goal: data that is accessible and

interpretable 20-years in the future• technological - need persistent media

that does not become technologically obsolete

•contextual - need to capture context of data collection

•semantic - terms need to be well-defined

Page 9: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Challenges for Scientific Databases

• Deal with Diversity– science means asking NEW questions

•new kinds of queries

– scientific data is heterogeneous and diverse

– scientific users have different backgrounds and goals

– the user community for a given database will be dynamic

• Deal with Diversity– science means asking NEW questions

•new kinds of queries

– scientific data is heterogeneous and diverse

– scientific users have different backgrounds and goals

– the user community for a given database will be dynamic

Page 10: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Characteristics of Ecological Characteristics of Ecological DataData

Complexity/Metadata RequirementsComplexity/Metadata Requirements

SatelliteImages

DataDataVolumeVolume(per(perdataset)dataset)

LowLow

HighHigh

HighHigh

Soil CoresSoil Cores

PrimaryPrimaryProductivityProductivity

GISGIS

Population DataPopulation Data

BiodiversityBiodiversitySurveysSurveys

Gene Sequences

Business Data

WeatherStations Most EcologicalMost Ecological

DataData

Most Most SoftwareSoftware

Page 11: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Database Characteristics

“Deep” • Relatively few

kinds of data• Large numbers

of observations• Sophisticated

query and analysis tools

“Deep” • Relatively few

kinds of data• Large numbers

of observations• Sophisticated

query and analysis tools

“Wide”• Many different

types of data• Smaller number

of observations of each type

• Few analysis tools

“Wide”• Many different

types of data• Smaller number

of observations of each type

• Few analysis tools

““Deep” vs “Wide”Deep” vs “Wide”

Page 12: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Examples of Scientific Databases

• Large Databases– GENBANK - genetic sequence data– PDB - protein structure database -

6K+ atomic coordinate entries– funding >$1 million/year– excellent examples of need for

database solutions that scale– substantial focus on specialized

tools and storage

• Large Databases– GENBANK - genetic sequence data– PDB - protein structure database -

6K+ atomic coordinate entries– funding >$1 million/year– excellent examples of need for

database solutions that scale– substantial focus on specialized

tools and storage

Page 13: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Examples of Scientific Databases

• LTER Sites– approximately 15% of site funding– focus on long-term data– diverse approaches to data

management at different sites dictated by • locations of researchers• types of data collected

– testbed for “practical data management”

• LTER Sites– approximately 15% of site funding– focus on long-term data– diverse approaches to data

management at different sites dictated by • locations of researchers• types of data collected

– testbed for “practical data management”

Page 14: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Examples of Scientific Databases

•WWW pages of individual researchers or research projects– can provide access to data – typically do not utilize

standards for metadata (documentation)

– typically provide no query tools

•WWW pages of individual researchers or research projects– can provide access to data – typically do not utilize

standards for metadata (documentation)

– typically provide no query tools

Page 15: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Evolving a Database• Development of a database is an

evolutionary process• Implement system based on current

priorities - but think ahead!• Seek scalable solutions

– avoid bottlenecks– adding the 1000th piece of data should

be as easy as adding the first (or easier)

• Development of a database is an evolutionary process

• Implement system based on current priorities - but think ahead!

• Seek scalable solutions– avoid bottlenecks– adding the 1000th piece of data should

be as easy as adding the first (or easier)

Page 16: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Developing a Database - Questions to Ask

•Why is this database NEEDED?•Who will be the USERS of the

database?•What types of QUESTIONS

should the database be able to answer?

•What INCENTIVES will be available for data providers?

•Why is this database NEEDED?•Who will be the USERS of the

database?•What types of QUESTIONS

should the database be able to answer?

•What INCENTIVES will be available for data providers?

Page 17: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Library Model• Individual with 20 books

– just randomly put on shelves

• Individual with 500 books– sort books on shelves based on topic

or alphabetically

• Library– complex cataloging system– controlled keyword and subject

vocabularies

• Individual with 20 books– just randomly put on shelves

• Individual with 500 books– sort books on shelves based on topic

or alphabetically

• Library– complex cataloging system– controlled keyword and subject

vocabularies

Page 18: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Commonly Used Types of Software

Commonly Used Types of Software

• Input and Analysis tools• Metadata Tools• Information sharing tools –

WWW• Database Management Systems

(DBMS)

• Input and Analysis tools• Metadata Tools• Information sharing tools –

WWW• Database Management Systems

(DBMS)

Page 19: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Input and AnalysisInput and AnalysisSpreadsheets• Good

– Widely used, easy to learn for simple graphical and statistical analyses

– Commonly already installed on most computers

• Bad– Can encourage “bad practices” – create data

that can’t easily be used – Poor support for sophisticated analyses– Lack of auditability – hard to “back track”

how data were manipulated

Spreadsheets• Good

– Widely used, easy to learn for simple graphical and statistical analyses

– Commonly already installed on most computers

• Bad– Can encourage “bad practices” – create data

that can’t easily be used – Poor support for sophisticated analyses– Lack of auditability – hard to “back track”

how data were manipulated

Page 20: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Statistical PackagesStatistical Packages

• Examples: SAS, SPSS, Statistica etc.

• Good– Powerful analysis tools– Auditable: Can store programs –

fully document details of analysis• Bad

– Harder to learn– Less common on computers – Can be expensive

• Examples: SAS, SPSS, Statistica etc.

• Good– Powerful analysis tools– Auditable: Can store programs –

fully document details of analysis• Bad

– Harder to learn– Less common on computers – Can be expensive

Page 21: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Other Input Other Input

• DBMS – Database Management Systems– We’ll talk more about these later…..

• DBMS – Database Management Systems– We’ll talk more about these later…..

Page 22: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Database Management

System (DBMS) Types • Filesystem-based

– simple– inefficient– few capabilities

• Hierarchical– phylogenetic

structures– geographical images

• Network– very flexible– not widely used

• Filesystem-based– simple– inefficient– few capabilities

• Hierarchical– phylogenetic

structures– geographical images

• Network– very flexible– not widely used

• Relational– widely-used, mature– table-oriented– restricted range of

structures

• Object-oriented– developing -few

commercial implementations

– diverse structures– extensible

• Relational– widely-used, mature– table-oriented– restricted range of

structures

• Object-oriented– developing -few

commercial implementations

– diverse structures– extensible

Page 23: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

DBMS Advantages and Disadvantages

•Advantages– additional

capabilities•sorting•query•integrity checking

– easy access to data

•Advantages– additional

capabilities•sorting•query•integrity checking

– easy access to data

• Disadvantages– few graphical or

statistical capabilities

– proprietary formats may limit archival quality of data

– require expertise and resources to administer

• Disadvantages– few graphical or

statistical capabilities

– proprietary formats may limit archival quality of data

– require expertise and resources to administer

Page 24: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Choosing a DBMS• What tasks to do you want the

DBMS to accomplish?– query– sorting– analysis

• Is there a type of DBMS whose structure best mirrors that of the underlying data?

• What tasks to do you want the DBMS to accomplish?– query– sorting– analysis

• Is there a type of DBMS whose structure best mirrors that of the underlying data?

Page 25: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Database Management Systems

Database Management Systems

• Commercial Products– Microsoft ACCESS (part of

Microsoft Office)– Microsoft SQLserver– Oracle

• Freeware– MySQL– PostgreSQL– MiniSQL

• Commercial Products– Microsoft ACCESS (part of

Microsoft Office)– Microsoft SQLserver– Oracle

• Freeware– MySQL– PostgreSQL– MiniSQL

Page 26: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

DBMS BackendsDBMS Backends• Increasingly DBMS are being used as

tools that support the “behind the scenes” activities in support of web sites– You may not interact with the database

itself, but rather with a TOOL that interacts with the database

• Tools such as Content Management Systems (CMS) use programs that in turn use DBMS to perform their functions

• Increasingly DBMS are being used as tools that support the “behind the scenes” activities in support of web sites– You may not interact with the database

itself, but rather with a TOOL that interacts with the database

• Tools such as Content Management Systems (CMS) use programs that in turn use DBMS to perform their functions

Page 27: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Information Sharing Tools

Information Sharing Tools

•WWW servers– Apache Web Server

•Free•Based on open standards•Runs on PCs, Macintosh and Unix

– Microsoft Web Server•Free, often distributed with Windows•Links to Microsoft tools•Proprietary - runs only under Windows

•WWW servers– Apache Web Server

•Free•Based on open standards•Runs on PCs, Macintosh and Unix

– Microsoft Web Server•Free, often distributed with Windows•Links to Microsoft tools•Proprietary - runs only under Windows

Page 28: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

WWW ServersWWW Servers• Need dedicated Internet address that

is connected to the network all the time – A high-speed connection is desirable

• Need space to store web content• The web server need not be local

– Locally-created WWW pages can be uploaded to a remote server•e.g., field station can use server at main

university campus and use a modem or even floppy disks to transfer content

• Need dedicated Internet address that is connected to the network all the time – A high-speed connection is desirable

• Need space to store web content• The web server need not be local

– Locally-created WWW pages can be uploaded to a remote server•e.g., field station can use server at main

university campus and use a modem or even floppy disks to transfer content

Page 29: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

What are the “Best Software”?

What are the “Best Software”?

• SORRY! – there is no one list that is the correct answer for everyone!

• A knowledgeable user, rather than the particular software used, controls what can be accomplished

• Costs– Cost of software– Cost of administration– Life-cycle costs– Costs of migration

• SORRY! – there is no one list that is the correct answer for everyone!

• A knowledgeable user, rather than the particular software used, controls what can be accomplished

• Costs– Cost of software– Cost of administration– Life-cycle costs– Costs of migration

Page 30: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Computer Systems• UNIX/Linux

– mature, full-functioned system

• strong on multitasking

• more reliable and robust

– steep learning curve

– lots of free software

– software can be expensive

– wide array of WWW tools

• UNIX/Linux– mature, full-

functioned system• strong on

multitasking• more reliable and

robust

– steep learning curve

– lots of free software

– software can be expensive

– wide array of WWW tools

• PCs & Macs– rapid improvements

in operating system design facilitate network access

– software & hardware inexpensive

– tools are more user-friendly

– number of tools rapidly growing

• PCs & Macs– rapid improvements

in operating system design facilitate network access

– software & hardware inexpensive

– tools are more user-friendly

– number of tools rapidly growing

Page 31: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Cautionary Notes -

Lessons from the Worm

Community System

Page 32: Introduction to Field Station Databases John Porter Department of Environmental Sciences University of Virginia John Porter Department of Environmental

Final Thoughts• Ecological

databases are increasingly setting the boundaries for science itself

• Databases evolve, but they don’t spontaneously generate

• Ecological databases are increasingly setting the boundaries for science itself

• Databases evolve, but they don’t spontaneously generate

ConnectivityConnectivityContentContent

OrganizationOrganization

Database Building Database Building BlocksBlocks

Database Building Database Building BlocksBlocks