bioinformatics databases: fundamentals of database technology & data organization kristen...

31
Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth Medical School

Post on 21-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

Bioinformatics Databases:Fundamentals of Database

Technology & Data Organization

Kristen ChambersDirector of BioinformaticsDartmouth Medical School

BioInformatics @ Dartmouth Medical School

Page 2: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

How can data be organized?• Paper (i.e. in notebooks)• Flat files

– Collection of data records– Minimal structure, no metadata– Application program must contain relationship

information

• Database– Hierarchical– Network– Relational

Page 3: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Page 4: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

How can data be organized?• Paper (i.e. in notebooks)• Flat files

– Collection of data records– Minimal structure, no metadata– Application program must contain relationship

information

• Database– Hierarchical– Network– Relational

Page 5: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

What is a relational database?

A database composed of relations and conformingto a set of principles governing how such relations

are supposed to behave (“Codd’s 12 Rules”).There are many database systems that use tables

but don’t conform to all of the principles. These are often called “semirelational” systems.

from Understanding SQL, Martin Gruber

Page 6: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Practically speaking...• A database is a body of information stored in two

dimensions (rows and columns)– Rows are records– Columns are attributes of those record entities

• The groups of rows and columns, or tables, are largely independent of each other

• The power of the database lies in the relationships that you construct among the tables

• A database is self-describing: it contains metadata, which is a description of its own structure

Page 7: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

• A set of programs which define, administer and process databases and their associated applications

• A scalable DBMS can run on multiple platforms (varying sizes)

• A DBMS that supports interoperability uses industry-standard language and standard ways of exchanging data

What is a Database Management System (DBMS)?

Examples: Oracle, Sybase, 4D, MS Access …BioInformatics @ Dartmouth Medical School

Page 8: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

Features of a Relational Database

• Rows (records) are in no particular order

• Columns (fields) are ordered, numbered and named; names should indicate content of the field

• Primary key uniquely identifies each row - ensures that no row is empty, and that every row is different from every other row

• Two-step commit process

BioInformatics @ Dartmouth Medical School

Page 9: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

Features of a Relational Database

• A view is a subset of the database that an application (or user) can process

• The database schema is the structure of the entire database

• A constraint is a condition you apply to an attribute of a table

BioInformatics @ Dartmouth Medical School

Page 10: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Relationships between tables

• One-to-One, Many-to-One, Many-to-Many• A “join” is an operation that combines data from multiple tables into

a singe result table

• E-R (entity-relationship) diagram is the basic graphic to describe the structure of a database

SELECT Sequence.sname, KnownGenes.gname, KnownGenes.length FROM Sequence, KnownGenes WHERE KnownGenes.length = Sequence.length

Page 11: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

E-R Diagram

Page 12: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

The tool for communicating withrelational databases: SQL

• Standard Query Language (SQL)

• A query is a question you ask the database, and SQL retrieves the appropriate answer set

• Interactive SQL (command line) vs. RAD tool

• Standardization issue: ANSI (American National Standards Institute)

BioInformatics @ Dartmouth Medical School

Page 13: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

Data Types• Types of data indicate functions that are

possible between related fields• Each field is assigned one data type (imposes

structure on data)• Examples: text (CHAR, VARCHAR),

number (INT, DEC); date, time, money binary• Standardization issue: ANSI (American

National Standards Institute)

BioInformatics @ Dartmouth Medical School

Page 14: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

• Designing a database is not trivial

• The value is not in the data, but in the structure

• Design to facilitate the retrieval and interpretation of the data

BioInformatics @ Dartmouth Medical School

A word about database design:

Page 15: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School
Page 16: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

• Reusable ‘core’ modules, with customizable components

• Standard business logic framework controls transactions (middle layer)

• Metadata-based back-end data storage (facilitates data sharing)

BioInformatics @ Dartmouth Medical School

Example: BioInformatics Core Technology

Page 17: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

BioInformatics Core Technology

S y ba s e

A u th n t ic.db

Ev e n ts .db

Q u e s t io n s .db

S pe cim e n .dbS p ec if ic to

s tu d y

O th er s . . .

S p ec im en T r ac k in gd ef in e /c r ea te /ed it /d es tr o y I tem ( )d ef in e /c r ea te /ed it /d es tr o y P k g ( )ad d /d e le te I tem F r m P k g ( )s en d /r ec e iv eP k g ( )

Authntication

Even

t Track

Utilities

Qu

estions

Sp

ec Track

A u th To o ls S pe c To o ls Q u e s t To o ls Ev e n t To o ls Ut ility To o ls

D a ta ba s e A cce s s

W e b A pps

I S Q L /R ep o r tsc r ea te /ed it /d es tr o y R ep o r tQ u er yad d /ed it /d e le teQ u er y P ar am

G en er ic S Q L M eth o dm ak e/g e t/d es tr o y C o n n ec tio n ( )p r ep ar eT h eC all/S ta tem en t( )ex ec u teQ u er y /Up d ate( )

c r ea te /ed it / r e tir eUs erg r an t/ r ev o k eUs er P er m is s io n s

HT

ML

H TM L To o ls

Page 18: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

Life science has become a field which generates an enormous amount of un-integrated data.

BioInformatics @ Dartmouth Medical School

How can methods for data organization help to solve this

problem?

Page 19: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

What is Data Integration?

• Creating a system which allows the extraction of a piece or set of information (query result) across multiple domains (possibly disparate data sources - flat files, databases, spreadsheets, URLs...)

Page 20: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Sample integration problem:Cancer Biomarker Discovery

• Clinical center collects blood samples from 1000 individuals with colon cancer

• Expression analysis reveals that protein ‘x’ is over-expressed in these samples, relative to controls

• Could this be a colon cancer biomarker?

Page 21: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Understanding transcription factors for protein ‘x’ productionShow me all genes in the public literature that are putatively

related to protein ‘x’, have more than 4-fold expression differential between affected and normal tissue and are

homologous to known transcription factors.

Q1: Find homologsQ2: Find genes with

4-fold differentialQ3: Show me genesin public literature

SEQUENCE EXPRESSION LITERATURE

(Q1 Q2 Q3)

Page 22: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Key components to integration

• Accessing without modifying original data sources• Handling redundant, conflicting, missing, changing

(versions) data• Normalizing analytical data from different data

sources• Conforming terminology to industry standards• Accessing the integrated data as a single repository• Including metadata in repository

Page 23: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Approaches to Integrationwhere are the key issues addressed?

• Federated database (poses constraints on original data sources; fragility in reliance on source systems)

• Data warehousing (ETL layer, original data sources untouched, required understanding of domain, sophisticated update/archive processes)

• Integrating data source profiles

• Indexed Flat Files

• Others….

Page 24: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Data Warehousing

E(E xtractio n)

T(T rans fo rm atio n)

L(Lo ad )

S o u rc e D a ta D a ta W a re h o u s e

Page 25: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

• Describes data types, relationships, histories, etc.

• Back-end (supports developers), front-end (supports users and application)

Metadataone key to success

Data value: 55

Page 26: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Data value: 55Metadata values:

Data element name: vehicle speed

• Describes data types, relationships, histories, etc.

• Back-end (supports developers), front-end (supports users and application)

Metadataone key to success

Page 27: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Data value: 55Metadata values:

Data element name: vehicle speedUnit: miles per hour

• Describes data types, relationships, histories, etc.

• Back-end (supports developers), front-end (supports users and application)

Metadataone key to success

Page 28: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Data value: 55Metadata values:

Data element name: vehicle speedUnit: miles per hourDescription: the average velocity of a vehicle

• Describes data types, relationships, histories, etc.

• Back-end (supports developers), front-end (supports users and application)

Metadataone key to success

Page 29: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Standardsthe final frontier

• Naming conventions

• Standard coordinate systems

• Unify interpretations of single object types

• Unify software solutions to the same problem (also data formats)

• Standards for metadata (incompatible or missing metadata)

Page 30: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

BioInformatics @ Dartmouth Medical School

Developing Standardsfor Life Sciences Research

• Discovery science does not lend well to constraints (especially system constraints)

• Decentralized data management infrastructure, competition

• Wildly varying skill levels for data and information management

Several groups (Bio-Ontologies, HGNC, OMG, etc.) and national research initiatives (EDRN, caBIG, etc.) are taking

the lead in the effort to create ‘workable’ standards.

Page 31: Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School

New approach to integration:Cancer Biomarker Discovery

• Network of distributed data ‘silos’ (does not perturb data sources)

• Centralized query and ‘business logic’ servers, accessed through web interface

• CORBA framework ‘manages’ XML profile definitions across the web

• A profile is a set of resource definitions implemented in XML for data sources residing in one or more distributed systems

BioInformatics @ Dartmouth Medical School