biological databases, integration, and semantic web kei cheung, ph.d. yale center for medical...

57
Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Upload: mariah-hancock

Post on 29-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Biological Databases, Integration, and Semantic Web

Kei Cheung, Ph.D.

Yale Center for Medical Informatics

Genomics and Bioinformatics, December 4, 2006

Page 2: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Outline

• Database introduction– Overview– Query language

• Database integration– Issues

• Semantic Web approach to database integration– Overview of Semantic Web

Page 3: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Introduction• The Human Genome Project has transformed the

biological sciences into information sciences• Advances in the biological sciences depend on:

– creation of new knowledge– effective information management

• Future progress in biological research will be highly dependent on the ability of the scientific community to both deposit and utilize stored information on-line.

• The database challenge for the future will be to develop new ways to acquire, store and retrieve not only biological data, but also the biological context for these data.

Page 4: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Variety of Biological Databases

• Different data categories– DNA sequence, gene expression, protein

structure, pathway, etc

• Community vs. lab-specific vs. proprietary databases

• Mega vs. medium vs. boutique databases

• One thing in common: many of them are Web accessible

Page 5: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Food for thoughts

• Will a biological database different a biological journal?

Page 6: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

What is a database?

• A database is a collection of records stored in a computer in a systematic way, so that a computer program can consult it to answer questions.

• The items retrieved in answer to queries become information that can be used to make decisions.

• The computer program used to manage and query a database is known as a database management system (DBMS) – E.g., Oracle, MS Access, MySQL

Page 7: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Database components

• The central concept of a database is that of a collection of records, or pieces of knowledge

• For a given database, there is a structural description of the type of facts held in that database: this description is known as a schema

• The schema describes the objects that are represented in the database, and the relationships among them.

Page 8: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Data Model

• There are a number of different ways of organizing a schema (i.e., of modeling the database structure): these are known as data models. – Relational model– Hierarchical model– Network model– Object oriented model

Page 9: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Query Language

• A query language is a computer languages used to create, modify, retrieve and manipulate data from databases

• SQL (Structured Query Language) is a well-known query language for relational databases– SQL is an ANSI standard language for RDBMS’s– Different RDBMS’s vendors may provide slightly

different SQL syntax or additional proprietary extensions that are applicable only to their systems

Page 10: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

SQL

• CREATE TABLE• INSERT• SELECT• UPDATE• DELETE• CREATE VIEW

Page 11: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

CREATE TABLE

CREATE TABLE <tablename> (<column1> <data type1> [<constraint1>], <column2> <data type2> [<constraint2>], <column3> <data type3> [<constraint3>],

…);

Page 12: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Example

CREATE TABLE sgd_features(sgd_id VARCHAR(20) NOT NULL PRIMARY KEY,feature_type VARCHAR(20) NOT NULL DEFAULT ‘ORF’,quality VARCHAR(20),feature_name VARCHAR(20),standard_name VARCHAR(20),chromosome INT(2) NOT NULL,start_coord INT(10) NOT NULL,end_coord INT(10) NOT NULL,strand CHAR(1) NOT NULL,description VARCHAR(500)

);

Page 13: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

INSERT

INSERT INTO <table> (<column1>, …, <columnN>)VALUES (<value1>, …, <valueN>);

Page 14: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Example…INSERT INTO empinfo (sgd_id, feature_type, feature_name,chromosome, start_coord, stop_coord,strand, description)VALUES (‘S000006692’, ‘tRNA’, ‘tQ(UUG)C’, 3, 168368, 168297, ‘C’, ‘tRNA-Gln’);…

Page 15: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

SELECT

SELECT [DISTINCT] <col1> [as <alias1>] [, <col2> [as <alias2>], ...]FROM <table1> [as <alias1>] [, <table2> [as <alias2>] , …]WHERE <Boolean conditions>;[additional clauses]

Typical Conditional Operators: =, >, >=, <, <=, <>, LIKE, IN

Additional Clauses: ORDER BY, GROUP BY, HAVING, LIMIT

Built-in functions: UPPER, LOWER, SUBSTRING, LENGTH, COUNT, MAX, MIN, AVG, etc

Page 16: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

DISTINCT

SELECT DISTINCT chromosomeFROM sgd_featuresWHERE (feature_type=‘ORF’);

The answer is: 1,2,4,9,10,15,16,17

Page 17: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

count function

SELECT COUNT(*)FROM sgd_featuresWHERE (start_coord < 300000) AND (feature_name LIKE ‘Y%’);

The answer is 6

Page 18: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

string function

SELECT LENGTH(feature_name)FROM sgd_featuresWHERE (id=‘S000007274’);

The answer is 5

Page 19: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

math function

SELECT MIN(start_coord) FROM sgd_featuresWHERE (strand=‘W’);

Page 20: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

ORDER BY

SELECT sgd_id, feature_type, feature_name, chromosomeFROM sgd_featuresWHERE (feature_name like ‘Y%’)ORDER BY start_coord DESC;

Page 21: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

GROUP BYThis allows aggregate function to be performed on the column(s)

SELECT feature_type, AVG(stop_coord-start_coord) as “avg_diff” FROM sgd_features WHERE (strand = ‘W’) GROUP BY feature_type;

Page 22: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

HAVINGThis is the same as the WHERE clause except it is performed upon the data that have already retrieved from the database

SELECT feature_type, AVG(stop_coord-start_coord) as ‘avg_diff’FROM sgd_features WHERE (strand = ‘W’) GROUP BY feature_type;

HAVING (AVG(stop_coord-start_coord) > 1000);

Page 23: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

LIMIT [start, ] rowsReturns only the specified number of rows.

SELECT * WHERE (feature_type=‘ORF’) LIMIT 3

Page 24: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Join

SELECT s.feature_name, s.feature_type,s.chromosome, g.bio_function, g.bio_process,g.cell_locationFROM sgd_features as s, gene_ont as gWHERE (s.sgd_id=g.sgdid);

sgd_features

gene_ont

Page 25: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

UPDATE

UPDATE <table> SET <col1>=<val1> [,<col2>=<val2>, …][WHERE clause];

Page 26: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Example

UPDATE empinfo SET quality=‘Verified’WHERE sgd_id=‘S00000010’

Page 27: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

DELETE

DELETE FROM <table> [WHERE clause];

Page 28: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Example

DELETE FROM sgd_features WHERE sgd_id IN (‘S00000010’, ‘S000003599’);

DELETE FROM sgd_features;(this deletes all data in the table)

Page 29: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

CREATE VIEW

CREATE VIEW <viewname> [<col1>, <col2>, …]AS SELECT …;

Page 30: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Example (VIEW)

CREATE VIEW sgd_features_ORF_WAS SELECT *FROM sgd_features WHERE feature_type=‘ORF’ AND strand=‘W’;

Page 31: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Other Database Topics

• Normalization

• Query optimization

• Maintenance

Page 32: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Database Integration

Page 33: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Needs for database integration

• Biological data are more meaningful in context, no single DB supplies a complete context for a given biological research study

• New hypotheses are derived by generalizing across a multitude of examples from different DBs

• Integration of related data enables validation and consistency checking

Page 34: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Example• Find the genes for a metabolic pathway,

which are localized within a genome (e.g., find the clusters of genes involved in tryptophan biosynthesis, and in the histidine biosysthesis, in the E. coli genome)

• This query involves integrating data from pathway databases and genome mapping databases

Page 35: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Issues

Page 36: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Growth in the number of biological databases (published in NAR DB issue)

Page 37: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Database Size

Page 38: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Database Complexity(e.g., DNA Microarray Database)

Page 39: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Different Levels of Heterogeneity

• This problem arises from the fact that different databases are designed and developed independently to address local needs.

• There are two broad levels of heterogeneity– Syntactic heterogeneity– Semantic heterogeneity

Page 40: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Syntactic Heterogeneity

• Technological heterogeneity– Difference in hardware platforms, operating systems,

access methods (e.g., HTTP, ODBC, etc)– Difference in database engines, query languages,

data access interfaces, etc

• Different data formats/models – structured vs. un-structured data, files vs. databases,

etc– Object-oriented vs. relational data model

Page 41: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Semantic Heterogeneity

• Nomenclature problem– Gene/protein symbols/names (based on phenotype, sequence, function,

organisms, etc)• TSC1 • ABCC2 • ALDH• Sonic Hedgehog

– ID proliferation • Different ID schemes: 1OF1  (PDB ID) and P06478 (SwissProt ID) correspond to

Herpes Thymidine Kinase• Lexcial variation: GO1234, GO:1234, GO-1234

– Synonyms vs. homonyms• Dopamine receptor D2: DRD2, DRD-2, D2• Armadillo (fruitflies) vs. i-catenin (mice)• PSM1 (human) = PSM2 (yeast); PSM1 (yeast) = PSM2 (human)• PSA: prostate specific antigen, puromycin-sensitive aminopeptidase, psoriatric arthritis,

professional skaters association

• “Biologists would rather share their toothbrush than a gene name … Gene nomenclature is beyond redemption”, said Michael Ashburner

Page 42: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Other Semantic Heterogeities

• Inconsistent values– The same set of markers may be found in different

orders across different mapping databases (e.g., GDB and DB/12)

• Different units of measurement– Kb vs. bp

• Different data coding schemes– over-expressed vs. 2-fold change

Page 43: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Use of Standards to Address the Heterogeneity Issue

• Data specification standards (e.g. MIAME)

• Data representation standards – Syntax (e.g., XML, ASN.1)– Structure (e.g., MAGE-ML)

• Standard vocabulary/ontology (e.g., MGED ontology working group, gene ontology, etc)

Page 44: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Semantic Web

Page 45: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Semantic Web• It provides a standard framework that allows data to be

integrated and reused across application, enterprise, and community boundaries

• It is a web of data linked up in such a way as to be easily processable by machines, on a global scale.

• It is about two things:– It is about common formats for interchange of data– It is about language for recording how the data relates

to real world objects. • It is a collaborative effort led by the World Wide Web

Consortium or W3C with participation from a large number of researchers and industrial partners

Page 46: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Semantic Web for the Life Sciences

• “… the life sciences are a flagship area for the semantic web …” (Tim Berners-Lee)

• “… Today, boundaries that inhibit data sharing limit innovation in research and clinical settings alike, and impede the efficient delivery of care. Semantic web technologies give us a chance to solve this problem, resulting, ultimately, in faster drug targeting, more accurate reporting, and better patient outcomes.” (Susan Hockfield, President of MIT)

• Semantic Web Health Care and Life Sciences Interest Group (SW HCLSIG)– http://www.w3.org/2001/sw/hcls/

Page 47: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Component Technologies of Semantic Web

• Uniform Resource Identifier (URI)– A standard means of addressing resources on the Web – e.g., http://en.wikipedia.org/wiki/Protein_P53

• Ontology – Specification of a conceptualization of a knowledge domain

• Ontological Language– Resource Description Framework (RDF)– Web Ontology Language (OWL)– Both RDF and OWL have XML serialization

• Database and Tool– RDF databases: Sesame, Kowari, Oracle– OWL Reasoners: Racer, Pellet, FaCT

Page 48: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

RDF Statement

A RDF statement consists of:• Subject: resource identified by a URI• Predicate: property (as defined in a name space identified by a

URI) • Object: property value (literal) or a resource

For example, the dbSNP Website is a subject, creator is apredicate, NCBI is an object.

A resource can be described by multiple statements.

Page 49: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

<?xml version="1.0"?> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:dc=“http://purl.org/dc/elements/1.1” xmlns:ex=“http://www.example.org/terms”>

<rdf:Description about=“http://www.ncbi.nlm.nih.gov/SNP”><dc:creator rdf:resource=“http://www.ncbi.nlm.nih.gov”></dc:creator><dc:language>en</dc:language>date>

</rdf:Description></rdf:RDF>

Graphical & XML Representation

http://www.ncbi.nlm.nih.gov/SNP

http://www.ncbi.nlm.nih.gov

http://purl.org/dc/elements/1.1/creator http://purl.org/dc/elements/1.1/language

en

Page 50: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

RDF Schema (RDFS)

• RDF Schema terms:– Class– Property– type– subClassOf– range– domain

• Example:<Person,type,Class><has_parent,type,Property><Family_member,subClassOf,Person><“Joe Smith”, type, Family_member><has_parent, range, Family_member><has_parent, domain, Family_member>

Page 51: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Web Ontology Language (OWL)

• OWL builds on top of of RDF

• It semantically extends RDFS

• It is based on description logics

• Three species of OWL– OWL-Lite– OWL-DL– OWL-Full

Page 52: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Current Syntactic Web vs. Future Semantic Web

Page 53: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Form Vision to Implementation:Data Mashup

Page 54: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Data Mashup

• It refers to web resources that weave data from different web resources into a new service

• Disciplines like biosciences would benefit greatly from data mashups

• Data mashup is difficult to create without sharing data in a standard machine-readable format

Page 55: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Nature’s Data Mashup: A Google Earth Application

Page 56: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

Tracking of Avian Flu

Page 57: Biological Databases, Integration, and Semantic Web Kei Cheung, Ph.D. Yale Center for Medical Informatics Genomics and Bioinformatics, December 4, 2006

The End