biological databases, integration, and semantic web kei cheung, ph.d. yale center for medical...

Biological Databases, Integration, and Semantic Web

Kei Cheung, Ph.D.

Yale Center for Medical Informatics

Genomics and Bioinformatics, December 4, 2006

Outline

• Database introduction– Overview– Query language

• Database integration– Issues

• Semantic Web approach to database integration– Overview of Semantic Web

Introduction• The Human Genome Project has transformed the

biological sciences into information sciences• Advances in the biological sciences depend on:

– creation of new knowledge– effective information management

• Future progress in biological research will be highly dependent on the ability of the scientific community to both deposit and utilize stored information on-line.

• The database challenge for the future will be to develop new ways to acquire, store and retrieve not only biological data, but also the biological context for these data.

Variety of Biological Databases

• Different data categories– DNA sequence, gene expression, protein

structure, pathway, etc

• Community vs. lab-specific vs. proprietary databases

• Mega vs. medium vs. boutique databases

• One thing in common: many of them are Web accessible

Food for thoughts

• Will a biological database different a biological journal?

What is a database?

• A database is a collection of records stored in a computer in a systematic way, so that a computer program can consult it to answer questions.

• The items retrieved in answer to queries become information that can be used to make decisions.

• The computer program used to manage and query a database is known as a database management system (DBMS) – E.g., Oracle, MS Access, MySQL

Database components

• The central concept of a database is that of a collection of records, or pieces of knowledge

• For a given database, there is a structural description of the type of facts held in that database: this description is known as a schema

• The schema describes the objects that are represented in the database, and the relationships among them.

Data Model

• There are a number of different ways of organizing a schema (i.e., of modeling the database structure): these are known as data models. – Relational model– Hierarchical model– Network model– Object oriented model

Query Language

• A query language is a computer languages used to create, modify, retrieve and manipulate data from databases

• SQL (Structured Query Language) is a well-known query language for relational databases– SQL is an ANSI standard language for RDBMS’s– Different RDBMS’s vendors may provide slightly

different SQL syntax or additional proprietary extensions that are applicable only to their systems

• CREATE TABLE• INSERT• SELECT• UPDATE• DELETE• CREATE VIEW

CREATE TABLE

CREATE TABLE <tablename> (<column1> <data type1> [<constraint1>], <column2> <data type2> [<constraint2>], <column3> <data type3> [<constraint3>],

Example

CREATE TABLE sgd_features(sgd_id VARCHAR(20) NOT NULL PRIMARY KEY,feature_type VARCHAR(20) NOT NULL DEFAULT ‘ORF’,quality VARCHAR(20),feature_name VARCHAR(20),standard_name VARCHAR(20),chromosome INT(2) NOT NULL,start_coord INT(10) NOT NULL,end_coord INT(10) NOT NULL,strand CHAR(1) NOT NULL,description VARCHAR(500)

INSERT

INSERT INTO <table> (<column1>, …, <columnN>)VALUES (<value1>, …, <valueN>);

Example…INSERT INTO empinfo (sgd_id, feature_type, feature_name,chromosome, start_coord, stop_coord,strand, description)VALUES (‘S000006692’, ‘tRNA’, ‘tQ(UUG)C’, 3, 168368, 168297, ‘C’, ‘tRNA-Gln’);…

SELECT

SELECT [DISTINCT] <col1> [as <alias1>] [, <col2> [as <alias2>], ...]FROM <table1> [as <alias1>] [, <table2> [as <alias2>] , …]WHERE <Boolean conditions>;[additional clauses]

Typical Conditional Operators: =, >, >=, <, <=, <>, LIKE, IN

Additional Clauses: ORDER BY, GROUP BY, HAVING, LIMIT

Built-in functions: UPPER, LOWER, SUBSTRING, LENGTH, COUNT, MAX, MIN, AVG, etc

DISTINCT

SELECT DISTINCT chromosomeFROM sgd_featuresWHERE (feature_type=‘ORF’);

The answer is: 1,2,4,9,10,15,16,17

count function

SELECT COUNT(*)FROM sgd_featuresWHERE (start_coord < 300000) AND (feature_name LIKE ‘Y%’);

The answer is 6

string function

SELECT LENGTH(feature_name)FROM sgd_featuresWHERE (id=‘S000007274’);

The answer is 5

math function

SELECT MIN(start_coord) FROM sgd_featuresWHERE (strand=‘W’);

ORDER BY

SELECT sgd_id, feature_type, feature_name, chromosomeFROM sgd_featuresWHERE (feature_name like ‘Y%’)ORDER BY start_coord DESC;

GROUP BYThis allows aggregate function to be performed on the column(s)

SELECT feature_type, AVG(stop_coord-start_coord) as “avg_diff” FROM sgd_features WHERE (strand = ‘W’) GROUP BY feature_type;

HAVINGThis is the same as the WHERE clause except it is performed upon the data that have already retrieved from the database

SELECT feature_type, AVG(stop_coord-start_coord) as ‘avg_diff’FROM sgd_features WHERE (strand = ‘W’) GROUP BY feature_type;

HAVING (AVG(stop_coord-start_coord) > 1000);

LIMIT [start, ] rowsReturns only the specified number of rows.

SELECT * WHERE (feature_type=‘ORF’) LIMIT 3

SELECT s.feature_name, s.feature_type,s.chromosome, g.bio_function, g.bio_process,g.cell_locationFROM sgd_features as s, gene_ont as gWHERE (s.sgd_id=g.sgdid);

sgd_features

gene_ont

UPDATE

UPDATE <table> SET <col1>=<val1> [,<col2>=<val2>, …][WHERE clause];

Example

UPDATE empinfo SET quality=‘Verified’WHERE sgd_id=‘S00000010’

DELETE

DELETE FROM <table> [WHERE clause];

Example

DELETE FROM sgd_features WHERE sgd_id IN (‘S00000010’, ‘S000003599’);

DELETE FROM sgd_features;(this deletes all data in the table)

CREATE VIEW

CREATE VIEW <viewname> [<col1>, <col2>, …]AS SELECT …;

Example (VIEW)

CREATE VIEW sgd_features_ORF_WAS SELECT *FROM sgd_features WHERE feature_type=‘ORF’ AND strand=‘W’;

biological databases, integration, and semantic web kei cheung, ph.d. yale center for medical...

Documents

international kei magazine

wikineuron: semantic wiki of collective minds in...

kei district council

semantic web: knowledge representation in life...

kei koreaseconomy section01

product guide - kei japan

genome data and tool interoperation over the “semantic”...

kei-krant 2011 #1

kei koreaseconomy section02

great kei local municipality

kei iqsensato presentation e

kei koreaseconomy section04

kei creative economy

kei industries limited · kei industries limited (hereafter...

cheung presentation.pdf

wing cheung

nuclear physics by: cheung kwok tin andy (2) lai chin kei...

cheung kong/hutch’s bold move -...

wp 234 - kei koga

digipak research (visual kei)