the mobios project molecular biological information system daniel p. miranker dept. of computer...

40
The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang

Upload: kathlyn-flowers

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

The MoBIoS ProjectMolecular Biological Information System

Daniel P. MirankerDept. of Computer Sciences &

Center for Computational Biology and Bioinformatics

University of Texas

Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang

Page 2: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Problem:

In Life Sciencses, database management systems (DBMS) serve as glorified file managers.

Little use of sophisticated data and pattern-based retrieval

Real scientific and technological problems

Page 3: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

When biological data is put in to an RDBMS

• Primary data is stored in text or blob fields– Annotations may be relational

• Data retrieval – Filter DB, sequential dump, O(n), to utilities

• E.g. BLAST,

Organism Function Sequence

Yeast membrane AACCGGTTT

Yeast mitosis TATCGAAA

E. Coli membrane AGGCCTA

Page 4: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Linear Data Scans, O(n), Endemic in Life Sciences

Sequences: DNA, RNA, Protein databases

Mass Spectra proteomics

Small Molecules & Protein Structure Protein interaction Rational drug design

Pathways (graphs) Phylogenies (graphs, trees in particular)

Page 5: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Scope: To Find Common Ground Both Biology and DBMS’ Have to Move

DBMS

Biological

Information

System

Metric-Space Database as the Common Ground

Page 6: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Metric Space is a pair, M=(D,d),

where D is a set of points d is [metric] distance function with the following

properties:

d(x,y) = d (y,x) (symmetry) d(x, y) > 0, d(x,x) = 0 (non negativity) d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

x

y z

Page 7: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Definition - By Analogy

A Spatial Database Management System:

Extend relational DBMS Special indexes for 2D and

3D data; k-d and R-trees New data types

Geographic information systems Topographic maps Buildings and the like

A Metric-Space Database Management System

Extend Relational DBMS Special indexes for metric-

spaces New data types

Biological information system Life science data types

Page 8: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Develop index structures to support distance & nearest-neighbor queries

• Well studied in main-memory– But by no means a closed problem

• In databases (external/disk based methods)– Embryonic– Many myths

• Often assumed to be the basis of multimedia database systems

Page 9: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

How to build a metric-space index

• Three algorithmic classes [Tasan, Ozsoyoglu 04]

– Vantage points– Hyperplanes– Bounding spheres

Page 10: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Vantage Point Method [Burkhard&Keller73]

Page 11: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Vantage Point Method

Choose a point,VP

And a radius, R

Page 12: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Vantage Point Method

Choose a point,VP

And a radius,R

• Given VP, R

The predicates

• d(VP,x) < R

• d(VP,x) R

Divide the set into two equal halves

• apply recursively

Page 13: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Query, q, range r

qr

Page 14: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Query, q, range r

VP

R

q

r

if• d(q,VP) > R + rthen• all neighbors are outside the sphere

Page 15: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Multi-vantage point method

Page 16: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Multi-vantage point method

• Consider d(VPi, x) a projection onto an axis

• Looks like a k-d tree– Choose number k & d

Page 17: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Myths

• Solved problem; M-trees [Ciaccia et.al. 96, 97]

– I can’t get them to work on anything but their original synthetic data generator

• Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering)– Might be true for euclidean spaces– Early result, not true for our data

• High dimensional indexing always asymptotically reduces to linear scans.– Formal result based on an assumption of uniform data

distributions.

Page 18: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

#di st . cal . : RBT VS. GHT VS. MVPT

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 2 4 6 8 10radi us

#dist cal.

RBTGHTMVPT

#I / O, RBT VS. GHT. VS MVPT

0

100

200

300

400

500

600

700

800

0 2 4 6 8 10radi us

#IO

RBT

GHT

MVPT

Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT

Comparison of Three Methods of Metric-Space Indexing

Page 19: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Open problems

• Is there a general metric-space index structure that is generally good for most work loads.– We are optimistic mvp tree’s – further tuning will be a

useful answer

– Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine.

• No work addresses clustering data pages on disk.• Metric-space join algorithms

Page 20: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Biological Models are Usually Based on Similarity

Similarity• Biologist like scoring functions that reward each

similar feature with a positive number• Intuitive

Distance:• More Similar smaller numbers• Identical 0

Page 21: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

But Do Metric Models Capture Biology?But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models

.

Page 22: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Sequence Problem 1

Sequence similarity based on weighted edit distance

Accepted weight matrices, PAM & BLOSSUM, are not metric

Log-odd matrices – negative values

Defy simple algebraic normalization[TaylorJones93,Linialetal97]

Page 23: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Our First Result: mPAM [Xu&Miranker04]

Dayhoffetal’s PAM Derivation[74]

• Took a set of closely related protein sequences

• Developed a phylogenetic tree

• Counted substitutions to transform one sequence to another

• Tree determines a measure of time

Page 24: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

PAM vs. mPAM: t = 1/f

Using original substitution counts

PAM: frequency of substitution

S(a,b|t) = log P(b|a,t)/qb

mPAM: expected time between substitutions

D(a,b) = 1/log(1 – (P(a,x)P(b,x))x

Page 25: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Sequence Problem 2

• Sequences long units (identity for storage and retrieval)– Genes– Chromosomes

• Analysis comprises comparing small substrings

Page 26: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Soln: Sequence View

• New view type

• Breaks sequences into q-grams

create SEQUENCEVIEW rice_sview asSELECT CREATE FRAGMENTS (…, 3, 1)FROM …WHERE …

USING HAMMING-DISTANCE

Page 27: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Materialize as an Index

Genomes

Rowid Seq

R1 CAACA

R2 ATCAAA

R3 …

Rowd Offset Logical Fragment

R1 1 A C A

R1 2 C A A

R1 3 A A C

R1 4 A C A

… … …

R2 1 A T C

R2 2 T C A

R2 3 C A A

R2 4 A A A

… … …

D(ACA)

≤ 1D(CAA)

≤ 0D(ATC)

≤ 1

D(AAA)≤ 2

{

{

Page 28: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Status

• Started with McKoi– A Java open source object-relational DBMS– (Think of Postgress written in Java)

• AddedBiological data typesMetric-space indexExtending SQL engine (in progress)

Page 29: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome

1. Locate nucleotide patterns of form

primer pair candidate

2. Eliminate non-unique primer candidates3. Merge overlapping primer candidates

• Usual implementations O(n2), n = 109

Rice

Arab.

18 Matching Nucleotides

Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long

18 Matching Nucleotides

Page 30: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

mSQL Query to locate candidate primer pairsSELECT merge(R1.fragment, A1.fragment)

FROM

G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2

WHERE

distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND

(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND

(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND

(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND

(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000

GROUP BY R1.fragment, A1.fragment;

Page 31: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Query Plan Arab. Genome, O(n) Rice Genome, O(m)

Offline: Build Sequence View O(n log n)

Compare O(mlogn) Indexed Nested Loop

Eliminate Duplicates

Eliminate Low ComplexityPrimers (LZ compression)

Merge Overlapping Primers

~10,000 conserved primer pairs candidates

Page 32: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Preliminary Results• Found 13,418 possible primer pairs from MoBIoS• 100 best candidates BLASTed for matches in GenBank

– 15 matched other plant genes and the primers– At least 2 of 15 showed potential after PCR amplification against

Helianthus and Phalaenopsis.

Page 33: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

MoBIoS Architecture(Molecular Biological Information System)

Page 34: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Analysing Mass-Spectra

Spectrum = Histogram of Mass/Charge Ratios of a collection peptides

Similarity = Shared peaks count = Inner Product

(0100101) • (0111100) = 2

Page 35: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Cosine Distance Approx. Inner Product

Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2

shown store and retrieve mass-spectra

- using cosine distance, and it scales

Page 36: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

mSQL Query for Protein Identification by Mass-Spec.

Signature Database Look

SELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,

mass_spectra MS

WHEREMS.enzyme = DS.enzyme = E and

Cosine_Distance(S, MS.spectrum, range1) and

DS.accession_id = MS.accession_id = Prot.accesion_id and

DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);

Page 37: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Matching Electrostatic Shape of Molecules

Page 38: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106

Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers

G R I D

Mirror DB-Contents

MoBIoSServer

recluster

New index Shape match (FEM)

Distance(real)

High speed I/O

Page 39: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Hyper-planes [Ulhmann91]

• If d(x,h1) < d(x,h2) then x assigned to h1h1

h2

x

Page 40: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics

Develop a Hierarchical Clustering

Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap

• Inspired by R-trees

B

F D

EA

C