databases as analytical engines for drug discovery

Databases as Analytical Engines for Drug Discovery

Susie StephensPrincipal Product Manager, Life Sciences

Oracle Corporation

[email protected]

Outline

Data Challenges Case Studies Summary

Flat files

Distributed query

Transparent Gateway

External Sites

MySQL

Generic Connectivity

DBlinks

UltraSearch

Sybase SRS

Transparent Gateway

External Table

Access Distributed Data

Integrate a Variety of Data Types

CLOBs XML Text Images Video Relational Users Defined

Objects

Nucleotide Sequences Gene Expression Data Papers Cell Histology Images Protein Folding Video SwissProt KEGG Chemical Structures

XMLXML

Partitioning Oracle Data Guard Real Application Clusters (RAC) Automated Storage Management Adaptive Instance Tuning Automated Application and SQL

Tuning Automated Database Diagnostic

Monitor (ADDM) Scheduling

Manage Vast Quantities of Data

0

10TB

20TB

30TB

40TB

50TB

Integrated communications

Single enterprise search

Flexible access Fine grained access

control Auditing Workflow Personalized portal

Collaborate Securely

Find Patterns and Insights

Oracle Data Mining– Find relationships & clusters

Oracle Discoverer & Oracle OLAP– Interactive query & drill-down

Statistics– mean, stdev, median, correlations, linear

regression

Oracle Text– Cluster & Classify documents of interest

Table Functions– Implement complex algorithms within the

database

Outline


Regular Expression Searches

• A powerful method of describing both simple & complex patterns for searching & manipulating

• A multilingual regular expression support for SQL & PL/SQL string types

• Follows POSIX style Regexp syntax

• Support standard Regexp operators

• Includes common extensions such as case-insensitive matching, sub-expression back-references, etc.

• Compatible with popular Regexp implementations like GNU, Perl, Awk

Case Study: Retrieve Protein Data from SGD using Regular Expressions

Case study courtesy of Prolexys Pharmaceuticals, Inc.

HTTP Raw Data</script></head><body><body bgcolor='#FFFFFF'><table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded"><input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /></form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr 4 MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK*</pre><hr size="2" width="75%"><table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion"><input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" src="http://www.yeastgenome.org/images/mail.gif" /></a></form></td></tr></table></body></html>

Function to Parse out AA Sequence

create or replace function orf2seq ( p_orf in varchar2) return varchar2 is v_stream clob; strt number;begin -- Retrieve the HTTP stream: v_stream := httpuritype.getclob(httpuritype.createuri( 'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf|| '&flankl=0&flankr=0&map=p3map') );

-- Trim off the head of the stream: strt := dbms_lob.instr(v_stream, 'Submit', 1, 1);

-- Strip out control characters, new lines, etc.: v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), '[[:cntrl:]]', '');

-- Return the AA sequence: return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), '[[:upper:]]{10,}'));end;

AA Sequence for ORF ‘YDR099W’

SQL> select orf2seq('YDR099W') from dual;

ORF2SEQ('YDR099W')--------------------------------------------------------------------------------

MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK

Elapsed: 00:00:01.24

SQL> insert into pseq (orf_id, sequence)2 values ('YDR099W', orf2seq('YDR099W'));

Case Study: Motif Searching in Proteins

PROSITE database of protein sequence motifs

ID TYR_PHOSPHO_SITE; PATTERN AC PS00007DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE)DE Tyrosine kinase phosphorylation sitePA [RK]-x(2,3)-[DE]-x(2,3)-YCC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylationCC /SKIP-FLAG=TRUEDO PDOC00007

Source: http://www.expasy.org/prosite/ps_frequent_patterns.txt

TKP Pattern: [RK]-x(2,3)-[DE]-x(2,3)-Y– R=Arginine, K=Lysine, D=Aspartate, E=Glutamate, Y=Tyrosine, x=any AA

Oracle10g Regular Expression Equivalent– [RK].{2,3}[DE].{2,3}[Y]


SQL to Retrieve All Proteins Interacting with TKP

select distinct substr(a.refseq_id, 1, 9) refseq_id, length(a.seq_string_varchar) seq_length, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1) motif_offs1, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2) motif_offs2, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3) motif_offs3, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4) motif_offs4from target_db a, y2h_interaction_p bwhere a.refseq_id like 'NP%' and regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]') and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) =

b.prey_refseq);

Query Results

REFSEQ_ID SEQ_LENGTH MOTIF1_OFFS MOTIF2_OFFS MOTIF3_OFFS MOTIF4_OFFS------------ ---------- ----------- ----------- ----------- -----------NP_003961 1465 14 202 347 537NP_003968 330 241 0 0 0NP_003983 490 8 50 62 93NP_004001 3562 3085 0 0 0...

MHHCKRYRSPEPDPYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSRSHDRLPYQRRYRERRDSDTYRCEERSPSFGEDYYGPSRSRHRRRSRERGPYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGKNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVADFGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHRTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEERSFHTSRNPSR

SQL to Retrieve Motif Frequency by Protein

selectc.refseq_id "Refseq ID",rs2desc(c.refseq_id) "Protein Description",a.cnt "Repetitions",b.ps_ac "Prosite AC",b.descr "Motif Description"

frommotif_data a,ps_data b,target_dbp c

wherea.ps_ac = b.ps_acand a.sequence_id = c.sequence_id

order by3 desc, 1

;

Query Results

Refseq ID Protein Description Repetitions Prosite AC Motif Description--------------- ------------------------------ ----------- ------------ ------------------------------NP_055995.2 spectrin repeat containing, 145 PS00006 Casein kinase II phosphorylation site nuclear envelope 2 NP_056363.1 bullous pemphigoid antigen 1, 132 PS00006 Casein kinase II phosphorylation site 230/240kDa NP_001139.2 ankyrin 2, neuronal 115 PS00006 Casein kinase II phosphorylation site NP_066267.1 ankyrin 3, node of Ranvier 110 PS00006 Casein kinase II phosphorylation site (ankyrin G) NP_056363.1 bullous pemphigoid antigen 1, 102 PS00005 Protein kinase C phosphorylation site 230/240kDa NP_005520.2 heparan sulfate proteoglycan 2 97 PS00008 N-myristoylation site (perlecan)NP_066267.1 ankyrin 3, node of Ranvier 97 PS00005 Protein kinase C phosphorylation site (ankyrin G) P_001139.2 ankyrin 2, neuronal 96 PS00005 Protein kinase C phosphorylation site NP_115495.1 monogenic, audiogenic seizure 95 PS00006 Casein kinase II phosphorylation site susceptibility 1 homolog (mouse) ...

Regular Expression Searches Quote

"Thanks to Oracle 10g's Regular Expressions (RE) query support, it's no longer necessary to export data from the database, process

it with a RE enabled tool and then import the data back into the database. Now, RE processing can be handled with a single query." - Marcel Davidson, Head of Database Administration,

Myriad Proteomics

Oracle Data Mining BLAST

Implemented using a table function interface

BLAST search functions can be placed in SQL queries

Different functions for match & align

Combination of SQL queries & BLAST is very powerful & flexible

C A T G0 0 1 0 1

Case Study: BLAST as a Sequence Identification Tool Identify protein with high sequence similarity and the

functional class

select function, COUNT(seq_id) f_count

from (select t.seq_id, t.score, t.expect, g.function

from SwissProt_DB g,

Table(BLASTP_MATCH(

‘AEQAERYDDMAAAMKRY’,

cursor (select seq_id, sequence

from SwissProt_DB),

5)) t /* expect_value */

where t.seq_id = g.seq_id)

group by function /* swissprot kw */

order by f_count

BLASTP_MATCH

SwissProt_DB

seq_id, score, expect

query_sequence, parameters

seq_id, function

t.seq_id = g.seq_id

GROUP BY

function, f_count

SwissProt_DB

Case Study: Homology Search between Yeast and Human Data

A

B

X

Y

C

Z

Yeast Protein Interactome Human Protein Interactome

Homology Mapping

Interlogs: (A|X, B|Y) and (A|X, B|Z)

Determined experimentally

with Y2HDetermined

experimentally with Y2H

Inferred through BLAST


Batch BLAST: Human (query) vs. Yeast (subject)

for v1 in c1 loopinsert into yeast_human_homolog (

human_refseq, yeast_orf_name, score, expect

)select

v1.refseq_id,t.t_seq_id,t.score,t.expect

fromtable ( blastp_match (

v1.sequence_string,cursor ( select a.yeast_acn, a.yeast_seq

from yeast_prot_seq a ))

) twhere

t.expect < 0.00001;

end loop;

BLAST Results

Yeast Yeast Human Human Expect 1 Expect 2 Gene 1 Gene 2 Refseq 1 Refseq 2------- ------- ----------- ----------- -------- --------YAR018C YIL061C NP_XXXXX1.1 NP_YYYYY1.1 4.79E-12 4.58E-06 YBL016W YDL159W NP_XXXXX2.1 NP_YYYYY2.1 1.11E-08 5.25E-10 YBL016W YDL159W NP_XXXXX3.1 NP_YYYYY3.1 2.63E-10 9.04E-11 YBL016W YDL159W NP_XXXXX4.1 NP_YYYYY4.1 4.57E-07 8.33E-09 YBL016W YDL159W NP_XXXXX5.1 NP_YYYYY5.1 1.57E-22 1.11E-08 YBL063W YIL061C NP_XXXXX6.1 NP_YYYYY6.1 3.17E-64 8.67E-06 YBL063W YIL061C NP_XXXXX7.1 NP_YYYYY7.1 2.30E-06 4.58E-06 YBR109C YDR356W NP_XXXXX8.1 NP_YYYYY8.1 1.78E-07 7.74E-11 YBR109C YDR356W NP_XXXXX9.1 NP_YYYYY9.1 1.24E-08 7.74E-11 YBR109C YDR356W NP_XXXX10.1 NP_YYYY10.1 5.19E-07 2.80E-20 YBR109C YDR356W NP_XXXX11.1 NP_YYYY11.1 3.92E-10 4.39E-11 YBR109C YFR014C NP_XXXX12.1 NP_YYYY12.1 3.67E-48 6.91E-17 YBR109C YOL016C NP_XXXX13.1 NP_YYYY13.1 3.67E-48 1.82E-17

Yeast Interactors

Human Interactors

Interlogs

BLAST Quote

"Oracle 10g's new BLAST feature will enable us to easily integrate multiple types of genomic and proteomic data for complicated queries used in the mining of our proprietary protein-protein

interaction and cDNA sequence datasets." - Jake Chen, Principal Bioinformatics Scientist, Myriad Proteomics

Spatial Network Data Model

Data model for managing graph (link-node) structures

Rich graph analysis functions Supports variety of network

structures (hierarchical, directed, undirected, random, scale-free)

Framework for applying network constraints and rules (e.g. path length, cost, minimum bounding rectangle)

Bundled Java visualiser & APIs for 3rd party tools, application development

Case Study: Integration Architecture

NREF EMBL GO KEGG BIND AFCS

Nodes Edges Graph

NDM layer (semantic layer)

NativeFormats

Data type determines available routes

Routes can be determined using semantics

Network Route

Distributed Database layer

Case study courtesy of Beyond Genomics, Inc.

Network Data Model Quote

"Beyond Genomics, Inc., as a leading systems biology company, believes that Oracle 10g's network data model will significantly

advance the integration of metabolomic, proteomic, transcriptomic, and clinical data sets and the applications that derive value from

these data." – Eric Neumann, Vice President Strategic Informatics, Beyond Genomics, Inc.

Oracle Data Mining

Unsupervised Learning– Hierarchical K-means Cluster– O-Cluster– Non-Negative Matrix Factorization– Apriori

Supervised Learning– Naïve Bayes– Adaptive Bayes Network– Support Vector Machines– PredictorVariance

ODM can mine structured data, text data, or structured and text data

K-Means Clustering

• Hierarchical k-means produces tree of clusters

• All splits are binary

• Each cluster has a centroid & a histogram

• Achieves a reliable solution in a single run

• Ranked rules that describe attributes for cluster

• Cluster assignments are probabilistic using a Bayesian model

• Operates on very deep datasets by using a summarization module

42 Tumor Samples:

• Normal Cerebellum [MD] (4)

• Malignant Gliomas [MGlio] (10)

• Medulloblastomas [MD] (10)

• Rhabdoid tumors [Rhabdoid] (10)

• Primitive Neuroectodermal [PNET] (8)

* Pomeroy et al Nature 415, 24, p436 (2002).

Case Study: Brain Tumor Clustering• Collection of 42 Human Brain tumors* and 7,129 gene expression

profiles

• Clustering of samples according to their gene expression profiles

• It is an example of class and taxonomy discovery

• Does the data cluster according to the known biological classes?

ODM Hierarchical k-Means Clustering

Node 1

Node 2 Node 3

Node 6Node 4 Node 7Node 5 Glioblastoma Normal Medulloblastoma Rhabdoid Cluster Cluster Cluster Cluster

Rhabdoid

Ncer

MD

MGlio

PNET

Literature Results using Hierarchical Clustering

From Pomeroy et al Nature 415, 24, p436 (2002).

Rhabdoid

Ncer

MD

MGlio

PNET

Association Rules

• Captures frequent co-occurrences of items/attribute values

(A, B) => C occurrence or A and B together implies C

• Can be applied in different scenarios• Market basket analysis • Pattern discovery• Predictive applications

• ODM uses SQL-based implementation of Apriori algorithm

Case Study: Analysis of Trends in a Patient Group

Clinical Table of 60 Medulloblastoma Patients

7 Clinical attributes:

Subtype: classic or desmoplastic medulloblastoma

Size (tumor size): T1-T4

Stage: M0-M4

Sex: M, F

Age (range): 0-5, 5-10, 10-15….

Outcome: S (treatment success),

F (treatment failure)

Chemo (regime type): 0,1,2,3,4,5,6 * Pomeroy et al Nature 415, 24, p436 (2002).

Association Rules Results

Over 100 rules reflecting factual or known relationships in data:

Age=1 THEN Sex=M

(confidence = 0.8)

Interpretation: Most 5-10 year-old patients are male

Subtype=Desmoplastic THEN Stage=M0

(confidence = 0.79)

Interpretation: Most desmoplastic patients in the study have stage M0

Association Rules Results

Other interesting trends:

Stage=M0 THEN Outcome=S

(confidence = 0.74)

Interpretation: Stage M0 vs non-M0 is a predictor of treatment outcome

Stage=M0 AND Size=T3 AND Chemo=1 THEN Outcome=S

(confidence = 0.92)

Interpretation: Most patients with stage M0, size T3 who received chemo regime 1 had good response to treatment

Support Vector Machines

• SVM provides a very general multi-purpose and powerful classifier

• SVM does not require feature selection and can work well with thousands of input features

• SVM is accurate and can approximate complex functional relationships

• SVM works in binary, multi-class, sparse (text) classification and regression

• SVM is easy to train and apply and can be used in discovery mode or in production automated methodologies

Multiple Examples (14) of normal human tissue and tumors

• Could a single model distinguish normal vs cancer?

• Train set: 200 samples, test set: 80 samples

• Microarrays profiles for 7,129 genes

Case Study: Classification of Normal Human Tissue and Tumors

S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA 98: 15149-15154 (2001)

Normal Tissue vs. Cancer

Normal vs. Cancer (Multiple types)

SVM Test Set Predictions

Predicted

Normal Cancer

Actual Normal 16 10

Cancer 3 51 Test set accuracy: 83.75%

(Naïve Bayes = 75%)

Support Vector Machines Results

Classification of Multiple Tumor Types

• S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA 98: 15149-15154 (2001)

• C. Yeang et al, Procs. of ISMB 2001. Bioinformatics Discovery Note, 1:1-7, (2001)

DNA Microarray Data for 14 Tumor Classes

Published Datasets

• Gene expression profiles for 7,129 genes

• Datasets tumor type composition:

• 9 minutes training time on 500MHz Netra

• 78.3% accuracy for multi-tumor molecular classification

Results of Multiple Tumor Type Analysis

Tumor Class # Train # Test Tumor Class # Train # Test

Breast (BR) 8 3 Uterus (UT) 8 2

Prostate (PR) 8 2 Leukemia (LE) 24 6

Lung (LU) 8 3 Renal (RE) 8 3

Colorectal (CO) 8 5 Pancreas (PA) 8 3

Lymphoma (LY) 16 6 Ovary (OV) 8 3

Bladder (BL) 8 3 Mesothelioma (MS) 8 3

Melanoma (ML) 8 2 Brain (BR) 16 4

Outline


Databases have functionality to access and integrate distributed data

There are data management, performance and security benefits to performing analytics in databases

A range of analytical functionality is now available in databases

Summary

databases as analytical engines for drug discovery

Documents