databases as analytical engines for drug discovery
DESCRIPTION
Databases as Analytical Engines for Drug Discovery. Susie Stephens Principal Product Manager, Life Sciences Oracle Corporation [email protected]. Outline. Data Challenges Case Studies Summary. Access Distributed Data. External Sites. UltraSearch. Distributed query. MySQL. - PowerPoint PPT PresentationTRANSCRIPT
Databases as Analytical Engines for Drug Discovery
Susie StephensPrincipal Product Manager, Life Sciences
Oracle Corporation
Outline
Data Challenges Case Studies Summary
Flat files
Distributed query
Transparent Gateway
External Sites
MySQL
Generic Connectivity
DBlinks
UltraSearch
Sybase SRS
Transparent Gateway
External Table
Access Distributed Data
Integrate a Variety of Data Types
CLOBs XML Text Images Video Relational Users Defined
Objects
Nucleotide Sequences Gene Expression Data Papers Cell Histology Images Protein Folding Video SwissProt KEGG Chemical Structures
XMLXML
Partitioning Oracle Data Guard Real Application Clusters (RAC) Automated Storage Management Adaptive Instance Tuning Automated Application and SQL
Tuning Automated Database Diagnostic
Monitor (ADDM) Scheduling
Manage Vast Quantities of Data
0
10TB
20TB
30TB
40TB
50TB
Integrated communications
Single enterprise search
Flexible access Fine grained access
control Auditing Workflow Personalized portal
Collaborate Securely
Find Patterns and Insights
Oracle Data Mining– Find relationships & clusters
Oracle Discoverer & Oracle OLAP– Interactive query & drill-down
Statistics– mean, stdev, median, correlations, linear
regression
Oracle Text– Cluster & Classify documents of interest
Table Functions– Implement complex algorithms within the
database
Outline
Data Challenges Case Studies Summary
Regular Expression Searches
• A powerful method of describing both simple & complex patterns for searching & manipulating
• A multilingual regular expression support for SQL & PL/SQL string types
• Follows POSIX style Regexp syntax
• Support standard Regexp operators
• Includes common extensions such as case-insensitive matching, sub-expression back-references, etc.
• Compatible with popular Regexp implementations like GNU, Perl, Awk
Case Study: Retrieve Protein Data from SGD using Regular Expressions
Case study courtesy of Prolexys Pharmaceuticals, Inc.
HTTP Raw Data</script></head><body><body bgcolor='#FFFFFF'><table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded"><input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /></form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr 4 MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK*</pre><hr size="2" width="75%"><table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion"><input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" src="http://www.yeastgenome.org/images/mail.gif" /></a></form></td></tr></table></body></html>
Function to Parse out AA Sequence
create or replace function orf2seq ( p_orf in varchar2) return varchar2 is v_stream clob; strt number;begin -- Retrieve the HTTP stream: v_stream := httpuritype.getclob(httpuritype.createuri( 'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf|| '&flankl=0&flankr=0&map=p3map') );
-- Trim off the head of the stream: strt := dbms_lob.instr(v_stream, 'Submit', 1, 1);
-- Strip out control characters, new lines, etc.: v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), '[[:cntrl:]]', '');
-- Return the AA sequence: return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), '[[:upper:]]{10,}'));end;
AA Sequence for ORF ‘YDR099W’
SQL> select orf2seq('YDR099W') from dual;
ORF2SEQ('YDR099W')--------------------------------------------------------------------------------
MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK
Elapsed: 00:00:01.24
SQL> insert into pseq (orf_id, sequence)2 values ('YDR099W', orf2seq('YDR099W'));
Case Study: Motif Searching in Proteins
PROSITE database of protein sequence motifs
ID TYR_PHOSPHO_SITE; PATTERN AC PS00007DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE)DE Tyrosine kinase phosphorylation sitePA [RK]-x(2,3)-[DE]-x(2,3)-YCC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylationCC /SKIP-FLAG=TRUEDO PDOC00007
Source: http://www.expasy.org/prosite/ps_frequent_patterns.txt
TKP Pattern: [RK]-x(2,3)-[DE]-x(2,3)-Y– R=Arginine, K=Lysine, D=Aspartate, E=Glutamate, Y=Tyrosine, x=any AA
Oracle10g Regular Expression Equivalent– [RK].{2,3}[DE].{2,3}[Y]
Case study courtesy of Prolexys Pharmaceuticals, Inc.
SQL to Retrieve All Proteins Interacting with TKP
select distinct substr(a.refseq_id, 1, 9) refseq_id, length(a.seq_string_varchar) seq_length, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1) motif_offs1, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2) motif_offs2, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3) motif_offs3, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4) motif_offs4from target_db a, y2h_interaction_p bwhere a.refseq_id like 'NP%' and regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]') and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) =
b.prey_refseq);
Query Results
REFSEQ_ID SEQ_LENGTH MOTIF1_OFFS MOTIF2_OFFS MOTIF3_OFFS MOTIF4_OFFS------------ ---------- ----------- ----------- ----------- -----------NP_003961 1465 14 202 347 537NP_003968 330 241 0 0 0NP_003983 490 8 50 62 93NP_004001 3562 3085 0 0 0...
MHHCKRYRSPEPDPYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSRSHDRLPYQRRYRERRDSDTYRCEERSPSFGEDYYGPSRSRHRRRSRERGPYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGKNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVADFGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHRTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEERSFHTSRNPSR
SQL to Retrieve Motif Frequency by Protein
selectc.refseq_id "Refseq ID",rs2desc(c.refseq_id) "Protein Description",a.cnt "Repetitions",b.ps_ac "Prosite AC",b.descr "Motif Description"
frommotif_data a,ps_data b,target_dbp c
wherea.ps_ac = b.ps_acand a.sequence_id = c.sequence_id
order by3 desc, 1
;
Query Results
Refseq ID Protein Description Repetitions Prosite AC Motif Description--------------- ------------------------------ ----------- ------------ ------------------------------NP_055995.2 spectrin repeat containing, 145 PS00006 Casein kinase II phosphorylation site nuclear envelope 2 NP_056363.1 bullous pemphigoid antigen 1, 132 PS00006 Casein kinase II phosphorylation site 230/240kDa NP_001139.2 ankyrin 2, neuronal 115 PS00006 Casein kinase II phosphorylation site NP_066267.1 ankyrin 3, node of Ranvier 110 PS00006 Casein kinase II phosphorylation site (ankyrin G) NP_056363.1 bullous pemphigoid antigen 1, 102 PS00005 Protein kinase C phosphorylation site 230/240kDa NP_005520.2 heparan sulfate proteoglycan 2 97 PS00008 N-myristoylation site (perlecan)NP_066267.1 ankyrin 3, node of Ranvier 97 PS00005 Protein kinase C phosphorylation site (ankyrin G) P_001139.2 ankyrin 2, neuronal 96 PS00005 Protein kinase C phosphorylation site NP_115495.1 monogenic, audiogenic seizure 95 PS00006 Casein kinase II phosphorylation site susceptibility 1 homolog (mouse) ...
Regular Expression Searches Quote
"Thanks to Oracle 10g's Regular Expressions (RE) query support, it's no longer necessary to export data from the database, process
it with a RE enabled tool and then import the data back into the database. Now, RE processing can be handled with a single query." - Marcel Davidson, Head of Database Administration,
Myriad Proteomics
Oracle Data Mining BLAST
Implemented using a table function interface
BLAST search functions can be placed in SQL queries
Different functions for match & align
Combination of SQL queries & BLAST is very powerful & flexible
C A T G0 0 1 0 1
Case Study: BLAST as a Sequence Identification Tool Identify protein with high sequence similarity and the
functional class
select function, COUNT(seq_id) f_count
from (select t.seq_id, t.score, t.expect, g.function
from SwissProt_DB g,
Table(BLASTP_MATCH(
‘AEQAERYDDMAAAMKRY’,
cursor (select seq_id, sequence
from SwissProt_DB),
5)) t /* expect_value */
where t.seq_id = g.seq_id)
group by function /* swissprot kw */
order by f_count
BLASTP_MATCH
SwissProt_DB
seq_id, score, expect
query_sequence, parameters
seq_id, function
t.seq_id = g.seq_id
GROUP BY
function, f_count
SwissProt_DB
Case Study: Homology Search between Yeast and Human Data
A
B
X
Y
C
Z
Yeast Protein Interactome Human Protein Interactome
Homology Mapping
Interlogs: (A|X, B|Y) and (A|X, B|Z)
Determined experimentally
with Y2HDetermined
experimentally with Y2H
Inferred through BLAST
Case study courtesy of Prolexys Pharmaceuticals, Inc.
Batch BLAST: Human (query) vs. Yeast (subject)
for v1 in c1 loopinsert into yeast_human_homolog (
human_refseq, yeast_orf_name, score, expect
)select
v1.refseq_id,t.t_seq_id,t.score,t.expect
fromtable ( blastp_match (
v1.sequence_string,cursor ( select a.yeast_acn, a.yeast_seq
from yeast_prot_seq a ))
) twhere
t.expect < 0.00001;
end loop;
BLAST Results
Yeast Yeast Human Human Expect 1 Expect 2 Gene 1 Gene 2 Refseq 1 Refseq 2------- ------- ----------- ----------- -------- --------YAR018C YIL061C NP_XXXXX1.1 NP_YYYYY1.1 4.79E-12 4.58E-06 YBL016W YDL159W NP_XXXXX2.1 NP_YYYYY2.1 1.11E-08 5.25E-10 YBL016W YDL159W NP_XXXXX3.1 NP_YYYYY3.1 2.63E-10 9.04E-11 YBL016W YDL159W NP_XXXXX4.1 NP_YYYYY4.1 4.57E-07 8.33E-09 YBL016W YDL159W NP_XXXXX5.1 NP_YYYYY5.1 1.57E-22 1.11E-08 YBL063W YIL061C NP_XXXXX6.1 NP_YYYYY6.1 3.17E-64 8.67E-06 YBL063W YIL061C NP_XXXXX7.1 NP_YYYYY7.1 2.30E-06 4.58E-06 YBR109C YDR356W NP_XXXXX8.1 NP_YYYYY8.1 1.78E-07 7.74E-11 YBR109C YDR356W NP_XXXXX9.1 NP_YYYYY9.1 1.24E-08 7.74E-11 YBR109C YDR356W NP_XXXX10.1 NP_YYYY10.1 5.19E-07 2.80E-20 YBR109C YDR356W NP_XXXX11.1 NP_YYYY11.1 3.92E-10 4.39E-11 YBR109C YFR014C NP_XXXX12.1 NP_YYYY12.1 3.67E-48 6.91E-17 YBR109C YOL016C NP_XXXX13.1 NP_YYYY13.1 3.67E-48 1.82E-17
Yeast Interactors
Human Interactors
Interlogs
BLAST Quote
"Oracle 10g's new BLAST feature will enable us to easily integrate multiple types of genomic and proteomic data for complicated queries used in the mining of our proprietary protein-protein
interaction and cDNA sequence datasets." - Jake Chen, Principal Bioinformatics Scientist, Myriad Proteomics
Spatial Network Data Model
Data model for managing graph (link-node) structures
Rich graph analysis functions Supports variety of network
structures (hierarchical, directed, undirected, random, scale-free)
Framework for applying network constraints and rules (e.g. path length, cost, minimum bounding rectangle)
Bundled Java visualiser & APIs for 3rd party tools, application development
Case Study: Integration Architecture
NREF EMBL GO KEGG BIND AFCS
Nodes Edges Graph
NDM layer (semantic layer)
NativeFormats
Data type determines available routes
Routes can be determined using semantics
Network Route
Distributed Database layer
Case study courtesy of Beyond Genomics, Inc.
Network Data Model Quote
"Beyond Genomics, Inc., as a leading systems biology company, believes that Oracle 10g's network data model will significantly
advance the integration of metabolomic, proteomic, transcriptomic, and clinical data sets and the applications that derive value from
these data." – Eric Neumann, Vice President Strategic Informatics, Beyond Genomics, Inc.
Oracle Data Mining
Unsupervised Learning– Hierarchical K-means Cluster– O-Cluster– Non-Negative Matrix Factorization– Apriori
Supervised Learning– Naïve Bayes– Adaptive Bayes Network– Support Vector Machines– PredictorVariance
ODM can mine structured data, text data, or structured and text data
K-Means Clustering
• Hierarchical k-means produces tree of clusters
• All splits are binary
• Each cluster has a centroid & a histogram
• Achieves a reliable solution in a single run
• Ranked rules that describe attributes for cluster
• Cluster assignments are probabilistic using a Bayesian model
• Operates on very deep datasets by using a summarization module
42 Tumor Samples:
• Normal Cerebellum [MD] (4)
• Malignant Gliomas [MGlio] (10)
• Medulloblastomas [MD] (10)
• Rhabdoid tumors [Rhabdoid] (10)
• Primitive Neuroectodermal [PNET] (8)
* Pomeroy et al Nature 415, 24, p436 (2002).
Case Study: Brain Tumor Clustering• Collection of 42 Human Brain tumors* and 7,129 gene expression
profiles
• Clustering of samples according to their gene expression profiles
• It is an example of class and taxonomy discovery
• Does the data cluster according to the known biological classes?
ODM Hierarchical k-Means Clustering
Node 1
Node 2 Node 3
Node 6Node 4 Node 7Node 5 Glioblastoma Normal Medulloblastoma Rhabdoid Cluster Cluster Cluster Cluster
Rhabdoid
Ncer
MD
MGlio
PNET
Literature Results using Hierarchical Clustering
From Pomeroy et al Nature 415, 24, p436 (2002).
Rhabdoid
Ncer
MD
MGlio
PNET
Association Rules
• Captures frequent co-occurrences of items/attribute values
(A, B) => C occurrence or A and B together implies C
• Can be applied in different scenarios• Market basket analysis • Pattern discovery• Predictive applications
• ODM uses SQL-based implementation of Apriori algorithm
Case Study: Analysis of Trends in a Patient Group
Clinical Table of 60 Medulloblastoma Patients
7 Clinical attributes:
Subtype: classic or desmoplastic medulloblastoma
Size (tumor size): T1-T4
Stage: M0-M4
Sex: M, F
Age (range): 0-5, 5-10, 10-15….
Outcome: S (treatment success),
F (treatment failure)
Chemo (regime type): 0,1,2,3,4,5,6 * Pomeroy et al Nature 415, 24, p436 (2002).
Association Rules Results
Over 100 rules reflecting factual or known relationships in data:
Age=1 THEN Sex=M
(confidence = 0.8)
Interpretation: Most 5-10 year-old patients are male
Subtype=Desmoplastic THEN Stage=M0
(confidence = 0.79)
Interpretation: Most desmoplastic patients in the study have stage M0
Association Rules Results
Other interesting trends:
Stage=M0 THEN Outcome=S
(confidence = 0.74)
Interpretation: Stage M0 vs non-M0 is a predictor of treatment outcome
Stage=M0 AND Size=T3 AND Chemo=1 THEN Outcome=S
(confidence = 0.92)
Interpretation: Most patients with stage M0, size T3 who received chemo regime 1 had good response to treatment
Support Vector Machines
• SVM provides a very general multi-purpose and powerful classifier
• SVM does not require feature selection and can work well with thousands of input features
• SVM is accurate and can approximate complex functional relationships
• SVM works in binary, multi-class, sparse (text) classification and regression
• SVM is easy to train and apply and can be used in discovery mode or in production automated methodologies
Multiple Examples (14) of normal human tissue and tumors
• Could a single model distinguish normal vs cancer?
• Train set: 200 samples, test set: 80 samples
• Microarrays profiles for 7,129 genes
Case Study: Classification of Normal Human Tissue and Tumors
S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA 98: 15149-15154 (2001)
Normal Tissue vs. Cancer
Normal vs. Cancer (Multiple types)
SVM Test Set Predictions
Predicted
Normal Cancer
Actual Normal 16 10
Cancer 3 51 Test set accuracy: 83.75%
(Naïve Bayes = 75%)
Support Vector Machines Results
Classification of Multiple Tumor Types
• S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA 98: 15149-15154 (2001)
• C. Yeang et al, Procs. of ISMB 2001. Bioinformatics Discovery Note, 1:1-7, (2001)
DNA Microarray Data for 14 Tumor Classes
Published Datasets
• Gene expression profiles for 7,129 genes
• Datasets tumor type composition:
• 9 minutes training time on 500MHz Netra
• 78.3% accuracy for multi-tumor molecular classification
Results of Multiple Tumor Type Analysis
Tumor Class # Train # Test Tumor Class # Train # Test
Breast (BR) 8 3 Uterus (UT) 8 2
Prostate (PR) 8 2 Leukemia (LE) 24 6
Lung (LU) 8 3 Renal (RE) 8 3
Colorectal (CO) 8 5 Pancreas (PA) 8 3
Lymphoma (LY) 16 6 Ovary (OV) 8 3
Bladder (BL) 8 3 Mesothelioma (MS) 8 3
Melanoma (ML) 8 2 Brain (BR) 16 4
Outline
Data Challenges Case Studies Summary
Databases have functionality to access and integrate distributed data
There are data management, performance and security benefits to performing analytics in databases
A range of analytical functionality is now available in databases
Summary