exploring and exploiting the biological maze zoé lacroix arizona state university

30
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Upload: lester-johns

Post on 18-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Exploring and Exploiting the Biological Maze

Zoé Lacroix

Arizona State University

Page 2: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Data collection queries

Scientific protocol– Must be able to reproduce the process

Involve multiple resources– Data sources– Applications

Page 3: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Expressing scientific protocols

Scientific protocols mix design and implementation

Design – What the protocols does (tasks)– Scientific objects involved

Implementation – How the protocol is executed– Data sources and applications

Page 4: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Expressing scientific protocols

Scientific protocols are driven by their implementation– Scientists use the resources they know

• data (quality)• access to data• format, limits, etc.

– Scientists may not exploit better resources because they do not know them

Queries should be driven by the design, the implementation should meet the design needs

Page 5: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Example* - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

The alternative splicing pipeline will provide a complete characterization of variations in proteins due to splice variation or SNPs evident in repositiories of contiguous genome sequence data and expressed sequence tags (ESTs). The pipeline applies secondary structure, tertiary structure, domain motif detection and sequence comparison tools to proteins encoded by genes with alternatively splice forms or SNPs.

*Courtesy of Dr. Marta Janer, Institute for Systems Biology

Page 6: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.

Page 7: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.

Data sources

Page 8: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.

tools

Page 9: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.

tasks

Page 10: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.

Scientific objects

Page 11: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Pipeline Selecting Target Proteins*

SMART Swiss-Prot

BIND DIP CEY2H

sigpep

blast x D.mel

Step 1 = retrieve all proteins from SMART and Swiss-Prot with textual search with the keyword “apoptosis”Step 2 = retrieve all proteins from Swiss-Prot with a signal peptide feature and the keyword “apoptosis” Step 3 = retrieve their binding partners from DIP, BIND and the C.elegans datasetStep 4 = run through a signal peptide prediction program such as SigPep to check for the presence of signal peptides in each of the sequencesStep 5 = homology search using BLAST of the retrieved sequences with proteins predicted from the Drosophila melanogaster genome might yield additional candidatesOutput = final set of signal peptide proteins involved in apoptosis

*Courtesy of Dr. Terry Gaasterland, The Rockefeller University

Page 12: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Design and implementation

Step Task Implementation

Input Relevant keyword for which the proteins are required  

Step 1All proteins with keyword and with signal feature peptide must be retrieved

SMART

Swissprot

 Step 2Binding partners of all of these proteins are retrieved DIP

BIND

Step 3Integration into final set is run through a signal peptide prediction program

SigPep

 

Step 4Homology search of the retrieved sequences with proteins predicted from the specific genome yield additional candidates

BLAST

 

Page 13: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Expressing scientific pipelines with BioNavigation Queries are expressed at a conceptual

level (design)

DNA Seq.

Disease

GeneCitation

Protein Seq.

Conceptual level

Scientific classes

Page 14: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Conceptual graph

Labeled edges– Scientific meaningful edges

Gene

NucleotideSequence

DNA

RNA mRNA

Protein

isA

isA

isA

isA

transcribesTo

isTranscribedFrom

isTranslatedFrom

translatesTo

Page 15: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Conceptual graph

Gene

NucleotideSequence

DNA

RNA mRNA

Protein

isA

isA

isA

isA

transcribesTo

isTranscribedFrom

isTranslatedFrom

translatesToIsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

Page 16: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Mapping to physical resources

OMIM

Gen-Bank

Pub-Med

HUGO

NCBIProtein

DNA Seq.

Disease

GeneCitation

Protein Seq.

Conceptual level

Physical level

Data Sources

Scientific classes

Page 17: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Mapping to physical resources

OMIM

Gen-Bank

Pub-Med

HUGO

NCBIProtein

DNA Seq.

Disease

GeneCitation

Protein Seq.

Conceptual level

Physical level

Data Sources

Scientific classes

Page 18: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Exploring biological metadata “Return all citations that are related to some

disease or condition” Diabetes : 11 Aging : 71 Cancer : 391

OMIM

NUCLEOTIDE PROTEIN

PUBMED

(P1)(P2) (P3)

•Link: Entrez provides an index with the Links in the display option from each entry • Parse: Parsing each entry to retrieve its related entries

•All: Entrez provides an index with the Links in the display option which allows to look at a set of entries at a time

Page 19: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Selecting biological resources

3 resources that look the same – Are they the same?

3 paths that will retrieve PubMed entries related to citations– Do they have the same semantics?

Page 20: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Results for the disease conditions diabetes, aging and cancer

  P1 P2 P3Diabetes Link 43,890

 42,969

 59,959

 Parse 43,747 

43,090 

51,906 All 44,037

 43,581

 49,719

 Aging Link 48,393 

51,712 

60,129 Parse 48,398

 51,855

 61,260

 All 48,393 

51,474 

60,938 Cancer Link 56,315

 54,487

 62,686

 Parse 56,315 

54,607 

63,367 All 56,532

 52,488

 60,033

 

Page 21: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Overlap results for the disease conditions diabetes

 P1 P2 P3

Link

P1  100%  25.82%  21.95%P2  25.28%  100%  70.00%P3  29.98%  97.68%  100%

Parse

P1  100%  23.93%  22.87%P2  29.18%  100%  81.20%P3 33.60%   97.81%  100%

All

P1  100%  24.75%  24.29%P2  24.64%  100%  79.49%P3  27.42%  90.68%  100%

Page 22: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Evaluating resources

Similar applications– Different outputs

Similar data sources– Different output

Number of resources– Different output

Order of resources– Different output

Page 23: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Exploiting semantics of resources

Number of entries Characterization of entries (number of

attributes) Time

Page 24: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Exploiting the semantics of links

Page 25: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

BioNavigation (joint work with Louiqa Raschid and Maria-Esther Vidal) Conceptual graph

– No labeled links Queries

– Regular expressions of concepts ESearch

– Path cardinality - number of instances of paths of the result. For a path of length 1 between two sources S1 and S2, it is the number of pairs (e1, e2) of entries e1 of S1 linked to an entry e2 of S2.

– Target Object Cardinality – number of distinct objects retrieved from the final data source.

– Evaluation Cost – cost of the evaluation plan, which involves both the local processing cost and remote network access delays.

Page 26: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Work in progress

Conceptual graph– Labeled links

Queries– Complex dataflows

Physical graph– Access to a BioMetaDatabase– Data sources– Applications

Page 27: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Representing the conceptual graph in Protégé

Page 28: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Visualization Limitations in Protégé

Using the GraphViz plugin– Shows only IsA hierarchy

Page 29: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

TgiViz plugin

Page 30: Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

Conclusion

Scientists need support to select resources to express their protocols

Semantics of resources may be exploited to enhance the data collection process

Need for a repository of biological metadata (BioMetaDatabase)