data mining for bioinformatics

43
Data Mining for Bioinformatics Craig A. Struble, Ph.D. Marquette University [email protected]

Upload: tommy96

Post on 10-May-2015

5.205 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Data Mining for Bioinformatics

Data Mining for Bioinformatics

Craig A. Struble, Ph.D.Marquette [email protected]

Page 2: Data Mining for Bioinformatics

Data Mining for Bioinformatics 2

Overview Survey of KDD for Bioinformatics

KDD overview Bioinformatics data Survey of KDD steps

Case Study: miRNA Project Identifying the problem Data collection with Perl Selection/cleansing Future work…

Next Time

Page 3: Data Mining for Bioinformatics

Data Mining for Bioinformatics 3

Knowledge Discovery in Databases

Data Warehouse

Prepareddata

Data

CleaningIntegration

SelectionTransformation

DataMining

Patterns

EvaluationVisualization

KnowledgeKnowledge

Base

Page 4: Data Mining for Bioinformatics

Data Mining for Bioinformatics 4

Bioinformatics Data DNA Sequences Genes

Location, introns, exons, function, etc. Gene products

RNA, Proteins Pathways

Signaling, metabolic, genomic, etc.

Page 5: Data Mining for Bioinformatics

Data Mining for Bioinformatics 5

Bioinformatics Data Experimental

Gene expression, knockouts, etc. Literature

Diseases, viruses, bacteria Organisms Textbooks

Expert knowledge Unpublished Insights Etc.

Page 6: Data Mining for Bioinformatics

Data Mining for Bioinformatics 6

KDD for Bioinformatics

Genomic

Literature

Experimental

Data Warehouse

Prepareddata

Data

NormalizationCurationValidationEtc.

ClusteringSVMsILPClassificationEtc.

Patterns

EvaluationVisualization

KnowledgeExpert

Knowledge

SamplingExpressed GenesHomologsEtc.

Often not explicitly

implemented

Page 7: Data Mining for Bioinformatics

Data Mining for Bioinformatics 7

Data Collection and Cleansing Perl scripts (BioPerl) From literature

Read a paper and enter the information Supplemental data for papers

Public databases GenBank Stanford Microarray Database SWISS-Prot Etc.

Page 8: Data Mining for Bioinformatics

Data Mining for Bioinformatics 8

Data Cleansing Remove invalid, redundant, or

otherwise useless data Extrapolate missing data values Data formatting/transformation

Binning, normalization, scaling, etc.

Page 9: Data Mining for Bioinformatics

Data Mining for Bioinformatics 9

Data Selection Database queries for specific genes,

organisms, sequences, etc. Statistical analysis (microarray) Random sampling Etc.

Page 10: Data Mining for Bioinformatics

Data Mining for Bioinformatics 10

Data Mining Techniques Statistical

Principal Component Analysis ANOVA Outlier analysis Discrimination Some clustering techniques (K-Means)

Page 11: Data Mining for Bioinformatics

Data Mining for Bioinformatics 11

Data Mining Techniques Machine Learning

Neural Networks Support Vector Machines Decision Trees Inductive Logic Programming Fuzzy Logic Rough Sets Bayesian Belief Networks

Page 12: Data Mining for Bioinformatics

Data Mining for Bioinformatics 12

Data Mining Techniques More Techniques

Clustering Self Organizing Maps Hidden Markov Models Maximum Likelihood Estimators Association Rules …

Page 13: Data Mining for Bioinformatics

Data Mining for Bioinformatics 13

Kinds of Techniques Unsupervised

Technique makes no assumption about a priori knowledge

Useful when not much known Supervised

Attach class labels to data items Identify (or learn about) properties that

distinquish classes

Page 14: Data Mining for Bioinformatics

Data Mining for Bioinformatics 14

Kinds of Techniques Unsupervised

Clustering SOMs

Supervised Support Vector Machines Neural Networks Bayesian Belief Networks HMMs

Page 15: Data Mining for Bioinformatics

Data Mining for Bioinformatics 15

Kinds of Techniques Supervised techniques require

training Data split into training and test sets Many kinds of validation

• N-way cross validation• Leave one out testing• Etc…

Page 16: Data Mining for Bioinformatics

Data Mining for Bioinformatics 16

Visualization of Results Graphs/Charts Rules

If expression of X < 1035, then tissue is cancerous

Largely dependent on the technique used

Page 17: Data Mining for Bioinformatics

Data Mining for Bioinformatics 17

Case Study: miRNA Project Started Jan, 2002 Participants

Dr. Craig Struble Dr. Stephen Munroe Dr. John Simms Parthav Jailwala Peigang Li

http://bistro.mscs.mu.edu/miRNA

Page 18: Data Mining for Bioinformatics

Data Mining for Bioinformatics 18

Case Study: miRNA Project Lee, R. C. & Ambros, V. An extensive class of small RNAs in

Caenorhabditis elegans. Science 294, 862-864 (2001). Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T.

Identification of novel genes coding for small expressed RNAs. Science 294, 853-858 (2001).

Hutvßgner, G. et al. A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science 293, 834-838 (2001).

N.C. Lau, Lee P. Lim, Earl G. Weinstein, David P. Bartel. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294, 858-86 (2001).

Page 19: Data Mining for Bioinformatics

Data Mining for Bioinformatics 19

Research Questions Can we identify features of existing

miRNAs that can be used to predict the existence of other miRNA genes?

Which mRNA (messenger RNA) are targeted by miRNAs?

What other family-wide behavioral and structural questions can be answered about miRNAs?

Page 20: Data Mining for Bioinformatics

Data Mining for Bioinformatics 20

Current Implementation

Genbank

PerlScript

PerlScript

PerlScript

miRNA library

BLAST Reports

Homolog library

Multiple Sequence Alignment

Data warehouseData Selection/Cleansing

Initial mining and cleansing

Page 21: Data Mining for Bioinformatics

Data Mining for Bioinformatics 21

Perl Practical Extraction and Report

Language Language of choice for many

bioinformaticians Excellent support for

parsing/transforming data http://www.perl.com

Page 22: Data Mining for Bioinformatics

Data Mining for Bioinformatics 22

Data Collection with Perl

E.G. Using Entrez

Page 23: Data Mining for Bioinformatics

Data Mining for Bioinformatics 23

Data Collection with Perl

Construct a URL to search and access information in Entrez

Page 24: Data Mining for Bioinformatics

Data Mining for Bioinformatics 24

Data Collection with Perl Use LWP module

Makes network connections easy Use BioPerl (http://www.bioperl.org)

Perl modules/objects for handling bioinformatics data

Handles connections to databases

Page 25: Data Mining for Bioinformatics

Data Mining for Bioinformatics 25

Sample Perl Script#!/usr/local/bin/perl

#

# Simple Entrez Query in Perl

# Craig A. Struble

#

# For internet requests and protocols

use LWP;

# A user agent for testing

my $ua = LWP::UserAgent->new;

$ua->agent('miRNA/0.1 ');

# URL base for Entrez search

my $NCBI_ENTREZ = 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?';

Page 26: Data Mining for Bioinformatics

Data Mining for Bioinformatics 26

Script (cont.)# Building up the URL for the Entrez Search

my $search_URL = $NCBI_ENTREZ # URL Base

. 'cmd=Search' # Command

. '&db=nucleotide' # Database

. '&dispmax=100' # Max results

. '&term=miRNA' # Search term

. '&doptcmdl=FASTA'; # result format

# Make an HTTP GET request for a Entrez search

my $req = HTTP::Request->new(GET => $search_URL);

$req->push_header(Connection => 'Keep-Alive');

# Get the response

my $res = $ua->request($req);

Page 27: Data Mining for Bioinformatics

Data Mining for Bioinformatics 27

Script (cont.)

# Check the response. If it's OK, print out the content

if ($res->is_success) {

print $res->content;

} else {

print $res->error_as_HTML;

exit 1;

}

Page 28: Data Mining for Bioinformatics

Data Mining for Bioinformatics 28

Sample Result<input name="showndispmax" type="hidden" value="100"><input name="page" type="hi

dden" value="0"></table></td></tr>

</table><dl><dt><table cellpadding="0" cellspacing="0" width="100%"><tr><td><inp

ut name="uid" type="checkbox" value="17646034"><b>1: </b>AJ421749. Homo sapiens

micr...[gi:17646034]</td>

<td align="right"><SPAN><a CLASS="dblinks" href="query.fcgi?db=nucleotide&amp;cm

d=Display&amp;dopt=nucleotide_pubmed&amp;from_uid=17646034">PubMed, </a></SPAN>

<SPAN><a CLASS="dblinks" href="query.fcgi?db=nucleotide&amp;cmd=Display&amp;dopt

=nucleotide_taxonomy&amp;from_uid=17646034">Taxonomy</a></SPAN>

</td>

</tr></table></dt></dl><pre>>gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens m

icroRNA miR-27

TTCACAGTGGCTAAGTTCCGCT

</pre><dl><dt><table cellpadding="0" cellspacing="0" width="100%"><tr><td><input

name="uid" type="checkbox" value="17646061"><b>2: </b>AJ421776. Drosophila mela

no...[gi:17646061]</td>

Page 29: Data Mining for Bioinformatics

Data Mining for Bioinformatics 29

Parsing Result Result is big, ugly HTML file Need to take out data in <pre> tags Fortunately, Perl can come to the

rescue!

Page 30: Data Mining for Bioinformatics

Data Mining for Bioinformatics 30

Parsing Result with Perl#!/usr/local/bin/perl

# Use an HTML parser

use HTML::TreeBuilder;

# Extract out FASTA entries for each file on the command line

foreach my $file_name (@ARGV) {

# Build an HTML Parse Tree

my $tree = HTML::TreeBuilder->new;

$tree->parse_file($file_name);

# FASTA entries are in PRE tags

@entries = $tree->find_by_tag_name('pre');

# Print out each entry

foreach my $entry (@entries) {

@children = $entry->content_list;

print $children[0] . "\n"; # first child is text content

}

}

Page 31: Data Mining for Bioinformatics

Data Mining for Bioinformatics 31

Processed Results>gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens microRNA miR-27

TTCACAGTGGCTAAGTTCCGCT

>gi|17646061|emb|AJ421776.1|DME421776 Drosophila melanogaster microRNA miR-14

TCAGTCTTTTTCTCTCTCCTA

>gi|17646060|emb|AJ421775.1|DME421775 Drosophila melanogaster microRNA miR-13b-2

TATCACAGCCATTTTGACGAGT

>gi|17646059|emb|AJ421774.1|DME421774 Drosophila melanogaster microRNA miR-13b-1

TATCACAGCCATTTTGACGAGT

>gi|17646058|emb|AJ421773.1|DME421773 Drosophila melanogaster microRNA miR-13a

TATCACAGCCATTTTGATGAGT

Page 32: Data Mining for Bioinformatics

Data Mining for Bioinformatics 32

Getting BLAST Reports Can automate getting BLAST reports

with Perl URL format documentation is

available at http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html

Perl code not displayed

Page 33: Data Mining for Bioinformatics

Data Mining for Bioinformatics 33

Parsing BLAST Reports Use BioPerl Bio::Tools::BPLite Find high scoring pairs that contain

surrounding sequence BLAST also reports original sequence

hits Extract out matching sequence with

up and downstream surrounding sequence

Page 34: Data Mining for Bioinformatics

Data Mining for Bioinformatics 34

Perl Script

#!/usr/local/bin/perl

#

# Create homolog database from BLAST reports

# Author: Craig A. Struble

# Various BioPerl modules to use

use Bio::Tools::BPlite;

use Bio::DB::GenBank;

use Bio::SeqIO;

use Bio::Seq;

Page 35: Data Mining for Bioinformatics

Data Mining for Bioinformatics 35

Script (cont.)###############################################################################

# Function: rev_comp

# Description: Calculates the reverse complement of a DNA sequence.

###############################################################################

sub rev_comp {

my @seqs;

foreach $seq (@_) {

$seq =~ tr/AaCcTtGg/TtGgAaCc/;

$seq = reverse $seq;

push @seqs, $seq;

}

# wantarray checks whether we were called in list context

return wantarray ? @seqs : $seqs[0];

}

Page 36: Data Mining for Bioinformatics

Data Mining for Bioinformatics 36

Script (cont.)###############################################################################

# Function: around_seq

# Description: Returns the upstream and downstream sequence around an HSP

# Parameters: hsp - the high scoring pair

# seq - the sequence of reference

# upstream - number of basepairs upstream

# downstream - number of basepairs downstream

###############################################################################

sub around_seq {

my ($hsp, $seq, $upstream, $downstream) = @_;

# Code deleted due to space

return $subseq;

}

Page 37: Data Mining for Bioinformatics

Data Mining for Bioinformatics 37

Script (cont.)# Open the BLAST report

open(BLAST, "<" . $ARGV[0]) or die "open failed";

$report = new Bio::Tools::BPlite(-fh => \*BLAST);

$gb = new Bio::DB::GenBank;

# Open output file

$out = Bio::SeqIO->new('-file' => ">$ARGV[1]", '-format' => 'fasta');

# Amount up and downstream to get

$upstream = $ARGV[2];

$downstream = $ARGV[3];

Page 38: Data Mining for Bioinformatics

Data Mining for Bioinformatics 38

Script (cont.)while (my $sbjct = $report->nextSbjct) {

my ($db, $accv, $acc, $rest) = split /\|| /, $sbjct->name;

$seq = $gb->get_Seq_by_acc($acc);

print $seq->accession_number . "\n";

while (my $hsp = $sbjct->nextHSP) {

my $seqstr = around_seq($hsp, $seq, $upstream, $downstream);

my $subseq = Bio::Seq->new('-seq' => $seqstr,

'-accession_number' => $seq->accession_number,

'-display_id' => $seq->accession_number .

"_" .

$hsp->subject->start .

".." .

$hsp->subject->end .

"_" .

$hsp->subject->strand );

$out->write_seq($subseq);

}

}

Page 39: Data Mining for Bioinformatics

Data Mining for Bioinformatics 39

Results>AC084471_10966..10987_-1

TCCCCCTTGGTCCCTTCTCATATACCATACTACATTTCTTTCAAAACTAACCGGGATTTT

TCAGGGGATTGCAGGATGATGGCTCTACACTGGGGTACGGTGAGGTAGTAGGTTGTATAG

TTTAGAATATTACTCTCGGTGAACTATGCAAGTTTCTACCTCACCGAATACCAGGTTCTC

AACTGCATCGTGTCAATTACTCTCAAACGACGGACACCTTCA

>AF274345_1763..1784_1

CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG

CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG

TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT

CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT

>Z70203_12425..12446_-1

CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG

CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG

TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT

CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT

Page 40: Data Mining for Bioinformatics

Data Mining for Bioinformatics 40

Multiple Sequence Alignment Currently using clustalw/clustalx Eventually generate web pages with

sequence alignments Investigate conserved regions of the

surrounding sequence

Page 41: Data Mining for Bioinformatics

Data Mining for Bioinformatics 41

Multiple Sequence Alignment

Page 42: Data Mining for Bioinformatics

Data Mining for Bioinformatics 42

Future Work Process homolog library with RNA

fold predication software (mFold) Collect together fold structure

information and other information Transform into logical representation

for ILP analysis Store data in a database (Postgres)

Page 43: Data Mining for Bioinformatics

Data Mining for Bioinformatics 43

Next Time Applications of

Clustering Neural Networks Support Vector Machines Etc.

Available tools to use, etc.