data mining for bioinformatics
TRANSCRIPT
![Page 2: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/2.jpg)
Data Mining for Bioinformatics 2
Overview Survey of KDD for Bioinformatics
KDD overview Bioinformatics data Survey of KDD steps
Case Study: miRNA Project Identifying the problem Data collection with Perl Selection/cleansing Future work…
Next Time
![Page 3: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/3.jpg)
Data Mining for Bioinformatics 3
Knowledge Discovery in Databases
Data Warehouse
Prepareddata
Data
CleaningIntegration
SelectionTransformation
DataMining
Patterns
EvaluationVisualization
KnowledgeKnowledge
Base
![Page 4: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/4.jpg)
Data Mining for Bioinformatics 4
Bioinformatics Data DNA Sequences Genes
Location, introns, exons, function, etc. Gene products
RNA, Proteins Pathways
Signaling, metabolic, genomic, etc.
![Page 5: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/5.jpg)
Data Mining for Bioinformatics 5
Bioinformatics Data Experimental
Gene expression, knockouts, etc. Literature
Diseases, viruses, bacteria Organisms Textbooks
Expert knowledge Unpublished Insights Etc.
![Page 6: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/6.jpg)
Data Mining for Bioinformatics 6
KDD for Bioinformatics
Genomic
Literature
Experimental
Data Warehouse
Prepareddata
Data
NormalizationCurationValidationEtc.
ClusteringSVMsILPClassificationEtc.
Patterns
EvaluationVisualization
KnowledgeExpert
Knowledge
SamplingExpressed GenesHomologsEtc.
Often not explicitly
implemented
![Page 7: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/7.jpg)
Data Mining for Bioinformatics 7
Data Collection and Cleansing Perl scripts (BioPerl) From literature
Read a paper and enter the information Supplemental data for papers
Public databases GenBank Stanford Microarray Database SWISS-Prot Etc.
![Page 8: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/8.jpg)
Data Mining for Bioinformatics 8
Data Cleansing Remove invalid, redundant, or
otherwise useless data Extrapolate missing data values Data formatting/transformation
Binning, normalization, scaling, etc.
![Page 9: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/9.jpg)
Data Mining for Bioinformatics 9
Data Selection Database queries for specific genes,
organisms, sequences, etc. Statistical analysis (microarray) Random sampling Etc.
![Page 10: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/10.jpg)
Data Mining for Bioinformatics 10
Data Mining Techniques Statistical
Principal Component Analysis ANOVA Outlier analysis Discrimination Some clustering techniques (K-Means)
![Page 11: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/11.jpg)
Data Mining for Bioinformatics 11
Data Mining Techniques Machine Learning
Neural Networks Support Vector Machines Decision Trees Inductive Logic Programming Fuzzy Logic Rough Sets Bayesian Belief Networks
![Page 12: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/12.jpg)
Data Mining for Bioinformatics 12
Data Mining Techniques More Techniques
Clustering Self Organizing Maps Hidden Markov Models Maximum Likelihood Estimators Association Rules …
![Page 13: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/13.jpg)
Data Mining for Bioinformatics 13
Kinds of Techniques Unsupervised
Technique makes no assumption about a priori knowledge
Useful when not much known Supervised
Attach class labels to data items Identify (or learn about) properties that
distinquish classes
![Page 14: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/14.jpg)
Data Mining for Bioinformatics 14
Kinds of Techniques Unsupervised
Clustering SOMs
Supervised Support Vector Machines Neural Networks Bayesian Belief Networks HMMs
![Page 15: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/15.jpg)
Data Mining for Bioinformatics 15
Kinds of Techniques Supervised techniques require
training Data split into training and test sets Many kinds of validation
• N-way cross validation• Leave one out testing• Etc…
![Page 16: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/16.jpg)
Data Mining for Bioinformatics 16
Visualization of Results Graphs/Charts Rules
If expression of X < 1035, then tissue is cancerous
Largely dependent on the technique used
![Page 17: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/17.jpg)
Data Mining for Bioinformatics 17
Case Study: miRNA Project Started Jan, 2002 Participants
Dr. Craig Struble Dr. Stephen Munroe Dr. John Simms Parthav Jailwala Peigang Li
http://bistro.mscs.mu.edu/miRNA
![Page 18: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/18.jpg)
Data Mining for Bioinformatics 18
Case Study: miRNA Project Lee, R. C. & Ambros, V. An extensive class of small RNAs in
Caenorhabditis elegans. Science 294, 862-864 (2001). Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T.
Identification of novel genes coding for small expressed RNAs. Science 294, 853-858 (2001).
Hutvßgner, G. et al. A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science 293, 834-838 (2001).
N.C. Lau, Lee P. Lim, Earl G. Weinstein, David P. Bartel. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294, 858-86 (2001).
![Page 19: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/19.jpg)
Data Mining for Bioinformatics 19
Research Questions Can we identify features of existing
miRNAs that can be used to predict the existence of other miRNA genes?
Which mRNA (messenger RNA) are targeted by miRNAs?
What other family-wide behavioral and structural questions can be answered about miRNAs?
![Page 20: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/20.jpg)
Data Mining for Bioinformatics 20
Current Implementation
Genbank
PerlScript
PerlScript
PerlScript
miRNA library
BLAST Reports
Homolog library
Multiple Sequence Alignment
Data warehouseData Selection/Cleansing
Initial mining and cleansing
![Page 21: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/21.jpg)
Data Mining for Bioinformatics 21
Perl Practical Extraction and Report
Language Language of choice for many
bioinformaticians Excellent support for
parsing/transforming data http://www.perl.com
![Page 22: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/22.jpg)
Data Mining for Bioinformatics 22
Data Collection with Perl
E.G. Using Entrez
![Page 23: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/23.jpg)
Data Mining for Bioinformatics 23
Data Collection with Perl
Construct a URL to search and access information in Entrez
![Page 24: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/24.jpg)
Data Mining for Bioinformatics 24
Data Collection with Perl Use LWP module
Makes network connections easy Use BioPerl (http://www.bioperl.org)
Perl modules/objects for handling bioinformatics data
Handles connections to databases
![Page 25: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/25.jpg)
Data Mining for Bioinformatics 25
Sample Perl Script#!/usr/local/bin/perl
#
# Simple Entrez Query in Perl
# Craig A. Struble
#
# For internet requests and protocols
use LWP;
# A user agent for testing
my $ua = LWP::UserAgent->new;
$ua->agent('miRNA/0.1 ');
# URL base for Entrez search
my $NCBI_ENTREZ = 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?';
![Page 26: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/26.jpg)
Data Mining for Bioinformatics 26
Script (cont.)# Building up the URL for the Entrez Search
my $search_URL = $NCBI_ENTREZ # URL Base
. 'cmd=Search' # Command
. '&db=nucleotide' # Database
. '&dispmax=100' # Max results
. '&term=miRNA' # Search term
. '&doptcmdl=FASTA'; # result format
# Make an HTTP GET request for a Entrez search
my $req = HTTP::Request->new(GET => $search_URL);
$req->push_header(Connection => 'Keep-Alive');
# Get the response
my $res = $ua->request($req);
![Page 27: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/27.jpg)
Data Mining for Bioinformatics 27
Script (cont.)
# Check the response. If it's OK, print out the content
if ($res->is_success) {
print $res->content;
} else {
print $res->error_as_HTML;
exit 1;
}
![Page 28: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/28.jpg)
Data Mining for Bioinformatics 28
Sample Result<input name="showndispmax" type="hidden" value="100"><input name="page" type="hi
dden" value="0"></table></td></tr>
</table><dl><dt><table cellpadding="0" cellspacing="0" width="100%"><tr><td><inp
ut name="uid" type="checkbox" value="17646034"><b>1: </b>AJ421749. Homo sapiens
micr...[gi:17646034]</td>
<td align="right"><SPAN><a CLASS="dblinks" href="query.fcgi?db=nucleotide&cm
d=Display&dopt=nucleotide_pubmed&from_uid=17646034">PubMed, </a></SPAN>
<SPAN><a CLASS="dblinks" href="query.fcgi?db=nucleotide&cmd=Display&dopt
=nucleotide_taxonomy&from_uid=17646034">Taxonomy</a></SPAN>
</td>
</tr></table></dt></dl><pre>>gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens m
icroRNA miR-27
TTCACAGTGGCTAAGTTCCGCT
</pre><dl><dt><table cellpadding="0" cellspacing="0" width="100%"><tr><td><input
name="uid" type="checkbox" value="17646061"><b>2: </b>AJ421776. Drosophila mela
no...[gi:17646061]</td>
![Page 29: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/29.jpg)
Data Mining for Bioinformatics 29
Parsing Result Result is big, ugly HTML file Need to take out data in <pre> tags Fortunately, Perl can come to the
rescue!
![Page 30: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/30.jpg)
Data Mining for Bioinformatics 30
Parsing Result with Perl#!/usr/local/bin/perl
# Use an HTML parser
use HTML::TreeBuilder;
# Extract out FASTA entries for each file on the command line
foreach my $file_name (@ARGV) {
# Build an HTML Parse Tree
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
# FASTA entries are in PRE tags
@entries = $tree->find_by_tag_name('pre');
# Print out each entry
foreach my $entry (@entries) {
@children = $entry->content_list;
print $children[0] . "\n"; # first child is text content
}
}
![Page 31: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/31.jpg)
Data Mining for Bioinformatics 31
Processed Results>gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens microRNA miR-27
TTCACAGTGGCTAAGTTCCGCT
>gi|17646061|emb|AJ421776.1|DME421776 Drosophila melanogaster microRNA miR-14
TCAGTCTTTTTCTCTCTCCTA
>gi|17646060|emb|AJ421775.1|DME421775 Drosophila melanogaster microRNA miR-13b-2
TATCACAGCCATTTTGACGAGT
>gi|17646059|emb|AJ421774.1|DME421774 Drosophila melanogaster microRNA miR-13b-1
TATCACAGCCATTTTGACGAGT
>gi|17646058|emb|AJ421773.1|DME421773 Drosophila melanogaster microRNA miR-13a
TATCACAGCCATTTTGATGAGT
![Page 32: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/32.jpg)
Data Mining for Bioinformatics 32
Getting BLAST Reports Can automate getting BLAST reports
with Perl URL format documentation is
available at http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
Perl code not displayed
![Page 33: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/33.jpg)
Data Mining for Bioinformatics 33
Parsing BLAST Reports Use BioPerl Bio::Tools::BPLite Find high scoring pairs that contain
surrounding sequence BLAST also reports original sequence
hits Extract out matching sequence with
up and downstream surrounding sequence
![Page 34: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/34.jpg)
Data Mining for Bioinformatics 34
Perl Script
#!/usr/local/bin/perl
#
# Create homolog database from BLAST reports
# Author: Craig A. Struble
# Various BioPerl modules to use
use Bio::Tools::BPlite;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::Seq;
![Page 35: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/35.jpg)
Data Mining for Bioinformatics 35
Script (cont.)###############################################################################
# Function: rev_comp
# Description: Calculates the reverse complement of a DNA sequence.
###############################################################################
sub rev_comp {
my @seqs;
foreach $seq (@_) {
$seq =~ tr/AaCcTtGg/TtGgAaCc/;
$seq = reverse $seq;
push @seqs, $seq;
}
# wantarray checks whether we were called in list context
return wantarray ? @seqs : $seqs[0];
}
![Page 36: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/36.jpg)
Data Mining for Bioinformatics 36
Script (cont.)###############################################################################
# Function: around_seq
# Description: Returns the upstream and downstream sequence around an HSP
# Parameters: hsp - the high scoring pair
# seq - the sequence of reference
# upstream - number of basepairs upstream
# downstream - number of basepairs downstream
###############################################################################
sub around_seq {
my ($hsp, $seq, $upstream, $downstream) = @_;
# Code deleted due to space
return $subseq;
}
![Page 37: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/37.jpg)
Data Mining for Bioinformatics 37
Script (cont.)# Open the BLAST report
open(BLAST, "<" . $ARGV[0]) or die "open failed";
$report = new Bio::Tools::BPlite(-fh => \*BLAST);
$gb = new Bio::DB::GenBank;
# Open output file
$out = Bio::SeqIO->new('-file' => ">$ARGV[1]", '-format' => 'fasta');
# Amount up and downstream to get
$upstream = $ARGV[2];
$downstream = $ARGV[3];
![Page 38: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/38.jpg)
Data Mining for Bioinformatics 38
Script (cont.)while (my $sbjct = $report->nextSbjct) {
my ($db, $accv, $acc, $rest) = split /\|| /, $sbjct->name;
$seq = $gb->get_Seq_by_acc($acc);
print $seq->accession_number . "\n";
while (my $hsp = $sbjct->nextHSP) {
my $seqstr = around_seq($hsp, $seq, $upstream, $downstream);
my $subseq = Bio::Seq->new('-seq' => $seqstr,
'-accession_number' => $seq->accession_number,
'-display_id' => $seq->accession_number .
"_" .
$hsp->subject->start .
".." .
$hsp->subject->end .
"_" .
$hsp->subject->strand );
$out->write_seq($subseq);
}
}
![Page 39: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/39.jpg)
Data Mining for Bioinformatics 39
Results>AC084471_10966..10987_-1
TCCCCCTTGGTCCCTTCTCATATACCATACTACATTTCTTTCAAAACTAACCGGGATTTT
TCAGGGGATTGCAGGATGATGGCTCTACACTGGGGTACGGTGAGGTAGTAGGTTGTATAG
TTTAGAATATTACTCTCGGTGAACTATGCAAGTTTCTACCTCACCGAATACCAGGTTCTC
AACTGCATCGTGTCAATTACTCTCAAACGACGGACACCTTCA
>AF274345_1763..1784_1
CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG
CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG
TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT
CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT
>Z70203_12425..12446_-1
CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG
CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG
TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT
CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT
![Page 40: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/40.jpg)
Data Mining for Bioinformatics 40
Multiple Sequence Alignment Currently using clustalw/clustalx Eventually generate web pages with
sequence alignments Investigate conserved regions of the
surrounding sequence
![Page 41: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/41.jpg)
Data Mining for Bioinformatics 41
Multiple Sequence Alignment
![Page 42: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/42.jpg)
Data Mining for Bioinformatics 42
Future Work Process homolog library with RNA
fold predication software (mFold) Collect together fold structure
information and other information Transform into logical representation
for ILP analysis Store data in a database (Postgres)
![Page 43: Data Mining for Bioinformatics](https://reader036.vdocument.in/reader036/viewer/2022062312/554e7ec5b4c9054a698b535b/html5/thumbnails/43.jpg)
Data Mining for Bioinformatics 43
Next Time Applications of
Clustering Neural Networks Support Vector Machines Etc.
Available tools to use, etc.