data mining for bioinformatics

Data Mining for Bioinformatics

Craig A. Struble, Ph.D.Marquette [email protected]

Data Mining for Bioinformatics 2

Overview Survey of KDD for Bioinformatics

KDD overview Bioinformatics data Survey of KDD steps

Case Study: miRNA Project Identifying the problem Data collection with Perl Selection/cleansing Future work…

Next Time


Knowledge Discovery in Databases

Data Warehouse

Prepareddata

Data

CleaningIntegration

SelectionTransformation

DataMining

Patterns

EvaluationVisualization

KnowledgeKnowledge

Base


Bioinformatics Data DNA Sequences Genes

Location, introns, exons, function, etc. Gene products

RNA, Proteins Pathways

Signaling, metabolic, genomic, etc.


Bioinformatics Data Experimental

Gene expression, knockouts, etc. Literature

Diseases, viruses, bacteria Organisms Textbooks

Expert knowledge Unpublished Insights Etc.


KDD for Bioinformatics

Genomic

Literature

Experimental

Data Warehouse

Prepareddata

Data

NormalizationCurationValidationEtc.

ClusteringSVMsILPClassificationEtc.

Patterns

EvaluationVisualization

KnowledgeExpert

Knowledge

SamplingExpressed GenesHomologsEtc.

Often not explicitly

implemented


Data Collection and Cleansing Perl scripts (BioPerl) From literature

Read a paper and enter the information Supplemental data for papers

Public databases GenBank Stanford Microarray Database SWISS-Prot Etc.


Data Cleansing Remove invalid, redundant, or

otherwise useless data Extrapolate missing data values Data formatting/transformation

Binning, normalization, scaling, etc.


Data Selection Database queries for specific genes,

organisms, sequences, etc. Statistical analysis (microarray) Random sampling Etc.


Data Mining Techniques Statistical

Principal Component Analysis ANOVA Outlier analysis Discrimination Some clustering techniques (K-Means)


Data Mining Techniques Machine Learning

Neural Networks Support Vector Machines Decision Trees Inductive Logic Programming Fuzzy Logic Rough Sets Bayesian Belief Networks


Data Mining Techniques More Techniques

Clustering Self Organizing Maps Hidden Markov Models Maximum Likelihood Estimators Association Rules …


Kinds of Techniques Unsupervised

Technique makes no assumption about a priori knowledge

Useful when not much known Supervised

Attach class labels to data items Identify (or learn about) properties that

distinquish classes


Kinds of Techniques Unsupervised

Clustering SOMs

Supervised Support Vector Machines Neural Networks Bayesian Belief Networks HMMs


Kinds of Techniques Supervised techniques require

training Data split into training and test sets Many kinds of validation

• N-way cross validation• Leave one out testing• Etc…


Visualization of Results Graphs/Charts Rules

If expression of X < 1035, then tissue is cancerous

Largely dependent on the technique used


Case Study: miRNA Project Started Jan, 2002 Participants

Dr. Craig Struble Dr. Stephen Munroe Dr. John Simms Parthav Jailwala Peigang Li

http://bistro.mscs.mu.edu/miRNA


Case Study: miRNA Project Lee, R. C. & Ambros, V. An extensive class of small RNAs in

Caenorhabditis elegans. Science 294, 862-864 (2001). Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T.

Identification of novel genes coding for small expressed RNAs. Science 294, 853-858 (2001).

Hutvßgner, G. et al. A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science 293, 834-838 (2001).

N.C. Lau, Lee P. Lim, Earl G. Weinstein, David P. Bartel. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294, 858-86 (2001).


Research Questions Can we identify features of existing

miRNAs that can be used to predict the existence of other miRNA genes?

Which mRNA (messenger RNA) are targeted by miRNAs?

What other family-wide behavioral and structural questions can be answered about miRNAs?


Current Implementation

Genbank

PerlScript

PerlScript

PerlScript

miRNA library

BLAST Reports

Homolog library

Multiple Sequence Alignment

Data warehouseData Selection/Cleansing

Initial mining and cleansing


Perl Practical Extraction and Report

Language Language of choice for many

bioinformaticians Excellent support for

parsing/transforming data http://www.perl.com


Data Collection with Perl

E.G. Using Entrez


Data Collection with Perl

Construct a URL to search and access information in Entrez


Data Collection with Perl Use LWP module

Makes network connections easy Use BioPerl (http://www.bioperl.org)

Perl modules/objects for handling bioinformatics data

Handles connections to databases


Sample Perl Script#!/usr/local/bin/perl

#

# Simple Entrez Query in Perl

# Craig A. Struble

#

# For internet requests and protocols

use LWP;

# A user agent for testing

my $ua = LWP::UserAgent->new;

$ua->agent('miRNA/0.1 ');

# URL base for Entrez search

my $NCBI_ENTREZ = 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?';


Script (cont.)# Building up the URL for the Entrez Search

my $search_URL = $NCBI_ENTREZ # URL Base

. 'cmd=Search' # Command

. '&db=nucleotide' # Database

. '&dispmax=100' # Max results

. '&term=miRNA' # Search term

. '&doptcmdl=FASTA'; # result format

# Make an HTTP GET request for a Entrez search

my $req = HTTP::Request->new(GET => $search_URL);

$req->push_header(Connection => 'Keep-Alive');

# Get the response

my $res = $ua->request($req);


Script (cont.)

# Check the response. If it's OK, print out the content

if ($res->is_success) {

print $res->content;

} else {

print $res->error_as_HTML;

exit 1;

}


Sample Result<input name="showndispmax" type="hidden" value="100"><input name="page" type="hi

dden" value="0"></table></td></tr>

</table><dl><dt><table cellpadding="0" cellspacing="0" width="100%"><tr><td><inp

ut name="uid" type="checkbox" value="17646034"><b>1: </b>AJ421749. Homo sapiens

micr...[gi:17646034]</td>

<td align="right"><SPAN><a CLASS="dblinks" href="query.fcgi?db=nucleotide&cm

d=Display&dopt=nucleotide_pubmed&from_uid=17646034">PubMed, </a></SPAN>

<SPAN><a CLASS="dblinks" href="query.fcgi?db=nucleotide&cmd=Display&dopt

=nucleotide_taxonomy&from_uid=17646034">Taxonomy</a></SPAN>

</td>

</tr></table></dt></dl><pre>>gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens m

icroRNA miR-27

TTCACAGTGGCTAAGTTCCGCT

</pre><dl><dt><table cellpadding="0" cellspacing="0" width="100%"><tr><td><input

name="uid" type="checkbox" value="17646061"><b>2: </b>AJ421776. Drosophila mela

no...[gi:17646061]</td>


Parsing Result Result is big, ugly HTML file Need to take out data in <pre> tags Fortunately, Perl can come to the

rescue!


Parsing Result with Perl#!/usr/local/bin/perl

# Use an HTML parser

use HTML::TreeBuilder;

# Extract out FASTA entries for each file on the command line

foreach my $file_name (@ARGV) {

# Build an HTML Parse Tree

my $tree = HTML::TreeBuilder->new;

$tree->parse_file($file_name);

# FASTA entries are in PRE tags

@entries = $tree->find_by_tag_name('pre');

# Print out each entry

foreach my $entry (@entries) {

@children = $entry->content_list;

print $children[0] . "\n"; # first child is text content

}

}


Processed Results>gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens microRNA miR-27

TTCACAGTGGCTAAGTTCCGCT

>gi|17646061|emb|AJ421776.1|DME421776 Drosophila melanogaster microRNA miR-14

TCAGTCTTTTTCTCTCTCCTA

>gi|17646060|emb|AJ421775.1|DME421775 Drosophila melanogaster microRNA miR-13b-2

TATCACAGCCATTTTGACGAGT

>gi|17646059|emb|AJ421774.1|DME421774 Drosophila melanogaster microRNA miR-13b-1

TATCACAGCCATTTTGACGAGT

>gi|17646058|emb|AJ421773.1|DME421773 Drosophila melanogaster microRNA miR-13a

TATCACAGCCATTTTGATGAGT


Getting BLAST Reports Can automate getting BLAST reports

with Perl URL format documentation is

available at http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html

Perl code not displayed


Parsing BLAST Reports Use BioPerl Bio::Tools::BPLite Find high scoring pairs that contain

surrounding sequence BLAST also reports original sequence

hits Extract out matching sequence with

up and downstream surrounding sequence


Perl Script

#!/usr/local/bin/perl

#

# Create homolog database from BLAST reports

# Author: Craig A. Struble

# Various BioPerl modules to use

use Bio::Tools::BPlite;

use Bio::DB::GenBank;

use Bio::SeqIO;

use Bio::Seq;


Script (cont.)###############################################################################

# Function: rev_comp

# Description: Calculates the reverse complement of a DNA sequence.

###############################################################################

sub rev_comp {

my @seqs;

foreach $seq (@_) {

$seq =~ tr/AaCcTtGg/TtGgAaCc/;

$seq = reverse $seq;

push @seqs, $seq;

}

# wantarray checks whether we were called in list context

return wantarray ? @seqs : $seqs[0];

}


Script (cont.)###############################################################################

# Function: around_seq

# Description: Returns the upstream and downstream sequence around an HSP

# Parameters: hsp - the high scoring pair

# seq - the sequence of reference

# upstream - number of basepairs upstream

# downstream - number of basepairs downstream

###############################################################################

sub around_seq {

my ($hsp, $seq, $upstream, $downstream) = @_;

# Code deleted due to space

return $subseq;

}


Script (cont.)# Open the BLAST report

open(BLAST, "<" . $ARGV[0]) or die "open failed";

$report = new Bio::Tools::BPlite(-fh => \*BLAST);

$gb = new Bio::DB::GenBank;

# Open output file

$out = Bio::SeqIO->new('-file' => ">$ARGV[1]", '-format' => 'fasta');

# Amount up and downstream to get

$upstream = $ARGV[2];

$downstream = $ARGV[3];


Script (cont.)while (my $sbjct = $report->nextSbjct) {

my ($db, $accv, $acc, $rest) = split /\|| /, $sbjct->name;

$seq = $gb->get_Seq_by_acc($acc);

print $seq->accession_number . "\n";

while (my $hsp = $sbjct->nextHSP) {

my $seqstr = around_seq($hsp, $seq, $upstream, $downstream);

my $subseq = Bio::Seq->new('-seq' => $seqstr,

'-accession_number' => $seq->accession_number,

'-display_id' => $seq->accession_number .

"_" .

$hsp->subject->start .

".." .

$hsp->subject->end .

"_" .

$hsp->subject->strand );

$out->write_seq($subseq);

}

}


Results>AC084471_10966..10987_-1

TCCCCCTTGGTCCCTTCTCATATACCATACTACATTTCTTTCAAAACTAACCGGGATTTT

TCAGGGGATTGCAGGATGATGGCTCTACACTGGGGTACGGTGAGGTAGTAGGTTGTATAG

TTTAGAATATTACTCTCGGTGAACTATGCAAGTTTCTACCTCACCGAATACCAGGTTCTC

AACTGCATCGTGTCAATTACTCTCAAACGACGGACACCTTCA

>AF274345_1763..1784_1

CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG

CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG

TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT

CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT

>Z70203_12425..12446_-1

CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG

CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG

TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT

CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT


Multiple Sequence Alignment Currently using clustalw/clustalx Eventually generate web pages with

sequence alignments Investigate conserved regions of the

surrounding sequence


Multiple Sequence Alignment


Future Work Process homolog library with RNA

fold predication software (mFold) Collect together fold structure

information and other information Transform into logical representation

for ILP analysis Store data in a database (Postgres)


Next Time Applications of

Clustering Neural Networks Support Vector Machines Etc.

Available tools to use, etc.

data mining for bioinformatics

Documents

data cleansing

useless data

data items

problem data collection

parsingtransforming

data mining techniques

information supplemental

clustering techniques