hmm applications in computational biologyepxing/class/10701/slides/compbio15.pdf · hmm...

36
HMM applications in computational biology 10-701 Machine Learning

Upload: others

Post on 08-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

HMM applications in computational biology

10-701

Machine Learning

Page 2: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

2

Central dogma

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 3: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Biological data is rapidly accumulating

DNA

RNA

transcription

translation

Proteins

Transcription factors Next generation sequencing

Page 4: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA
Page 5: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA
Page 6: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Biological data is rapidly accumulating

DNA

RNA

transcription

translation

Proteins

Transcription factors Array / sequencing technology

Page 7: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Biological data is rapidly accumulating

DNA

RNA

transcription

translation

Proteins

Transcription factors Protein interactions

• 38,000 identified interactions • Hundreds of thousands of predictions

Page 8: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

8

Page 9: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

FDA Approves Gene-Based Breast Cancer Test*

“ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk or high risk for the spread of the cancer to another site.”

*Washington Post, 2/06/2007

Page 10: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

10

Page 11: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Active Learning

11

Page 12: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Sequencing DNA

Due to accumulated errors, we could only reliably read at most 300-500 nucleotides.

First human genome draft in 2001

Page 13: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Shotgun Sequencing

Wikipedia

Page 14: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Caveats

• Errors in reading • Non-trivial assembly task: repeats in the genome

MacCallum et al., GB 2009

Page 15: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Error Correction in DNA sequencing • The fragmentation happens at random locations of the molecules. We expect all positions in the genome to have the same # number of reads K-mers = substrings of length K of the reads. Errors create error k-mers.

Kellly et al., GB 2010

Page 16: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Transcriptome Shotgun Sequencing (RNA-Seq)

Sequencing RNA molecule transcripts. Reminder: • (mRNA) Transcripts are “expression products” of genes. • Different genes having different expression levels so some

transcripts are more or less abundant than others.

@Friedrich Miescher Laboratory

Page 17: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Challenges

• Large datasets: 10-100 millions reads of 75-150 bps. • Memory efficiency: Too time consuming to perform out-

memory processing of data. DNA Sequencing + others : alternative splicing, RNA editing, post-transcription modification.

Page 18: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

• Some transcripts are more prone to errors • Errors are harder to correct in reads from lowly expressed transcripts

Errors are non uniformly distributed

Page 19: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

SEECER Error Correction + Consensus sequence estimation for RNA-Seq data

Page 20: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Key idea: HMM model

The way sequencers work: • Read letter by letter sequentially • Possible errors: Insertion , Deletion or Misread of a nucleotide

Salmela et al., Bioinformatics 2011

Page 21: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA
Page 22: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Building (Learning) the HMMs and Making Corrections (Inference)

Learning = Expectation-Maximization Inference = Viterbi algorithm

Seeding: Guessing possible reads using k-mer overlaps. Constructing the HMM from these reads. Speed up: The k-mer overlaps yield approximate multiple alignments of reads. We can learn HMM parameters from this directly.

Page 23: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Clustering to improve seeding

Real biological differences should be supported by a set of reads with similar mismatches to the consensus

Page 24: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

1. Clustering positions with mismatches to identify clusters of correlated positions.

2. Build a similarity matrix between these positions.

3. Use Spectral clustering to find clusters of correlated positions.

4. Filter reads have mismatches in these clusters.

Page 25: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Comparison to other methods

Page 26: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Using the corrected reads, the assembler can

recover more transcripts

Page 27: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Analysis of sea cucumber data

B

Page 28: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Data integration in biology

Page 29: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Key problem: Most high-throughput data is static

Sequencing

motif CHIP-chip

PPI microarray

Static data sources Time-series measurements

Time

Page 30: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

DREM: Dynamic Regulatory Events Miner

Page 31: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

TF C

time

Expression Level

Model Structure

time

1

0.1

0.9

1

0.95

0.05

Expression Level

Time Series Expression Data Static TF-DNA Binding Data

IOHMM Model

TF A

TF B

TF D

?

?

a b

c d

Page 32: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Things are a bit more complicated: Real data

Page 33: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

A Hidden Markov Model

T

t

tt

n

i

T

t

tt iHiHpiHiOpOHL2

1

1 1

))(|)(())(|)(();,(

Hidden States

Observed outputs (expression levels)

t=0 t=1 t=2 t=3

H0 H1 H2 H3

O0 O1 O2 O3

Schliep et al Bioinformatics 2003

1

Page 34: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

Sum over all genes

Sum over all paths Q

Product over all Gaussian emission density values on path

Product over all transition probabilities on path

Input – Output Hidden Markov Model

Input (Static TF-gene interactions)

Hidden States (transitions between states form a tree structure)

Emissions (Distribution of expression values)

Ig

t=0 t=1 t=2 t=3

H0 H1 H2 H3

O0 O1 O2 O3

Log Likelihood

Page 35: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

1 2

3 4

5

6 7 8

9

E. coli. response

PLoS Comp. Bio. 2008

Nature MSB 2011

IRF7

Fly development Science 2010

Genome Research 2010, PLoS ONE 2011

Mouse Immune response

Stem cells differentiation

Page 36: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA

• Approximate learning to speed up on large datasets.

• In real world, one technique is not enough. A solution involves using

many techniques.

• Precision and Recall are trade-offs.

Things that work