modelling genomes

Modelling genomes

Gil McVeanDepartment of Statistics, Oxford

Why would we want to model a genome?

• To identify genes– Protein-coding– RNA– Small RNAs

• To identify regulatory elements– Transcription factor binding sites– Enhancers

• To classify genome content– Repeat DNA– Unique sequence

• To understand the processes that shape genomes– Mutation– Recombination– Duplication– Rearrangement– Natural selection

Non-coding DNA Start codon Codon Non-coding DNATERMs t

A rather simple model for a protein-coding gene

T: 30%C: 20%A: 30%G: 20%

ATG: 100% AAA: 1/61%AAC: 1/61%AAG: 1/61%…TTT: 1/61%

T: 30%C: 20%A: 30%G: 20%

TAA: 30%TAG: 40%TGA: 30%

ST

AT

ES

EM

ISS

ION

S

Define model

Explore properties

Estimate parametersfrom data

Test goodness-of-fit

Refine model

A ‘genome’ model is like any other statistical model

Hidden Markov Models in bioinformatics

• The model of a gene just described can be thought of as a hidden Markov model (HMM)

– The underlying states evolve in a Markov fashion, but we observe features (the DNA sequence) emitted by those states

• You will remember that there are lots of nice computational properties of hidden Markov models that we can use for inference

– Finding a most likely sequence of states– Calculating posterior probabilities of a given state at a given position

• There are also various algorithms we can use to estimate parameters of HMMs (e.g. ML estimation by EM)

• How would you use the model of a gene to find new genes?– How well do you think it would do?

Making useful HMMs in bioinformatics

• To be useful, HMMs for genes have to incorporate many features– Regulatory sequences– Intron-splicing features– Correlations and biases in amino acid and base composition

• A REALLY important feature to capture is their evolution– Important parts of genes and genomes evolve slower due to constraint

Searching for homology

• If we compare human and chimpanzee sequences they are approximately 98.8% identical at the DNA level. It is ‘easy’ to identify which parts of the genome in humans correspond to which parts in chimps

• If we compare human with, say mouse, we can see some parts that are similar, and other parts where there is only vague or even no obvious similarity.

• When measuring evolution, we need to identify regions that are homologous

– Homology means similarity by descent

• Traditionally, the problem of identifying homology has been intrinsically linked to the problem of alignment

Alignment of PFEMP1 proteins from P. falciparum

The simplest problem: aligning two sequences

• Suppose we have just two protein sequences that we want to align

• In evolution, three types of event can happen– Mutation to new amino acids– Insertion of new amino acids– Deletion of amino acids

• We want to work out which amino acids in the two sequences are homologous – i.e. related to each other through shared ancestry

WAKISWEEKS

W—AKIS

WEEK-S What do the ‘-’s really mean?

How can we construct an alignment algorithm?

• What we want to do is to look at every possible alignment and choose the one that is ‘best’

• What we have to do is to find an efficient algorithm that can search every possible alignment and that has an objective measure as to what ‘best’ means

• A natural approach is to make a model of alignments, parameterise it and find the alignment that maximises the likelihood

• Although the problem sounds hard we can solve it using a hidden Markov model structure

Xi

Yj

How does is work?

• Suppose residues Xi and Yj are aligned to each other

• Three things could happen next– The next two residues in each sequence could also align (A) – A gap could be introduced in sequence X

(B)– A gap could be introduced in sequence Y

(C)

• We can parameterise the probabilities of each event

XiXi+1

YjYj+1

Xi-

YjYj+1

XiXi+1

Yj-(A) (B) (C)

The full algorithm

• We need to consider similar transitions for the cases when residue Xi is aligned to a gap after residue Yj, and when Yj is aligned to a gap after Xi

• We need to specify various probabilities– The probability of inserting a gap– The probability of extending a gap– The probability of finishing the alignment– The probability of observing an aligned pair of residues (20x20)– The probability of observing a residue aligned to a gap (20)

• Once specified we can use the Viterbi and Forward/Backward algorithms to identify ML alignments, sample from the posterior or calculate posterior probabilities

Xi-a…Xi

Yj …-

Xi …-

Yj-a…Yj

H

D

H

Xi+1

)()()()1( 1 iHDHDHHHH Xeqifqifif

Transition probabilities= qij

Emission probabilities= ek(Xi+1 )

In alignment the state space is two-dimensional (residue i aligned to residue j)

The forward algorithm

H

D

H

D

Xi+1Xi

H

D

Xi-1

jkjjikk qivXeiv )(max)()1( 1

The Viterbi algorithm

A traceback matrix is used to keep track of the best partial alignments

An example

• Suppose the gap opening and extension parameters are 0.2 and 0.5 respectively. There is a 80% chance of observing a match, a 20/19% chance of observing any given mismatch and a 5% chance of observing each unaligned amino acid (We can ignore termination for the moment)

• The BEST alignments are given below, each of which has log likelihood of -16.84, or 31% of the total likelihood (lnlk = -15.67).

• In many real situations, the best alignment represents only a fraction of the total likelihood

W—AKIS

WEEK-S

WA-KIS

WEEK-S

Posterior decoding

• Using the forward-backward algorithm we can calculate the posterior probability that any residue is aligned to any other, or that a given residue is in a gap state

X1 X2 X3 X4 X5Y1

Y2

Y3

Y4

Y5

X1 X2 X3 X4 X5Y1

Y2

Y3

Y4

Y5

Conditional on X2-Y3

Extending the method

• Originally, alignment algorithms (Needleman and Wunsch, 1970; Smith and Waterman, 1981; Gotoh 1982) were not explicitly defined as hidden Markov models

– Finite-state automata (FSA)

• There have been many extensions to the original idea– Local alignment– Repeat alignment– Protein family identification– Gene finding– Multiple alignment

• The alignment algorithm is very much a workhorse of bioinformatics, as an alignment is needed or almost all subsequent analyses (e.g. phylogenetic tree reconstruction, population genetic inference)

– However, relying on a single alignment is not always a great idea

Doing away with alignment

• For most problems, the alignment is not of primary interest

• The natural thing to do is to integrate over alignments (as in the FB algorithm) to estimate parameters of interest

• The key problem is that there is no computationally efficient algorithm for statistical multiple alignment. All widely-used methods use heuristic approaches

Gene conversion and var gene diversity in P. falciparum

• Multiple alignment methods typically assume the sequences are related to each through an evolutionary tree

• For the case of multi-gene families, this may not be the case, because gene conversion between copies can lead to mosaic structures

• If we wish to learn about the processes of conversion, a natural approach is to model the mosaicism

– In the case of var genes, the sequences are so diverged that we also need to consider the problem of alignment

Mosaic alignment

• We could model the n+1th sequence as a mosaic of the previous n

• We can calculate the likelihood of observing a given sequence by summing over all possible mosaic structures and their alignment

• We can also identify the most likely mosaic structure and calculate the expected number of recombination events

– Repeating the procedure for all sequences provides a way of assessing the importance of mosaicism within the family

Extensive mosaicism within the var family

modelling genomes

Documents

simple model

bioinformaticsthe model

model of alignments

new genes

markov fashion

dna sequence

possible alignment

alignment algorithm