5 open problems in bioinformatics pedigrees from genomes comparative genomics of alternative...

Post on 13-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

5 Open Problems in Bioinformatics

•Pedigrees from Genomes

•Comparative Genomics of Alternative Splicing

•Viral Annotation

•Evolving Turing Patterns

•Protein Structure Evolution

Three Processes1. Recombination

2. Choosing Parents

3. The Mutational Process

Ped

igree p

rocess

Co

alescen

t R

ebo

mb

inatio

n

pro

cess

Se

qe

un

ce/In

divid

ua

l

Bo

un

da

ry

Fro

m Y

un

So

ng

From genomes to pedigrees

Probability of Data given a pedigree.Elston-Stewart (1971) -Temporal Peeling Algorithm:

Lander-Green (1987) - Genotype Scanning Algorithm:

Mother Father

Condition on parental states

Recombination and mutation are Markovian

Mother Father

Condition on paternal/maternal inheritance

Recombination and mutation are Markovian

Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm

Genomes with and --> infinity

recombination rate, mutation rate

•Counting within a small interval would reveal the length of the path connecting the two segments.

•Siblings are readily revealed, since they will have segments with 2 density of mutations

•The distribution of path lengths are readily observable between two sequences

•All embedded phylogenies are observable

Benevolent Mutation and Recombination Process

From Phylogenies to PedigreesMike’s counter example, linkage and individuals

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

Pedigree 1 Pedigree 2

grand

paren

tsIndividual 1

Individual 2

Different PedigreesSame Phylogenies

Gluing Phylogenies together

?

Sibling Sequences come from different parents.

A recombinants’ parent are sister sequences.

Comparative Genomics of Alternative Splicing

From Transcripts to the AS-Graph

EES S

1. How well known is the AS-graph as a function number of transcripts?

2. A family and distribution of transcripts, can they be explained an AS-graph with probabilities at donor sites or do we need probabilities for (donor,acceptor) pairs? Or possibly even more complicated situations. And is sampling transcripts good enough to distinguish these situations.

Mini-project: reliability of AS-detection.Choose Idealized AS-Graph:

1. Genome

2. Choose donor and acceptor sites in random pairs.

3. For each possible splice pair assign probability for choosing it.

This should define a probability for all transcripts.

• Generate a set of transcripts.

• Reconstruct AS-Graph.

Key questions:

1. How many transcripts must be sampled to detect AS.

2. How well will the AS-Graph be recovered?

Optimal DAG (directed acyclic graph) under restrictionsO

pt i

mal

Pat

hs:

Su

b-o

pti

mal

Pat

hs:

Finding a set of annotations:

1. Find set of paths, maximizing sum of scores.

2. The score of minimal path must be above threshold.

Two paths must differ significantly: An enclosed area, the maximal height must be d higher than the boundary defining it. Height(i,j) = di,j + di,j

1. Does known AS genes have more CTO structure than non-AS genes?

2. Do the AS correspond to the CTO structure

3. Is the CTO structure evolutionary conserved?

Phylogenetically related ASGs

1. Is ASG conserved?

2. What is conserved?

3. How is selection along position dependent on splicing status?

EESS EESS EESS

http://www.tulane.edu/~dmsander/WWW/335/Diarrhoea.html

http://www.tulane.edu/~dmsander/WWW/335/Papovaviruses.htmlhttp://www.tulane.edu/~dmsander/WWW/335/Retroviruses.html

Virus AnnotationClasses of Gene Structures

Retroviridae Arrangements Papoviridae Arrangement

Diarrhoea Causing Arrangements

Illustrating the 3 main classes of gene structures: Unidirectional, Convergent and Divergent.

The Problems of Viral Annotation

•HMM gene structure generator (McCauley)

•Gene Structure Evolution (de Groot)

•Alignment (Caldeira, Lunter, Rocco)

•Recombination (Lyngsø, Song)

•Multiple constraints: RNA secondary structure,

gene conservation, binding/transcriptional

instructional sites.

Our 8 State HMM which allows for Unidirectional overlapping gene structures

HMM States• Non-coding • Coding RF1• Coding RF2• Coding RF3• Coding RF1,2• Coding RF1,3• Coding RF2,3• Coding RF1,2,3

Combining Levels of Selection.

Protein-Protein

Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic

Jensen & Pedersen, 2001

Contagious Dependence

Assume multiplicativity: fA,B = fA*fB

Protein-RNA

DoubletsSinglet

Contagious Dependence

HIV2 Genomes

SS HMM Sensitivity

PHMM

Sensitivity

Del

Sensitivity

SS HMM

Specificity

PHMM

Specificity

Del

Specificity

M30502 0.8913 0.9765 0.0852 0.9878 0.9753 -0.0125

J04542 0.8458 0.9173 0.0714 1.0000 0.9956 -.0044

D00835 0.8796 0.9432 0.0636 0.9920 0.9733 -0.0187

M15390 0.9310 0.9971 0.0661 1.0000 0.9869 -0.0131

J03654 0.8261 0.9971 0.1709 1.0000 0.9865 -0.0135

AY509259 0.8697 0.9256 0.0559 1.0000 0.9886 -0.0114

AY509260 0.8257 0.9101 0.0844 0.9928 0.9792 -0.0136

J04498 0.8961 0.9737 0.0776 1.0000 0.9911 -0.0089

AF082339 0.9074 0.9650 0.0577 0.9842 0.9773 -0.0069

U22047 0.9028 0.9874 0.0847 0.9865 0.9744 -0.0121

U27200 0.8769 0.9453 0.0684 0.9928 0.9748 -0.0180

LO7625 0.8340 0.9680 0.1340 1.0000 0.9607 -0.0393

L36874 0.8653 0.9957 0.1303 0.9980 0.9766 -0.0214

MEANS 0.8732 0.9617 0.0885 0.9949 0.9800 -0.0149

Table illustrating the performance benefit in Sensitivity we obtain utilizing a Phylogenetic HMM. We extend the HMM model to include evolutionary

information from 13 aligned HIV2 sequences.

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html

Entrez Genomes currently contains 2120 Reference Sequences for 1510 viral genomes and 36 Reference Sequences for viroids.

Properties of overlapping genes are conserved across microbial genomes.Genome Res. 2004 Nov;14(11):2268-72.

GenBank: Centralized resource for publicly available viral sequence data.

Within microbial genomes, one third of annotated genes contain some degree of overlap, and one third of these are either Convergent or Divergent.

Krakauer, D.C. Stability and evolution of overlapping genes. Evolution 54: 731-739 (2000) Genome Res. 2004 Nov;14(11):2268-72.

General preponderance of overlapping gene structures is roughly a 90:9:1 ratio split across Unidirectional, Convergent and Divergent arrangements.

Turing Patterns

From Maini’s Home Page: http://www.maths.ox.ac.uk/~maini

Mathematical models to understand biological patterns

⎪⎩

⎪⎨

+∇=∂

+∇=∂

}){,,(),(

}){,,(),(

2

2

vv

uu

pvugvDt

txv

pvufuDt

txu

Turing Model

[From: Leppanen et al. Dimensionality effects in Turing pattern formation, Int. J. Mod. Phys. B 17, 5541-5553 (2003)]

Different parameters lead to different patterns

⎩⎨⎧

++++∇=−−++∇=

)()(

22

22

puvuvhubvvDvpuvuvavuuDu

vt

ut

γγ

Stripes: p small Spots: p large

3 suggestions

1. Networks and Turing Patterns

2. Stochastic Partial Differential Equations

3. Phylogenetically related Turing Patterns

Evolutionary Models of Protein Structure Evolution

Known KnownUnknown

-globin Myoglobin

300 amino acid changes800 nucleotide changes1 structural change1.4 Gyr

?

?

?

?

1. Given Structure what are the possible events that could happen?

2. What are their probabilities? Old fashioned substitution + indel process with bias.

Bias: Folding(Sequence Structure) & Fitness of Structure

3. Summation over all paths.

2 suggestions

Folding(Sequence Structure)

As a first approximation similar structures should be compared and the problem could be solved by comparative modelling.

Fitness of Structure – such functions are common place in guiding prediction programs.

Fast Homology Modelling

B. MCMC

A. Structure (Homology Modelling, Topology)

Using Protein Topology as Hidden Variable

Questions to be asked

Protein Structure Analysis is much harder than Sequence Analysis. Much of the first hand impression will remain: “Structures are either trivially similar or highly dissimilar” – the middle ground is empty.

At Gyr scale other rearrangements occur.

Test of smooth/catastrophic structure evolution

Separation of analogous/homologous similarities

Protein Evolution in General

How closely linked are homologous and structurally equivalent sites?

Negative Note:

Positive Note: If it works

http://www.biochem.ucl.ac.uk/bsm/cath/

http://scop.mrc-lmb.cam.ac.uk/scop/

Summary

Pedigrees from Genomes

Does infinite genomes determine pedigrees?

How many pedigrees are there?

Comparative Genomics of Alternative Splicing

How well do you know the ASG?

How do you measure selection on the ASG?

Viral Annotation

How well can you annotate viruses from observed evolution?

Evolving Turing Patterns

Turing Patterns and Networks

Stochastic Turing Patterns

Phylogenetically Related Turing Patterns

Protein Structure Evolution

Full Model of Structure Evolution

Model of Protein Topology Evolution

top related