5 open problems in bioinformatics pedigrees from genomes comparative genomics of alternative...

24
5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns

Upload: mildred-beverly-carroll

Post on 13-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

5 Open Problems in Bioinformatics

•Pedigrees from Genomes

•Comparative Genomics of Alternative Splicing

•Viral Annotation

•Evolving Turing Patterns

•Protein Structure Evolution

Page 2: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Three Processes1. Recombination

2. Choosing Parents

3. The Mutational Process

Ped

igree p

rocess

Co

alescen

t R

ebo

mb

inatio

n

pro

cess

Se

qe

un

ce/In

divid

ua

l

Bo

un

da

ry

Fro

m Y

un

So

ng

From genomes to pedigrees

Page 3: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Probability of Data given a pedigree.Elston-Stewart (1971) -Temporal Peeling Algorithm:

Lander-Green (1987) - Genotype Scanning Algorithm:

Mother Father

Condition on parental states

Recombination and mutation are Markovian

Mother Father

Condition on paternal/maternal inheritance

Recombination and mutation are Markovian

Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm

Page 4: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Genomes with and --> infinity

recombination rate, mutation rate

•Counting within a small interval would reveal the length of the path connecting the two segments.

•Siblings are readily revealed, since they will have segments with 2 density of mutations

•The distribution of path lengths are readily observable between two sequences

•All embedded phylogenies are observable

Benevolent Mutation and Recombination Process

Page 5: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

From Phylogenies to PedigreesMike’s counter example, linkage and individuals

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

Pedigree 1 Pedigree 2

grand

paren

tsIndividual 1

Individual 2

Different PedigreesSame Phylogenies

Gluing Phylogenies together

?

Sibling Sequences come from different parents.

A recombinants’ parent are sister sequences.

Page 6: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Comparative Genomics of Alternative Splicing

Page 7: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

From Transcripts to the AS-Graph

EES S

1. How well known is the AS-graph as a function number of transcripts?

2. A family and distribution of transcripts, can they be explained an AS-graph with probabilities at donor sites or do we need probabilities for (donor,acceptor) pairs? Or possibly even more complicated situations. And is sampling transcripts good enough to distinguish these situations.

Page 8: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Mini-project: reliability of AS-detection.Choose Idealized AS-Graph:

1. Genome

2. Choose donor and acceptor sites in random pairs.

3. For each possible splice pair assign probability for choosing it.

This should define a probability for all transcripts.

• Generate a set of transcripts.

• Reconstruct AS-Graph.

Key questions:

1. How many transcripts must be sampled to detect AS.

2. How well will the AS-Graph be recovered?

Page 9: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Optimal DAG (directed acyclic graph) under restrictionsO

pt i

mal

Pat

hs:

Su

b-o

pti

mal

Pat

hs:

Finding a set of annotations:

1. Find set of paths, maximizing sum of scores.

2. The score of minimal path must be above threshold.

Two paths must differ significantly: An enclosed area, the maximal height must be d higher than the boundary defining it. Height(i,j) = di,j + di,j

1. Does known AS genes have more CTO structure than non-AS genes?

2. Do the AS correspond to the CTO structure

3. Is the CTO structure evolutionary conserved?

Page 10: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Phylogenetically related ASGs

1. Is ASG conserved?

2. What is conserved?

3. How is selection along position dependent on splicing status?

EESS EESS EESS

Page 11: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

http://www.tulane.edu/~dmsander/WWW/335/Diarrhoea.html

http://www.tulane.edu/~dmsander/WWW/335/Papovaviruses.htmlhttp://www.tulane.edu/~dmsander/WWW/335/Retroviruses.html

Virus AnnotationClasses of Gene Structures

Retroviridae Arrangements Papoviridae Arrangement

Diarrhoea Causing Arrangements

Illustrating the 3 main classes of gene structures: Unidirectional, Convergent and Divergent.

Page 12: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

The Problems of Viral Annotation

•HMM gene structure generator (McCauley)

•Gene Structure Evolution (de Groot)

•Alignment (Caldeira, Lunter, Rocco)

•Recombination (Lyngsø, Song)

•Multiple constraints: RNA secondary structure,

gene conservation, binding/transcriptional

instructional sites.

Page 13: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Our 8 State HMM which allows for Unidirectional overlapping gene structures

HMM States• Non-coding • Coding RF1• Coding RF2• Coding RF3• Coding RF1,2• Coding RF1,3• Coding RF2,3• Coding RF1,2,3

Page 14: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Combining Levels of Selection.

Protein-Protein

Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic

Jensen & Pedersen, 2001

Contagious Dependence

Assume multiplicativity: fA,B = fA*fB

Protein-RNA

DoubletsSinglet

Contagious Dependence

Page 15: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

HIV2 Genomes

SS HMM Sensitivity

PHMM

Sensitivity

Del

Sensitivity

SS HMM

Specificity

PHMM

Specificity

Del

Specificity

M30502 0.8913 0.9765 0.0852 0.9878 0.9753 -0.0125

J04542 0.8458 0.9173 0.0714 1.0000 0.9956 -.0044

D00835 0.8796 0.9432 0.0636 0.9920 0.9733 -0.0187

M15390 0.9310 0.9971 0.0661 1.0000 0.9869 -0.0131

J03654 0.8261 0.9971 0.1709 1.0000 0.9865 -0.0135

AY509259 0.8697 0.9256 0.0559 1.0000 0.9886 -0.0114

AY509260 0.8257 0.9101 0.0844 0.9928 0.9792 -0.0136

J04498 0.8961 0.9737 0.0776 1.0000 0.9911 -0.0089

AF082339 0.9074 0.9650 0.0577 0.9842 0.9773 -0.0069

U22047 0.9028 0.9874 0.0847 0.9865 0.9744 -0.0121

U27200 0.8769 0.9453 0.0684 0.9928 0.9748 -0.0180

LO7625 0.8340 0.9680 0.1340 1.0000 0.9607 -0.0393

L36874 0.8653 0.9957 0.1303 0.9980 0.9766 -0.0214

MEANS 0.8732 0.9617 0.0885 0.9949 0.9800 -0.0149

Table illustrating the performance benefit in Sensitivity we obtain utilizing a Phylogenetic HMM. We extend the HMM model to include evolutionary

information from 13 aligned HIV2 sequences.

Page 16: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html

Entrez Genomes currently contains 2120 Reference Sequences for 1510 viral genomes and 36 Reference Sequences for viroids.

Properties of overlapping genes are conserved across microbial genomes.Genome Res. 2004 Nov;14(11):2268-72.

GenBank: Centralized resource for publicly available viral sequence data.

Within microbial genomes, one third of annotated genes contain some degree of overlap, and one third of these are either Convergent or Divergent.

Krakauer, D.C. Stability and evolution of overlapping genes. Evolution 54: 731-739 (2000) Genome Res. 2004 Nov;14(11):2268-72.

General preponderance of overlapping gene structures is roughly a 90:9:1 ratio split across Unidirectional, Convergent and Divergent arrangements.

Page 17: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Turing Patterns

Page 18: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

From Maini’s Home Page: http://www.maths.ox.ac.uk/~maini

Mathematical models to understand biological patterns

⎪⎩

⎪⎨

+∇=∂

+∇=∂

}){,,(),(

}){,,(),(

2

2

vv

uu

pvugvDt

txv

pvufuDt

txu

Turing Model

Page 19: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

[From: Leppanen et al. Dimensionality effects in Turing pattern formation, Int. J. Mod. Phys. B 17, 5541-5553 (2003)]

Different parameters lead to different patterns

⎩⎨⎧

++++∇=−−++∇=

)()(

22

22

puvuvhubvvDvpuvuvavuuDu

vt

ut

γγ

Stripes: p small Spots: p large

Page 20: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

3 suggestions

1. Networks and Turing Patterns

2. Stochastic Partial Differential Equations

3. Phylogenetically related Turing Patterns

Page 21: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Evolutionary Models of Protein Structure Evolution

Known KnownUnknown

-globin Myoglobin

300 amino acid changes800 nucleotide changes1 structural change1.4 Gyr

?

?

?

?

1. Given Structure what are the possible events that could happen?

2. What are their probabilities? Old fashioned substitution + indel process with bias.

Bias: Folding(Sequence Structure) & Fitness of Structure

3. Summation over all paths.

Page 22: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

2 suggestions

Folding(Sequence Structure)

As a first approximation similar structures should be compared and the problem could be solved by comparative modelling.

Fitness of Structure – such functions are common place in guiding prediction programs.

Fast Homology Modelling

B. MCMC

A. Structure (Homology Modelling, Topology)

Using Protein Topology as Hidden Variable

Page 23: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Questions to be asked

Protein Structure Analysis is much harder than Sequence Analysis. Much of the first hand impression will remain: “Structures are either trivially similar or highly dissimilar” – the middle ground is empty.

At Gyr scale other rearrangements occur.

Test of smooth/catastrophic structure evolution

Separation of analogous/homologous similarities

Protein Evolution in General

How closely linked are homologous and structurally equivalent sites?

Negative Note:

Positive Note: If it works

http://www.biochem.ucl.ac.uk/bsm/cath/

http://scop.mrc-lmb.cam.ac.uk/scop/

Page 24: 5 Open Problems in Bioinformatics Pedigrees from Genomes Comparative Genomics of Alternative Splicing Viral Annotation Evolving Turing Patterns Protein

Summary

Pedigrees from Genomes

Does infinite genomes determine pedigrees?

How many pedigrees are there?

Comparative Genomics of Alternative Splicing

How well do you know the ASG?

How do you measure selection on the ASG?

Viral Annotation

How well can you annotate viruses from observed evolution?

Evolving Turing Patterns

Turing Patterns and Networks

Stochastic Turing Patterns

Phylogenetically Related Turing Patterns

Protein Structure Evolution

Full Model of Structure Evolution

Model of Protein Topology Evolution