with astonishing advance of the human genome project, essentially all human genomic sequences are...

Post on 21-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

• With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the entire scientific community is to identify medically important genes and determine their functions. Discovery and characterization of corin, the first transmembrane serine protease identified from the heart, exemplifies such a challenge. The bioinformatic and biochemical approaches used in our studies can be applied to study many other genes.

• Serine proteases are important for a variety of biological processes including food digestion, blood coagulation, host defense and embryonic development. These proteases are also protein targets of pharmaceutical drugs. For example, inhibitors of blood clotting enzymes such as thrombin and factor X are developed to prevent and treat thrombotic diseases.

• To identify novel serine proteases in the cardiovascular system, we used the BLAST program to search genomic databases for new genes that share significant homology with serine protease family members, such as trypsin. A partial cDNA sequence (EST) was identified from a human heart library and subsequently used to clone the full-length cDNA of a novel gene, designated corin for its abundant expression in the heart..  

• Sequence analysis indicates that human corin cDNA encodes a polypeptide of 1042 amino acids. Near the amino terminus of corin, there is a transmembrane domain identified by hydropathy plots using the GCG program. In the extracellular region, corin contains two frizzled-like cysteine-rich motifs, seven low density lipoprotein receptor repeats, a macrophage scavenger receptor-like domain, and a trypsin-like protease domain. Such a unique mosaic domain structure was never found in any of the trypsin superfamily members

• To study the function of corin, we performed series of biochemical experiments. Using combined bioinformatic and biochemical approaches, we have solved a long-standing puzzle in the cardiovascular biology.  

4951 Projects-Characterize a protein. Carefully select a protein. Suggestions:

select a protein that is in a 3D database

select a protein that has been studied in many organisms

select a protein that has a known activity

select a protein for which there are known mutant versions.

Use the library to research your chosen protein.

Demonstrate how its function is related to its structure.

Comment on the evolution of the protein.

Comment on the difference between the normal and mutant versions.

Find and analyze the DNA of the protein (determine its size, its intron #, chromosomal location, etc.)

During your presentation, indicate the names of the software used and databases analyzed to give you your information.

Introduction to Molecular Phylogeny*

*Phylogeny- the evolutionary history of a group

Requirement:

• Basic understanding of evolutionary principles.

• Basic understanding of mutation at the molecular level

Genetic variation exists. Evolution depends on it.

• Genetic variation:

DNA segments (large or small) can be altered or duplicated or deleted.

Point mutations or other small changes (ex. A G) generate a new version of a gene (i.e. a new allele)

New loci are generated by gene duplication events.

Basis of Molecular Phylogenetics

• To a first approximation, the evolution of species or genes can be modeled as a bifurcating process.  Two populations become reproductively isolated and diverge due to random mutational processes. Over time, this process may repeat itself, so that at any time, each population can be said to be most closely-related to some other population with which it shares a direct common ancestor.

Basis of Molecular Phylogenetics

• If genomes evolve the by gradual accumulation of mutations, then the amount of nucleotide sequence difference between a pair of genomes should indicate how recently those two genomes shared a common ancestor.

Basis of Molecular Phylogenetics

• Divergence consists of changes in characters, such as amino acids in a protein, or nucleotides in DNA. The longer two populations remain reproductively isolated, the more divergence will occur. Given the existence of homologous characters across a set of populations, it should be possible to work backwards in time, ascending the tree, until a common ancestor of all populations in the set is reached.

Word of Caution• Phylogenetic analysis is one of the

most controversial areas in bioinformatics. There are a wide variety of different methods for analyzing the data, and even the experts often disagree on the best method for analyzing the data.

Phylogenetic Data Analysis requires 4 steps (text- starting on page 327)

• 1) Alignment

• 2) Determine the substitution model

• 3) Tree Building

• 4) Tree Evaluation

Alignment

• Phylogenetic Analyses is very dependent on a good multiple alignment. The alignment of sequences can often have more of an impact on the final tree than the choice of phylogenetic software or phylogenetic parameters.

Homology

It is critical to phylogenetic analysis that homologous characters be compared across species. For DNA and proteins, this means that gaps must be correctly in multiple alignments to ensure that the same position is being compared for each species. Consequently, if a multiple alignment is poor, phylogeny construction will also be poor.

What to align?

• Phylogenetic trees are generated by comparing DNA, RNA, or protein. The molecule of choice depends on the question you are attempting to answer.

DNA/RNA

• contains more evolutionary information than protein

• high rate of base substitution makes DNA test for very short term studies e.g.. closely-related species

Protein

• more reliable alignment than DNA (DNA- 25% = random)

• fewer homoplasies* than DNA • lower rate of substitution than

DNA; better for wide species comparisons

*Homoplasy• Return of a character to its original

state, thus masking intervening mutational events. Homoplasies are most important in DNA sequences, because there are only 4 nucleotides. Every fourth mutation should result in a homoplasy.

rRNA= ribosomal RNA• Best for very long term evolutionary

studies spanning biological kingdoms • Most consistent with an evolutionary

clock. • Selective processes constraining

sequence evolution should be roughly the same across species boundaries

Determine the substitution model-DNA:

• May be a nucleotide substitution rate matrix:

A C G T

A - 2 1 2

C 2 - 2 1

G 1 2 - 2

T 2 1 2 -

Mutation Rates Vary:

• Transitions (purine to purine or pyrimidine to pyrimidine) occur more frequently than transversions (purine to pyrimidine or pyrimidine to purine).

• In general, DNA distance matrices are calculated such that each mismatch between two sequences adds to the distance, and each identity subtracts from the distance. Scoring matrices include values for all possible substitutions.  

Determine the substitution model

• May be an amino acid substitution rate matrix such as PAM or BLOSUM.

Tree Building

• There are four main tree drawing methods.

• - pairwise distance

• - neighbor joining

• - maximum parsimony

• - maximum likelihood

Basic tree terminology:Nodes: branching pointsBranches: linesTopology: branching pattern

Branches can be rotated at a node, without changing the relationships.

Phylogenetic trees based on pairwise distance.

Simplest to visualize with DNA data:

1) Align each pair of sequences under consideration2) The two sequences that are closest together are

connected at a node. The branch lengths reflect the degree of similarity (and theoretically reflect evolutionary time).

3) The process is repeated until all sequences are joined.

4) Addition of the last sequence defines the root of the tree.

Phylogenetic trees based on pairwise distance.

• Relatively simple.

• Problem:

–May not be accurate!!

Phylogenetic trees based on neighbor joining.

• Also utilizes a ‘distance matrix’

• Neighbor joining algorithm searches for sets of neighbors that minimize the total length of the tree.

• Can produce reasonable trees, especially when evolutionary distances are short.

Pairwise distance and neighbor joining are distance methods.

• There are two main categories of phylogeny methods, distance methods and character methods. In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences. Next,  the tree is constructed to minimize the distance when all branches are added together.

Maximum parsimony and maximum likelihood are

character methods

• Character methods attempt to reconstruct ancestral nodes of trees in order to fit the tree to an evolutionary model. They therefore use more of the information in the data, at the expense of longer execution time.

Phylogenetic trees based on maximum parsimony

First step in maximum parsimony analysis:Identify all of the informative sites.

Parsimony Analysis 2nd step: Calculate the minimum number of substitutions at each informative site

1 step 2 steps 2 steps

Final step in parsimony analysis:

After sequences are aligned, algorithms model each tree: Sum the number of changes

over all informative sites for each possible tree.

Parsimony: General scientific criterion for choosing among competing hypotheses states that

we should accept the hypothesis that explains the data most simply and

efficiently.

• The tree requiring the _______ number of nucleic acid or amino acid substitutions is selected.

Problem- As the # of sequences increases, the # of possible trees increases dramatically

# of sequences # of trees

3 1

4 3

5 15

6 105

7 945

8 10,395

9 135,135

10 1,027,025

50 2.8 x 1074

Programs take shortcuts.• When a large number of tree is being

compared, it is impossible to score each tree. A shortcut algorithm establishes an upper limit. As it evaluates other trees, it throws out any tree exceeding the upper bound before the calculation is completed.

Phylogenetic trees based on maximum likelihood

Also evaluates every possible tree topology. ML methods are probabilistic. They assign probabilities to every possible evolutionary change at informative sites.

Phylogenetic trees based on maximum likelihood

The aim is to find the tree (among all possible trees)

with the highest L (likelihood) value.

Tree EvaluationBootstrap method of assessing tree

reliability:

Inferred tree is constructed from data set.

Characters are resampled from the data set with replacement.

Resampling is repeated several (100-1000) times.

Bootstrap method

Bootstrap trees are constructed from the resampled data sets.

Bootstrap tree is compared to original inferred tree.

% of bootstrap trees supporting a node are determined for each node in the tree.

Why the controversy??

• Molecular vs. Classical

• Different Methods Same Tree??

• Molecular Clock

The End

top related