with astonishing advance of the human genome project, essentially all human genomic sequences are...

48
With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the entire scientific community is to identify medically important genes and determine their functions. Discovery and characterization of corin, the first transmembrane serine protease identified from the heart, exemplifies such a challenge. The bioinformatic and biochemical approaches used in our studies can be applied to study many other genes. Serine proteases are important for a variety of biological processes including food digestion, blood coagulation, host defense and embryonic development. These proteases are also protein targets of pharmaceutical drugs. For example, inhibitors of blood clotting enzymes such as thrombin and factor X are developed to prevent and treat thrombotic diseases.

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

• With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the entire scientific community is to identify medically important genes and determine their functions. Discovery and characterization of corin, the first transmembrane serine protease identified from the heart, exemplifies such a challenge. The bioinformatic and biochemical approaches used in our studies can be applied to study many other genes.

• Serine proteases are important for a variety of biological processes including food digestion, blood coagulation, host defense and embryonic development. These proteases are also protein targets of pharmaceutical drugs. For example, inhibitors of blood clotting enzymes such as thrombin and factor X are developed to prevent and treat thrombotic diseases.

Page 2: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

• To identify novel serine proteases in the cardiovascular system, we used the BLAST program to search genomic databases for new genes that share significant homology with serine protease family members, such as trypsin. A partial cDNA sequence (EST) was identified from a human heart library and subsequently used to clone the full-length cDNA of a novel gene, designated corin for its abundant expression in the heart..  

Page 3: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

• Sequence analysis indicates that human corin cDNA encodes a polypeptide of 1042 amino acids. Near the amino terminus of corin, there is a transmembrane domain identified by hydropathy plots using the GCG program. In the extracellular region, corin contains two frizzled-like cysteine-rich motifs, seven low density lipoprotein receptor repeats, a macrophage scavenger receptor-like domain, and a trypsin-like protease domain. Such a unique mosaic domain structure was never found in any of the trypsin superfamily members

Page 4: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

• To study the function of corin, we performed series of biochemical experiments. Using combined bioinformatic and biochemical approaches, we have solved a long-standing puzzle in the cardiovascular biology.  

Page 5: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

4951 Projects-Characterize a protein. Carefully select a protein. Suggestions:

select a protein that is in a 3D database

select a protein that has been studied in many organisms

select a protein that has a known activity

select a protein for which there are known mutant versions.

Use the library to research your chosen protein.

Demonstrate how its function is related to its structure.

Comment on the evolution of the protein.

Comment on the difference between the normal and mutant versions.

Find and analyze the DNA of the protein (determine its size, its intron #, chromosomal location, etc.)

During your presentation, indicate the names of the software used and databases analyzed to give you your information.

Page 6: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Introduction to Molecular Phylogeny*

*Phylogeny- the evolutionary history of a group

Page 7: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Requirement:

• Basic understanding of evolutionary principles.

• Basic understanding of mutation at the molecular level

Page 8: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Genetic variation exists. Evolution depends on it.

• Genetic variation:

DNA segments (large or small) can be altered or duplicated or deleted.

Point mutations or other small changes (ex. A G) generate a new version of a gene (i.e. a new allele)

New loci are generated by gene duplication events.

Page 9: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Basis of Molecular Phylogenetics

• To a first approximation, the evolution of species or genes can be modeled as a bifurcating process.  Two populations become reproductively isolated and diverge due to random mutational processes. Over time, this process may repeat itself, so that at any time, each population can be said to be most closely-related to some other population with which it shares a direct common ancestor.

Page 10: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Basis of Molecular Phylogenetics

• If genomes evolve the by gradual accumulation of mutations, then the amount of nucleotide sequence difference between a pair of genomes should indicate how recently those two genomes shared a common ancestor.

Page 11: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Basis of Molecular Phylogenetics

• Divergence consists of changes in characters, such as amino acids in a protein, or nucleotides in DNA. The longer two populations remain reproductively isolated, the more divergence will occur. Given the existence of homologous characters across a set of populations, it should be possible to work backwards in time, ascending the tree, until a common ancestor of all populations in the set is reached.

Page 12: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the
Page 13: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Word of Caution• Phylogenetic analysis is one of the

most controversial areas in bioinformatics. There are a wide variety of different methods for analyzing the data, and even the experts often disagree on the best method for analyzing the data.

Page 14: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Phylogenetic Data Analysis requires 4 steps (text- starting on page 327)

• 1) Alignment

• 2) Determine the substitution model

• 3) Tree Building

• 4) Tree Evaluation

Page 15: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Alignment

• Phylogenetic Analyses is very dependent on a good multiple alignment. The alignment of sequences can often have more of an impact on the final tree than the choice of phylogenetic software or phylogenetic parameters.

Page 16: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Homology

It is critical to phylogenetic analysis that homologous characters be compared across species. For DNA and proteins, this means that gaps must be correctly in multiple alignments to ensure that the same position is being compared for each species. Consequently, if a multiple alignment is poor, phylogeny construction will also be poor.

Page 17: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the
Page 18: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

What to align?

• Phylogenetic trees are generated by comparing DNA, RNA, or protein. The molecule of choice depends on the question you are attempting to answer.

Page 19: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

DNA/RNA

• contains more evolutionary information than protein

• high rate of base substitution makes DNA test for very short term studies e.g.. closely-related species

Page 20: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Protein

• more reliable alignment than DNA (DNA- 25% = random)

• fewer homoplasies* than DNA • lower rate of substitution than

DNA; better for wide species comparisons

Page 21: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

*Homoplasy• Return of a character to its original

state, thus masking intervening mutational events. Homoplasies are most important in DNA sequences, because there are only 4 nucleotides. Every fourth mutation should result in a homoplasy.

Page 22: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

rRNA= ribosomal RNA• Best for very long term evolutionary

studies spanning biological kingdoms • Most consistent with an evolutionary

clock. • Selective processes constraining

sequence evolution should be roughly the same across species boundaries

Page 23: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Determine the substitution model-DNA:

• May be a nucleotide substitution rate matrix:

A C G T

A - 2 1 2

C 2 - 2 1

G 1 2 - 2

T 2 1 2 -

Page 24: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Mutation Rates Vary:

• Transitions (purine to purine or pyrimidine to pyrimidine) occur more frequently than transversions (purine to pyrimidine or pyrimidine to purine).

Page 25: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

• In general, DNA distance matrices are calculated such that each mismatch between two sequences adds to the distance, and each identity subtracts from the distance. Scoring matrices include values for all possible substitutions.  

Page 26: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Determine the substitution model

• May be an amino acid substitution rate matrix such as PAM or BLOSUM.

Page 27: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Tree Building

• There are four main tree drawing methods.

• - pairwise distance

• - neighbor joining

• - maximum parsimony

• - maximum likelihood

Page 28: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Basic tree terminology:Nodes: branching pointsBranches: linesTopology: branching pattern

Page 29: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Branches can be rotated at a node, without changing the relationships.

Page 30: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the
Page 31: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Phylogenetic trees based on pairwise distance.

Simplest to visualize with DNA data:

1) Align each pair of sequences under consideration2) The two sequences that are closest together are

connected at a node. The branch lengths reflect the degree of similarity (and theoretically reflect evolutionary time).

3) The process is repeated until all sequences are joined.

4) Addition of the last sequence defines the root of the tree.

Page 32: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Phylogenetic trees based on pairwise distance.

• Relatively simple.

• Problem:

–May not be accurate!!

Page 33: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Phylogenetic trees based on neighbor joining.

• Also utilizes a ‘distance matrix’

• Neighbor joining algorithm searches for sets of neighbors that minimize the total length of the tree.

• Can produce reasonable trees, especially when evolutionary distances are short.

Page 34: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Pairwise distance and neighbor joining are distance methods.

• There are two main categories of phylogeny methods, distance methods and character methods. In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences. Next,  the tree is constructed to minimize the distance when all branches are added together.

Page 35: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Maximum parsimony and maximum likelihood are

character methods

• Character methods attempt to reconstruct ancestral nodes of trees in order to fit the tree to an evolutionary model. They therefore use more of the information in the data, at the expense of longer execution time.

Page 36: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Phylogenetic trees based on maximum parsimony

First step in maximum parsimony analysis:Identify all of the informative sites.

Page 37: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Parsimony Analysis 2nd step: Calculate the minimum number of substitutions at each informative site

1 step 2 steps 2 steps

Page 38: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Final step in parsimony analysis:

After sequences are aligned, algorithms model each tree: Sum the number of changes

over all informative sites for each possible tree.

Page 39: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Parsimony: General scientific criterion for choosing among competing hypotheses states that

we should accept the hypothesis that explains the data most simply and

efficiently.

• The tree requiring the _______ number of nucleic acid or amino acid substitutions is selected.

Page 40: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Problem- As the # of sequences increases, the # of possible trees increases dramatically

# of sequences # of trees

3 1

4 3

5 15

6 105

7 945

8 10,395

9 135,135

10 1,027,025

50 2.8 x 1074

Page 41: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the
Page 42: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Programs take shortcuts.• When a large number of tree is being

compared, it is impossible to score each tree. A shortcut algorithm establishes an upper limit. As it evaluates other trees, it throws out any tree exceeding the upper bound before the calculation is completed.

Page 43: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Phylogenetic trees based on maximum likelihood

Also evaluates every possible tree topology. ML methods are probabilistic. They assign probabilities to every possible evolutionary change at informative sites.

Page 44: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Phylogenetic trees based on maximum likelihood

The aim is to find the tree (among all possible trees)

with the highest L (likelihood) value.

Page 45: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Tree EvaluationBootstrap method of assessing tree

reliability:

Inferred tree is constructed from data set.

Characters are resampled from the data set with replacement.

Resampling is repeated several (100-1000) times.

Page 46: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Bootstrap method

Bootstrap trees are constructed from the resampled data sets.

Bootstrap tree is compared to original inferred tree.

% of bootstrap trees supporting a node are determined for each node in the tree.

Page 47: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

Why the controversy??

• Molecular vs. Classical

• Different Methods Same Tree??

• Molecular Clock

Page 48: With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the

The End