comparative genomics - isoft.postech.ac.kr

Comparative Genomics

Russ B. AltmanBMI 214CS 274

REMINDER: BMI 214 Industry Night

Location: Here (Thornton 102), on TV too.Time: 7:30-9:00 PM (May 21, 2002)Speakers:–Francisco De La Vega, Applied Biosystems–Gary Peltz, Roche–Darren Platt, Exelixis–Liping Wei, Nexus Genomics–Tom Wu, Genentech

Loose Definition

Comparative genomics can be defined as the large scale comparison of genomes in order to understand the biology of individual genomes as well as to extract general principles applying to groups of genomes.

Comparative genomics is based on the assumption that many biological sequences, structures, and functions are shared across organisms, and the signal from these organisms can be increased by combining them in analyses.

We’ve already done some comparative genomics…

Multiple alignment: teaches us more about individual molecules in the context of their homologs

Gibbs Sampling: improved detection of sequence motifs using multiple sequences

Structure prediction: PSI-Blast and other methods to increase signal for predicting structure

Coming…Phylogenetics (creation of trees describing evolution of genes/organisms)

Human vs.

Mouse

Synteny = literally means “same thread”, originally used to denote genes on the same chromosome…

But, it has come to denote two regions of two genomes that show rough conservation of the order of genes in those regions, and thus are related by common descent.

http://www.nih.gov/science/models/

Important issues for Comparative Genomics

1. Aligning very large sequences2. Comparative approaches to gene finding3. Comparative approaches to assigning

function4. Comparative approaches to identifying key

regulatory regions5. Many others are being invented daily…

Aligning very long sequences

What’s the problem?

Given: two genomes (very long sequences)Produce: high quality alignment between the

two genomes• Standard Dynamic Programming = too

expensive• Made difficult by rearrangements of large

homologous segments (cf human/mouse comparison slide)• Most progress in aligning similar genomes• Still an open problem

Genome Research 10, p 950 (2000)

GLASS Method for Alignment

(First, FIND region that is syntenic, no transpositions allowed…)

1. Chose an integer K that is “large” (K = 20)2. Find all matching K-mers in two sequences

(not all possible ones are found…)3. Create new alphabet with each “letter” = a

K-mer found in step 2. Create two new strings from sequences.

4. Align the strings with DP scoring based on exact match plus match of flanking regions

GLASS (continued)

5. Find areas with high local alignment of the K-mer strings.

6. Prune high scoring local alignments that have overlapping K-mers

7. Remaining regions represent aligned portions of the two sequences.

8. Recursively apply steps 2-7 to the regions between aligned portions of the sequence, employing smaller and smaller K (15, 12, 9, 8, 7, 6, 5).

9. Align whatever remains with regular DP.

ATGCTCCATAGAATACTAT = α

CCCATATAC = ρ

WABA Method (Kent et al)

PRE: Divide one genome into 2000 base queries that overlap by 1000 bases

1. Using BLAST-like word hashing technique to store all hits of 8-mers of form XX*XX*X from 2000 base query sequence to other genome.

2. Choose highest scoring region for 2000 base query, and do slow DP alignment of query with 5000 bases around highest scoring region.

3. Find overlapping alignments, and merge.

Comparative approaches to gene finding

New Gene in conserved region…

ROSETTA Method for Comparative Gene Calling

0. Use alignment generated by GLASS1. Build simple gene model (single strand,

initial exon, intron, internal exons, internal introns, terminal exon) similar to GENSCAN.

2. Find highest scoring “parse” on both genomes using DP with scoring based on: splice site locations, codon usage, amino acid similarity, exon length.

http://theory.lcs.mit.edu/crossspecies/

ftp://theory.lcs.mit.edu/pub/cb/rosetta/table117.xls

ROSETTA Performance

Trained on 117 orthologous genes.

95% sensitivity at nucleotide level97% specificity at nucleotide level

(!!!!!!)

Could incorporate more sophisticated gene finding (ala GENSCAN) to improve performance.

Comparative approaches to assigning function

Phylogenetic Profiles

Problem: Assign function to proteins of unknown function.

Insight: Proteins that work together in pathways should occur/not-occur in correlation

Resulting Algorithm: Look for patterns of presence/absence in genomes. Matching patterns should *correlate* with function…

Steps for analysis

1. For each gene, use BLAST to assess presence of homologs in other species.

2. Create a binary matrix indicating which homologs present in which organisms.

3. Cluster “rows” of matrix (for genes) to find genes with similar pattern of occurrence.

4. Find clusters that are removed by 1, 2, 3, etc… bit flips.

5. Clustered rows are hypothesized to have common function…probably.

Evaluation

1. Evaluate by comparing keywords in SwissProt for genes with their clustering.

2. Evaluate by assessing metabolic function (as stored in EcoCYC) of genes that are clustered.

Result: Clustered genes much more likely than random to have shared function. But not 100%… (For (1): 18% vs. 4% overlap)

Protein-Protein InteractionsProblem: Assign function to proteins of

unknown function.

Insight: Over evolution, some proteins that interact with one another fuse their genes to create a single protein.

Resulting Algorithm: Compare genomes to find genes that have fused in some organisms, but not in others. Use this to predict that they are involved in same function…probably.

Method for Detecting Fusion Proteins

1. Prepare a list of all known protein domains (using pfam or prodom databases).

2. For each protein in SwissProt, compute all the domains it contains.

3. Find triplets of SwissProt entries in which two (A,B) have nonhomologous domains, and third (C) has both of these domains together.

4. For a given genome (e.g. E. coli) look for evidence of A & B homologs, and predict an interaction (physical or functional) based on C.

Find SwissProt Triplets (A B C)

Enright et al, Nature. 1999 Nov 4;402(6757):23, 25-6.

Evaluation

1. Compare SwissProt annotations for genes of known function with their predicted functions. (32% vs. 14% for random)

2. Use DIP (Database of Interacting Proteins) as gold standard (6.4% of entries discovered by this method).

3. Correlate results of phylogenetic profile work. (8X more likely to occur in both than random interactions).

Putting it all together…

Nature 402 (1999), p 83

Putting it all together…

Comparative approaches to finding key regulatory regions

Non-coding regulatory regions

Problem: How do we identify areas of non-coding DNA that are important for regulating gene expression?

Insight: Related species may conserve important non-coding DNA in much the same way they conserve exons (vs. intronsand other non-regulatory non-coding DNA which evolve more rapidly)

Resulting Algorithm: Look for conserved non-coding regions as guide for search for regulatory regions.

Genome Research 12, p 832 (2002)

Conclusions

1. Genome-scale alignment is still difficult, and there is room for improvement.

2. Comparative analysis of genomes can yield insight into location of genes and location of regulatory regions.

3. Comparative analysis of genomes can yield insight into evolutionary history of protein function, which is useful for function prediction/annotation.

comparative genomics - isoft.postech.ac.kr

Documents