comparative genomics - fordham universitycschweikert/cisc4020/lecture 8.pdf · comparative genomics...

20
Comparative genomics Chapter 15 Features that are investigated include Features that are investigated include 1. Genome size variation 2. Base composition biases 3. Gene order 1. Colinearity 2. Identification of orthologs 3. Functional predictions 4. Whole genome alignments 1. Identify genes 2. Improve gene annotation 3 Id tif lt i 3. Identify regulatory regions (phylogenetic footprinting) 5. Evolution of genomes

Upload: others

Post on 23-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Comparative genomicsChapter 15p

Features that are investigated includeFeatures that are investigated include1. Genome size variation2. Base composition biases3. Gene order

1. Colinearity2. Identification of orthologs3. Functional predictions

4. Whole genome alignments1. Identify genes2. Improve gene annotation3 Id tif l t i3. Identify regulatory regions

(phylogenetic footprinting)5. Evolution of genomes

Page 2: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Comparative genomics Genome Size

• There are huge differences in genome sizes among related organisms• More complex organisms not necessarily have larger genomes. • “C-value paradox”• Repetitive elements responsible for large proportion of the p p g p p

differences• Amount of DNA affect cell size, speed of cell cycle, and other traits

Lungfish 140 Gb!

Page 3: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

D t l t t th l BAC B l i

Comparative genomics Aligning Genomic sequences

G 1

Dotplots to compare orthologous BACs: Barley (horizontal) vs. rice (vertical)

G 2

G 3

G 4

G 5

Th

G 1 G 2 G 3 G 4bG 4aTa Tb Tc Te Tf Te TgTd

Ti

Barley BAC 635P2 (horizontal) vs. Rice BAC 36I05 (vertical)Dubcovsky et al. Plant Physiology. 125:1342-1353

Page 4: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

What have we learned?Dubcovsky and Dvorak 2007, Science 2007 316: 1862-1866

• Small groups of genes are dispersed within a sea of repetitive elements

Am genome T. monococcum

• Rapid divergence in 20 kb

A genome durum

intergenic regions:Comparison orthologous A and Am VRN2 regions• 1.1 MYA divergence• 31% similarity

Decay of synteny in intergenic regions1.0C(t)

A

31% similarity

• Fast decay of % similarity isHvsCh

0.4

0.6

0.8

Wheat -barley

A-B-D wheat

A-Am

B

C

• Fast decay of % similarity is also affecting genes • Fast rate of gene deletions and inactivation by retro insertions

HvsChHuman vs. Chimp7 mya, 3% divergence

Million years

0.0

0.2

0 2 4 6 8 10 12

-barleywheat

D E

insertions.•Small phenotypic changes in polyploid wheat. Valuable source of diversity

Page 5: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Comparative genomics Composition bias

Effect of repetitive elements i GC t tin GC content

•Grass genes 51% GC, Arabidopsis genes 44% GC

• The overall GC content in• The overall GC content in wheat 46% is 2% higher than Arabidopsis or rice

•The GC content of barley and wheat is very similar to the GCwheat is very similar to the GC content of the most abundant retroelements.

Page 6: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Comparative genomics Gene order

Different organisms show different rates of chromosome rearrangementsComparative maps of the wheat genome d ib d i f h i (A)described in terms of the rice genome (A) and the Aegilops umbellulata genome (B).(A) Data from Kurata et al. (1994), Van Deynze et al. (1995) (B) Data from Zhang et al. (1998).

A. Wheat- rice 50 million years

B. Wheat- Aegilops5 million years

Human – Mouse (80 million years): Many rearrangements (180 blocks) but some large conserved regions!Human – Chimpanzee (5 million years) almost completely colinear.• Allows transfer of genetic mapping data between organisms.• Allows functional analysis of human genes via precise deletion of their colinear regions in the mouse genome

Page 7: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Homology search for the mouse genome

• Homology search of all gygenes in the mouse genome27% i th t

EukaryoteOther

• 27% in other metazoans• 29% in other eukaryotes• 6% in other chordates

MetazoanMammal• 6% in other chordates

• 14% in other mammals• Less than 1% rodent

Rodentspecific

Chordate

Mammal

specific specific

Page 8: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Dotplots to discover duplications in rice

Comparative genomics Aligning Genomic sequencesp p

• A large genome duplication occurred approximately 70 million years ago before the divergences of most grass subfamilies.

Diploid grasses are actually ancient•Diploid grasses are actually ancient polyploids

•Figures shows the arrangement of duplicated protein-encoding genes in rice in the order found in the current sequence

bl B th X d Y tassembly. Both X and Y axes represent 45,174 genes in their chromosomal order.

• Colors indicate same (red) or opposite (green) transcriptional orientations.

• Differential Gene Loss Contributes to apparent Incongruities in Comparative Grass Genomics.

•Paterson et al. 2004 PNAS

• A similar result has been described for Arabidopsis suggesting that this species is also an ancient polyploid

Page 9: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Comparative Genomics. Gene Annotation

Though most human and mouse orthologs have similar structure some differences are sometimes observed:C i f th h d idi th• Comparison of the human and mouse spermidine synthase genes revealed an additional intron in the human gene that is not found in the mouse homolog

HumanHuman

MouseMouse

5,500 bp

Page 10: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Algorithms for Aligning Genomic sequences

Fi di th l i BLASTZ (http://zpicture dcode org/ see zpicture pdf)Finding orthologous region between two genomes is nontrivial

Existing methods based on

BLASTZ (http://zpicture.dcode.org/, see zpicture.pdf)

• Local Alignment program. Aligns 2-Mb<1min.

• 1st masks regions in seq. 1 repeated in seq. 2

• Search stretches of 19-bp /12 matches (1 transition OK)Existing methods based on dynamic programming algorithms (e.g. Needleman and Wunsch / Smith and Waterman) or

Search stretches of 19 bp /12 matches (1 transition OK)

• After initial match a gap free extension is performed until a cumulative score reaches a threshold (3000)

• If threshold is met, then realigned now allowing gaps

hashing (e.g. BLAST / FASTA) rapidly run out of memory.

Even Megablast and BLAT

• Alignments with scores >5000 move to next phase

• Scores are calculated using refined substitution matrices based on aligned human-mouse non-coding

• Connect individual alignments separated by <50-kbcan not handle large genome comparisons

Specific programsBLASTN BLASTZ

Connect individual alignments separated by 50 kb

• BLASTZ

• LAGAN

• AVIDAVIDSubstitution matrices used by BLAST and BLASTZ. BLASTZ matrix based on observed substitutions in aligned conserved regions of mouse and human. BLASTZ aligned mouse sequences to 40% of the human genome.

Page 11: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

ZPicture http://zpicture.dcode.org/

zPicture is a pair-wise alignment and visualization tool that compares two sequences using the localalignment program BlastZMultiple sequences: http://zpicture.dcode.org/multiz.php

Gene 1 4548-7287 Gene 2 8663-13534 R Gene3 14166-22601

zPicture alignment Rice (AY013245) vs. Barley (AY013246) [no gene annotation provided so all in red]

Gene4 24136-29057 Repeat

Page 12: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Comparative genomics LAGAN Limited Area Global Alignment of Nucleotideshttp://lagan.stanford.edu/lagan_web/index.shtml

LAGAN: Global pair-wise and multiple alignment of finishedmultiple alignment of finished sequences.

• Detects closely and distantly related sequences

If f th i• If some of the sequences are in a draft format your query will be redirected to AVID

• Multiple alignment will be visualized by VISTAby VISTA

• This is the only alignment program available through the VISTA server that produces true multiple alignments.

LAGAN f b tt th BLASTZ• LAGAN performs better than BLASTZ for distantly related organisms

• Shuffle-LAGAN: detects rearrangements and inversions

Page 13: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

AVID

AVID: global pair-wise alignment.

• Fast alignment of large seq.

Red: non-overlapping, non-crossing matches

Fast alignment of large seq.

• Detects weak homologies

• One of the sequences should be finished but all others can be

Clean Matches (1st) & Repeat matches

Set of non-overlapping, non-crossing matchesfinished, but all others can be either finished or in a draft format. • For all finished sequences in the set AVID generates all-against-all

pp g, g

Recursion

set, AVID generates all against all pair-wise alignments

• Draft sequences are aligned to the finished sequence.

Repeat for each inter-anchor region from the previous stepq

• View using VISTA

• Similar to LAGAN in that it uses anchorsanchors

Page 14: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

VISTA• Option to use AVID or LAGAN• Option to use AVID or LAGAN.• If one of the sequences is not finished uses AVID.

• Can align large sequence to genome bliassemblies

• Can perform multiple alignments

• Can enter GenBank accessions

• Accepts annotation filesAccepts annotation files

• Window size (default 100 bp) and minimum conservation to show (default 50%) can be adjusted.

• Aligns in forward orientation• Aligns in forward orientation

Wheat BAC (AF459639) against the rice genome annotation

http://genome.lbl.gov/vista/index.shtml

Page 15: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

MULTI-LAGAN http://lagan.stanford.edu/lagan_web/index.shtml

Barley viewRice view

Rice

Wheat

y

Gene 1 2563 6313G 1 G 2 Gene3 14166 22601

Rice view

Wheat

Barley

Rice

Wheat

Gene 1 2563-63136 exons

Gene 1 4548-72876 exons

Gene 2 8663-1353413 exons Missed barley

Gene3 14166-2260118 exons

Wheat

Barley

Rice

Gene 2 30279-3439313 exons Missed ReverseGene4 24136-29057

18 exons

MULTI-LAGAN: similar to ClustalWI d d i li d l PDF f Wheat

Rice

Gene3 64887-7306118 exons

It produces dynamic alignments and also PDF of the alignments presented here.

Any of the genomes can be used as reference

In the alignment between rice-barley-wheat it fails Rice

Wheat

Gene4a 74857-80916 Gene4b 84895-88925

to detect the inverted Gene 2 in barley.

It produces an evolutionary tree for the submitted sequences

18 exons 18 exons

Page 16: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Shuffle –LAGAN Barley view

http://genome.lbl.gov/cgi-bin/VistaInput?align_pgm=slagan&num_seqs=2

y

Gene 1 6 exons

Gene 2 13 exons Reverse

Gene3 18 exons

Gene4a 18 exons Gene4b 18 exons

Page 17: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Different levels of conservation

Page 18: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Precomputed genomic alignmentsMajor sources of Precomputed Whole-genome alignmentsMajor sources of Precomputed Whole-genome alignments• Ensembl http://www.ensembl.org/index.html [human-chimp-mouse-rat-chicken-fugu-zebrafish-Drosophila-C.elegans]

• VISTA browser http://genome.lbl.gov/vista/index.shtml [human-chimp-mouse-rat-chicken] MULTI-LAGAN

• UCSC Genome Browser http://genome.ucsc.edu/ [human-chimp-mouse-rat-chicken-fugu] BLASTZ

Chimp

Mouse

Vista Rat

Chicken

Conservation

UCSC Chimp

Mouse

Conservation

ouse

RatChicken

Rep. Masker

Page 19: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Using multiple species for phylogenetic footprinting

• The location of regulatory sequences can also be found bycan also be found by comparing several related sequences

• Multiple alignments• Multiple alignments performed

• Better able to home in on important regionson important regions

• Conservation alone not enough; need to

lid t i t fvalidate importance of elements

Page 20: Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements

Phylogenetic FootprintingA large portion of the

5’ TATA 3’

A large portion of the genome contains regulatory information

Ci l tCis-regulatory elements.

• Basal transcription machinery binding sites (TATA)

• Enhancers

Programs to detect conserved regulatory elements

• FootPrinterFootPrinter• rVISTA• Enhancers:

http://enhancer.lbl.gov/

Identification of conserved regions: phylogenetic footprint

FootPrinter http://genome.cs.mcgill.ca/cgi-bin/FootPrinter3.0/FootPrinterInput2.pl

• Useful to identify putative transcription factors binding sites

• Fig: Upstream regions of 4 Beta-globin genes, prealigned with ClustalW (Motif size=12, Max # of mutations =0