biology 224 instructor: tom peavy oct 12 & 14, 2009

52
Biology 224 Instructor: Tom Peavy Oct 12 & 14, 2009 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner Gene Structure & Genomes

Upload: bern

Post on 11-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Gene Structure & Genomes. Biology 224 Instructor: Tom Peavy Oct 12 & 14, 2009. . Similarities & Differences Prokaryotic vs. Eukaryotic Genomic DNA. size of genome? Complexity of genes? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Biology 224Instructor: Tom Peavy

Oct 12 & 14, 2009

<Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>

Gene Structure &Genomes

Page 2: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Similarities & Differences Prokaryotic vs. Eukaryotic Genomic DNA

size of genome?Complexity of genes? Open reading Frames (1 gene per stretch)? Regulatory sequences for Transcription? Density of genes? One gene = 1 transcript?

Page 3: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Finding genes in eukaryotic DNA

Types of genes include• protein-coding genes• pseudogenes• functional RNA genes: tRNA, rRNA and others

--snoRNA small nucleolar RNA--snRNA small nuclear RNA--miRNA microRNA

There are several kinds of exons:-- noncoding-- initial coding exons-- internal exons-- terminal exons-- some single-exon genes are intronless

Page 4: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Eukaryotic gene prediction algorithms distinguish several kinds of exons

Page 5: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Gene-finding algorithms

Homology-based searches (“extrinsic”) Rely on previously identified genes

Algorithm-based searches (“intrinsic”)Investigate nucleotide composition, open-reading frames, and other intrinsic properties of genomic DNA

(refer to Chapter 16, Eukaryotic Chromosome, Figure 16-9 for a listof extrinsic vs intrinsic based algorithms).

Page 6: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

DNA

RNA

RNA

protein

Extrinsic, homology-based searching: compare genomic DNA to expressed genes (ESTs)

intron

Page 7: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

DNA

RNA

Intrinsic, algorithm-based searching: Identify open reading frames (ORFs). Compare DNA in exons (unique codon usage) to DNA in introns (unique splices sites)and to noncoding DNA.

Page 8: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009
Page 9: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

chimpanzeeDNA

Comparative genomics: Compare gene models between species. (For annotation of the chimpanzee genome reported in 2005, BLAT and BLASTZ searches were used to align the two genomes.)

human DNA

Page 10: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Finding genes in eukaryotic DNA

Cautionary Notes:

-- The quality of EST sequence is sometimes low

-- Highly expressed genes are disproportionately

represented in many cDNA libraries

-- ESTs provide no information on genomic location

Page 11: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Finding genes in eukaryotic DNA

Both intrinsic and extrinsic algorithms vary in their ratesof false-positive and false-negative gene identification.

Programs such as GENSCAN and Grail account for features such as the nucleotide composition of codingregions, and the presence of signals such as promoter elements.

Page 12: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Finding genes in eukaryotic DNA

In as study using 100,000 base pairs of human DNA, intrinsicalgorithms correctly identified several exons of RBP4, but failed to generate a complete gene model.

As another example, initial annotation of the rice genomeyielded over 75,000 gene predictions, only 53,000 of whichwere complete (having initial and terminal exons). Also,it is very difficult to accurately identify exon-intron boundaries.

Estimates of gene content improve dramatically whenfinished (rather than draft) sequence is analyzed.

Page 561

Page 13: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

There are three main resources for genomes:

EBI European Bioinformatics Institutehttp://www.ebi.ac.uk/genomes/

NCBI National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov

TIGR The Institute for Genomic Researchhttp://www.tigr.org

Genome sequencing projects

Page 14: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

C value paradox: why eukaryotic genome sizes vary

The haploid genome size of eukaryotes, called the C value,varies enormously.

Small genomes include:Encephalotiozoon cuniculi (2.9 Mb)A variety of fungi (10-40 Mb)Takifugu rubripes (pufferfish)(365 Mb)(same number of genes as other fish or as the human genome, but 1/10th the size)

Large genomes include:Pinus resinosa (Canadian red pine)(68 Gb)Protopterus aethiopicus (Marbled lungfish)(140 Gb)Amoeba dubia (amoeba)(690 Gb)

Page 15: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

virusesplasmids

bacteriafungi

plantsalgae

insects

mollusks

reptiles

birds

mammals

Genome sizes in nucleotide base pairs

104 108105 106 107 10111010109

The size of the humangenome is ~ 3 X 109 bp;almost all of its complexityis in single-copy DNA.

The human genome is thoughtto contain ~30,000-40,000 genes.

bony fish

amphibians

http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt

Page 16: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

C value paradox: why eukaryotic genome sizes vary

The range in C values does not correlate well with thecomplexity of the organism. This phenomenon is calledthe C value paradox.

Why?

Page 17: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Britten and Kohne (1968) identified repetitive DNA classes

Reassociation Kinetics = isolated genomic DNA,Shear, denature (melted), & measure the rates of DNAreassociation.

Page 18: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Protein-coding genes in eukaryotic DNA:a new paradox

Why are the number of protein-coding genes about the samefor worms, flies, plants, and humans?

This has been called the N-value paradox (number of genes)or the G value paradox (number of genes).

Page 19: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Five main classes of repetitive DNA

1. Interspersed repeats (RNA/DNA transposon-derived) -- approx 45% of human genome (e.g. LINES, SINES, Alu)

2. Processed pseudogenes (gene loss)

3. Simple sequence repeats -- Microsatellites (1-12 bp); Minisatellites (12-500 bp)

4. Segmental duplications-- blocks of about 1 kilobase to 300 kb that are copied

intra- or interchromosomally (5% of human genome)

5. Blocks of tandem repeats -- includes telomeric and centromeric repeats

and can span millions bp (often species-specific)

Page 20: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

The spectrum of variation

Category of variation Size typeSingle base pair changes 1 bp SNPs,

point mutationsSmall insertions/deletions 1 – 50 bpShort tandem repeats 1 – 500 bp microsatellitesFine-scale structural var. 50 bp – 5 kb del, dup, inv

tandem repeatsRetroelement insertions 0.3 – 10 kb SINEs, LINEs

LTRs, ERVsIntermediate-scale struct. 5 kb – 50 kb del, dup, inv,

tandem repeatsLarge-scale structural var. 50 kb – 5 Mb del, dup, inv, large

tandem repeatsChromosomal variation >>5Mb aneuploidy

Adapted from Sharp AJ et al. (2006) Annu Rev Genomics Hum Genet 7:407-42

Page 21: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009
Page 22: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

nucleolar organizing center

centromere

human chromosome 21at www.ensembl.org

Page 23: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

human chromosome 21at UCSC Genome Browser

centromere

Page 24: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Chromosomes can be highly dynamic, in several ways.

• Whole genome duplication (autopolyploidy) can occur, as in yeast (Chapter 15) and some plants.

• The genomes of two distinct species can merge, as in the mule (male donkey, 2n = 62 and female horse, 2n = 64)

• An individual can acquire an extra copy of a chromosome (e.g. Down syndrome, trisomy 13 or 18)

• Chromosomes can fuse; e.g. human chromosome 2 derives from a fusion of two ancestral primate chromosomes

• Chromosomal regions can be inverted or deleted

• Segmental and other duplications occur

Page 565

Page 25: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Conservative nature of chromosome evolution

Among placental mammals, the number of diploid chromosomes is:

84 in black rhinoceros46 in Homo sapiens17 in two rodent species

The process of chromosome evolution tends to remain conservative. Heterozygous carriers of most types of chromosomal rearrangements are semisterile. Thus many chromosomal changes cannot be fixed.

Ohno (1970) p. 41

Page 26: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Diploidization of the tetraploid

Ohno (1970) pp 98- 101

A species can become tetraploid. All loci are duplicated, and what was formerly the diploid chromosome complement is now the haploid set of the genome.

Polyploid evolution occurs commonly in plants. For example, in the cereal plant Sorghum

S. versicolor (diploid) 2n = 2 x 5; 10 chromosomesS. sudanense (tetraploid) 4n = 4 x 5; 20 chromosomesS. halepense (octoplooid) 8n = 8 x 5; 40 chromosomes

Page 27: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

“Retrotransposons constitute over 40% of the human genome and consist of several millions of family members. They play important roles in shaping the structure and evolution of the genome and in participating in gene functioning and regulation. Since L1, Alu, and SVA retrotransposons are currently active in the human genome, their recent and ongoing retrotranspositional insertions generate a unique and important class of genetic polymorphisms (for the presence or absence of an insertion) among and within human populations. As such, they are useful genetic markers in population genetics studies due to their identical-by-descent and essentially homoplasy-free nature. Additionally, some polymorphic insertions are known to be responsible for a variety of human genetic diseases. dbRIP is a database of human Retrotransposon Insertion Polymorphisms (RIPs). dbRIP contains all currently known Alu, L1, and SVA polymorphic insertion loci in the human genome.”

--dbRIP

Homoplasy: having some states arise more than once on a tree.

Page 28: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

http://falcon.roswellpark.org:9090/index.html

Page 29: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Five main classes of repetitive DNA

Page 547

2. Processed pseudogenes

These genes have a stop codon or frameshift mutationand do not encode a functional protein. They commonly arise from retrotransposition, or following gene duplication and subsequent gene loss.

For a superb on-line resource, visit Mark Gerstein’s website, http://www.pseudogene.org. Gerstein and colleagues (2006) suggest that there are ~19,000 pseudogenes in the human genome, slightly fewer than the number of functional protein-coding genes. (11,000 non-processed, 8,000 processed [lack introns].)

Page 30: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Five main classes of repetitive DNA

Page 546

3. Simple sequence repeats

Microsatellites: from one to a dozen base pairsExamples: (A)n, (CA)n, (CGG)n

These may be formed by replication slippage.Minisatellites: a dozen to 500 base pairs

Simple sequence repeats of a particular length andcomposition occur preferentially in different species.In humans, an expansion of triplet repeats such as CAGis associated with at least 14 disorders (includingHuntington’s disease).

Page 31: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

observed today

Page 32: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

Page 33: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

Page 34: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

Page 35: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

Transcription factor databases

In addition to identifying repetitive elements and genes,it is also of interest to predict the presence of genomicDNA features such as promoter elements and GC content.

Websites that predict transcription factor binding sites and related sequences.

AliBaba2 (http://www.gene-regulation.de/)Eukaryotic Promoter Database

(http://www.epd.isb-sib.ch)PlantProm (http://mendel.cs.rhul.ac.uk)

Page 36: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

http://www.sanger.ac.uk/Users/td2/eponine

Eponine predicts transcription start sites in promoter regions. The algorithm uses a set of DNA weight matrices recognizing sequence motifs that are associated with a position distribution relative to the transcription start site. The model is as follows:

The specificity is good (~70%), and the positional accuracy is excellent. The program identifies ~50% of TSSs—although it does not always know the direction of transcription.

Page 37: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

The ENCODE project

Goal of ENCODE: build a list of all sequence-based functional elements in human DNA. This includes: ► protein-coding genes► non-protein-coding genes► regulatory elements involved in the control of gene transcription ► DNA sequences that mediate chromosomal structure and dynamics.

Page 38: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

VISTA output for an alignment of human and mousegenomic DNA (including RBP4)

Page 39: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

1977 first viral genome (Sanger et. Al. bacteriophage fX174; 11 genes)

1981 Human mitochondrial genome16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)Today, over 400 mitochondrial genomes sequenced

1986 Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb)

1995 Haemophilus influenzae genome sequenced

1996 Saccharomyces cerevisiae (1st Euk. Genome)and archaeal genome, Methanococcus jannaschii.

Chronology of genome sequencing projects

Page 40: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

1997 More bacteria and archaeaEscherichia coli 4.6 megabases, 4200 proteins (38% of unknown function)

1998 Nematode Caenorhabditis elegans (1st multicellular org.)97 Mb; 19,000 genes.

1999 first human chromosome: Chrom 22 (49 Mb, 673 genes)

2000 Drosophila melanogaster (13,000 genes); Plant Arabidopsis thaliana & Human chromosome 21

2001: draft sequence of the human genome(public consortium and Celera Genomics)

Chronology of genome sequencing projects

Page 41: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009
Page 42: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[1] Selection of genomes for sequencing

[2] Sequence one individual genome, or several?

[3] How big are genomes?

[4] Genome sequencing centers

[5] Sequencing genomes: strategies

[6] When has a genome been fully sequenced?

[7] Repository for genome sequence data

[8] Genome annotation

Overview of genome analysis

Page 43: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[1] Selection of genomes for sequencing is basedon criteria such as:

• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture

Overview of genome analysis

Page 44: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[2] Sequence one individual genome, or several?

--Each genome center may study one chromosome from an organism--It is necessary to measure polymorphisms (e.g. SNPs) in large populations

For viruses, thousands of isolates may be sequenced.For the human genome, cost is the impediment.

Overview of genome analysis

Page 45: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[3] How big are genomes?

Viral genomes: 1 kb to 350 kb (Mimivirus: 800 kb)

Bacterial genomes: 0.5 Mb to 13 Mb

Eukaryotic genomes: 8 Mb to 686 Mb

Overview of genome analysis

Page 46: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[4] 20 Genome sequencing centers contributedto the public sequencing of the human genome.

Many of these are listed at the Entrez genomes site.

Overview of genome analysis

Page 47: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[5] There are two main strategies for sequencing genomes

a) Whole genome shotgun (WGS) method -- applied to the entire genome all at once

(sequenced fragments ordered by alignment of overlaps)

VERSUS

b) hierarchical shotgun method --applied to large overlapping DNA fragments of known location

in the genome.(Assemble contigs from chromosomes and then systematically sequence them and reassemble complete sequence)

Overview of genome analysis

Page 48: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[6] When has a genome been fully sequenced?

A typical goal is to obtain five to ten-fold coverage.

Finished sequence: a clone insert is contiguouslysequenced with high quality standard of error rate0.01%. There are usually no gaps in the sequence.

Draft sequence: clone sequences may contain severalregions separated by gaps. The true order andorientation of the pieces may not be known.

Overview of genome analysis

Page 49: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[7] Repository for genome sequence data

Raw data from many genome sequencing projectsare stored at the trace archive at NCBI or EBI(main NCBI page, bottom right)

Overview of genome analysis

Page 50: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009
Page 51: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

[8] Genome annotation

Information content in genomic DNA includes:

-- repetitive DNA elements -- nucleotide composition (GC content)-- protein-coding genes, other genes

Overview of genome analysis

Page 52: Biology 224 Instructor:  Tom Peavy Oct 12 & 14, 2009

How can whole genomes be compared?

-- molecular phylogeny

-- You can BLAST (or PSI-BLAST) all the DNA and/or protein in one genome against another

-- TaxPlot and COG for bacterial (and for some eukaryotic) genomes

-- PipMaker, MUMmer and other programs align large stretches of genomic DNA from multiple species