gene and genome: from gene structure to genome sequencing, annotation and gene ontology bmi/ibgp 730...

Gene and Genome:From Gene Structure To Genome

Sequencing, Annotation and Gene Ontology

BMI/IBGP 730 Victor Jin

Department of Biomedical InformaticsOhio State University

Gene Structure

Genome Sequencing

Genome Annotation (ENCODE Project)

Gene Ontology

Genetic Code

All genomes, from virus to humans, are designed around linear sequences of nucleotides, share a universal code.

An mRNA specify amino acid sequence through the genetic code.

We know one amino acid only could specify one nucleotide.Two nucleotide combinations could only specify 16 amino acids. Three nucleotides (64 possibilities), called a codon, is enough to specify each amino acid.

Each 3 nucleotide code for one amino acid. •The first codon is the start codon, and usually coincides with the Amino Acid Methionine. (M which has codon code ‘ATG’)•The last codon is the stop codon and does NOT code for an amino acid. It is sometimes represented by ‘*’ to indicate the ‘STOP’ codon.

•A coding region (abbreviation CDS) starts at the START codon and ends at the STOP codon.

Codon table

Each amino acid might have up to six codons that specify it.

A handful of species vary from the codon association described above, and use different codons for different amino acids.

RNA

RNA consists of a sugar-phosphate backbone, with nucleotides attached to the 1' carbon of the sugar.

The differences between DNA and RNA are that: RNA has a hydroxyl group on the 2' carbon of the sugar. Not like DNA uses thymine (T), RNA uses uracil (U). Because of the extra hydroxyl group on the sugar, RNA is too bulky to

form a stable double helix. RNA exists as a single-stranded molecule. However, regions of double helix can form where there is some base pair complementation (U and A , G and C), resulting in hairpin loops. The RNA molecule with its hairpin loops is said to have a secondary structure.

RNA molecule can form many different stable three-dimensional tertiary structures, because it is not restricted to a rigid double helix.

Open Reading Frames (ORF)On a given piece of DNA, there can be 6 possible frames. The ORF can be either on the + or - strand and on any of 3 possible frames

Frame 1: 1st base of start codon can either start at base 1,4,7,10,...



(frame –1,-2,-3 are on minus strand)

An open reading frames starts with ATG in most species, and ends with a stop codon (TAA, TAG or TGA)

A program called SIXFRAME, you can visit the site directly http://searchlauncher.bcm.tmc.edu/seq-util/Options/sixframe.html

Eukaryotic Nuclear Gene Structure

Gene prediction for Pol II transcribed genes.

Upstream Enhancer elements.

Upstream Promoter elements.

GC box (-90nt) (20bp), CAAT box (-75 nt)(22bp)

TATA promoter (-30 nt) (70%, 15 nt consensus (Bucher et al (1990))

14-20 nt spacer DNA

CAP site (8 bp)

Transcription Initiation.

Transcript region, interrupted by introns. Translation Initiation (Kozak signal 12 bp consensus) 6 bp prior to initiation codon.

polyA signal (AATAAA 99%,other)

Transcript region, interrupted by introns. Each introns

starts with a donor site consensus (G100T100A62A68G84T63..)

Has a branch site near 3’ end of intron (one not very conserved consensus UACUAAC)

ends with an acceptor site consensus. (12Py..NC65A100G100)

AGUACUAAC

Introns

The exons of the transcript region are composed of:

5’UTR (mean length of 769 bp) with a specific base composition, that depends on local G+C content of genome)

AUG (or other start codon)

Remainder of coding region

Stop Codon

3’ UTR (mean length of 457, with a specific base composition that depends on local G+C content of genome)

Exons

Untranslated regions (UTR’s)

introns (can be genes within introns of another gene!)

intergenic regions.

- repetitive elements

- pseudogenes (dead genes that may(or not) have been retroposed back in the genome as a single-exon “gene”)

Non-Coding Eukaryotic DNA

Each repeat family has many subfamilies.

ALU: ~ 300nt long; 600,000 elements in human genome. can cause false homology with mRNA. Many have an Alu1 restriction site.

Retroposons. ( can get copied back into genome)

LINEs (Long INtersped Elements)

L1 1-7kb long, 50000 copies

SINEs (Short Intersped Elements)

Repeats

Low-Complexity Elements

When analyzing sequences, one often rely on the fact that two stretches are similar to infer that they are homologous (and therefore related).. But sequences with repeated patterns will match without there being any philogenetic relation!

Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity.

Triplet repeats (particularly CAG) have a tendency to make the replication machinery stutter.. So they are amplified.

The low-complexity sequence can also be hidden at the translated protein level.

To avoid finding spurious matches in alignment programs, you should always mask out the query sequence.

Before predicting genes it is a good idea to mask out repeats (at least those containing ORFs).

Before running blastn against a genomic record, you must mask out the repeats.

Most used Programs:

GenScan:http://genes.mit.edu/GENSCAN.html

Repeat Masker:

http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

Masking

~6-12% of human DNA encodes proteins.

~10% of human DNA codes for UTR

~90% of human DNA is non-coding.

Structure of the Eukaryotic Genome

Gene Number

Walter Gilbert [1980s] 100k Antequera & Bird [1993] 70-80k John Quackenbush et al. (TIGR) [2000] 120k Ewing & Green [2000] 30k Tetraodon analysis [2001] 35k Human Genome Project (public) [2001] ~ 31k Human Genome Project (Celera) [2001] 24-40k Mouse Genome Project (public) [2002] 25k -30k Lee Rowen [2003] 25,947

Gene finding

Rules ATG TAA, TGA, TAG GT…..AG

Compositional features Exon lengths Intron lengths Codon bias General genomic properties

Homology

?

?

Gene Structure

Genome Sequencing


Gene Ontology

Human Genome ProjectThe Beginning (1988)

Cold Spring Harbor LaboratoryLong Island, New York

History of the Human Genome Project

Strachan and Read, HMG3 p213

1956 Physical map. 24 types and total set of 46 chromosomes

1977 Sanger publishes dideoxy sequencing method

1980 Botstein proposes human genetic map using RFLPs

1987 US DOE publishes report discussing HGP

1988 HUGO is established

1990 Official start of HGP with 3 billion $ and a 15 year horizon.

1991 Genome Database GB is established

1992 Genethon publishes map based on microsatelites.

1995 Lander et al. detailed map based on sequence tagged sites.

1998 Comprehensive map based on gene markers.

1999 Sanger Centre publishes chromosome 22

2001 Draft Genome published: Celera & Public

2003 Completion (almost) of Human Genome

globin

Exon 2Exon 1 Exon 3

5’ flanking

3’ flanking

(chromosome 11)

The Human Genome I

*5.000

*20

6*104 bp

3.2*109 bp

*103

3*103 bp

ATTGCCATGTCGATAATTGGACTATTTGGA30 bp

Myoglobin globin

aa aa aa aa aa aa aa aa aa aa

DNA:

Protein:

1

2 3

4 56 7

8 9X

Y151413121011

2120191817

1622

279251

221197 198

176 163 148 140 143 148 142118 107 100

10488 86

72 66 45 48

163

51

mitochondria

.016

http://www.sanger.ac.uk/HGP/ & R.Harding & HMG (2004) p 245

aa

The Human Genome II

Strachan and Read (2004) Chapter 9 + Lander et al.(2001), http://www.sanger.ac.uk/HGP/

Gene families

Clustered

a-globins (7), growth hormone (5), Class I HLA heavy chain (20),….

Dispersed

Pyruvate dehydrogenase (2), Aldolase (5), PAX (>12),..

Clustered and Dispersed

HOX (38 – 4), Histones (61 – 2), Olfactory receptors (>900 – 25),…

Human Genes and Gene Structures IPresently estimated Gene Number: 24.000 (reference: )

Average Gene Size: 27 kb

The largest gene: Dystrophin 2.4 Mb - 0.6% coding – 16 hours to transcribe.

The shortest gene: tRNATYR 100% coding

Largest exon: ApoB exon 26 is 7.6 kb Smallest: <10bp

Average exon number: 9

Largest exon number: Titin 363 Smallest: 1

Largest intron: WWOX intron 8 is 800 kb Smallest: 10s of bp

Largest polypeptide: Titin 38.138 smallest: tens – small hormones.

Intronless Genes: mitochondrial genes, many RNA genes, Interferons,

Histones,..Jobling, Hurles & Tyler-Smith (2004) HEG p 29 + HMG chapt. 9

How do we differ? – Let me count the ways

Single nucleotide polymorphisms 1 every few hundred bp, mutation rate* ≈ 10-9

Short indels (=insertion/deletion) 1 every few kb, mutation rate v. variable

Microsatellite (STR) repeat number 1 every few kb, mutation rate ≤ 10-3

Minisatellites 1 every few kb, mutation rate ≤ 10-1

Repeated genes rRNA, histones

Large inversions, deletions Rare, e.g. Y chromosome

TGCATTGCGTAGGCTGCATTCCGTAGGC

TGCATT---TAGGCTGCATTCCGTAGGC

TGCTCATCATCATCAGCTGCTCATCA------GC

≤100bp

1-5kb

*per generation

http://www.sanger.ac.uk/HGP/draft2000/gfx/fig2.gif

STS – sequence-tagged sites (short segments of unique DNA on every chromosome – defined by a pair of PCR primers that amplified only one segment of the genome)

BAC – Bacterial artificial chromosome, 100-400kb

YAC – Yeast artificial chromosome, 150kb-1.5Mb

Contig – assembled contiguous overlapping segments of DNA from BACs and YACs

ESTs – Expressed Sequence Tags

UniGene Database – a database for ESTs

Genome Mapping

Shotgun Sequencing

• Segments are short ~2kb• Problem with repeated segments or genes

Concepts in Biochemistry, 2nd Ed., R. Boyer

$1000 genome project

SolexaSOLiD454

Re-sequencing using massive parallel sequencer

Visualization• Ensembl

(http://www.ensembl.org/index.html)• Genome Browser

(http://genome.ucsc.edu/)• Map Viewer

(http://www.ncbi.nlm.nih.gov/mapview/)• VEGA – VErtebrate Genome Annotation

database (http://vega.sanger.ac.uk/index.html)

Gene Structure

Genome Sequencing


Gene Ontology

http://www.sanger.ac.uk/HGP/havana/

The value of sequenced genome lies in the annotation.

Gene discoveryPolymorphismTSSCpG regionncRNATF binding sites

Annotation projects:• HAVANA (Sanger Inst.)• ENCODE • CCDS

ENCODE Project

• ENCODE: Encyclopedia Of DNA Elements• The project started in September 2003 funded by

National Human Genome Research Institute (NHGRI) • Task: identify all functional elements in the human

genome sequence. • Initial Phases: a pilot phase and a technology

development phase.• Production phase: A large scale.

Initial Phases

• The pilot phase: analyze 1% human genome, including 44 genome regions with 30 million DNA base pairs.

• The conclusions from this pilot project were published in June 2007 in Nature and Genome Research [genome.org].

• The findings highlighted the success of the project to identify and characterize functional elements in the human genome.

• The technology development phase also has been a success with the promotion of several new technologies to generate high throughput data on functional elements.

Major findings• The ENCODE consortium generated more than 200 datasets and analyzed more than 600

million data points.

• The ENCODE consortium's major findings include :

1) the majority of DNA in the human genome is transcribed into functional molecules RNA, and that these transcripts extensively overlap one another. This broad pattern of transcription challenges the long-standing view that the human genome consists of a relatively small set of discrete genes, along with a vast amount of so-called junk DNA that is not biologically active.

2) The genome contains very little unused sequences and, is a complex, interwoven network. In this network, genes are just one of many types of DNA sequences that have a functional impact.

3) Half of functional elements in the human genome do not appear to have been constrained during evolution. This may indicate that many species' genomes contain a pool of functional elements that provide no specific benefits in terms of survival or reproduction. As this pool turns over during evolutionary time, it may serve as a "warehouse for natural selection" by acting as a source of functional elements unique to each species and of elements that perform the similar functions among species despite having sequences that appear dissimilar.

4) Identification of numerous previously unrecognized start sites for DNA transcription.

5) Evidence that, contrary to traditional views, regulatory sequences are just as likely to be located downstream of a transcription start site on a DNA strand as upstream.

6) Identification of specific signatures of change in histones, and correlation of these signatures with different genomic functions.

7) Deeper understanding of how DNA replication is coordinated by modifications in histones.

Production Phase

• NHGRI funded new awards in September 2007 to scale the ENCODE Project to a production phase on the entire genome.

• The ENCODE team: is an open consortium and includes investigators with diverse backgrounds and expertise in the production and analysis of data .

Gene Structure

Genome Sequencing


Gene Ontology

Toolshttp://www.geneontology.org/GO.tools.shtmlExamples:

• DAVID - http://david.abcc.ncifcrf.gov• BiNGO

gene and genome: from gene structure to genome sequencing, annotation and gene ontology bmi/ibgp 730...

Documents

base of start codon

rna rna

codon association

codon table

codon code atg

stop codon taa

rna molecule

amino acid sequence