bioinformatics

36
BIOINFORMATICS Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 4 Gene Prediction po University lty of technical engineering rtment of Biotechnology

Upload: sulwyn

Post on 09-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Lecture 4 Gene Prediction. Bioinformatics. Dr. Aladdin HamwiehKhalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene Prediction: Computational Challenge. Gene: A sequence of nucleotides coding for protein - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics

BIOINFORMATICS

Dr. Aladdin Hamwieh Khalid Al-shamaaAbdulqader Jighly

2010-2011

Lecture 4Gene Prediction

Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology

Page 2: Bioinformatics

Gene: A sequence of nucleotides coding for protein

Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

GENE PREDICTION: COMPUTATIONAL CHALLENGE

Page 3: Bioinformatics

GENE PREDICTION: COMPUTATIONAL CHALLENGE aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Page 4: Bioinformatics

GENE PREDICTION: COMPUTATIONAL CHALLENGE aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Page 5: Bioinformatics

GENE PREDICTION: COMPUTATIONAL CHALLENGE aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene!

Page 6: Bioinformatics

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

CENTRAL DOGMA: DNA -> RNA -> PROTEIN

Page 7: Bioinformatics

CODONSIn 1961 Sydney Brenner and Francis

Crick discovered frameshift mutations

Systematically deleted nucleotides from DNASingle and double deletions

dramatically altered protein productEffects of triple deletions were

minorConclusion: every triplet of

nucleotides, each codon, codes for exactly one amino acid in a protein

Page 8: Bioinformatics

THE SLY FOXIn the following string THE SLY FOX AND THE SHY DOGDelete 1, 2, and 3 nucleotifes after the

first ‘S’:THE SYF OXA NDT HES HYD OGTHE SFO XAN DTH ESH YDO GTHE SOX AND THE SHY DOG

Which of the above makes the most sense?

Page 9: Bioinformatics

TRANSLATING NUCLEOTIDES INTO AMINO ACIDSCodon: 3 consecutive nucleotides43 = 64 possible codonsGenetic code is degenerative and redundantIncludes start and stop codonsAn amino acid may be coded by

more than one codon

Page 10: Bioinformatics

UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames

Genetic Code and Stop Codons

Page 11: Bioinformatics

TERMINOLOGY ORF

A series of DNA codons, including a 5’ initiation codon and a termination codon, that encodes a putative or known gene.

Exons Portions of the ORF that are transcribed and when

combined form the coding sequence (CDS) for the gene

Introns Portions of the ORF that are transcribed and are

spliced out of the mRNA before translation. Untranslated regions (UTRs)

Non-coding regions that are transcribed and flank the ORF (for DNA) and CDS (for mRNA)

5’ end (relative to mRNA) UTRs (leader, regulatory sites)

3’ end UTRs (terminator sites, trailer)

Page 12: Bioinformatics

TRANSCRIPTION IN PROKARYOTES

Coding region

Promoter

Transcription start side

Untranslated regions

Transcribed region

start codon stop codon

5’ 3’

upstream downstream

Transcription stop side

Page 13: Bioinformatics

GENE STRUCTURE IN EUKARYOTS

Promoter

Transcription start siteUntranslated regions

Transcribed region

start codon stop codon

5’ 3’

Transcription stop site

exons

donor and acceptor sitesGT AG

introns

Page 14: Bioinformatics

BIOLOGICAL BACKGROUNDGENE STRUCTURE

Ex1 In1 Ex2 Ex2 In2 Ex3 In3 Ex4 In4 Ex5 Ex5

5’ UTR StopcodonATG

3’ UTR

Ex5Ex2

Ex2 Ex3 Ex4 Ex5Ex1 Poly ACAP

GT AG

Page 15: Bioinformatics
Page 16: Bioinformatics

DISCOVERY OF SPLIT GENES In 1977, Phillip Sharp and Richard Roberts

experimented with mRNA of hexon, a viral protein. Map hexon mRNA in viral genome by

hybridization to adenovirus DNA and electron microscopy

mRNA-DNA hybrids formed three curious loop structures instead of contiguous duplex segments

Page 17: Bioinformatics

EXONS AND INTRONSIn eukaryotes, the gene is a

combination of coding segments (exons) that are interrupted by non-coding segments (introns)

This makes computational gene prediction in eukaryotes even MORE DIFFICULT

Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Page 18: Bioinformatics

CENTRAL DOGMA AND SPLICING

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = codingintron = non-coding

Batzoglou

Page 19: Bioinformatics

SPLICING SIGNALS

Exons are interspersed with introns and typically flanked by GT and AG

Page 20: Bioinformatics

CONSENSUS SPLICE SITES

Page 21: Bioinformatics

TWO APPROACHES TO GENE PREDICTION

Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns).

Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.

Page 22: Bioinformatics

Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and the stock report in a foreign newspaper?

STATISTICAL APPROACH: METAPHOR IN UNKNOWN LANGUAGE

Page 23: Bioinformatics

If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent

SIMILARITY-BASED APPROACH: METAPHOR IN DIFFERENT LANGUAGES

Page 24: Bioinformatics

WHICH ISSUES HELP US TO DETECT GENES?

1. ORFs constructions2. ORFs Long3. ORFs Codon usage4. Regulatory motifs5. Ribosomal Binding Sites 6. CG islands7. Similarities with other genes

Page 25: Bioinformatics

OPEN READING FRAMES (ORFS)

Detect potential coding regions by looking at ORFsA genome of length n is comprised of (n/3)

codonsStop codons break genome into segments

between consecutive Stop codonsThe subsegments of these that start from

the Start codon (ATG) are ORFsORFs in different frames may overlap

Genomic Sequence

Open reading frame

ATG TGA

Page 26: Bioinformatics

LONG VS.SHORT ORFSLong open reading frames may be a

geneAt random, we should expect one

stop codon every (64/3) ~= 21 codons

However, genes are usually much longer than this

A basic approach is to scan for ORFs whose length exceeds certain thresholdThis is naive because some genes

(e.g. some neural and immune system genes) are relatively short

Page 27: Bioinformatics

9-kb fungal plasmid sequence in looking for ORFs (potential genes).

Six possible reading frames, three in each direction.

Two large ORFs, 1 and 2, are the most likely candidates as potential genes.

The yellow ORFs are too short to be genes

Page 28: Bioinformatics

TESTING ORFS: CODON USAGECreate a 64-element hash table and

count the frequencies of codons in an ORF

Amino acids typically have more than one codon, but in nature certain codons are more in use

Uneven use of the codons may characterize a real gene

This compensate for pitfalls of the ORF length test

Page 29: Bioinformatics

CODON USAGE IN HUMAN GENOME

Page 30: Bioinformatics

AA codon /1000 frac Ser TCG 4.31 0.05Ser TCA 11.44 0.14Ser TCT 15.70 0.19Ser TCC 17.92 0.22Ser AGT 12.25 0.15Ser AGC 19.54 0.24

Pro CCG 6.33 0.11Pro CCA 17.10 0.28Pro CCT 18.31 0.30Pro CCC 18.42 0.31

AA codon /1000 frac Leu CTG 39.95 0.40Leu CTA 7.89 0.08Leu CTT 12.97 0.13Leu CTC 20.04 0.20

Ala GCG 6.72 0.10Ala GCA 15.80 0.23Ala GCT 20.12 0.29Ala GCC 26.51 0.38

Gln CAG 34.18 0.75Gln CAA 11.51 0.25

CODON USAGE IN MOUSE GENOME

Page 31: Bioinformatics

CODON USAGE AND LIKELIHOOD RATIO An ORF is more “believable” than another

if it has more “likely” codons Do sliding window calculations to find ORFs

that have the “likely” codon usage Allows for higher precision in identifying

true ORFs; much better than merely testing for length.

However, average vertebrate exon length is 130 nucleotides, which is often too small to produce reliable peaks in the likelihood ratio

Further improvement: in-frame hexamer count (frequencies of pairs of consecutive codons)

Page 32: Bioinformatics

GENE PREDICTION AND MOTIFS Upstream regions of genes often contain

motifs that can be used for gene prediction

-10STOP

0 10-35ATG

TATACTPribnow Box

TTCCAA GGAGGRibosomal binding site

Transcription start site

Page 33: Bioinformatics

RIBOSOMAL BINDING SITE

Page 34: Bioinformatics

34

CG-ISLANDS Given 4 nucleotides: probability of

occurrence is ~ 1/4. Thus, probability of occurrence of a dinucleotide is ~ 1/16.

However, the frequencies of dinucleotides in DNA sequences vary widely.

In particular, CG is typically underepresented (frequency of CG is typically < 1/16)

Page 35: Bioinformatics

35

WHY CG-ISLANDS? CG is the least frequent dinucleotide

because C in CG is easily methylated and has the tendency to mutate into T afterwards

However, the methylation is suppressed around genes in a genome. So, CG appears at relatively high frequency within these CG islands

So, finding the CG islands in a genome is an important problem

Page 36: Bioinformatics

THANK YOU