finding genes in the genome

15
Finding genes in the genome Lecture 7 Global Sequence 1

Upload: easter

Post on 08-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Finding genes in the genome. Lecture 7. Introduction. The open reading frame: (OFR) Finding genes in prokaryotes. Finding gene in Eukaryotes EST (cDNA) and there role. Finding promoters. Introduction. Homolgous approach: sequence similarity [discussed in the next lecture] - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finding genes in the genome

Global Sequence 1

Finding genes in the genome

Lecture 7

Page 2: Finding genes in the genome

Introduction• The open reading frame: (OFR)

• Finding genes in prokaryotes.

• Finding gene in Eukaryotes

• EST (cDNA) and there role.

• Finding promoters

Page 3: Finding genes in the genome

Global Sequence 3

Introduction • Homolgous approach:

– sequence similarity [discussed in the next lecture]

– Has proven to be useful but only if similar sequences already exist in the database.

– However, if there is no similar sequences then must apply the general property of genes; start codon…stop codon to analyse our sequence.

Page 4: Finding genes in the genome

Open Reading Frames (ORF)• If the homologous approach is not successful then you

look for ORF

• This is a region of the DNA which could be a coding sequence (CDS) of a gene [not the promoter, untranslated region (UTR)…

• It has a start codon (ATG) and an end codon [ one of three] (TAA, TAG, TGA)

• If you have a novel sequence you would look for all ORF in all 6 reading, 3 reading frames per strand, as a

Page 5: Finding genes in the genome

Global Sequence 5

Finding potential OFR – Translate each reading frame

beginning at:– Base 1: 5’ 3’ frame 1– Base 2: 5’ 3’ frame 2– Base 3: 5’3’ frame 2Why no need for frame 4?

• Get the “reverse compliment of the given strand” and repeat the process”; 3’ 5’ frame 1….

• Look for start and stop codons (amino acids).

– Note: in a fasta file the gene will be in the given sequence (strand ) so no need to get the reverse compliment.

Page 6: Finding genes in the genome

Is the ORF a gene• First check length of the ORF; {consider the smallest

protein is about 20 aa in length.]

• Check for the presence of promoters upstream of the ORF (TATAAT) sequence…

• Search for genes with similar aa sequences to the candidate gene.

• Prokaryotes and eukaryotes take different approaches which takes into account the difference in their gene structure.

Page 7: Finding genes in the genome

Global Sequence 7

ORF’s in prokaryotic genes• In prokaryotic genes the ORF or protein coding sequence

beings with a start codon and ends with a stop codon.

• Gene density is about 1 per kilobase, ORF every 1000 bases. In some cases the genes density can cause the stop codon of one gene to overlap with the promoter of another [ Zvelebil chapter 9]

• E. G. Within the lac operon there are 3 genes (CDS) all in close proximity: so the ATG lac Y is close to TAG of LacZ….

Page 8: Finding genes in the genome

Review of Eukaryotic gene expression expression

Eukaryotic expression showing exons/ introns…, adapted from Zhang 2002

Page 9: Finding genes in the genome

Global Sequence 9

ORF in Eukaryotes• Gene density is much lower; genes are further apart and

can vary significantly between chromosomes (~ 1.5% of human DNA is CDS).

• ORF contain introns between the coding sequences (CDS) of exons. Further detail can be found at klug 2010.

• An added problem in relation to interpretating the data is; e.g. if the intron contains a stop codon sequence it means it is only a; e.g. a “tta”, sequence and not a stop codon

• Further details on finding and prediction of exons can be found at (Baxevanis 2005)

Page 10: Finding genes in the genome

Global Sequence 10

Finding Coding regions in Eukaryotes• Identify the TTS and the Untranslated regions:

– Like coding region they also contain exons and introns – there are Untranslated regions (UTRs), on both sides of the CDS (both at the 5’ and 3’

end of the coding mRNA) and they play a part in regulating translation via: degradation, attaching to the ribosome and promote or inhibit translation.

• Identify start and stop signals (Zhang 2002 Chasin 2007) – Initial exon (start and 5’ splice site)– Internal exon (3’ and 5’ site)– Terminal site (3’ and and stop codon)

• There is compositional bias in: the coding regions; and also at splice sites

• Database pattern searches can also be used where it is assumed that coding regions have a higher degree of conservation than not coding regions.

• It is important to be aware that the length exons and introns may not be multiples of 3.[ Zvelebil chapter 9 and chapter 5 Baxevanis]

Page 11: Finding genes in the genome

Global Sequence 11

Promoter Analysis• The existence of a “potential” ORF indicates the presence of a

near by promoter.• Promoter are essential elements upstream of the protein

coding sequence that are essential in the transcription process and exist in both eukaryotic and prokaryotic organisms. The figure below illustrates a number of eukaryotic promoters and illustrates the variability. [klug 7th ed] . However it also illustrates the common features: TATA box…

Page 12: Finding genes in the genome

Global Sequence 12

Promoter Analysis• In Prokaryotes:

– the TATAAT region, pribnow box, just upstream, of the TTS (transcription start site). (-10 b.p.)

– A further marker, TTGACA, may also be found 25 p.b. from this position. (-35 bp)

• In Eukaryotes there are 3 subsections of the promoter.– The core/basal promoter (~80 bp from the TSS) (klug p. 321)

• In most cases in contains a TATA box (25 bp upstream of TSS)• Many contain a CAAT box and are GC elements rich.

– The proximal/upstream promoter (~ 250 bp from the TSS)• There is wide variation in this region from one gene to another.

– The distal promoter (much further upstream)

Page 13: Finding genes in the genome

Global Sequence 13

Promoter Analysis• The identification of a Core promoter indicates the

presence of a gene and visa versa so prediction of both to an extent complement each other.

• Promoters characterisation (discovering transcription factor binding patterns) takes two basic approaches (Chapter 5 Baxevanis 2005):– Pattern Driven Algorithms: depends on existing annotated

data, in bioinformatics databases, that relate to binding sites – Sequence-driven algorithms: the assumption that common,

promoter functionality can be obtained from underlying conserved, sequences. Genes that are co-regulation or co-expression provide good candidates for obtaining data for this approach.

Page 14: Finding genes in the genome

Potential exam questions

• Open reading frames (ORFs) are an essential part of finding genes in genomes: Discuss how you would attempt to find ORF’s and why such ORF’s are a more accurate prediction of protein structure in bacterial cells as opposed to animal cells

• A critical part of finding the protein coding regions of DNA sequences is the discovery of open reading frames (ORF). Discuss the difficulties associated with finding such sequences in Eukaryotic cells

Page 15: Finding genes in the genome

Global Sequence 15

Reference• Baxevanis, A.D. 2005 Bioinformatics: a practical guide to the

analysis of genes and proteins. Wiley; Chapter 5. [book is in the library]

• Kel, A. E. et al 2003: MATCHTM: a tool for searching transcription factor binding sites in DNA sequences; Nucleic Acids Res. 2003 July 1; 31(13): 3576–3579

• Klug, W.A. et al 2010; Concepts of Genetics; Pearson Education p. 596-p.597

• Zhang, M.Q. 2002 Computational prediction of eukaryotic coding genes. Nat Rev. Genet. 3 698-709.

• Chasin, L.A. 2007 Searching for splicing motifs. Adv Exp Med Biol. 623:85-106

• Zvelebil M. “understanding bioinformatics” chapter 9 {book is in the library]