genome annotation 2013

48
Genome Annotation Karan Veer Singh, Scientist. NBAGR, Karnal, India 1

Upload: karan-veer-singh

Post on 26-Jan-2015

125 views

Category:

Education


1 download

DESCRIPTION

Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.

TRANSCRIPT

Page 1: Genome annotation 2013

Genome Annotation

Karan Veer Singh,

Scientist.

NBAGR, Karnal,

India

1

Page 2: Genome annotation 2013

• The genome contains all the biological information required to build and maintain any given living organism

• The genome contains the organisms molecular history

• Decoding the biological information encoded in these molecules will have enormous impact in our understanding of biology

The Genome

Page 3: Genome annotation 2013

1. Structural genomics-genetic and physical mapping of genomes.

2. Functional genomics-analysis of gene function (and non-genes).

3. Comparative genomics-comparison of genomes across species.

Includes structural and functional genomics.

Evolutionary genomics.

Genomics

Page 4: Genome annotation 2013

The Human genome project promised to revolutionise medicine and explain every base of our DNA.

Large MEDICAL GENETICS focus

Identify variation in the genome that is disease causing

Determine how individual genes play a role in health

and disease

Human Genome Project

Page 5: Genome annotation 2013

Human Genome Project & Functional Genome

It cost 3 billion dollars and took 10 years to complete (5 less than initially predicted).• Approx 200 Mb still in progress

– Heterochromatin– Repetitive

Page 6: Genome annotation 2013

Genomics & Genome annotation

First genome annotation software system was designed in 1995 by Dr. Owen White with The Institute for Genomic Research that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae

It involve assembling of the reads to form contigs then assembling with a reference genome (reference assembly) or de novo assembly to obtain the complete genome

Variations such as mutations, SNP, InDels etc can be identified

The genome is then annotated by structural and functional annotation

Mapping Image of Whole genome in an easily understandable manner.

Page 7: Genome annotation 2013

Sequence to Annotation

Page 8: Genome annotation 2013

Input1 to Genome Viewer- Variant Annotation

Page 9: Genome annotation 2013

Input2 to Genome Viewer- Structural Annotation

Structural Annotation- AUGUSTUS (version 2.5.5)

Page 10: Genome annotation 2013

Input3 to Genome Viewer-Functional Annotation

Page 11: Genome annotation 2013

Genome Annotation

The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do

Finding and attaching the structural elements and its related function to each genome locations

11

Page 12: Genome annotation 2013

Genome Annotation

12

gene structure prediction

Identifying elements (Introns/exons,CDS,stop,start) in the genome

gene function prediction

Attaching biological information to these elements- eg: for which protein exon will code for

Page 13: Genome annotation 2013

Structural annotationStructural annotation - identification of genomic elementsOpen reading frame and their localisationgene structurecoding regionslocation of regulatory motifs

Page 14: Genome annotation 2013

Functional annotation

Functional annotation- attaching biological information to genomic elements

biochemical functionbiological functioninvolved regulations

Page 15: Genome annotation 2013

Genome annotation - workflow

16

Genome sequence

Repeats

Structural annotation-Gene finding

Protein-coding genesnc-RNAs (tRNA, rRNA), Introns

Functional annotation

View in Genome viewer

Masked or un-masked genome sequence

Page 16: Genome annotation 2013

Genome Repeats & features

17

Percentage of repetitive sequences in different organisms

Genome Genome Size (Mb)

% Repeat

Aedes aegypti 1,300 ~70

Anopheles gambiae 260 ~30

Culex pipiens 540 ~50

Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR

Polymorphic between individuals/populations

Page 17: Genome annotation 2013

Finding repeats as a preliminary to gene prediction

18

Repeat discovery

Homology based approaches

Use RepeatMasker to search the genome and mask the sequence

Page 18: Genome annotation 2013

Masked sequence

Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s

Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set

19

>my sequence

atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct

>my sequence (repeatmasked)

atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

Positions/locations are not affected by masking

Page 19: Genome annotation 2013

Types of Masking- Hard or Soft?

Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked

20

>my sequence

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

>my sequence (softmasked)

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

>my sequence (hardmasked)

atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

Page 20: Genome annotation 2013

Genome annotation - workflow

21

Genome sequence

Map repeats

Gene finding- structural annotation

Protein-coding genesnc-RNAs, Introns

Functional annotation

View in Genome viewer

Masked or un-masked

Page 21: Genome annotation 2013

Structural annotation

Identification of genomic elements

Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s Introns

22

Page 22: Genome annotation 2013

Methods

24

Similarity• Similarity between sequences which does not necessarily infer any

evolutionary linkage

Ab- initio prediction• Prediction of gene structure from first principles using only the

genome sequence

Page 23: Genome annotation 2013

Genefinding

25

ab initio similarity

Page 24: Genome annotation 2013

ab initio prediction

26

Genome

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

Examples:

Genefinder, Augustus, Glimmer, SNAP, fgenesh

Page 25: Genome annotation 2013

Genefinding - similarity

27

Use known coding sequence to define coding regions

EST sequences

Peptide sequences

Problem to handle fuzzy alignment regions around splice sites

Examples: EST2Genome, exonerate, genewise, Augustus, Prodigal

Gene-finding - comparative

Use two or more genomic sequences to predict genes based on conservation of exon sequences

Examples: Twinscan and SLAM

Page 26: Genome annotation 2013

Genome annotation - workflow

28

Genome sequence

Map repeats

Gene finding- structural annotation Gene finding- structural annotation

Protein-coding genesnc-RNAs, Introns

Functional annotation

View in Genome viewer

Masked or un-masked

Page 27: Genome annotation 2013

Genefinding - non-coding RNA genes

29

Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples

tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes

Rfam - a suite of HMM’s trained against a large number of different RNA genes

Page 28: Genome annotation 2013

Gene-finding omissions

30

Alternative isoformsCurrently there is no good method for predicting alternative isoformsOnly created where supporting transcript evidence is present

PseudogenesEach genome project has a fuzzy definition of pseudogenesBadly curated/described across the board

PromotersRarely a priority for a genome projectSome algorithms exist but usually not integrated into an annotation set

Page 29: Genome annotation 2013

Practical- structural annotation

31

Eukaryotes- AUGUSTUS (gene model)

~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff

Prokaryotes – PRODIGAL (Codon Usage table)

~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt

Page 30: Genome annotation 2013

Structural Annotation-output Structural Annotation conducted using AUGUSTUS (version 2.5.5),

Magnaporthe_grisea as genome model

Page 31: Genome annotation 2013

Functionalannotation

33

Page 32: Genome annotation 2013

Genome annotation - workflow

34

Genome sequence

Map repeats

Gene finding- structural annotation

Protein-coding genesnc-RNAs, Introns

Functional annotation

View in Genome viewer

Masked or un-masked

Page 33: Genome annotation 2013

Functional annotation

35

Genome

ATG STOP

AAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Find function

Page 34: Genome annotation 2013

Functional annotation

36

Attaching biological information to genomic elements

Biochemical functionBiological functionInvolved regulation and interactionsExpression

• Utilize known structural annotation to predicted protein sequence

Page 35: Genome annotation 2013

Functional annotation – Homology Based

Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities

Visually assess the top 5-10 hits to identify whether these have been assigned a function

Functions are assigned

37

Page 36: Genome annotation 2013

Functional annotation - Other features

Other features which can be determined

Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain Secretome

See http://expasy.org/tools/ for a good list of possible prediction algorithms38

Page 37: Genome annotation 2013

Functional annotation - Other features (Ontologies)

Use of ontologies to annotate gene products

Gene Ontology (GO) Cellular component Molecular function Biological process

39

Page 38: Genome annotation 2013

Practical - FUNCTIONAL ANNOTATION

Homology Based Method

setup blast database for nucleotide/protein

Blasting the genome.fasta for annotations

(nucleotide/protein)

sorting for blast minimum E-value (>=0.01) for

nucleotide/protein

assigning functions

40

Page 39: Genome annotation 2013

Functional annotation- output

August 2008 Bioinformatics tools for Comparative Genomics of Vectors

41

Page 40: Genome annotation 2013

Conclusion

Annotation accuracy is dependent available supporting data at the time of annotation; update information is necessary

Gene predictions will change over time as new data becomes available (NCBI) that are much similar than previous ones

Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)

42

Page 41: Genome annotation 2013

Genome annotation - workflow

43

Genome sequence

Map repeats

Gene finding- structural annotation

Protein-coding genesnc-RNAs, Introns

Functional annotation

View in Genome viewer

Masked or un-masked

Page 42: Genome annotation 2013

Genome Viewer

The Files that can be visualised

Annotation files

Indel files

Consensus sequence

Comparative Genomics 44

Page 43: Genome annotation 2013

Genome View

August 2008 45

Page 44: Genome annotation 2013

46

Page 45: Genome annotation 2013

47

Page 46: Genome annotation 2013

48

Page 47: Genome annotation 2013

Short Read track

49

Page 48: Genome annotation 2013

Thank You

50