genome bioinformatics tyler alioto center for genomic regulation barcelona, spain

26
Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

Post on 18-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

Genome Bioinformatics

Tyler Alioto

Center for Genomic RegulationBarcelona, Spain

Page 2: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Node 1 of the INB

GN1 Bioinformática y Genómica Genome Bioinformatic Lab, CRG

Roderic Guigó (PI)

Page 3: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Themes

Gene prediction ab initio => GeneID dual-genome => SGP2 u12 introns => GeneID v1.3 and U12DB combiner => GenePC

Genome feature visualization gff2ps

Alternative splicing ASTALAVISTA

Gene expression regulatory elements meta and mmeta alignment

Page 4: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Eukaryotic gene structure

Page 5: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Eukaryotic gene structure

EXONS

INTRONS

UPSTREAMREGULATOR

DOWNSTREAMREGULATOR

PROMOTOR

acceptor

donor

Page 6: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

The Splicing Code

Page 7: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Gene Prediction Strategies Expressed Sequence (cDNA) or protein sequence

available? Yes Spliced alignment

BLAT, Exonerate, est_genome, spidey, GMAP, Genewise No Integrated gene prediction

Informant genome(s) available? Yes Dual or n-genome de novo predictors:

SGP2, Twinscan, NSCAN, (Genomescan – same or cross genome protein blastx)

No ab initio predictors geneid, genscan, augustus, fgenesh, genemark, etc.

Many newer gene predictors can run in multiple modes depending on the evidence available.

Page 8: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Frameworks for gene prediction

Hierarchical exon-buliding and chaining Hidden Markov Models (many flavors)

HMM, GHMM, GPHMM, Phylo-HMM Conditional Random Fields (new!)

Conrad, Contrast... and, no doubt, more to come

All of them involve parsing the optimal path of exons using dynamic programming (e.g. GenAmic, Viterbi algorithms)

Page 9: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

How does GeneID approach gene prediction?

Page 10: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

The gene prediction problem

a1

a2

a3

a4

d1d2

d3 d4

d5

e1e2e3

e4 e5

e6 e7

e8

sites

exons

genes

e1e4 e8

Page 11: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

GeneID

Geneid follows a hierarchical structure: signal exon gene

Exon score: Score of exon-defining signals +

protein-coding potential (log-likelihood ratios)

Dynamic programming algorithm: maximize score of assembled

exons assembled gene

Page 12: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

1 2 3 4 5 6 7 8 9

A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1

C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2

G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1

T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6

GAGGTAAAC

TCCGTAAGT

CAGGTTGGA

ACAGTCAGT

TAGGTCATT

TAGGTACTG

ATGGTAACT

CAGGTATAC

TGTGTGAGT

AAGGTAAGT

ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT

GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG

GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT

GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC

ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA

GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC

Training GeneID

Page 13: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Running GeneID command line or on geneid server

NAMEgeneid - a program to annotate genomic sequences

SYNOPSISgeneid [-bdaefitnxszr]

[-DA] [-Z][-p gene_prefix][-G] [-3] [-X] [-M] [-m][-WCF] [-o][-j lower_bound_coord][-k upper_bound_coord][-O <gff_exons_file>][-R <gff_annotation-file>][-S <gff_homology_file>][-P <parameter_file>][-E exonweight][-V evidence_exonweight][-Bv] [-h]<locus_seq_in_fasta_format>

RELEASEgeneid v 1.3

OPTIONS-b: Output Start codons-d: Output Donor splice sites-a: Output Acceptor splice sites-e: Output Stop codons-f: Output Initial exons-i: Output Internal exons-t: Output Terminal exons-n: Output introns-s: Output Single genes-x: Output all predicted exons-z: Output Open Reading Frames

-D: Output genomic sequence of exons in predicted genes-A: Output amino acid sequence derived from predicted CDS

-p: Prefix this value to the names of predicted genes, peptides and CDS

-G: Use GFF format to print predictions-3: Use GFF3 format to print predictions-X: Use extended-format to print gene predictions-M: Use XML format to print gene predictions-m: Show DTD for XML-format output

-j Begin prediction at this coordinate-k End prediction at this coordinate-W: Only Forward sense prediction (Watson)-C: Only Reverse sense prediction (Crick)-U: Allow U12 introns (Requires appropriate U12 parameters to be set in the parameter file)-r: Use recursive splicing-F: Force the prediction of one gene structure-o: Only running exon prediction (disable gene prediction)-O <exons_filename>: Only running gene prediction (not exon prediction)-Z: Activate Open Reading Frames searching

-R <exons_filename>: Provide annotations to improve predictions-S <HSP_filename>: Using information from protein sequence alignments to improve predictions

-E: Add this value to the exon weight parameter (see parameter file)-V: Add this value to the score of evidence exons -P <parameter_file>: Use other than default parameter file (human)

-B: Display memory required to execute geneid given a sequence-v: Verbose. Display info messages-h: Show this help

AUTHORSgeneid_v1.3 has been developed by Enrique Blanco, Tyler Alioto and Roderic Guigo.Parameter files have been created by Genis Parra and Tyler Alioto. Any bug or suggestioncan be reported to [email protected]

Page 14: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

GeneID output## gff-version 2

## date Mon Nov 26 14:37:15 2007

## source-version: geneid v 1.2 -- [email protected]

# Sequence HS307871 - Length = 4514 bps

# Optimal Gene Structure. 1 genes. Score = 16.20

# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20

HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1

HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1

HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1

Page 15: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

GFF: a standard annotation format Stands for:

Gene Finding Format -or- General Feature Format Designed as a single line record for describing features on DNA sequence --

originally used for gene prediction output 9 tab-delimited fields common to all versions

seq source feature begin end score strand frame group The group field differs between versions, but in every case no tabs are allowed

GFF2: group is a unique description, usually the gene name. NCOA1

GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced, start_codon and stop_codon are required features for CDS

transcript_id “NM_056789” ; gene_id “NCOA1” GFF3: Capitalized tags follow Sequence Ontology (SO) relationships, FASTA seqs can be

embedded ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon”

Page 16: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

GeneID output## gff-version 2

## date Mon Nov 26 14:37:15 2007

## source-version: geneid v 1.2 -- [email protected]

# Sequence HS307871 - Length = 4514 bps

# Optimal Gene Structure. 1 genes. Score = 16.20

# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20

HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1

HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1

HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1

HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1

Page 17: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Visualizing features with gff2ps

generated by Josep Abril

Page 18: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Visualizing features on UCSC genome browser (custom tracks)

If “your” genome is served by UCSC, this is a good option because: browsing is dynamic access to other annotations can view DNA sequence can do complex intersections and filtering

gff2ps is good when: your genome is not on UCSC you want more flexible layout options you want to run it ‘offline’

Page 19: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Extensions to GeneID

Syntenic Gene Prediction (dual-genome) Evidence-based (constrained) gene

prediction U12 intron detection Combining gene predictions Selenoprotein gene prediction

Page 20: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Syntenic Gene Prediction: SGP2

Page 21: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Minor splicing and U12 introns

U12 introns make up a minor proportion of all introns (~0.33% in human, less in insects)

But they can be found in 2-3% of genes Normally ignored, but this causes

annotation problems Easy to predict due to highly conserved

donor and branch sites

Page 22: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Splice Signal Profiles: major and minor

Page 23: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Gathering U12 Introns

U12 DB

genome

Human

merge

published

all annotated

introns

score

predict

563568385

658

ENSEMBL?

ortholog search (17 species)+ spliced alignment

genome

Fruit Fly

all annotated

introns

score

predict

merge

aln to EST/mRNA

aln to EST/mRNA

2084

597

Page 24: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Page 25: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Coming Soon: GenePCa Gene Prediction Combiner

Page 26: Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

04/18/23 INB Roadshow - Pamplona

Tutorial Homepage

http://genome.imim.es/courses/Pamplona07/

GBL Homepage

http://genome.imim.es/