jul-01-0806/16/08bioinformatics workshop - malaga genome bioinformatics tyler alioto center for...
Post on 15-Jan-2016
214 views
TRANSCRIPT
![Page 1: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/1.jpg)
Jul-01-0806/16/08 Bioinformatics Workshop - Malaga
Genome Bioinformatics
Tyler Alioto
Center for Genomic Regulation
Barcelona, Spain
![Page 2: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/2.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Node 1 of the INB
GN1 Bioinformática y Genómica Genome Bioinformatic Lab, CRG
Roderic Guigó (PI)
![Page 3: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/3.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Themes
Gene prediction ab initio => GeneID dual-genome => SGP2 u12 introns => GeneID v1.3 and U12DB combiner => GenePC
Genome feature visualization gff2ps
Alternative splicing ASTALAVISTA
Gene expression regulatory elements meta and mmeta alignment
![Page 4: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/4.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Eukaryotic gene structure
![Page 5: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/5.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Eukaryotic gene structure
EXONS
INTRONS
UPSTREAMREGULATOR
DOWNSTREAMREGULATOR
PROMOTOR
acceptor
donor
![Page 6: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/6.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
The Splicing Code
![Page 7: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/7.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Gene Prediction Strategies Expressed Sequence (cDNA) or protein
sequence available? Yes Spliced alignment
BLAT, Exonerate, est_genome, spidey, GMAP, Genewise
No Integrated gene prediction Informant genome(s) available?
Yes Dual or n-genome de novo predictors: SGP2, Twinscan, NSCAN, (Genomescan – same or cross genome protein blastx)
No ab initio predictors geneid, genscan, augustus, fgenesh, genemark, etc.
Many newer gene predictors can run in multiple modes depending on the evidence available.
![Page 8: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/8.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Gene Prediction Strategies
![Page 9: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/9.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Frameworks for gene prediction
Hierarchical exon-buliding and chaining Hidden Markov Models (many flavors)
HMM, GHMM, GPHMM, Phylo-HMM Conditional Random Fields (new!)
Conrad, Contrast... and, no doubt, more to come
All of them involve parsing the optimal path of exons using dynamic programming (e.g. GenAmic, Viterbi algorithms)
![Page 10: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/10.jpg)
Jul-01-0806/16/08 Bioinformatics Workshop - Malaga
How does GeneID approach gene prediction?
![Page 11: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/11.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
The gene prediction problem
a1
a2
a3
a4
d1d2
d3 d4
d5
e1e2e3
e4 e5
e6 e7
e8
sites
exons
genes
e1e4 e8
![Page 12: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/12.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
GeneID
Geneid follows a hierarchical structure: signal exon gene
Exon score: Score of exon-defining signals
+ protein-coding potential (log-likelihood ratios)
Dynamic programming algorithm: maximize score of assembled
exons assembled gene
![Page 13: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/13.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
0.60.20.10.11.00.00.10.10.4T
0.10.50.10.10.01.00.70.10.1G
0.20.10.10.20.00.00.10.20.2C
0.10.20.70.60.00.00.10.60.3A
987654321
GAGGTAAAC
TCCGTAAGT
CAGGTTGGA
ACAGTCAGT
TAGGTCATT
TAGGTACTG
ATGGTAACT
CAGGTATAC
TGTGTGAGT
AAGGTAAGT
ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT
GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG
GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT
GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC
ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA
GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC
Training GeneID
![Page 14: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/14.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Running GeneID command line or on geneid server
NAMEgeneid - a program to annotate genomic sequences
SYNOPSISgeneid [-bdaefitnxszr]
[-DA] [-Z][-p gene_prefix][-G] [-3] [-X] [-M] [-m][-WCF] [-o][-j lower_bound_coord][-k upper_bound_coord][-O <gff_exons_file>][-R <gff_annotation-file>][-S <gff_homology_file>][-P <parameter_file>][-E exonweight][-V evidence_exonweight][-Bv] [-h]<locus_seq_in_fasta_format>
RELEASEgeneid v 1.3
OPTIONS-b: Output Start codons-d: Output Donor splice sites-a: Output Acceptor splice sites-e: Output Stop codons-f: Output Initial exons-i: Output Internal exons-t: Output Terminal exons-n: Output introns-s: Output Single genes-x: Output all predicted exons-z: Output Open Reading Frames
-D: Output genomic sequence of exons in predicted genes-A: Output amino acid sequence derived from predicted CDS
-p: Prefix this value to the names of predicted genes, peptides and CDS
-G: Use GFF format to print predictions-3: Use GFF3 format to print predictions-X: Use extended-format to print gene predictions-M: Use XML format to print gene predictions-m: Show DTD for XML-format output
-j Begin prediction at this coordinate-k End prediction at this coordinate-W: Only Forward sense prediction (Watson)-C: Only Reverse sense prediction (Crick)-U: Allow U12 introns (Requires appropriate U12 parameters to be set in the parameter file)-r: Use recursive splicing-F: Force the prediction of one gene structure-o: Only running exon prediction (disable gene prediction)-O <exons_filename>: Only running gene prediction (not exon prediction)-Z: Activate Open Reading Frames searching
-R <exons_filename>: Provide annotations to improve predictions-S <HSP_filename>: Using information from protein sequence alignments to improve predictions
-E: Add this value to the exon weight parameter (see parameter file)-V: Add this value to the score of evidence exons -P <parameter_file>: Use other than default parameter file (human)
-B: Display memory required to execute geneid given a sequence-v: Verbose. Display info messages-h: Show this help
AUTHORSgeneid_v1.3 has been developed by Enrique Blanco, Tyler Alioto and Roderic Guigo.Parameter files have been created by Genis Parra and Tyler Alioto. Any bug or suggestioncan be reported to [email protected]
![Page 15: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/15.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
GeneID output## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1
HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1
HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1
![Page 16: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/16.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
GFF: a standard annotation format Stands for:
Gene Finding Format -or- General Feature Format Designed as a single line record for describing features on
DNA sequence -- originally used for gene prediction output 9 tab-delimited fields common to all versions
seq source feature begin end score strand frame group The group field differs between versions, but in every case no
tabs are allowed GFF2: group is a unique description, usually the gene name.
NCOA1 GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced,
start_codon and stop_codon are required features for CDS transcript_id “NM_056789” ; gene_id “NCOA1”
GFF3: Capitalized tags follow Sequence Ontology (SO) relationships, FASTA seqs can be embedded
ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon”
![Page 17: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/17.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
GeneID output## gff-version 2
## date Mon Nov 26 14:37:15 2007
## source-version: geneid v 1.2 -- [email protected]
# Sequence HS307871 - Length = 4514 bps
# Optimal Gene Structure. 1 genes. Score = 16.20
# Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20
HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1
HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1
HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1
HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1
![Page 18: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/18.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Visualizing features with gff2ps
generated by Josep Abril
![Page 19: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/19.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Visualizing features on UCSC genome browser (custom tracks)
If “your” genome is served by UCSC, this is a good option because: browsing is dynamic access to other annotations can view DNA sequence can do complex intersections and filtering
gff2ps is good when: your genome is not on UCSC you want more flexible layout options you want to run it ‘offline’
![Page 20: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/20.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Extensions to GeneID
Syntenic Gene Prediction (dual-genome) Evidence-based (constrained) gene
prediction U12 intron detection Combining gene predictions Selenoprotein gene prediction
![Page 21: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/21.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Syntenic Gene Prediction: SGP2
![Page 22: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/22.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Minor splicing and U12 introns
U12 introns make up a minor proportion of all introns (~0.33% in human, less in insects)
But they can be found in 2-3% of genes Normally ignored, but this causes
annotation problems Easy to predict due to highly conserved
donor and branch sites
![Page 23: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/23.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Splice Signal Profiles: major and minor
![Page 24: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/24.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Gathering U12 Introns
U12 DB
genome
Human
merge
published
all annotated
introns
score
predict
563568385
658
ENSEMBL?
ortholog search (17 species)+ spliced alignment
genome
Fruit Fly
all annotated
introns
score
predict
merge
aln to EST/mRNA
aln to EST/mRNA
2084
597
![Page 26: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/26.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Coming Soon: GenePCa Gene Prediction Combiner
![Page 27: Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d2a5503460f949fe8a9/html5/thumbnails/27.jpg)
Jul-01-08 Bioinformatics Workshop - Malaga
Tutorial Homepage
http://genome.imim.es/courses/Malaga08/
GBL Homepage
http://genome.imim.es/