ensembl steve searle joint project leader, ensembl genebuild team
TRANSCRIPT
Ensembl
Steve Searle
Joint project leader, Ensembl Genebuild team
Outline
• Ensembl project overview
• Core database schema and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
What is Ensembl?project aims
• funded to provide vertebrate genomes to the world
• aims to provide the high quality automated genome annotation
• aims to a leading group in genome analysis
• all software, data and results freely available
What is Ensembl ?project background
• Group split between EBI and Sanger
• Mainly Wellcome Trust funded (recently received a new five year grant for 2006-2011)
• Largest dedicated compute in biology in Europe
• Developer community > 300 people, including companies
Ensembl - Technical overview
• Data storage– Mysql databases (~160Gb in current release)
• Core databases - annotation for each species• Variation databases - variation data for some species• Compara - single database containing all comparative genomic
data for species in ensembl• Mart - set of denormalised databases for datamining
• Data production– Pipeline systems running automatic annotation on a compute farm
of 800 CPUs
• Interfaces– Website– Mart (datamining tool)– Apollo– SQL– APIs (both perl and Java)
Currently 21 organismsin Ensembl
Open sourceOpen source
• Object model– standard interface makes it easy for others to build custom applications on top of
Ensembl data
• Open discussion of design ([email protected])
• Most major pharmaceutical companies and many academics on mailing list
• Ensembl installs worldwide– Both public and commercial
e.g. Gramene (CSHL)
Ciona-sg (Temasek)
Arabidopsis (NASC)
Fugu (IMCB)
Outline
• Ensembl project overview
• Core database and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
The Ensembl Core Database
• Relational database (MySQL) containing the genomic sequence and annotations on it (genes, alignments, ab initio predictions etc)
• Data stored in it throughout analysis process and the website displays features from it
• Current schema has 68 tables
• Ensembl core API team control changes
Requirements for the schema
• Store data for human genome
• … and all the other genomes we have
• … and all the genomes we might get
• Flexible to add more data
• Easy to adapt to new genome
• Responds fast enough for web site display and pipelined genebuild
System Context
Java API (EnsJ)
www PipelineApollo
Mart DB
MartShell MartView
Other Scripts & Applications
Ensembl DBs
Perl API
Sequence Tables
assembly
asm_seq_region_id int
cmp_seq_region_id int
asm_start int
asm_end int
cmp_start int
cmp_end int
ori int
seq_region
seq_region_id int
name varchar
coord_system_id int
length int
seq_region_attrib
seq_region_id int
attrib_type_id int
value varchar
attrib_type
attrib_type_id int
code varchar
name varchar
description text
dna
seq_region_id int
sequence mediumtext
dnac
seq_region_id int
sequence mediumblob
n_line text
coord_system
coord_system_id int
name varchar
version varchar
attrib“default_version”, “sequence_level”
rank int
0..1
0..n
0..1
0..n
1
1
0..1
0..1
1
1…n
1
0..n
1
0..n
Feature Tables• Feature tables describe annotations with positions in sequence.• Each feature is associated with a seq_region and has a start, end, and
orientation on the seq_region.• There is no central feature table. There are tables specific to each feature
type (DNA/DNA alignments, DNA/Protein alignments, Repeats, Simple features).
• Different feature tables have different attributes, but always have a seq_region position.
analysis
analysis_id int
created datetime
logic_name varchar
db varchar
db_version varchar
db_file varchar
program varchar
program_version varchar
program_file varchar
parameters varchar
module varchar
module_version varchar
gff_source varchar
gff_feature varchar
simple_feature
simple_feature_id int
seq_region_id int
seq_region_start int
seq_region_end int
seq_region_strand tinyint
display_label varchar
analysis_id int
score double
any feature
usually has a any_feature_id
int
contains a sequence position with or without strand on a sequence region
usually contains a string to display
varchar
usually links to the analysis responsible for calculating it
int
contains any number of other attributes
..
..
1
0..n
Other features
protein_align_feature
protein_align_feature_id int
Sequence position
hit_start int
hit_end int
hit_name varchar
analysis_id int
score double
evalue double
perc_ident float
cigar_line text
dna_align_feature
dna_align_feature_id int
Sequence position
hit_start int
hit_end int
hit_strand tinyint
hit_name varchar
analysis_id int
score double
evalue double
perc_ident float
cigar_line text
repeat_feature
repeat_feature_id int
Sequence position
repeat_start int
repeat_end int
repeat_consensus_id int
analysis_id int
score double
prediction_exon
prediction_exon_id int
prediction_transcript_id int
rank int
Sequence position
start_phase tinyint
score double
p_value double
prediction_transcript
prediction_transcript_id int
Sequence position
analysis_id int
repeat_consensus
repeat_consensus_id int
repeat_name varchar
repeat_class varchar
consensus text
1..n
1
1
1..n
Tables for Genes
exon
exon_id int
seq_region_id int
seq_region_start int
seq_region_end int
seq_region_strand tinyint
phase tinyint
end_phase tinyintexon_stable_id
exon_id int
stable_id varchar
version int
exon_transcript
exon_id int
transcript_id int
rank int
gene
gene_id int
status enum
description
source varchar
biotype varchar
analysis_id int
seq_region_id int
seq_region_start int
seq_region_end int
seq_region_strand tinyint
display_xref_id int
gene_stable_id
gene_id int
stable_id varchar
version int
transcript
transcript_id int
gene_id int
seq_region_id int
biotype varchar
status enum
seq_region_start int
seq_region_end int
seq_region_strand tinyint
display_xref_id int
transcript_stable_id
transcript_id int
stable_id varchar
version int
translation_stable_id
translation_id int
stable_id varchar
version int
translation
translation_id int
transcript_id int
seq_start int
start_exon_id int
seq_end int
end_exon_id int
1
0..1
0..1
0..1
1
1
10..n
0..n
1
0..1
1..n
1
1
0..1
1
1
0..n0..n
1
transcript_attrib
transcript_id int
attrib_type_id int
value varchar
translation_attrib
translation_id int
attrib_type_id int
value varchar
0..n
0..n
Other tables
• Sets of tables to handle:– Cross references of ensembl features to external database– Markers– QTLs– Regulatory regions and factors– Stable ID archive– Affymetrix probe data– Misc features– Density features
• Tables containing meta information about the database• Karyotype bands• Protein annotation• Supporting evidence• Assembly exceptions (haplotypes and PARs)
Ensembl APIs
• Programmatic access to ensembl databases is via three main APIs:– ensembl core API access to genome database– ensembl compara API access to compara database– ensembl variation API access to variation database
• All three have the same basic structure– Data objects to represent biological entities eg. Gene,
Homology, Variation– DataAdaptor objects to store and retrieve data objects from
database.
• Data production APIs– ensembl-pipeline genebuild pipeline– ensembl-analysis analysis wrapper objects– ensembl-hive compara pipeline
The Perl Core API
• The Perl core API provides a layer of abstraction over the Ensembl core databases.
• Written in Object-Oriented Perl.
• Can be used to get information into or out of Ensembl databases.
• Insulates programmers to some extent from changes to the database schema.
• Insulates programmer from coordinate transformations
Data Objects
• Information is obtained from the API in the form of Data Objects.
• Each object represents some data which is stored in the database.
• A Gene object represents a gene, a Transcript object represents a transcript, a Marker Object represents a Marker, etc.
Data Objects – Code Example
# print out the start, end and strand of a transcriptprint $transcript->start(), '-', $transcript->end(), '(',$transcript->strand(), “)\n”;
# print out the stable identifier for an exonprint $exon->stable_id(), “\n”;
# print out the name of a marker and its primer sequencesprint $marker->display_marker_synonym()->name, “\n”;print “left primer: ”, $marker->left_primer(), “\n”;print “right primer: ”, $marker->right_primer(), “\n”;
# set the start and end of a simple feature$simple_feature->start(10);$simple_feature->end(100);
Object Adaptors
• Object Adaptors are factories for Data Objects.
• Data Objects are retrieved from and stored in databases using Object Adaptors.
• Each Object Adaptor is responsible for creating objects of only one particular type.
• Data Adaptor fetch, store, and remove methods are used to retrieve, save, and delete information in the database.
• All the SQL is in the Object Adaptors
Object Adaptors – Code Example
# fetch a gene by its internal identifier$gene = $gene_adaptor->fetch_by_dbID(1234);
# fetch a gene by its stable identifier$gene = $gene_adaptor->fetch_by_stable_id('ENSG0000005038');
# store a transcript in the database$transcript_adaptor->store($transcript);
# remove an exon from the database$exon_adaptor->remove($exon);
# get all transcripts having a specific interpro domain@transcripts = @{$transcript_adaptor->fetch_all_by_domain('IPR000980')};
The DBAdaptor and the Registry
DB
DBAdaptor
GeneAdaptor MarkerAdaptor …
Gene Marker …
Object Adaptors
Data Objects
• The Database Adaptor is a factory for Object Adaptors• It is used to connect to the database and to obtain Object Adaptors
• Registry enables access to multiple databases using information from a config file (important for compara work)
Slices
• A Slice Data Object represents an arbitrary region of a genome.
• Slices are not directly stored in the database.• A Slice is used to request sequence or features
from a specific region in a specific coordinate system.
chr20
Clone AC022035
Slices – Code Example
# get the slice adaptor$slice_adaptor = $db->get_SliceAdaptor();
# fetch a slice on a region of chromosome 12$slice = $slice_adaptor->fetch_by_region('chromosome', '12', 1e6, 2e6);
# print out the sequence from this regionprint $slice->seq();
# get all clones in the database and print out their names@slices = @{$slice_adaptor->fetch_all('clone')};foreach $slice (@slices) { print $slice->seq_region_name(), “\n”;}
Features
• Features are Data Objects with associated genomic locations.
• All Features have start, end, strand and slice attributes.
• Features are retrieved from Object Adaptors using limiting criteria such as identifiers or regions (slices).
• Gene
• Transcript
• Exon
• PredictionTranscript
• PredictionExon
• DnaAlignFeature
• ProteinAlignFeature
• SimpleFeature
• MarkerFeature
• QtlFeature
• MiscFeature
• KaryotypeBand
• RepeatFeature
• AssemblyExceptionFeature
• DensityFeature
A Complete Code Example
use Bio::EnsEMBL::DBSQL::DBAdaptor;my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’);
my $slice_ad = $db->get_SliceAdaptor();my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6);foreach my $sf (@{$slice->get_all_SimpleFeatures()}) { my $start = $sf->start(); my $end = $sf->end(); my $strand = $sf->strand(); my $score = $sf->score(); print “$start-$end($strand) $score\n”;}
A Gene Object Code Example
#!/usr/bin/perl -wuse Bio::EnsEMBL::DBSQL::DBAdaptor;use strict;my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’);
my $slice_ad = $db->get_SliceAdaptor();my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6);foreach my $gene (@{$slice->get_all_Genes_by_type(‘ensembl’)}) { print “Gene “,$gene->stable_id,“ “, $gene->start,“ “, $gene->end,“\n”; foreach my $trans (@{$gene->get_all_Transcripts}) { print “ Trans “,$trans->stable_id,”\n”; my $tlnseq = $trans->translate->seq; $tlnseq =~ s/(.{1,60})/$1\n/g;
print “ “,$tlnseq; }}
Coordinate Transformations
• The API provides the means to convert between any related coordinate systems in the database.
• Feature methods transfer, transform, project can be used to move features between coordinate systems.
• Slice method project can be used to move features between coordinate systems.
Feature::transfer
• The Feature method transfer moves a feature from one Slice to another.
• The Slice may be in the same coordinate system or a different coordinate system.
Chr20
Chr17
AC099811
Chr17
ChrX
Feature::transfer – Code Example
# fetch an exon from the database$exon = $exon_adaptor->fetch_by_stable_id('ENSE00001180238');print “Exon is on slice: “, $exon->slice()->name(), “\n”;print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”; # transfer the exon to a small slice just covering it$exon_slice = $slice_adaptor->fetch_by_Feature($exon);$exon = $exon->transfer($exon_slice);
print “Exon is on slice: “, $exon->slice()->name(), “\n”;print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”;
Sample output:Exon is on slice: chromosome:NCBI34:12:1:132078379:1Exon coords: 56452706-56452951Exon is on slice: chromosome:NCBI34:12:56452706:56452951:1Exon coords: 1-246
Stability of API
• Ensembl API changes to meet our needs
• Request for greater stability from users
• Some methods are now labelled as stable and we guarentee that they will not change for at least 2 years.
Outline
• Ensembl project overview
• Core database and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
Runnables and RunnableDBs
• Runnables are perl objects which wrap analysis programs. Methods:– run– parse_results
• Generates ensembl data objects
– output• Returns generated data objects
eg. Blast runnable wraps blast
• RunnableDBs are perl objects which wrap Runnables allowing them to retrieve input data from and store output data into ensembl databases– fetch_input– write_output
my $seq = Bio::SeqIO->new( -file => "<test.fa", -format => 'Fasta')->next_seq;
my $slice = Bio::EnsEMBL::Slice->new( -seq => $seq->seq, -coord_system => Bio::EnsEMBL::CoordSystem->new(-name => 'contig', -rank => 1), -seq_region_name => $seq->display_id, -start => 1, -end => $seq->length);
my $genscan_runnable = Bio::EnsEMBL::Analysis::Runnable::Genscan->new( -query => $slice, -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'genscan'));
$genscan_runnable->run;
my @output;foreach my $prediction (@{$genscan_runnable->output}) { my $blast_run = Bio::EnsEMBL::Analysis::Runnable::Blast->new ( -query => $prediction->translate, -parser => Bio::EnsEMBL::Analysis::Tools::BPliteWrapper->new(), -database => 'embl_vertrna', -program => 'wutblastn', -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'vertrna'));
$blast_run->run; push(@output, @{$blast_run->output});}
Runnable example
#!/usr/local/ensembl/bin/perl -w
use strict;use Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor;use Bio::EnsEMBL::Pipeline::Analysis;
my $db = new Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor( -host => 'localhost', -user => 'root', -dbname => 'test_db');
my $anal = $db->get_AnalysisAdaptor->fetch_by_logic_name(’Uniprot');
my $rdbstr = “Bio::EnsEMBL::Analysis::RunnableDB::”.$anal->module; my $runobj = “$rdbstr”->new( -db => $db, -input_id => 'contig::AL1347153.1.3517:1:3571:1', -analysis => $anal);
$runobj->fetch_input;
$runobj->run;
$runobj->write_output;
RunnableDB example
Writing Runnables and RunnableDBs
• A lot of functionality is implemented in the base classes
• At its simplest just requires implementing:parse_results in the Runnable
get_adaptor in the RunnableDB
fetch_input in the RunnableDB
• Other methods which may need overridingwrite_output in the RunnableDB
run_analysis in the Runnable
Pipeline Summary
Example pipeline
Current hardware
• 8x ES40 Alpha (667 MHz) with 2Tb fibre channel storage
• 10x ES45 Alpha (1GZ) with 5Tb fibre channel storage
• 3x Itanium 4 CPU with 1.6Tb storage
• 400 HS20 IBM Blades (2x2.8 or 3.2Ghz PIV + 4 Gig memory + 2TB clustered SAN filesystem or 600GB clustered IDE filesystem (both IBM GPFS)
• Tru64 UNIX/Linux
• 21 MySQL (v 4.1) instances
• Most binaries and all sequence databases stored locally (avoids using NFS)
Outline
• Ensembl project overview
• Core database and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
Genome annotation overview
Raw compute - Alignments against protein and DNA dbs, and other basic analyses
Automatic gene annotation
Protein coding gene models
Pseudogenes (some)
RNA genes
Alignment of species ESTs and cDNAs
Affymetrix probe mapping
Protein domain annotation
Cross reference generation
The Raw Computes
• Repeat Features– RepeatMasker– Dust– TRF
• Ab Initio Genes– Genscan (sometimes other programs)
• Blast alignments– Blastp against Uniprot– Blastn against EMBL vertebrate RNAs and UniGene Clusters
• Other Features– CPG islands– tRNA genes– Transcription start sites using Eponine
Species Specific Proteins
Other ProteinsSpecies Specific
cDNAs
AlignedcDNAs
Exonerate
Species Specific ESTs
Exonerate
AlignedESTs
EnsemblEST genes
ClusterMerge
Preliminarygene set
GenebuilderSupported ab initio
(optional)
cDNA genes
ClusterMerge
PseudogenesFinal set
+ pseudogenes
Genewisegenes
Genewise
GeneCombiner
Core Ensemblgenes
Gene Annotation
Genewise geneswith UTRs
Blessed gene set(optional)
ncRNAs
• Functional RNAs
• Families share conserved secondary structure
• Low sequence identity
• Ribosome
• Spliceosome
• tRNAs
• miRNA
Difficulties in annotating ncRNAs
Ab initio gene predicting programs such as GENSCAN cannot predict non-coding genes.
BLAST performs poorly at detecting non coding genes where structure is conserved but sequence identity is low.
Cannot use repeat masked DNA as some ncRNAs look very much like repeats (ALU related to SRP RNA)
Cannot use ESTs as ncRNAs lack poly-A
RFAM
• Hand made alignments
• Use Infernal to make Covariance Models
• Scan models over subset of EMBL to build family alignments
Problems 2
• Infernal does not scale well:
• “Covariance model searches are extremely compute intensive… The compute time scales roughly to the 4th power of the length of the RNA, so larger models quickly become infeasible without significant compute resources”
• How long would it take to run the human genome?
• Rough estimate > a week on the farm
• Need to limit the amount of sequence we run Infernal on
Rfam Scan• Rfam procedure to speed up Infernal on large eukaryotes
– Uses Blast to narrow search:• BLAST is poor at finding ncRNAs with low sequence ID• RFAM families contain sequences from all organisms• More sequence variation = more chance of Blast making
alignment
• In ensembl:– Separated blast and Infernal steps (using Runnables)
– Determined filtering for blast results to limit time without significant reduction in sensitivity
– Now runs in less than 24hrs.
miRNA
• Highly conserved across species
• Precursor stem loop sequence ~ 70nt
• Mature miRNA ~ 21nt
• Identified using BLAST genomic vs miRBase precursors
• RNAfold used to test for stem loop
• Mature sequence identified (only 2 nt changes tolerated)
• Start with ~ 290,000 blast hits
• End with 222 miRNA
• 96% of SE miRNAs + additional 60
• Novel c.f. miRBase:
• 1 – chicken, 36 – mouse, 5 – rat
Examples
StructuresStructures identified by Infernal / RNAfold are stored as transcript
attributes ::::::::::::::::<<-<<<<<-<<<________________>>>>>>>>-->>,,,, 1 AuCUUUGCGCAGGGGCaaUaucguAgccAGUGAGGcUuuaCCGAggcgcgauUAuuGCUA 60 A+CUUUGCGCAG GGCA:UAU :UAGCCA+UGAGG+UU++CCGAGGCG: AUUA:UGCUA 181 AGCUUUGCGCAGUGGCAGUAUCAUAGCCAAUGAGGUUUAUCCGAGGCGCAAUUAUUGCUA 240
<<<<_.________.__>>>>,,,,,<<<.<<<<<<<<<<____......__>>>>>>>> 61 gUugA.AAACUAUU.CCcaAccgCCCgcc.aagacgacauguua......uauugucggc 111 :UU A AAA UA AA:+G G:C ::: ::A:::+UUA U :::U::+: 241 AUUAAuAAAUUAAAuAAUAAAAGGG-GACuCUU-UUAGUGCUUAuaaaggUUUACUAACC 298
>>->>>,,,,,,,,,,,,<<<<____>>>> 112 uuuggcAAUUUUUGGAAGcccuccAaaggg 141 :: G:CAA UU +AAG ::C+AA:: 299 ACAGACAACUU---AAAGGUAACAAACCUA 325
Displayable on website as markup on transcript sequence
Low Coverage Genomes
• 16 mammalian genomes sequenced to 2x coverage are expected over the next 2-3 years.
• Ensembl is aiming to provide gene sets for these based on alignments to human, building predictions on scaffolds which align to genomic locations of human genes
• Test case– Cow preliminary 3x assembly 449727 scaffolds,
795212 contigs– Good test case because 6x assembly was recently
made available so we can assess accuracy of method
Method overview
- Details- Raw alignments grouped using UCSC chain and net method- Use human as source for cow - human has best annotation- ‘Gene scaffolds’ (new coord system) are stored in the database- Allow scaffolds to be broken at contig gaps- Retain ‘gap’ exons
NNNNNNNN
Co
w
BL
UE
C
ow
RE
D
Dealing with duplication: Iterative Human Net
Co
w
GR
EE
N
Human
Cow gene with ‘gap’ exons
multicontigview comparison of gene structures of Cow, Mouse (and human)
QC
• Internal QC– Comparison against Uniprot or Refseq– Comparison to previous build– Comparisons to homologs
• External comparisons - the CCDS set
Increase in quality
0%
20%
40%
60%
80%
100%
Human-UniSw-33Human-UniSw-34Human-UniSw-35Human-RefSeq-33Human-RefSeq-34Human-RefSeq-35Mouse-UniSw-30Mouse-UniSw-32Mouse-UniSw-33Mouse-UniSw-34Mouse-RefSeq-30Mouse-RefSeq-32Mouse-RefSeq-33Mouse-RefSeq-34
Missing
Matching
Edge perfect
Identical
Using homology to find partial predictions
Comparing builds using homology pipeline
Improving the human build
• CCDS– Collaboration with Havana, NCBI and UCSC to produce a stable,
reliable set of complete (ATG->Stop) CDS structures for human– NCBI and Ensembl guarantee to retain the set in builds– Generation:
• Comparison of merged Ensembl/Vega set with the NCBI Refseq set to find the set of complete CDSs both groups predict identically.
• UCSC (and the other groups) analyse the complete sets and the CDS intersection set for possible errors.
• Assign stable ids and release– This process has been very valuable to both NCBI and us in
highlighting problems in our build processes.– Two rounds of comparisons have taken place
CCDS display
Affymetrix probe mapping
• Exonerate to map probe sequences
• Assign xrefs to transcripts for matched probe sets
• API currently being modified to be less Affymetrix array format (probeset) specific (by Zebrafish annotation team)
• Dog, chicken, fruitfly, zebrafish, rat, mouse, human (worm next month)
Other developments
• Cross reference data– New system for generating this data, leading to:
• More reliable generation of xrefs
• New types of xref eg. Unigene
• Monthly cDNA alignment set updates for human– Genomic alignments of cDNAs using an up to date
cDNA set. – Displayed on the website in the ‘cDNAs’ track
Outline
• Ensembl project overview
• Core database and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
What is Ensembl Compara?
A single database which links all the Ensembl Species databases together through precalculated
comparative genomics data analysis.
A perl object API to access, and create that database
A production system for generating that database
H. sapiens (human) NCBI35 3Gb*
M. musculus (house mouse) NCBIm33 2.6Gb *R. norvegicus (Norway rat) RGSC3.1 2.6Gb *
T. rubripes (tiger pufferfish) Fugu v2.0 400Mb *T. nigroviridis (Water fresh pufferfish) 400Mb *
D. rerio (zebrafish) WTSI Zv4 1.7Gb *
C. savignyi (sea squirt) 180Mb
D. melanogaster (fruitfly) BDGP3.1 125Mb *A. gambiae (African malaria mosquito) 230Mb *
C. elegans (nematode) WS116 100Mb *
O. latipes (Japanese medaka) 800Mb
M. mulatta (rhesus macaque) *P. troglodytes (common chimpanzee) 3Gb *
C. familiaris (domestic dog) BROAD1 2.5Gb *F. catus (domestic cat)
E. caballus (horse)S. scrofa (domestic pig)
B. taurus (domestic cattle) Btau 1.0 +O. aries (domestic sheep)
G. gallus (domestic fowl) WASHUC1 1.2Gb *X. laevis (African clawed frog) JGI3 3.1Gb
X. tropicalis (tropical clawed frog) 1.7Gb *
C. intestinalis (sea squirt) 180Mb + A. aegypti (yellow fever mosquito)
523
41
91
4574
83
65
20
310
197
92
360
450
990 25
70
140
?
550
250200?
Red : whole genome assembly availableGreen : whole genome assembly due in the next 2 years
1002003004005001000
Million years
* 17 species currently in Ensembl* 17 species currently in Ensembl++ 3 to be added soon 3 to be added soon
A. mellifera (honey bee) Amel1.1 200Mb *
340
S. cerevisiae (yeast, SGD) S228C 12Mb *
M. domestica (opposum) +
170
I. scapularis (tick)
1500?
Comparing different Comparing different speciesspecies
Compara database
• Gene orthology / paralogy predictions• Protein Family clusters• Raw protein alignments (wublastp)
Dna/Dna
Protein/Protein
• Synteny regions
• Whole genome alignments (BLAT, BlastZ, chain/net)
• Whole genome multiple alignments(Mercator, MLagan)
Compara database schema
Gene orthology prediction
species1 species2
Best Reciprocal Hits
Extra orthologous pairs foundBased on gene order conservation
Protein->DNA alignmentsdN, dS calculation
species3
wublastp+swqydb
dbqy
species1:species2species1:species3species2:species3
RHS, Orphans and OthersRHS, Orphans and Others
Using UBRH and MBRH-DUPs as anchors and comparing genomic coordinates in both species, we identify additional orthologues labeled RHS for Reciprocal Hit supported by Synteny
OrphanOthers
Matches to someother chromosome
Human
Mouse
UBRHMBRHDUP1.2
UBRHRHSRHS
MBRHCOMPLEX
MBRHSYN
MultiContigViewMultiContigView
Pairwise whole genome alignments pipeline
Species1dna chunking
Species2dna chunking
Dna chunks defined bySize, Overlap, Masking options,Chunk grouping size, Dump location
PairAligner Superclass
qy db
Blastz
Filtering (UCSC chain and net code)
BLATExonerate
DNA/DNA matches web DNA/DNA matches web displaydisplay
SyntenyView
Multiple whole genome alignments pipeline
Species1Coding exons
Species2Coding exons
Orthology Map Builder
Mercator
Species3Coding exons
wublastp all vs all(orthology anchors)
MultipleAlignerSuperClass
MLAGANMAVIDPECAN
AlignSlice API
Using whole genome pairwise/multiple alignment data to generate a reference coordinate system common to the aligned species in the genomic region of interest.
• Able to project features (including transcripts) from one species to another through the alignment.
• Gives annotation context information across species.
AlignSliceView
Outline
• Ensembl project overview
• Core database and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
Variation database
• Refactored ensembl-variation database replaced ensembl SNP (and lite) database
• New API to access DB from perl and java • Variation databases for 7 species:
– Homo_Sapiens– Mus_Musculus– Anopheles_Gambiae– Rattus_Norvegicus– Gallus_Gallus– Danio_Rerio– Canis_Familiaris
Ensembl variation db schema
Ensembl-variation API
• Similar in design to ensembl core and compara APIs:– Adaptor objects onto the database– Objects to represent biological entities such as:
• Variation and VariationFeature• TranscriptVariation• Allele• Genotype• Population• Individual• AlleleGroup
Generating SNP gene consequence
• SNPs occurring within transcripts are identified and their consequence for that transcript determined
• Classified into– Coding
• Synonymous• Non synonymous• Frameshift• Stop gain / loss
– Non coding UTR exonic– Intronic– Upstream or downstream
LD calculation
•calculate pair-wise ld in different populations•calculate, in each population, how many individual have genotype of AA, Aa & aa
•for defined window size (100,000), for each pair of variations•including 7 populations (involving hapmap and perlegen) and 309M rows in individual_genotype table
•86M rows in pairwise_ld table (with r2>0.05 and population sample_size>=40)
Variation Data on the website
Outline
• Ensembl project overview
• Core database and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
BioMart and EnsMartBioMart and EnsMart
• Large-scale data retrieval tool
• Query builder interface
• Databases: Ensembl, SNP, Vega, (MSD, UniProt)
• Associated features or sequences
• Flexible output formats• http://www.biomart.org
• http://www.ensembl.org/Multi/martview
Outline
• Ensembl project overview
• Core database and API
• Pipeline
• Genomic annotation
• Comparative genomics
• Variation data
• Ensembl BioMart datamining db
• Making the data available
Web code• Encapsulates
– Input
– Output
– Ensembl API
– Rendering
• Improves– Maintainability
– Flexibility
– Code re-use
varestsnp
core
EnsemblAPI
View script
Client browser
Data
Output
Renderer
Input
Ensembl web site
• Web site is the main access method• Hardware recently upgraded
– website now runs on blades (2 CPU intel boxes) like the compute farm.
• Scale by adding more
– Site speed is important to users
• Code and interface updated during the summer– Plugins
• Customisation of the site – Side bar
• Quick access / discovery of pages relating to current page
• DAS can be used to put up user features as a track on the site
Entry points
GeneView
Data retrieval
BioMart
Data sets on ftp site
MySQL queries of databases
Perl API access to databases
Export View
BioMart - Features
Database access via MySQLmysql -h ensembldb.ensembl.org -u anonymous
mysql> show databases like 'h%';+------------------------------+| Database (h%) |+------------------------------+| homo_sapiens_core_14_31 || homo_sapiens_core_15_33 || homo_sapiens_core_16_33 || homo_sapiens_disease_14_31 |
mysql> use homo_sapiens_core_29_35b;Database changedmysql> show tables;+-----------------------------------+| Tables_in_homo_sapiens_core_29_35b|+-----------------------------------+| analysis || assembly || chromosome || clone |
Archive site
Archive site details
• Each new release is archived onto a web blade
• Plan is to keep each archive sites up for 2 years
• Stable links (for 2 years):http://nov2004.archive.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000139618
• Will allow for better handling of retired stable ids
Database Schema and Core API
Arne Stabenau
Yuan Chen
Ian Longden
Craig Melsopp
Glenn Proctor
Daniel Ríos
Guy Slater
ENCODE
Damian Keefe
Paul Flicek
Distributed Annotation System
Andreas Kähäri
Stefan Gräf
Project Leader
Ewan Birney (EBI)
Tim Hubbard (Sanger)
Ensembl Web Team
James Smith
Fiona Cunningham
Eugene Kulesha
Anne Parker
VectorBase
Martin Hammond
Karyn Megy
Dan Lawson
Analysis and
Annotation Pipeline
Val Curwen
Steve Searle
Mario Caccamo
Laura Clarke
Jan Hinnerck-Vogel
Kevin Howe
Kerstin Jekosch
Felix Kokocinski
Simon White
Bronwen Aken
Julio Banet
EnsMart & BioMart
Arek Kasprzyk
Darin London
Damian Smedley
User Support
Xosé Mª Fernández
Michael Schuster
Bert Overduin
Comparative Genomics
Abel Ureta-Vidal
Javier Herrero Sánchez
Jessica Severin
Vega Web Team
Patrick Meidl
Steve Trevianon
Ensembl TeamEnsembl Team
October 2005
Ensembl Team