ensembl steve searle joint project leader, ensembl genebuild team

Ensembl

Steve Searle

Joint project leader, Ensembl Genebuild team

http://www.ensembl.org/java/core-api

Outline

• Ensembl project overview

• Core database schema and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available


What is Ensembl?project aims

• funded to provide vertebrate genomes to the world

• aims to provide the high quality automated genome annotation

• aims to a leading group in genome analysis

• all software, data and results freely available


What is Ensembl ?project background

• Group split between EBI and Sanger

• Mainly Wellcome Trust funded (recently received a new five year grant for 2006-2011)

• Largest dedicated compute in biology in Europe

• Developer community > 300 people, including companies


Ensembl - Technical overview

• Data storage– Mysql databases (~160Gb in current release)

• Core databases - annotation for each species• Variation databases - variation data for some species• Compara - single database containing all comparative genomic

data for species in ensembl• Mart - set of denormalised databases for datamining

• Data production– Pipeline systems running automatic annotation on a compute farm

of 800 CPUs

• Interfaces– Website– Mart (datamining tool)– Apollo– SQL– APIs (both perl and Java)


Currently 21 organismsin Ensembl

Open sourceOpen source

• Object model– standard interface makes it easy for others to build custom applications on top of

Ensembl data

• Open discussion of design ([email protected])

• Most major pharmaceutical companies and many academics on mailing list

• Ensembl installs worldwide– Both public and commercial

e.g. Gramene (CSHL)

Ciona-sg (Temasek)

Arabidopsis (NASC)

Fugu (IMCB)

http://www.gramene.org/

http://www2.bioinformatics.tll.org.sg/Ciona_savignyi/



http://atensembl.arabidopsis.info/


Outline


• Core database and API

• Pipeline



• Variation data




The Ensembl Core Database

• Relational database (MySQL) containing the genomic sequence and annotations on it (genes, alignments, ab initio predictions etc)

• Data stored in it throughout analysis process and the website displays features from it

• Current schema has 68 tables

• Ensembl core API team control changes


Requirements for the schema

• Store data for human genome

• … and all the other genomes we have

• … and all the genomes we might get

• Flexible to add more data

• Easy to adapt to new genome

• Responds fast enough for web site display and pipelined genebuild


System Context

Java API (EnsJ)

www PipelineApollo

Mart DB

MartShell MartView

Other Scripts & Applications

Ensembl DBs

Perl API


Sequence Tables

assembly

asm_seq_region_id int

cmp_seq_region_id int

asm_start int

asm_end int

cmp_start int

cmp_end int

ori int

seq_region

seq_region_id int

name varchar

coord_system_id int

length int

seq_region_attrib

seq_region_id int

attrib_type_id int

value varchar

attrib_type

attrib_type_id int

code varchar

name varchar

description text

dna

seq_region_id int

sequence mediumtext

dnac

seq_region_id int

sequence mediumblob

n_line text

coord_system

coord_system_id int

name varchar

version varchar

attrib“default_version”, “sequence_level”

rank int

0..1

0..n

0..1

0..n

1

1

0..1

0..1

1

1…n

1

0..n

1

0..n


Feature Tables• Feature tables describe annotations with positions in sequence.• Each feature is associated with a seq_region and has a start, end, and

orientation on the seq_region.• There is no central feature table. There are tables specific to each feature

type (DNA/DNA alignments, DNA/Protein alignments, Repeats, Simple features).

• Different feature tables have different attributes, but always have a seq_region position.

analysis

analysis_id int

created datetime

logic_name varchar

db varchar

db_version varchar

db_file varchar

program varchar

program_version varchar

program_file varchar

parameters varchar

module varchar

module_version varchar

gff_source varchar

gff_feature varchar

simple_feature

simple_feature_id int

seq_region_id int

seq_region_start int

seq_region_end int

seq_region_strand tinyint

display_label varchar

analysis_id int

score double

any feature

usually has a any_feature_id

int

contains a sequence position with or without strand on a sequence region

usually contains a string to display

varchar

usually links to the analysis responsible for calculating it

int

contains any number of other attributes

..

..

1

0..n


Other features

protein_align_feature

protein_align_feature_id int

Sequence position

hit_start int

hit_end int

hit_name varchar

analysis_id int

score double

evalue double

perc_ident float

cigar_line text

dna_align_feature

dna_align_feature_id int

Sequence position

hit_start int

hit_end int

hit_strand tinyint

hit_name varchar

analysis_id int

score double

evalue double

perc_ident float

cigar_line text

repeat_feature

repeat_feature_id int

Sequence position

repeat_start int

repeat_end int

repeat_consensus_id int

analysis_id int

score double

prediction_exon

prediction_exon_id int

prediction_transcript_id int

rank int

Sequence position

start_phase tinyint

score double

p_value double

prediction_transcript

prediction_transcript_id int

Sequence position

analysis_id int

repeat_consensus

repeat_consensus_id int

repeat_name varchar

repeat_class varchar

consensus text

1..n

1

1

1..n


Tables for Genes

exon

exon_id int

seq_region_id int


seq_region_end int


phase tinyint

end_phase tinyintexon_stable_id

exon_id int

stable_id varchar

version int

exon_transcript

exon_id int

transcript_id int

rank int

gene

gene_id int

status enum

description

source varchar

biotype varchar

analysis_id int

seq_region_id int


seq_region_end int


display_xref_id int

gene_stable_id

gene_id int

stable_id varchar

version int

transcript

transcript_id int

gene_id int

seq_region_id int

biotype varchar

status enum


seq_region_end int


display_xref_id int

transcript_stable_id

transcript_id int

stable_id varchar

version int

translation_stable_id

translation_id int

stable_id varchar

version int

translation

translation_id int

transcript_id int

seq_start int

start_exon_id int

seq_end int

end_exon_id int

1

0..1

0..1

0..1

1

1

10..n

0..n

1

0..1

1..n

1

1

0..1

1

1

0..n0..n

1

transcript_attrib

transcript_id int

attrib_type_id int

value varchar

translation_attrib

translation_id int

attrib_type_id int

value varchar

0..n

0..n


Other tables

• Sets of tables to handle:– Cross references of ensembl features to external database– Markers– QTLs– Regulatory regions and factors– Stable ID archive– Affymetrix probe data– Misc features– Density features

• Tables containing meta information about the database• Karyotype bands• Protein annotation• Supporting evidence• Assembly exceptions (haplotypes and PARs)


Ensembl APIs

• Programmatic access to ensembl databases is via three main APIs:– ensembl core API access to genome database– ensembl compara API access to compara database– ensembl variation API access to variation database

• All three have the same basic structure– Data objects to represent biological entities eg. Gene,

Homology, Variation– DataAdaptor objects to store and retrieve data objects from

database.

• Data production APIs– ensembl-pipeline genebuild pipeline– ensembl-analysis analysis wrapper objects– ensembl-hive compara pipeline


The Perl Core API

• The Perl core API provides a layer of abstraction over the Ensembl core databases.

• Written in Object-Oriented Perl.

• Can be used to get information into or out of Ensembl databases.

• Insulates programmers to some extent from changes to the database schema.

• Insulates programmer from coordinate transformations


Data Objects

• Information is obtained from the API in the form of Data Objects.

• Each object represents some data which is stored in the database.

• A Gene object represents a gene, a Transcript object represents a transcript, a Marker Object represents a Marker, etc.


Data Objects – Code Example

# print out the start, end and strand of a transcriptprint $transcript->start(), '-', $transcript->end(), '(',$transcript->strand(), “)\n”;

# print out the stable identifier for an exonprint $exon->stable_id(), “\n”;

# print out the name of a marker and its primer sequencesprint $marker->display_marker_synonym()->name, “\n”;print “left primer: ”, $marker->left_primer(), “\n”;print “right primer: ”, $marker->right_primer(), “\n”;

# set the start and end of a simple feature$simple_feature->start(10);$simple_feature->end(100);


Object Adaptors

• Object Adaptors are factories for Data Objects.

• Data Objects are retrieved from and stored in databases using Object Adaptors.

• Each Object Adaptor is responsible for creating objects of only one particular type.

• Data Adaptor fetch, store, and remove methods are used to retrieve, save, and delete information in the database.

• All the SQL is in the Object Adaptors


Object Adaptors – Code Example

# fetch a gene by its internal identifier$gene = $gene_adaptor->fetch_by_dbID(1234);

# fetch a gene by its stable identifier$gene = $gene_adaptor->fetch_by_stable_id('ENSG0000005038');

# store a transcript in the database$transcript_adaptor->store($transcript);

# remove an exon from the database$exon_adaptor->remove($exon);

# get all transcripts having a specific interpro domain@transcripts = @{$transcript_adaptor->fetch_all_by_domain('IPR000980')};


The DBAdaptor and the Registry

DB

DBAdaptor

GeneAdaptor MarkerAdaptor …

Gene Marker …

Object Adaptors

Data Objects

• The Database Adaptor is a factory for Object Adaptors• It is used to connect to the database and to obtain Object Adaptors

• Registry enables access to multiple databases using information from a config file (important for compara work)


Slices

• A Slice Data Object represents an arbitrary region of a genome.

• Slices are not directly stored in the database.• A Slice is used to request sequence or features

from a specific region in a specific coordinate system.

chr20

Clone AC022035


Slices – Code Example

# get the slice adaptor$slice_adaptor = $db->get_SliceAdaptor();

# fetch a slice on a region of chromosome 12$slice = $slice_adaptor->fetch_by_region('chromosome', '12', 1e6, 2e6);

# print out the sequence from this regionprint $slice->seq();

# get all clones in the database and print out their names@slices = @{$slice_adaptor->fetch_all('clone')};foreach $slice (@slices) { print $slice->seq_region_name(), “\n”;}


Features

• Features are Data Objects with associated genomic locations.

• All Features have start, end, strand and slice attributes.

• Features are retrieved from Object Adaptors using limiting criteria such as identifiers or regions (slices).

• Gene

• Transcript

• Exon

• PredictionTranscript

• PredictionExon

• DnaAlignFeature

• ProteinAlignFeature

• SimpleFeature

• MarkerFeature

• QtlFeature

• MiscFeature

• KaryotypeBand

• RepeatFeature

• AssemblyExceptionFeature

• DensityFeature


A Complete Code Example

use Bio::EnsEMBL::DBSQL::DBAdaptor;my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’);

my $slice_ad = $db->get_SliceAdaptor();my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6);foreach my $sf (@{$slice->get_all_SimpleFeatures()}) { my $start = $sf->start(); my $end = $sf->end(); my $strand = $sf->strand(); my $score = $sf->score(); print “$start-$end($strand) $score\n”;}


A Gene Object Code Example

#!/usr/bin/perl -wuse Bio::EnsEMBL::DBSQL::DBAdaptor;use strict;my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’);

my $slice_ad = $db->get_SliceAdaptor();my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6);foreach my $gene (@{$slice->get_all_Genes_by_type(‘ensembl’)}) { print “Gene “,$gene->stable_id,“ “, $gene->start,“ “, $gene->end,“\n”; foreach my $trans (@{$gene->get_all_Transcripts}) { print “ Trans “,$trans->stable_id,”\n”; my $tlnseq = $trans->translate->seq; $tlnseq =~ s/(.{1,60})/$1\n/g;

print “ “,$tlnseq; }}


Coordinate Transformations

• The API provides the means to convert between any related coordinate systems in the database.

• Feature methods transfer, transform, project can be used to move features between coordinate systems.

• Slice method project can be used to move features between coordinate systems.


Feature::transfer

• The Feature method transfer moves a feature from one Slice to another.

• The Slice may be in the same coordinate system or a different coordinate system.

Chr20

Chr17

AC099811

Chr17

ChrX


Feature::transfer – Code Example

# fetch an exon from the database$exon = $exon_adaptor->fetch_by_stable_id('ENSE00001180238');print “Exon is on slice: “, $exon->slice()->name(), “\n”;print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”; # transfer the exon to a small slice just covering it$exon_slice = $slice_adaptor->fetch_by_Feature($exon);$exon = $exon->transfer($exon_slice);

print “Exon is on slice: “, $exon->slice()->name(), “\n”;print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”;

Sample output:Exon is on slice: chromosome:NCBI34:12:1:132078379:1Exon coords: 56452706-56452951Exon is on slice: chromosome:NCBI34:12:56452706:56452951:1Exon coords: 1-246


Stability of API

• Ensembl API changes to meet our needs

• Request for greater stability from users

• Some methods are now labelled as stable and we guarentee that they will not change for at least 2 years.


Outline



• Pipeline



• Variation data




Runnables and RunnableDBs

• Runnables are perl objects which wrap analysis programs. Methods:– run– parse_results

• Generates ensembl data objects

– output• Returns generated data objects

eg. Blast runnable wraps blast

• RunnableDBs are perl objects which wrap Runnables allowing them to retrieve input data from and store output data into ensembl databases– fetch_input– write_output


my $seq = Bio::SeqIO->new( -file => "<test.fa", -format => 'Fasta')->next_seq;

my $slice = Bio::EnsEMBL::Slice->new( -seq => $seq->seq, -coord_system => Bio::EnsEMBL::CoordSystem->new(-name => 'contig', -rank => 1), -seq_region_name => $seq->display_id, -start => 1, -end => $seq->length);

my $genscan_runnable = Bio::EnsEMBL::Analysis::Runnable::Genscan->new( -query => $slice, -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'genscan'));

$genscan_runnable->run;

my @output;foreach my $prediction (@{$genscan_runnable->output}) { my $blast_run = Bio::EnsEMBL::Analysis::Runnable::Blast->new ( -query => $prediction->translate, -parser => Bio::EnsEMBL::Analysis::Tools::BPliteWrapper->new(), -database => 'embl_vertrna', -program => 'wutblastn', -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'vertrna'));

$blast_run->run; push(@output, @{$blast_run->output});}

Runnable example


#!/usr/local/ensembl/bin/perl -w

use strict;use Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor;use Bio::EnsEMBL::Pipeline::Analysis;

my $db = new Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor( -host => 'localhost', -user => 'root', -dbname => 'test_db');

my $anal = $db->get_AnalysisAdaptor->fetch_by_logic_name(’Uniprot');

my $rdbstr = “Bio::EnsEMBL::Analysis::RunnableDB::”.$anal->module; my $runobj = “$rdbstr”->new( -db => $db, -input_id => 'contig::AL1347153.1.3517:1:3571:1', -analysis => $anal);

$runobj->fetch_input;

$runobj->run;

$runobj->write_output;

RunnableDB example


Writing Runnables and RunnableDBs

• A lot of functionality is implemented in the base classes

• At its simplest just requires implementing:parse_results in the Runnable

get_adaptor in the RunnableDB

fetch_input in the RunnableDB

• Other methods which may need overridingwrite_output in the RunnableDB

run_analysis in the Runnable


Pipeline Summary


Example pipeline


Current hardware

• 8x ES40 Alpha (667 MHz) with 2Tb fibre channel storage

• 10x ES45 Alpha (1GZ) with 5Tb fibre channel storage

• 3x Itanium 4 CPU with 1.6Tb storage

• 400 HS20 IBM Blades (2x2.8 or 3.2Ghz PIV + 4 Gig memory + 2TB clustered SAN filesystem or 600GB clustered IDE filesystem (both IBM GPFS)

• Tru64 UNIX/Linux

• 21 MySQL (v 4.1) instances

• Most binaries and all sequence databases stored locally (avoids using NFS)


Outline



• Pipeline



• Variation data




Genome annotation overview

Raw compute - Alignments against protein and DNA dbs, and other basic analyses

Automatic gene annotation

Protein coding gene models

Pseudogenes (some)

RNA genes

Alignment of species ESTs and cDNAs

Affymetrix probe mapping

Protein domain annotation

Cross reference generation


The Raw Computes

• Repeat Features– RepeatMasker– Dust– TRF

• Ab Initio Genes– Genscan (sometimes other programs)

• Blast alignments– Blastp against Uniprot– Blastn against EMBL vertebrate RNAs and UniGene Clusters

• Other Features– CPG islands– tRNA genes– Transcription start sites using Eponine


Species Specific Proteins

Other ProteinsSpecies Specific

cDNAs

AlignedcDNAs

Exonerate

Species Specific ESTs

Exonerate

AlignedESTs

EnsemblEST genes

ClusterMerge

Preliminarygene set

GenebuilderSupported ab initio

(optional)

cDNA genes

ClusterMerge

PseudogenesFinal set

+ pseudogenes

Genewisegenes

Genewise

GeneCombiner

Core Ensemblgenes

Gene Annotation

Genewise geneswith UTRs

Blessed gene set(optional)


ncRNAs

• Functional RNAs

• Families share conserved secondary structure

• Low sequence identity

• Ribosome

• Spliceosome

• tRNAs

• miRNA


Difficulties in annotating ncRNAs

Ab initio gene predicting programs such as GENSCAN cannot predict non-coding genes.

BLAST performs poorly at detecting non coding genes where structure is conserved but sequence identity is low.

Cannot use repeat masked DNA as some ncRNAs look very much like repeats (ALU related to SRP RNA)

Cannot use ESTs as ncRNAs lack poly-A


RFAM

• Hand made alignments

• Use Infernal to make Covariance Models

• Scan models over subset of EMBL to build family alignments


Problems 2

• Infernal does not scale well:

• “Covariance model searches are extremely compute intensive… The compute time scales roughly to the 4th power of the length of the RNA, so larger models quickly become infeasible without significant compute resources”

• How long would it take to run the human genome?

• Rough estimate > a week on the farm

• Need to limit the amount of sequence we run Infernal on


Rfam Scan• Rfam procedure to speed up Infernal on large eukaryotes

– Uses Blast to narrow search:• BLAST is poor at finding ncRNAs with low sequence ID• RFAM families contain sequences from all organisms• More sequence variation = more chance of Blast making

alignment

• In ensembl:– Separated blast and Infernal steps (using Runnables)

– Determined filtering for blast results to limit time without significant reduction in sensitivity

– Now runs in less than 24hrs.


miRNA

• Highly conserved across species

• Precursor stem loop sequence ~ 70nt

• Mature miRNA ~ 21nt

• Identified using BLAST genomic vs miRBase precursors

• RNAfold used to test for stem loop

• Mature sequence identified (only 2 nt changes tolerated)

• Start with ~ 290,000 blast hits

• End with 222 miRNA

• 96% of SE miRNAs + additional 60

• Novel c.f. miRBase:

• 1 – chicken, 36 – mouse, 5 – rat


Examples


StructuresStructures identified by Infernal / RNAfold are stored as transcript

attributes ::::::::::::::::<<-<<<<<-<<<________________>>>>>>>>-->>,,,, 1 AuCUUUGCGCAGGGGCaaUaucguAgccAGUGAGGcUuuaCCGAggcgcgauUAuuGCUA 60 A+CUUUGCGCAG GGCA:UAU :UAGCCA+UGAGG+UU++CCGAGGCG: AUUA:UGCUA 181 AGCUUUGCGCAGUGGCAGUAUCAUAGCCAAUGAGGUUUAUCCGAGGCGCAAUUAUUGCUA 240

<<<<_.________.__>>>>,,,,,<<<.<<<<<<<<<<____......__>>>>>>>> 61 gUugA.AAACUAUU.CCcaAccgCCCgcc.aagacgacauguua......uauugucggc 111 :UU A AAA UA AA:+G G:C ::: ::A:::+UUA U :::U::+: 241 AUUAAuAAAUUAAAuAAUAAAAGGG-GACuCUU-UUAGUGCUUAuaaaggUUUACUAACC 298

>>->>>,,,,,,,,,,,,<<<<____>>>> 112 uuuggcAAUUUUUGGAAGcccuccAaaggg 141 :: G:CAA UU +AAG ::C+AA:: 299 ACAGACAACUU---AAAGGUAACAAACCUA 325

Displayable on website as markup on transcript sequence


Low Coverage Genomes

• 16 mammalian genomes sequenced to 2x coverage are expected over the next 2-3 years.

• Ensembl is aiming to provide gene sets for these based on alignments to human, building predictions on scaffolds which align to genomic locations of human genes

• Test case– Cow preliminary 3x assembly 449727 scaffolds,

795212 contigs– Good test case because 6x assembly was recently

made available so we can assess accuracy of method


Method overview

- Details- Raw alignments grouped using UCSC chain and net method- Use human as source for cow - human has best annotation- ‘Gene scaffolds’ (new coord system) are stored in the database- Allow scaffolds to be broken at contig gaps- Retain ‘gap’ exons

NNNNNNNN


Co

w

BL

UE

C

ow

RE

D

Dealing with duplication: Iterative Human Net

Co

w

GR

EE

N

Human


Cow gene with ‘gap’ exons


multicontigview comparison of gene structures of Cow, Mouse (and human)


QC

• Internal QC– Comparison against Uniprot or Refseq– Comparison to previous build– Comparisons to homologs

• External comparisons - the CCDS set


Increase in quality

0%

20%

40%

60%

80%

100%

Human-UniSw-33Human-UniSw-34Human-UniSw-35Human-RefSeq-33Human-RefSeq-34Human-RefSeq-35Mouse-UniSw-30Mouse-UniSw-32Mouse-UniSw-33Mouse-UniSw-34Mouse-RefSeq-30Mouse-RefSeq-32Mouse-RefSeq-33Mouse-RefSeq-34

Missing

Matching

Edge perfect

Identical


Using homology to find partial predictions


Comparing builds using homology pipeline


Improving the human build

• CCDS– Collaboration with Havana, NCBI and UCSC to produce a stable,

reliable set of complete (ATG->Stop) CDS structures for human– NCBI and Ensembl guarantee to retain the set in builds– Generation:

• Comparison of merged Ensembl/Vega set with the NCBI Refseq set to find the set of complete CDSs both groups predict identically.

• UCSC (and the other groups) analyse the complete sets and the CDS intersection set for possible errors.

• Assign stable ids and release– This process has been very valuable to both NCBI and us in

highlighting problems in our build processes.– Two rounds of comparisons have taken place


CCDS display


Affymetrix probe mapping

• Exonerate to map probe sequences

• Assign xrefs to transcripts for matched probe sets

• API currently being modified to be less Affymetrix array format (probeset) specific (by Zebrafish annotation team)

• Dog, chicken, fruitfly, zebrafish, rat, mouse, human (worm next month)


Other developments

• Cross reference data– New system for generating this data, leading to:

• More reliable generation of xrefs

• New types of xref eg. Unigene

• Monthly cDNA alignment set updates for human– Genomic alignments of cDNAs using an up to date

cDNA set. – Displayed on the website in the ‘cDNAs’ track


Outline



• Pipeline



• Variation data




What is Ensembl Compara?

A single database which links all the Ensembl Species databases together through precalculated

comparative genomics data analysis.

A perl object API to access, and create that database

A production system for generating that database


H. sapiens (human) NCBI35 3Gb*

M. musculus (house mouse) NCBIm33 2.6Gb *R. norvegicus (Norway rat) RGSC3.1 2.6Gb *

T. rubripes (tiger pufferfish) Fugu v2.0 400Mb *T. nigroviridis (Water fresh pufferfish) 400Mb *

D. rerio (zebrafish) WTSI Zv4 1.7Gb *

C. savignyi (sea squirt) 180Mb

D. melanogaster (fruitfly) BDGP3.1 125Mb *A. gambiae (African malaria mosquito) 230Mb *

C. elegans (nematode) WS116 100Mb *

O. latipes (Japanese medaka) 800Mb

M. mulatta (rhesus macaque) *P. troglodytes (common chimpanzee) 3Gb *

C. familiaris (domestic dog) BROAD1 2.5Gb *F. catus (domestic cat)

E. caballus (horse)S. scrofa (domestic pig)

B. taurus (domestic cattle) Btau 1.0 +O. aries (domestic sheep)

G. gallus (domestic fowl) WASHUC1 1.2Gb *X. laevis (African clawed frog) JGI3 3.1Gb

X. tropicalis (tropical clawed frog) 1.7Gb *

C. intestinalis (sea squirt) 180Mb + A. aegypti (yellow fever mosquito)

523

41

91

4574

83

65

20

310

197

92

360

450

990 25

70

140

?

550

250200?

Red : whole genome assembly availableGreen : whole genome assembly due in the next 2 years

1002003004005001000

Million years

* 17 species currently in Ensembl* 17 species currently in Ensembl++ 3 to be added soon 3 to be added soon

A. mellifera (honey bee) Amel1.1 200Mb *

340

S. cerevisiae (yeast, SGD) S228C 12Mb *

M. domestica (opposum) +

170

I. scapularis (tick)

1500?

Comparing different Comparing different speciesspecies


Compara database

• Gene orthology / paralogy predictions• Protein Family clusters• Raw protein alignments (wublastp)

Dna/Dna

Protein/Protein

• Synteny regions

• Whole genome alignments (BLAT, BlastZ, chain/net)

• Whole genome multiple alignments(Mercator, MLagan)


Compara database schema


Gene orthology prediction

species1 species2

Best Reciprocal Hits

Extra orthologous pairs foundBased on gene order conservation

Protein->DNA alignmentsdN, dS calculation

species3

wublastp+swqydb

dbqy

species1:species2species1:species3species2:species3


RHS, Orphans and OthersRHS, Orphans and Others

Using UBRH and MBRH-DUPs as anchors and comparing genomic coordinates in both species, we identify additional orthologues labeled RHS for Reciprocal Hit supported by Synteny

OrphanOthers

Matches to someother chromosome

Human

Mouse

UBRHMBRHDUP1.2

UBRHRHSRHS

MBRHCOMPLEX

MBRHSYN


MultiContigViewMultiContigView


Pairwise whole genome alignments pipeline

Species1dna chunking

Species2dna chunking

Dna chunks defined bySize, Overlap, Masking options,Chunk grouping size, Dump location

PairAligner Superclass

qy db

Blastz

Filtering (UCSC chain and net code)

BLATExonerate


DNA/DNA matches web DNA/DNA matches web displaydisplay


SyntenyView


Multiple whole genome alignments pipeline

Species1Coding exons


Orthology Map Builder

Mercator


wublastp all vs all(orthology anchors)

MultipleAlignerSuperClass

MLAGANMAVIDPECAN


AlignSlice API

Using whole genome pairwise/multiple alignment data to generate a reference coordinate system common to the aligned species in the genomic region of interest.

• Able to project features (including transcripts) from one species to another through the alignment.

• Gives annotation context information across species.


AlignSliceView


Outline



• Pipeline



• Variation data




Variation database

• Refactored ensembl-variation database replaced ensembl SNP (and lite) database

• New API to access DB from perl and java • Variation databases for 7 species:

– Homo_Sapiens– Mus_Musculus– Anopheles_Gambiae– Rattus_Norvegicus– Gallus_Gallus– Danio_Rerio– Canis_Familiaris


Ensembl variation db schema


Ensembl-variation API

• Similar in design to ensembl core and compara APIs:– Adaptor objects onto the database– Objects to represent biological entities such as:

• Variation and VariationFeature• TranscriptVariation• Allele• Genotype• Population• Individual• AlleleGroup


Generating SNP gene consequence

• SNPs occurring within transcripts are identified and their consequence for that transcript determined

• Classified into– Coding

• Synonymous• Non synonymous• Frameshift• Stop gain / loss

– Non coding UTR exonic– Intronic– Upstream or downstream


LD calculation

•calculate pair-wise ld in different populations•calculate, in each population, how many individual have genotype of AA, Aa & aa

•for defined window size (100,000), for each pair of variations•including 7 populations (involving hapmap and perlegen) and 309M rows in individual_genotype table

•86M rows in pairwise_ld table (with r2>0.05 and population sample_size>=40)


Variation Data on the website


Outline



• Pipeline



• Variation data




BioMart and EnsMartBioMart and EnsMart

• Large-scale data retrieval tool

• Query builder interface

• Databases: Ensembl, SNP, Vega, (MSD, UniProt)

• Associated features or sequences

• Flexible output formats• http://www.biomart.org

• http://www.ensembl.org/Multi/martview

http://www.biomart.org/



http://www.ensembl.org/EnsMart/





Outline



• Pipeline



• Variation data




Pre!


Web code• Encapsulates

– Input

– Output

– Ensembl API

– Rendering

• Improves– Maintainability

– Flexibility

– Code re-use

varestsnp

core

EnsemblAPI

View script

Client browser

Data

Output

Renderer

Input


Ensembl web site

• Web site is the main access method• Hardware recently upgraded

– website now runs on blades (2 CPU intel boxes) like the compute farm.

• Scale by adding more

– Site speed is important to users

• Code and interface updated during the summer– Plugins

• Customisation of the site – Side bar

• Quick access / discovery of pages relating to current page

• DAS can be used to put up user features as a track on the site


Entry points


ContigView

Overview

Detailed View

Basepair View


GeneView


Data retrieval

BioMart

Data sets on ftp site

MySQL queries of databases

Perl API access to databases

Export View


BioMart


BioMart - Features


Database access via MySQLmysql -h ensembldb.ensembl.org -u anonymous

mysql> show databases like 'h%';+------------------------------+| Database (h%) |+------------------------------+| homo_sapiens_core_14_31 || homo_sapiens_core_15_33 || homo_sapiens_core_16_33 || homo_sapiens_disease_14_31 |

mysql> use homo_sapiens_core_29_35b;Database changedmysql> show tables;+-----------------------------------+| Tables_in_homo_sapiens_core_29_35b|+-----------------------------------+| analysis || assembly || chromosome || clone |


Archive site


Archive site details

• Each new release is archived onto a web blade

• Plan is to keep each archive sites up for 2 years

• Stable links (for 2 years):http://nov2004.archive.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000139618

• Will allow for better handling of retired stable ids

http://nov2004.archive.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000139618


Database Schema and Core API

Arne Stabenau

Yuan Chen

Ian Longden

Craig Melsopp

Glenn Proctor

Daniel Ríos

Guy Slater

ENCODE

Damian Keefe

Paul Flicek

Distributed Annotation System

Andreas Kähäri

Stefan Gräf

Project Leader

Ewan Birney (EBI)

Tim Hubbard (Sanger)

Ensembl Web Team

James Smith

Fiona Cunningham

Eugene Kulesha

Anne Parker

VectorBase

Martin Hammond

Karyn Megy

Dan Lawson

Analysis and

Annotation Pipeline

Val Curwen

Steve Searle

Mario Caccamo

Laura Clarke

Jan Hinnerck-Vogel

Kevin Howe

Kerstin Jekosch

Felix Kokocinski

Simon White

Bronwen Aken

Julio Banet

EnsMart & BioMart

Arek Kasprzyk

Darin London

Damian Smedley

User Support

Xosé Mª Fernández

Michael Schuster

Bert Overduin

Comparative Genomics

Abel Ureta-Vidal

Javier Herrero Sánchez

Jessica Severin

Vega Web Team

Patrick Meidl

Steve Trevianon

Ensembl TeamEnsembl Team

October 2005


Ensembl Team


ensembl steve searle joint project leader, ensembl genebuild team

Documents