ebi is an outstation of the european molecular biology laboratory. bert overduin edinburgh, 24...

41
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Upload: patience-daniels

Post on 28-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

EBI is an Outstation of the European Molecular Biology Laboratory.

Bert Overduin

Edinburgh, 24 February 2009

EnsemblDevelopers WorkshopCore API

Page 2: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Outline

• The Ensembl Core databases and Perl API• Documentation & Help

(1)Data Objects, Object Adaptors, Database Adaptors & The Registry

(2)Coordinate Systems & Slices

(3)Features

(4)Genes, Transcripts, Exons & Translations

(5)External References

(6)Coordinate Mappings

Page 3: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

The Ensembl Core databases

• The Ensembl Core databases store:

genomic sequence

assembly information

gene, transcript and protein models

cDNA and protein alignments

cytogenetic bands, markers, repeats, CpG islands etc.

external references

• homo_sapiens_core_52_36n

species group data version

assembly version

software version

Page 4: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

The Ensembl Core Perl API

• Used to retrieve data from and store data in the Ensembl Core databases

• Written in Object-Oriented Perl

• Partly based on and compatible with BioPerl objects (http://www.bioperl.org)

• Used by the Ensembl analysis and annotation pipeline and the Ensembl web code

• Robust and well-supported

• Forms the basis for the other Ensembl APIs

Page 5: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Documentation & Help

• Installation instructions, web-browsable version of the POD (Perldoc) and tutorial:http://www.ensembl.org/info/docs/api/core/index.html

• Inline Perl POD (Plain Old Documentation)

• ensembl-dev mailing list:

http://www.ensembl.org/info/about/contact/mailing.html

• Ensembl helpdesk:

[email protected]

Page 6: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Data Objects

• Data Objects model biological entities, e.g. Genes, Transcripts, Translations, …

• Each Data Object encapsulates information from one or a few specific MySQL tables

• Data Objects are retrieved from and stored in the database using Objects Adaptors

Page 7: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Object Adaptors

• Object Adaptors are Data Object factories

• Each Object Adaptor is responsible for creating Data Objects of only one particular type

Page 8: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Database Adaptors

• Database Adaptors are Object Adaptor factories

• Database Adaptors are used to connect to a single database

Page 9: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

The Registry

• The Registry is a container for all Database Adaptors

• The Registry handles all database connections

• The Registry is an Object Adaptor factory

• The Registry can be initialised via a configuration file or by automatically discovering databases on a RDBMS instance

Page 10: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

System Architecture

ObjectGene

GeneAdaptorGeneAdaptor MarkerAdaptorMarkerAdaptor

Gene MarkerMarkerMarkerGene

ObjectObject

ObjectAdaptorObjectAdaptor

Core DBAdaptorCore DBAdaptor

Human Core DB

Gene

VariationAdaptorVariationAdaptor GenotypeAdaptorGenotypeAdaptor

Gene MarkerMarkerGenotypeVariation

Variation DBAdaptorVariation DBAdaptor

Human Variation

DB

Mouse Core DB

Mouse Variation

DB

Ensembl RegistryEnsembl Registry

Page 11: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example

# Obtain the Ensembl Gene IDs for all human genes

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(

-host => 'ensembldb.ensembl.org',

-user => 'anonymous'

);

my $gene_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Gene’ );

my $genes = $gene_adaptor->fetch_all;

while ( my $gene = shift @{$genes} ){

print $gene->stable_id, “\n”;

}

Page 12: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example

OUTPUT:

ENSG00000208234

ENSG00000199674

ENSG00000221622

ENSG00000207604

ENSG00000207431

ENSG00000221312

ENSG00000223135

ENSG00000223136

ENSG00000200159

ENSG00000200131

ENSG00000206672

ENSG00000212552

ENSG00000201452

ENSG00000202016

ENSG00000200455

ENSG00000201916

ENSG00000212228

ENSG00000202261

ENSG00000207742

ENSG00000223137

ENSG00000212550

ENSG00000223138

ENSG00000200827

ENSG00000221638

ENSG00000201937

ENSG00000212205

ENSG00000221428

ENSG00000202470

ENSG00000200236

ENSG00000223139

ENSG00000207932

ENSG00000223140

ENSG00000221791

ENSG00000199102

ENSG00000199960

ENSG00000208013

ENSG00000223141

ENSG00000223142

ENSG00000221363

ENSG00000213177

ENSG00000216774

ENSG00000213194

ENSG00000207492

ENSG00000219252

ENSG00000222962

ENSG00000206963

ENSG00000207934

ENSG00000199814

ENSG00000199796

ENSG00000212450

ENSG00000207603

ENSG00000202474

ENSG00000206831

ENSG00000207331

ENSG00000221740

ENSG00000222963

ENSG00000201923

ENSG00000198928

ENSG00000208225

ENSG00000208227

ENSG00000208243

ENSG00000177117

ENSG00000208245

ENSG00000209219

ENSG00000213184

ENSG00000211401

ENSG00000208228

ENSG00000208229

ENSG00000208233

ENSG00000208235

Page 13: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Exercise 1

(a) Load all databases and print their names.

(b) What is the name of the human core database?

There are several solutions possible!

Use Perldoc!

(http://www.ensembl.org/info/docs/api/core/index.html)

Page 14: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Coordinate Systems

• Sequences stored in Ensembl are associated with Coordinate Systems

• Coordinate Systems vary from species to species: human: chromosome, supercontig, clone, contig zebrafish: chromosome, scaffold, contig

• Sequence information is directly stored in the database for the ‘sequence level’ Coordinate System

• The Coordinate System of the highest level in a given region is the ‘top level’ Coordinate System

• Features are stored in a single Coordinate System

Page 15: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Coordinate Systems

Chromosome

Contigs

Clones(Tiling path)

Top level

Sequence level

Page 16: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Slices

• A Slice Data Object represents an arbitrary region of a genome

• Slices are not directly stored in the database

• Slices are used to obtain sequences or features from a specific region in a specific coordinate system

Page 17: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example

# Obtain a slice covering the entire human Y chromosome

my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ );

my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘Y’ );

printf( “Slice: %s %s %s-%s (%s)\n”,

$slice->coord_system_name

$slice->seq_region_name

$slice->start

$slice->end

$slice->strand );

OUTPUT:

Slice: chromosome Y 1-57772954 (1)

Page 18: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Exercise 2

(a) Obtain the names of the coordinate systems for rat.

(b) Obtain a slice covering the first 10 MB of chromosome 20 of human and print its sequence.

(c) Obtain a slice covering the human gene with Ensembl Gene ID ‘ENSG00000101266’ with 2 kb of flanking sequence and print its sequence.

(d) Print the name, start, end and strand of the obtained slices as well as their coordinate system.

If you want to output your sequences to a file, have a look at BioSeq:IO at http://doc.bioperl.org/releases/bioperl-1.2.3/

Page 19: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Features

• Features are Data Objects with a defined location on the genome

• All Features have a start, end, strand and slice

• The start coordinate of a Feature is always less than its end coordinate, irrespective of the strand on which it is located (exception: insertion features)

Page 20: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Features

Some examples of Features:

• Gene, Transcript and Exon• ProteinFeature• PredictionTranscript and PredictionExon• DNAAlignFeature and ProteinAlignFeature• RepeatFeature• MarkerFeature• OligoFeature• SimpleFeature• MiscFeature

Page 21: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example

# Obtain all markers on human chromosome 1

my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ );

my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘1’ );

my $markers = $slice->get_all_MarkerFeatures;

while ( my $marker = shift @{$markers} ){

printf( “%s\t%s\n”,

$marker->slice->name, $marker->feature_Slice->name );

}

OUTPUT:

chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:1237:1488:1

chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:2585:2812:1

chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:4284:5085:1

Page 22: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Exercise 3

(a) Obtain all the CpG islands on the first 5 Mb of dog chromosome 20. Print the total number of CpG islands and the position and sequence of each CpG island.

(b) Obtain all the protein alignment features on the first 5 Mb of dog chromosome 20. Print for each alignment the name of the aligned protein, the start and end coordinates of the matching region on the protein and on the genome and the name of the analysis resulting in the alignment.

Hint: CpG islands are stored as SimpleFeatures with logic_name ‘cpg’.

Page 23: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Genes, Transcripts & Exons

• Genes, Transcript and Exons are Feature Data Objects

• A Gene is a grouping of Transcripts which share any (partially) overlapping Exons

• A Transcript is a set of Exons

• Introns are not explicitly defined in the database

Page 24: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Translations

• Translations are not Feature Data Objects

• Translations define the Untranslated Region (UTR) and Coding Sequence (CDS) composition of Transcripts

• Protein sequences are not stored in the database, but computed on the fly using Transcript(!) objects

Page 25: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Exercise 4

(a) Obtain the gene with Ensembl Gene ID ‘ENSG00000101266’ and its transcripts. Print the total number of exons in the gene and the number of exons in each individual transcript. Why do the found numbers disagree with each other?

(b) Print for each transcript of the above gene the coding sequence and the protein sequence.

Page 26: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

External References

• External References (Xrefs) are cross references of Ensembl Genes, Transcripts or Translations with identifiers from other databases, e.g. HGNC, WikiGenes, UniProtKB/Swiss-Prot, RefSeq, MIM etc. etc.

Page 27: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example# Obtain external references for Ensembl gene ENSG00000139618

my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000139618' );

my $gene_xrefs = $gene->get_all_DBEntries;

print "Xrefs on the gene: \n\n";

while ( my $gene_xref = shift @{$gene_xrefs} ){

printf( "%s: %s\n”,

$gene_xref->dbname, $gene_xref->display_id );

}

my $all_xrefs = $gene->get_all_DBLinks;

print "\nXrefs on the gene, transcript and protein: \n\n";

while ( my $all_xref = shift @{$all_xrefs} ){

printf( "%s: %s\n”,

$all_xref->dbname, $all_xref->display_id );

}

Page 28: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example

Output:

Xrefs on the gene:

HGNC: BRCA2

DBASS3: BRCA2

UCSC: uc001uub.1

HGNC_curated_gene: BRCA2

Xrefs on the gene, transcript and protein:

shares_CDS_with_OTTT: OTTHUMT00000046000

AFFY_HC_G110: 1503_at

AFFY_HG_U95A: 1503_at

AFFY_HG_U95Av2: 1503_at

AFFY_HC_G110: 1990_g_at

AFFY_HG_U95A: 1990_g_at

AFFY_HG_U95Av2: 1990_g_at

AFFY_HuGeneFL: X95152_rna1_at

AFFY_HG_Focus: 214727_at

AFFY_HG_U133A: 214727_at

AFFY_HG_U133A_2: 214727_at

AFFY_HG_U133_Plus_2: 214727_at

AFFY_HG_U133A: 208368_s_at

AFFY_HG_U133A_2: 208368_s_at

AFFY_HG_U133_Plus_2: 208368_s_at

AFFY_U133_X3P: g4502450_3p_a_at

AFFY_U133_X3P: 208368_3p_s_at

AFFY_U133_X3P: Hs.34012.1.S1_3p_at

AFFY_HC_G110: 1989_at

AFFY_HG_U95A: 1989_at

AFFY_HG_U95Av2: 1989_at

RefSeq_dna: NM_000059

HGNC: BRCA2

UniGene: Hs.34012

AgilentCGH: A_14_P131744

AgilentCGH: A_14_P109686

AgilentProbe: A_23_P99452

Codelink: GE60169

Illumina_V1: GI_4502450-S

Illumina_V2: ILMN_139227

HGNC_curated_transcript: BRCA2-001

CCDS: CCDS9344.1

EntrezGene: BRCA2

MIM_MORBID: 114480

MIM_MORBID: 155720

MIM_MORBID: 227650

MIM_GENE: 600185

MIM_MORBID: 600185

MIM_MORBID: 605724

RefSeq_peptide: NP_000050.2

Uniprot/SPTREMBL: A1YBP1_HUMAN

EMBL: DQ897648

protein_id: ABI74674.1

Uniprot/SPTREMBL: B2ZAH0_HUMAN

EMBL: EU625579

protein_id: ACD01217.1

EMBL: AL445212

Uniprot/SPTREMBL: Q5TBJ7_HUMAN

EMBL: AL137247

protein_id: CAI13195.1

protein_id: CAI40479.1

Uniprot/SPTREMBL: Q8IU64_HUMAN

EMBL: AY151039

protein_id: AAN28944.1

EMBL: AF489725

protein_id: AAN61409.1

EMBL: AF489726

protein_id: AAN61410.1

EMBL: AF489727

protein_id: AAN61411.1

EMBL: AF489728

protein_id: AAN61412.1

EMBL: AF489729

protein_id: AAN61413.1

EMBL: AF489730

protein_id: AAN61414.1

EMBL: AF489731

protein_id: AAN61415.1

EMBL: AF489732

protein_id: AAN61416.1

EMBL: AF489733

protein_id: AAN61417.1

EMBL: AF489734

protein_id: AAN61418.1

EMBL: AF489735

protein_id: AAN61419.1

EMBL: AF489736

protein_id: AAN61420.1

Page 29: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Exercise 5

(a) Obtain the Ensembl gene(s) that correspond(s) to UniProtKB/Swiss-Prot entry BRCA2_HUMAN. Print its Ensembl Gene ID, name and description.

(b) Obtain all external references for the above gene. Print their names and databases.

Page 30: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Coordinate Mappings

• The API provides the means to convert between any related coordinate systems in the database

• The Feature methods transfer, transform and project and the Slice method project are used to map features between coordinate systems

Page 31: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Transfer

• Transfer moves a feature on a slice in a given coordinate system to another slice in the same or another coordinate system

• Transfer needs the feature to be defined in the requested coordinate system, i.e. it cannot overlap an undefined region

Page 32: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Transfer

Chr 1

Chr 1

Chr 1

Chr Y

Page 33: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Transform

• Like transfer, but transform places the feature on a slice that spans the entire sequence that the feature is on in the requested coordinate system

Page 34: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Transform

Page 35: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Project

• Project doesn’t move a feature, but it provides a definition of where a feature or slice lies in another coordinate system

Page 36: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Project

Page 37: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example

# Project gene ENSG00000155657 to the clone coordinate system

my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000155657' );

my $projection = $gene->project( 'clone’ );

foreach my $segment ( @{$projection} ) {

my $to_slice = $segment->to_Slice;

printf( "%s %s-%s projects to %s %s:%s-%s(%s)\n",

$gene->stable_id,

$segment->from_start,

$segment->from_end,

$to_slice->coord_system_name,

$to_slice->seq_region_name,

$to_slice->start,

$to_slice->end,

$to_slice->strand );

}

Page 38: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Code Example

Output:

ENSG00000155657 1-65908 projects to clone AC023270.7:1-65908(-1)

ENSG00000155657 65909-241384 projects to clone AC010680.10:1-175476(-1)

ENSG00000155657 241385-281434 projects to clone AC009948.3:132579-172628(-1)

Page 39: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Exercise 6

(a) Obtain a gene located on clone ‘AL049761.11’ and print out its coordinate system and gene coordinates. Then transform the gene to ‘toplevel’ and again print out the coordinate system and gene coordinates.

Page 40: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Other Ensembl Core APIs

• Ruby (by Jan Aerts):

http://bioruby-annex.rubyforge.org/

• Python (by Jenny Qing Qian):

http://code.google.com/p/pygr/wiki/PygrOnEnsembl

Page 41: EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Acknowledgements

The Ensembl Core Team

Glenn Proctor

Andreas Kahari Daniel Rios

Ian Longden