1 of 42 browsing genes and genomes with ensembl bert overduin ensembl user support embl outstation...

42
1 of 42 Browsing Genes and Genomes Browsing Genes and Genomes with Ensembl with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge, UK

Upload: mervyn-randall

Post on 27-Dec-2015

237 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

1 of 42

Browsing Genes and Genomes Browsing Genes and Genomes with Ensemblwith Ensembl

Bert OverduinEnsembl User Support

EMBL Outstation

European Bioinformatics Institute

Wellcome Trust Genome Campus

Hinxton, Cambridge, UK

Page 2: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

2 of 42

Course ScheduleCourse Schedule

IntroductionWebsite walk-through

Coffee

ExercisesBioMart

Lunch

ExercisesGeneBuild

Tea

Variations / ComparaExercises

Page 3: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

3 of 42

Ensembl WorkshopsEnsembl Workshops

Page 4: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

4 of 42

EMBL-EBIHinxton, Cambridge

Page 5: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

5 of 42

Wellcome Trust Genome CampusHinxton, Cambridge

© John Freebrey (www.thedigitaldarkcloth.com)

Page 6: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

6 of 42

Page 7: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

7 of 42 © Sean T. McHugh (www.cambridgeincolour.com)

Cambridge

Page 8: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

8 of 42

A Bit of HistoryA Bit of History

• 1995 Haemophilus influenzae 1.8 Mb• 1996 Yeast 12 Mb• 1998 C. elegans 100 Mb• 1999 Fruit fly 125 Mb• 2000 Arabidopsis 115 Mb• 2001 Human (draft)• 2002 Mouse 2.6 Gb• 2004 Human (“finished”) 3 Gb

Sequenced genomes

Page 9: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

9 of 42

A Bit of HistoryA Bit of History

http://www.genomesonline.org/

Page 10: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

10 of 42

AnnotationAnnotation

Wikipedia:Genome annotation is the process of attaching biological information to sequences. It consists of two main steps:

1. identifying elements on the genome, a process called Gene Finding, and2. attaching biological information to these elements.

Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.

Page 11: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

11 of 42

Ensembl - GoalsEnsembl - Goals

• Provide automatic annotation of genomic sequence

• Integrate other biological data

• Make data available to all via the web

Page 12: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

12 of 42

Ensembl - OrganisationEnsembl - Organisation

• Joint project between European Bioinformatics Institute (EMBL-EBI) and Wellcome Trust Sanger Institute

• Started in 1999 for the Human Genome Project• Funded primarily by the Wellcome Trust, additional

funding by EMBL, EU, NIH-NIAID, BBSRC and MRC

• Team of ca. 40 people, led by Ewan Birney (EBI) and Tim Hubbard (Sanger)

• Uses the largest dedicated computer system in biology in Europe

Page 13: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

13 of 42

Genome BrowsersGenome Browsers

• Ensembl Genome browserhttp://www.ensembl.org

• NCBI Map Viewerhttp://www.ncbi.nlm.nih.gov/mapview/

• UCSC Genome Browserhttp://genome.ucsc.edu

Page 14: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

14 of 42

NCBI Map ViewerNCBI Map Viewer

Page 15: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

15 of 42

UCSC Genome BrowserUCSC Genome Browser

Page 16: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

16 of 42

Ensembl Genome BrowserEnsembl Genome Browser

Page 17: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

17 of 42

What Distinguishes Ensembl from What Distinguishes Ensembl from the UCSC and NCBI Browsers?the UCSC and NCBI Browsers?

• Automatic annotation for those species for which no manually curated gene set exists

• Direct database access and programmatic access via the Perl API

• Not only the data, but also the software source code is open source

Page 18: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

18 of 42

CaveatsCaveats

• While genome browsers can be very useful tools they do not provide the definitive answer to every question!

• Data is fluid

Page 19: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

19 of 42

Which Species Are Available?Which Species Are Available?

• 36 chordates, ranging from mammals to ‘primitive’ chordates (Ciona intestinalis and Ciona savignyi)

• 3 key eukaryote model organisms:fruitfly (Drosophila melanogaster)nematode (Caenorhabditis elegans)yeast (Saccharomyces cerevisiae)

• 2 insect pathogen vectors:malaria mosquito (Anopheles gambiae)yellow fever / dengue mosquito (Aedes aegypti)

Page 20: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

20 of 42

Species in EnsemblSpecies in Ensembl

CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA

57

0

50

5

43

8

40

8

36

0

28

6

24

5

20

8

14

4

65

MY

BP

FISHES

BIRDSREPTILES

MAMMALS PLACENTALS

MONOTREMES

MARSUPIALS

OTHER BIRDS

PALEOGNATHS

PASSERINES

CROCODILES

TURTLES

LIZARDS

AMPHIBIANS

TELEOSTS

SHARKS

RAYS

LATIMERIA

BICHIR/POLYPTERUS

LUNGFISHES

AGNATHANS

NON-VERTEBRATES

Page 21: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

21 of 42

More Species to Come ….More Species to Come ….

OikopleuraGorillaZebrafinchOrangutanMarmosetAmphioxusAcorn wormHyrax

MegabatDolphinTarsierKangaroo ratChinese pangolinTwo toed slothLlamaFlying lemur

Page 22: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

22 of 42

Which Data Are Available?Which Data Are Available?• Genomic sequence• Gene/transcript/peptide models• External references• Mapped cDNAs, peptides, micro array probes,

BAC clones etc.• Other features of the genome:

cytogenetic bands, markers, repeats etc.• Comparative data:

orthologues and paralogues, protein families, whole genome alignments, syntenic regions

• Variation data:SNPs

• Regulatory data:“best guess” set of regulatory elements

• Data from external sources (DAS)

Page 23: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

23 of 42

Gene/Transcript/Peptide ModelsGene/Transcript/Peptide Models

• Manual annotation

For parts of genomes:human, dog, mouse, zebrafish (“Vega genes”)

For complete genomes:fruitfly (FlyBase), C. elegans (WormBase), yeast

(SGD)

• Automatic predictions (“Ensembl genes”)

• EST predictions

• Ab initio predictions (GENSCAN, SNAP)

Page 24: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

24 of 42

Biological EvidenceBiological Evidence

• UniProt/Swiss-ProtA manually curated database and therefore of highest accuracy

• NCBI RefSeqA partially manually curated database

• UniProt/TrEMBLAutomatically annotated translations of EMBL coding sequence (CDS) features

• EMBL / GenBank / DDBJPrimary nucleotide sequence repository

All Ensembl gene predictions are based on experimental evidence:

Page 25: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

25 of 42

The Ensembl GenebuildThe Ensembl Genebuild

Genome assembly

Computer programs

Experimental evidence

Ensembl Ensembl GenesGenes

+

+

Page 26: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

26 of 42

Ensembl IdentifiersEnsembl Identifiers

• ENSG### Ensembl Gene ID• ENST### Ensembl Transcript ID• ENSP### Ensembl Peptide ID• ENSE### Ensembl Exon ID• ENSF### Ensembl Family ID• ENSR### Ensembl Regulatory Feature ID

• For other species than human a suffix is added:MUS for mouse (Mus musculus) : ENSMUSG###,DAR for zebrafish (Danio rerio) : ENSDARG### etc.etc.

• For imported genes Ensembl uses the original identifiers

Page 27: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

27 of 42

Access to Genome AnnotationAccess to Genome Annotation

• Release web site http://www.ensembl.org/

• Pre-Release http://pre.ensembl.org/

• Archive http://archive.ensembl.org

• BioMart http://www.ensembl.org/Multi/martview

• Downloads ftp://ftp.ensembl.org/

• MySQL interface ensembldb.ensembl.org

• Perl API http://www.ensembl.org/info/software/

Page 28: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

28 of 42

PrPree!! and Archiv and Archivee!! Sites Sites

Page 29: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

29 of 42

BioMart Data Mining ToolBioMart Data Mining Tool

Page 30: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

30 of 42

DownloadsDownloadsftp://ftp.ensembl.org/pubhttp://www.ensembl.org/info/data/download.html

FASTA files: plain sequence• DNA (assembly masked and unmasked)• cDNA (Ensembl and ab initio predictions)• Peptides (Ensembl and ab initio predictions)• RNA (non-coding RNA predictions)

Flatfiles: annotated 1Mb slices• EMBL format• GenBank format

MySQL: database table dumps

Page 31: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

31 of 42

MySQLMySQL

SQL = Structured Query Language

Needed:

• MySQL client programhttp://www.mysql.com

• Ability to write MySQL queries

• Knowledge of database schema

Page 32: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

32 of 42

Perl APIPerl API

API = Application Programming Interface

Needed:

• BioPerl modules• Ensembl modules

• Ability to code in Perl

For more information (installation instructions,tutorials, documentation etc.):http://www.ensembl.org/info/software/index.html

Page 33: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

33 of 42

Ensembl BLAST Ensembl BLAST

WU-BLAST 2.0:• search against assemblies, Ensembl predictions or ab

initio predictions

BLAT and SSAHA2:• BLAST-like Alignment Tool• Sequence Search and Alignment by Hashing Algorithm• very fast• search against assemblies for (almost) exact DNA-DNA

matches

Search against one or multiple speciesSearch max. 30 sequences simultaneously

Page 34: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

34 of 42

Ensembl AccountsEnsembl Accounts

• Personalise Ensembl by saving bookmarks, view configurations and homepage preferences in a user account

• Share bookmarks and configurations by setting up groups

Please note that all Ensembl data remains free access. It is not necessary to register in order to gain access to Ensembl data!

Page 35: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

35 of 42

Website StatisticsWebsite Statistics

On average 1,000,000 page impressions / week

Top 3 species:

Top 3 countries:

Page 36: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

36 of 42

Ensembl – Open SourceEnsembl – Open Source

• Data and software freely available

• More than 50 installs worldwide

• Academia and industry

• Local or available via the web• Mirrors with Ensembl data, e.g.

http://ensembl.genome.tugraz.at/index.html

or user projects with own data

Page 37: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

37 of 42

Powered by EnsemblPowered by Ensembl

Page 38: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

38 of 42

What If I Need Help?What If I Need Help?

• Helpdesk:

[email protected]

• Workshops on use of the browser or the API

• Mailing lists:

[email protected] [email protected]

• ‘Geek for a week’ program

• Animated tutorials

http://www.ensembl.org/common/Workshops_Online

Page 39: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

39 of 42

Ensembl TeamEnsembl Team

Guy Coates, Tim Cutts, Shelley GoddardSystems & Support

Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics

Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders

Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Dace Ruklisa, Daniel ZerbinoResearch

Martin Hammond, Dan Lawson, Karyn MegyVectorbase Annotation

Kerstin Howe, Tina Eyre, Ian SealyZebrafish Annotation

Val Curwen, Steve Searle, Bronwen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Felix Kokocinski, Jan-Hinnerck Vogel, Simon White

Analysis and Annotation Pipeline

Javier Herrero, Benoit Ballester, Kathryn Beal, Stephen Fitzgerald, Albert VilellaComparative Genomics

James Smith, Fiona Cunningham, Anne Parker, Bethan Pritchard, Stephen Rice, Steve TrevanionWeb Team

Xosé M Fernández, Bert Overduin, Michael Schuster, Giulietta SpudichOutreach & QC

Eugene Kulesha, Andy JenkinsonDistributed Annotation System (DAS)

Arek Kasprzyk, Syed Haider, Richard Holland, Damian SmedleyBioMart

Glenn Proctor, Andreas Kähäri, Ian Longden, Patrick MeidlDatabase Schema and

Core API

Page 40: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

40 of 42Ensembl Team on the river Cam, 2006

Page 41: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

41 of 42Ewan Birney

Page 42: 1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome

42 of 42

QQ&&AAQ U E S T I O N SQ U E S T I O N S

A N S W E R SA N S W E R S