ensembl an overview - european bioinformatics institute...ensembl –an overview twitter: #ensembl...

33
EBI is an Outstation of the European Molecular Biology Laboratory. Ensembl An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI

Upload: others

Post on 27-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

EBI is an Outstation of the European Molecular Biology Laboratory.

Ensembl – An Overview

Twitter: #Ensembl

Dr. Giulietta M. Spudich

Ensembl Outreach

EMBL-EBI

Page 2: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Page 3: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Beginnings …

1995: 1st free-living organism: bacterium

Haemophilus influenzae (1.8 million bp)

2001: First draft of the human sequence (3 gb)2004: ‘Finished’ human sequence

2014: Polished human sequence with haplotypes (GRCh38)

Page 4: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

THOMAS POROSTOCKY; SOURCE:

MEETINGZONE

1000 Genomes Project

ENCODE

Today’s genomics - human

COURTESY OF NIH

Page 5: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

5 of 24

Today’s genomics – other species

Page 6: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

6 of 24

Ensembl – Access to …

Page 7: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

7 of 24

Sister project …

Bacteria, Protists, Plants, Fungi, (non-vertebrate) Metazoa

Page 8: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC

CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG

TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTGCACTGCTGCGCCTCTGCTG

CGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGA

TTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGA

GCGGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAG

AGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATA

AGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG

ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGAAGAATC

TGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAAAGGAAACCATCTTA

TAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTAC

CAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGT

AGACTAAAAGTCTTCGCACAGTGAAAT

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC

CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG

TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTACTAAAATGGATCAAGCAGAT

GATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG

AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAGTGAAAGT

CCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA

Raw sequence

Page 9: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Ensembl – unlocking the code

06 March 2014 9

Regulation

Gene

Allele

Conserved

sequence

Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/

• Splice variants, proteins, non-coding RNA

• Small and large scale sequence variation, phenotype associations

• Whole genome alignments, protein trees

• Potential promoters and enhancers, DNA methylation

• User upload, custom data

Page 10: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Page 11: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Challenge: number of gene/protein sequences increases

11 of 24

• UniProtKB/Swiss-Prot (e.g.Q8IU82) 542,258

• UniProtKB/TrEMBL 51,616,950

• NCBI RefSeq (e.g. NP_006570) 37,371,278

Page 12: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Is there a consensus?

• Reaching a consensus coding sequence set for human and mouse.

• Human 29,045 CCDS IDs -18,683 EnsemblGene IDs (e74)

• Mouse 23,093 CCDS IDs- 19,988 EnsemblGeneIDs (e72)

Page 13: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

The GENCODE setwww.gencodegenes.org

13 of 24

• Ensembl has long been respected for its high-quality gene sets

• GENCODE genes = Ensembl Automatic Pipeline + Havana Manual Annotation (+ Yale pseudogenes)

• GENCODE is used by ENCODE, 1000 Genomes, and other projects.

Page 14: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Page 15: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Ensembl Variation

Aims:

• Collect, integrate and annotate all known variants

• Provide tools for comparison to other genomic data

• Provide a framework for access and to improve understanding

Page 16: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Practical applications of variation

Agriculture, livestock breeding• Disease-, insect-, and drought-resistant crops• Healthier, disease-resistant animals• Marker-assisted breeding• More nutritious produce• Reducing the costs of agricultureAnthropology, evolution, and human migration

Molecular and clinical medicine• Diagnosis, detection and treatment:

– e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, familial breast cancer

• Pharmacogenomics "custom drugs"

DNA forensics • Identification of suspects

• catastrophe victims• endangered species

Page 17: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Variation Sources

www.ensembl.org/info/genome/variation/sources_documentation

dbSNP (1000 Genomes, ClinVar, etc) ESP (Exome Sequencing Project)UniProt COSMICHGMD_PublicNHGRI-GWAS& more …

Page 18: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Variation in the Browser

Page 19: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Uses an Ensembl gene set to annotate: SNPs Indels Variants in regulatory regions Structural variants

Publication: McLaren et al. 2010 (Bioinformatics)

Ensembl Variant Effect Predictor

Perl scriptWeb interface REST API

XML

NewInterface!

Page 20: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Ensembl Comparative Genomics

Hom

o_sapiens

Pan_tro

glo

dyte

s

Gorilla

_gorilla

Pon

go

_ab

elii

No

ma

scu

s_le

uco

ge

ny

s

Ma

ca

ca

_m

ula

t ta

Ca

llit hrix

_ja

cch

us

Tars

ius_sy

rich

t a

Mic

roce

bu

s_m

uri

nu

s

Oto

lem

ur_

ga

rne

ttii

Tup

aia

_b

ela

ng

eri

Mu

s_m

uscu

lus

Rat t

us_n

orv

eg

icu

s

Dip

odom

ys_

ord

ii

Cavi

a_p

orc

ellu

s

Ict idom

ys_t

ridece

mlin

eatus

Ory

ctola

gus_cunic

ulus

Ochotona_p

rincepsVicugna_pacos

Tursiops_t runcatus

Bos_taurus

Sus_scrofa

Equus_caballusFelis_catus

Ailuropoda_m elanoleuca

Mustela_putorius_furo

Canis_fam iliaris

Myot is_lucifugus

Pteropus_vampyrus

Erinaceus_europaeus

Sorex_araneus

Loxodonta_africana

Proca

via

_capensis

Echin

ops_te

lfairi

Dasy

pu

s_n

ove

mcin

ctu

s

Ch

olo

ep

us_h

offm

an

ni

Mo

no

de

lph

is_d

om

est ic

aM

acro

pu

s_e

ug

en

iiS

arc

op

hilu

s_h

ar ris

i i

Orn

ith

orh

yn

ch

us_a

na

t in

us

Ga

llu

s_g

allu

sM

ele

ag

ris_g

allo

pa

vo

An

as_p

laty

rhy

nch

os

Tae

nio

pyg

ia_g

ut t

ata

Anolis

_caro

linensi

s

Pelo

dis

cus_

sinensi

s

Xenopus_t r

opicalis

Lat imeria

_chalu

mnae

Oreochro

mis_n

ilot ic

us

Tet ra

odon_nigrovirid

is

Takifugu_rubrip

es

Xiphophorus_maculatus

Oryzias_lat ipes

Gasterosteus_aculeatus

Gadus_m orhua

Danio_rerio

Pet romyzon_marinus

Ciona_savignyi

Ciona_intest inalisDrosophila_m

elanogaster

Caenorhabdit is_elegans

Saccharomyces_cerevisiae

Image obtained using Dendroscope (D.H. Huson and C Scornavacca,

Dendroscope 3: An interact ive tool for rooted phylogenet ic t rees and

networks, Syst emat ic Biology, 2012 )

Page 21: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Whole genome alignments

Homo sapiens ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---ACTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG

Pan troglodytes ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAAGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCACTG-CTGCGCCTCGGGTGTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Gorilla gorilla gorilla ........................................................................................................................

Ancestral sequences ........................................................................................................................

Pongo abelii ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TGTGCTGCACCTGTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-TAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Macaca mulatta ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-CAGGCG-GCAGAGGTGGAAC--TGCTGCTGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CCGCGCCTCGGGTCTTTTCTGCGGCG

Callithrix jacchus ACGT-GG--TCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGACC--TGCTG-TGTC--------------TCTG-CCGCGCCTCCGGTCTTTTCTGCGACG

Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA

Mus musculus ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGGGAGCCGT-G----------AGGCGTTGCCGTCAGT-CAGCT-----------------ACCGCTGC-------------------

Ancestral sequences ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGTGAGAAGT-G----------AGGCGGTGCCGTCCGT-CAGCT-----------------ACCGCAAC-------------------

Rattus norvegicus ACGGCGC--AGAGCGCGGGCTTTTCGCAGGAGCGTGAGAAGT-G----------AGGCGGCGCCGTCCGT-CAGCG-----------------GCCGCAAC-------------------

Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA

Oryctolagus cuniculus ACGT-GC--CCAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAA-AAGGCT-ATGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTGGGGTCTTGCCTAGGGCA

Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA

Bos taurus ACAT-ATCCCGAGAGCAGGCTTTTGGCGCGAGAATCTGAAAC-CCGGTGGGCGGAGGTGCGGC--TGCTG-AAGTTTG----------------C--TGTCTCGGGCGG-T---------

Ancestral sequences ACGT-GCTCCGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCGAGCGGAGGCGGAGC--TGCTG-GGGCTCC----------------C--TGTCTCGGGTGG-TTCTGTGGCA

Canis lupus familiaris ........................................................................................................................

Ancestral sequences ........................................................................................................................

Equus caballus ACGT-GCTCAGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAAGAAGGCAAGCGGAGGCGGAGT--TGCTG-GGGCTCC----------------C--TGACTGGGGTGG-TTGTGTGGCA

Great apes

Old world

monkeys

Primates

Glires

Rodents

Laurasiatheria

Boroeutherian

Page 22: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Page 23: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Gene expression:The basic model

Transcription Factor Binding Sites Promoter Gene

mRNA

Transcription Factors Activation

Repression RNA polymerase complex

2 nm

Page 24: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Available data

Page 25: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Regulation (ENCODE + …)

Page 26: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Page 27: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Open source- access our data!

• Ensembl Views (Website, ftp)

• Ensembl Database (Perl API, REST API, MySQL)

• BioMart – Quick Data Retrieval (Web interface , Bioconductor, Galaxy, BioMaRt)

Page 28: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Ensembl is used worldwide

Top users:

UK

US

Canada

China

France

Germany

Italy

Japan

Spain

Page 29: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

EBI is an Outstation of the European Molecular Biology Laboratory.

Workshops Worldwide (2013)

Page 30: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

EBI is an Outstation of the European Molecular Biology Laboratory.

What’s coming? (2014)

New Assemblies:

• GRCh38 (and all the updated annotation)www.ensembl.info/blog (category GRCh38)

• Baboon

• Vervet monkey

• Amazon molly

• Crab eating macaque (Pre.ensembl.org)

• Hedgehog (Pre.ensembl.org)

New BLAST

New Regulatory Buildwww.ensembl.info/blog/2013/12/26/the-new-ensembl-regulatory-annotation

Page 31: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Learn more

• Comments and questions? [email protected]

• YouTube channel www.youtube.com/user/EnsemblHelpdesk

• Mailing lists [email protected], [email protected]

• Courses online www.ensembl.info/ecourse

• Our tutorials page www.ensembl.org/info/website/tutorials

Page 32: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Follow us• Facebook www.facebook.com/Ensembl.org

• Twitter https://twitter.com/Ensembl

• Come visit our blog! www.ensembl.info

Page 33: Ensembl An Overview - European Bioinformatics Institute...Ensembl –An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EMBL-EBI. ... MEETINGZONE 1000 Genomes

Acknowledgements

FundingEuropean Commission Framework Programme 7