ensembl an overview - european bioinformatics institute...ensembl –an overview twitter: #ensembl...

Post on 27-Aug-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

EBI is an Outstation of the European Molecular Biology Laboratory.

Ensembl – An Overview

Twitter: #Ensembl

Dr. Giulietta M. Spudich

Ensembl Outreach

EMBL-EBI

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Beginnings …

1995: 1st free-living organism: bacterium

Haemophilus influenzae (1.8 million bp)

2001: First draft of the human sequence (3 gb)2004: ‘Finished’ human sequence

2014: Polished human sequence with haplotypes (GRCh38)

THOMAS POROSTOCKY; SOURCE:

MEETINGZONE

1000 Genomes Project

ENCODE

Today’s genomics - human

COURTESY OF NIH

5 of 24

Today’s genomics – other species

6 of 24

Ensembl – Access to …

7 of 24

Sister project …

Bacteria, Protists, Plants, Fungi, (non-vertebrate) Metazoa

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC

CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG

TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTGCACTGCTGCGCCTCTGCTG

CGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGA

TTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGA

GCGGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAG

AGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATA

AGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG

ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGAAGAATC

TGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAAAGGAAACCATCTTA

TAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTAC

CAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGT

AGACTAAAAGTCTTCGCACAGTGAAAT

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC

CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG

TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTACTAAAATGGATCAAGCAGAT

GATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG

AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAGTGAAAGT

CCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA

Raw sequence

Ensembl – unlocking the code

06 March 2014 9

Regulation

Gene

Allele

Conserved

sequence

Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/

• Splice variants, proteins, non-coding RNA

• Small and large scale sequence variation, phenotype associations

• Whole genome alignments, protein trees

• Potential promoters and enhancers, DNA methylation

• User upload, custom data

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Challenge: number of gene/protein sequences increases

11 of 24

• UniProtKB/Swiss-Prot (e.g.Q8IU82) 542,258

• UniProtKB/TrEMBL 51,616,950

• NCBI RefSeq (e.g. NP_006570) 37,371,278

Is there a consensus?

• Reaching a consensus coding sequence set for human and mouse.

• Human 29,045 CCDS IDs -18,683 EnsemblGene IDs (e74)

• Mouse 23,093 CCDS IDs- 19,988 EnsemblGeneIDs (e72)

The GENCODE setwww.gencodegenes.org

13 of 24

• Ensembl has long been respected for its high-quality gene sets

• GENCODE genes = Ensembl Automatic Pipeline + Havana Manual Annotation (+ Yale pseudogenes)

• GENCODE is used by ENCODE, 1000 Genomes, and other projects.

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Ensembl Variation

Aims:

• Collect, integrate and annotate all known variants

• Provide tools for comparison to other genomic data

• Provide a framework for access and to improve understanding

Practical applications of variation

Agriculture, livestock breeding• Disease-, insect-, and drought-resistant crops• Healthier, disease-resistant animals• Marker-assisted breeding• More nutritious produce• Reducing the costs of agricultureAnthropology, evolution, and human migration

Molecular and clinical medicine• Diagnosis, detection and treatment:

– e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, familial breast cancer

• Pharmacogenomics "custom drugs"

DNA forensics • Identification of suspects

• catastrophe victims• endangered species

Variation Sources

www.ensembl.org/info/genome/variation/sources_documentation

dbSNP (1000 Genomes, ClinVar, etc) ESP (Exome Sequencing Project)UniProt COSMICHGMD_PublicNHGRI-GWAS& more …

Variation in the Browser

Uses an Ensembl gene set to annotate: SNPs Indels Variants in regulatory regions Structural variants

Publication: McLaren et al. 2010 (Bioinformatics)

Ensembl Variant Effect Predictor

Perl scriptWeb interface REST API

XML

NewInterface!

Ensembl Comparative Genomics

Hom

o_sapiens

Pan_tro

glo

dyte

s

Gorilla

_gorilla

Pon

go

_ab

elii

No

ma

scu

s_le

uco

ge

ny

s

Ma

ca

ca

_m

ula

t ta

Ca

llit hrix

_ja

cch

us

Tars

ius_sy

rich

t a

Mic

roce

bu

s_m

uri

nu

s

Oto

lem

ur_

ga

rne

ttii

Tup

aia

_b

ela

ng

eri

Mu

s_m

uscu

lus

Rat t

us_n

orv

eg

icu

s

Dip

odom

ys_

ord

ii

Cavi

a_p

orc

ellu

s

Ict idom

ys_t

ridece

mlin

eatus

Ory

ctola

gus_cunic

ulus

Ochotona_p

rincepsVicugna_pacos

Tursiops_t runcatus

Bos_taurus

Sus_scrofa

Equus_caballusFelis_catus

Ailuropoda_m elanoleuca

Mustela_putorius_furo

Canis_fam iliaris

Myot is_lucifugus

Pteropus_vampyrus

Erinaceus_europaeus

Sorex_araneus

Loxodonta_africana

Proca

via

_capensis

Echin

ops_te

lfairi

Dasy

pu

s_n

ove

mcin

ctu

s

Ch

olo

ep

us_h

offm

an

ni

Mo

no

de

lph

is_d

om

est ic

aM

acro

pu

s_e

ug

en

iiS

arc

op

hilu

s_h

ar ris

i i

Orn

ith

orh

yn

ch

us_a

na

t in

us

Ga

llu

s_g

allu

sM

ele

ag

ris_g

allo

pa

vo

An

as_p

laty

rhy

nch

os

Tae

nio

pyg

ia_g

ut t

ata

Anolis

_caro

linensi

s

Pelo

dis

cus_

sinensi

s

Xenopus_t r

opicalis

Lat imeria

_chalu

mnae

Oreochro

mis_n

ilot ic

us

Tet ra

odon_nigrovirid

is

Takifugu_rubrip

es

Xiphophorus_maculatus

Oryzias_lat ipes

Gasterosteus_aculeatus

Gadus_m orhua

Danio_rerio

Pet romyzon_marinus

Ciona_savignyi

Ciona_intest inalisDrosophila_m

elanogaster

Caenorhabdit is_elegans

Saccharomyces_cerevisiae

Image obtained using Dendroscope (D.H. Huson and C Scornavacca,

Dendroscope 3: An interact ive tool for rooted phylogenet ic t rees and

networks, Syst emat ic Biology, 2012 )

Whole genome alignments

Homo sapiens ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---ACTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG

Pan troglodytes ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAAGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCACTG-CTGCGCCTCGGGTGTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Gorilla gorilla gorilla ........................................................................................................................

Ancestral sequences ........................................................................................................................

Pongo abelii ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TGTGCTGCACCTGTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-TAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Macaca mulatta ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-CAGGCG-GCAGAGGTGGAAC--TGCTGCTGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG

Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CCGCGCCTCGGGTCTTTTCTGCGGCG

Callithrix jacchus ACGT-GG--TCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGACC--TGCTG-TGTC--------------TCTG-CCGCGCCTCCGGTCTTTTCTGCGACG

Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA

Mus musculus ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGGGAGCCGT-G----------AGGCGTTGCCGTCAGT-CAGCT-----------------ACCGCTGC-------------------

Ancestral sequences ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGTGAGAAGT-G----------AGGCGGTGCCGTCCGT-CAGCT-----------------ACCGCAAC-------------------

Rattus norvegicus ACGGCGC--AGAGCGCGGGCTTTTCGCAGGAGCGTGAGAAGT-G----------AGGCGGCGCCGTCCGT-CAGCG-----------------GCCGCAAC-------------------

Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA

Oryctolagus cuniculus ACGT-GC--CCAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAA-AAGGCT-ATGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTGGGGTCTTGCCTAGGGCA

Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA

Bos taurus ACAT-ATCCCGAGAGCAGGCTTTTGGCGCGAGAATCTGAAAC-CCGGTGGGCGGAGGTGCGGC--TGCTG-AAGTTTG----------------C--TGTCTCGGGCGG-T---------

Ancestral sequences ACGT-GCTCCGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCGAGCGGAGGCGGAGC--TGCTG-GGGCTCC----------------C--TGTCTCGGGTGG-TTCTGTGGCA

Canis lupus familiaris ........................................................................................................................

Ancestral sequences ........................................................................................................................

Equus caballus ACGT-GCTCAGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAAGAAGGCAAGCGGAGGCGGAGT--TGCTG-GGGCTCC----------------C--TGACTGGGGTGG-TTGTGTGGCA

Great apes

Old world

monkeys

Primates

Glires

Rodents

Laurasiatheria

Boroeutherian

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Gene expression:The basic model

Transcription Factor Binding Sites Promoter Gene

mRNA

Transcription Factors Activation

Repression RNA polymerase complex

2 nm

Available data

Regulation (ENCODE + …)

This talk …

Genome Sequencing and Browsers

Ensembl Data

Genes

Variation

Comparative Genomics

Regulation

Access

Open source- access our data!

• Ensembl Views (Website, ftp)

• Ensembl Database (Perl API, REST API, MySQL)

• BioMart – Quick Data Retrieval (Web interface , Bioconductor, Galaxy, BioMaRt)

Ensembl is used worldwide

Top users:

UK

US

Canada

China

France

Germany

Italy

Japan

Spain

EBI is an Outstation of the European Molecular Biology Laboratory.

Workshops Worldwide (2013)

EBI is an Outstation of the European Molecular Biology Laboratory.

What’s coming? (2014)

New Assemblies:

• GRCh38 (and all the updated annotation)www.ensembl.info/blog (category GRCh38)

• Baboon

• Vervet monkey

• Amazon molly

• Crab eating macaque (Pre.ensembl.org)

• Hedgehog (Pre.ensembl.org)

New BLAST

New Regulatory Buildwww.ensembl.info/blog/2013/12/26/the-new-ensembl-regulatory-annotation

Learn more

• Comments and questions? helpdesk@ensembl.org

• YouTube channel www.youtube.com/user/EnsemblHelpdesk

• Mailing lists announce@ensembl.org, dev@ensembl.org

• Courses online www.ensembl.info/ecourse

• Our tutorials page www.ensembl.org/info/website/tutorials

Follow us• Facebook www.facebook.com/Ensembl.org

• Twitter https://twitter.com/Ensembl

• Come visit our blog! www.ensembl.info

Acknowledgements

FundingEuropean Commission Framework Programme 7

top related