ensembl opening up the whole genome philip lijnzaad [email protected]

47
EnsEMBL Opening up the whole Genome Philip Lijnzaad [email protected]

Upload: randolph-lambert

Post on 03-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

EnsEMBL

Opening up the whole Genome

Philip Lijnzaad

[email protected]

Page 2: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Overview

• what

• how (science, hardware software)

• results

• families and descriptions

• tour

• people

Page 3: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

What is EnsEMBL

• Automatic Annotation of complete Human Genome– genes– other: markers, SNPs, homologies, etc.

• completely open– data, software, discussions

• portable, downloadable• ‘the Linux of the Human Genome’

Page 4: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

From ...

TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTTTTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAACAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCCAGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAAAGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATGGGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTATTTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGTGGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACAAGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTGTTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTTCTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAATAACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACCCAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAGCAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCACCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG

Page 5: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

… to:

MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFSSPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTGTRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRRSDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITRDVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGVVKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYEAVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITESPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCESGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPDGPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC

Page 6: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Take:

• Draft human genome– clones and contigs from public databases– not finished

• errors• gaps

– Golden Path• assembly of all contigs into (nearly) complete

chromosomes

Page 7: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Then:

• Get rid of repeats

• Targetted searches– pmatch to ‘find back’ known proteins from

SWISSPROT, SP-TrEMBL and RefSeq– GeneWise and EST2Genome to build the

genes• fill in coding sequences and UTR’s

Page 8: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

And then:

• Similarity searches– GenScan on raw contigs– its peptides are searched against protein,

mRNA and EST databases– genes are built using GeneWise on

promising regions– additonally, exons can be used

• All predictions supported by evidence!

Page 9: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Add cross references:

• HUGO (HGNC)

• SwissProt/Trembl, RefSeq

• EMBL

• OMIM

• LocusLink

• InterPro

Page 10: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Add yet more

• GeneTribe families

• Gene descriptions

• Markers

• SNPs

• external annotations (EMBL)

• mouse traces

• ...

Page 11: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Hardware

• 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory

• 200 other nodes

• 10 days to do a complete blast + gene build

• ~ 30 million jobs

• ~ 30 GB

Page 12: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Software

• Digital Unix

• Apache

• relational database (MySQL)

• mostly perl, some C and Java– BioPerl, BioJava, BioCORBA

• LSF

• AltaVista

Page 13: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Software (2)

• Wiki Web

• CVS (~100 Mb)

• Code review, data review

• Testing conventions

• Interfaces

• VirtualContigs

• CORBA/Java

Page 14: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ID’s

• for genes, transcripts, exons, peptides, families

• ENSXnnnnnnnnnnn (eg: ENSG00000067369)– X denotes which type:

• G = gene

• T = transcript

• E = exon

• P = peptide (translation)

• F = family

Page 15: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ID’s (2)

• ID’s should be stable

• difficult, because underlying data keeps being refined!

• ID mapping

• version numbers

Page 16: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Results

• Latest release: ,1.1 (17. July)– Web code version: 1.1.1 (1 Aug.)

• April 2001 dataset

• 4,318,661,441 basepairs

• 143,479 exons

• 23,931 transcripts

• 21,921 genes (‘confirmed’)

Page 17: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Errors

• Missing data• Misassembly• Misidentification (pseudo-gene, paralog)• Sequencing errors

– in Human Genome Data– in supporting databases

• Bugs• GenScan tuning• GeneWise tuning

Page 18: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Gene Families

• Cluster EnsEMBL peptides together with SwissProt and SPTrembl– vertebrate

• GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)

Page 19: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Family descriptions

• distill consensus descriptions – using SwissProt DE-lines– may not work => unknown

• Transfer peptide’s family assignment to gene– resolve conflicts: choose family that has

best description– unknown < hypothetical < fragment < cDNA

Page 20: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Family statistics:

• 13,811 families– 7284 ‘unknown’ description

• 128,828 members– 21,894 ENS genes– 23,867 ENS peptides

Page 21: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Family statistics (2)

• 67591 member

• 34572-10 members

• 215 10-100

• 4 > 100

• max is 483 (zinc finger)

Page 22: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Gene descriptions

• Use SwissProt DE-line if known

• use Family if not

• Statistics:– 18053 descriptions

• 13202 from SwissProt • 4851 from family description

– 3868 still UNKNOWNs

Page 23: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Entry points

• http://www.ensembl.org

• ID search

• text search

• OMIM disease

• Browse chromosomes

• BLAST

Page 24: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk
Page 25: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

TextSearch

Page 26: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

DiseaseView

Page 27: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

BLAST/SSAHA

Page 28: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

MapView

Page 29: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

MarkerView

Page 30: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ContigView

Page 31: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ContigView (2)

Page 32: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ContigView(3)

Page 33: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

DAS annotations

Page 34: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Apollo

Page 35: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ExportView

Page 36: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

GeneView

Page 37: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

GeneView (2)

Page 38: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ProteinView

Page 39: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ProteinView (2)

Page 40: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

ExpressionView

Page 41: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

DomainView

Page 42: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

FamilyView

Page 43: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Recent developments

• HelpDesk

• DAS– Adding annotations from anywhere

• Apollo– Genome viewer

• Expression data– SAGE

Page 44: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Future• Better genes!• Alignments• Other genomes

– Comparative Genomics

• CORBA/Java• More protein-structural links

– Scop profiles

• IGI/IPI• Entity infra-structure

Page 45: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Links

• http://www.ensembl.org– dev.ensembl.org

• http://www.ensembl.org/genome/central• http://genome.ucsc.edu• http://compbio.ornl.gov/channel• http://ncbi.nlm.gov/genome/guide/human• http://www.biodas.org• http://www.bio{perl,xml,corba,python,java}.org

Page 46: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Acknowledgements

• Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more

Page 47: EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

Join!

• http://www.ensembl.org

• mailing lists– [email protected][email protected]– (see http://www.ensembl.org/Dev/Lists )