ensembl opening up the whole genome philip lijnzaad lijnzaad@ebi.ac.uk

Post on 03-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

EnsEMBL

Opening up the whole Genome

Philip Lijnzaad

lijnzaad@ebi.ac.uk

Overview

• what

• how (science, hardware software)

• results

• families and descriptions

• tour

• people

What is EnsEMBL

• Automatic Annotation of complete Human Genome– genes– other: markers, SNPs, homologies, etc.

• completely open– data, software, discussions

• portable, downloadable• ‘the Linux of the Human Genome’

From ...

TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTTTTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAACAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCCAGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAAAGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATGGGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTATTTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGTGGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACAAGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTGTTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTTCTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAATAACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACCCAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAGCAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCACCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG

… to:

MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFSSPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTGTRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRRSDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITRDVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGVVKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYEAVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITESPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCESGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPDGPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC

Take:

• Draft human genome– clones and contigs from public databases– not finished

• errors• gaps

– Golden Path• assembly of all contigs into (nearly) complete

chromosomes

Then:

• Get rid of repeats

• Targetted searches– pmatch to ‘find back’ known proteins from

SWISSPROT, SP-TrEMBL and RefSeq– GeneWise and EST2Genome to build the

genes• fill in coding sequences and UTR’s

And then:

• Similarity searches– GenScan on raw contigs– its peptides are searched against protein,

mRNA and EST databases– genes are built using GeneWise on

promising regions– additonally, exons can be used

• All predictions supported by evidence!

Add cross references:

• HUGO (HGNC)

• SwissProt/Trembl, RefSeq

• EMBL

• OMIM

• LocusLink

• InterPro

Add yet more

• GeneTribe families

• Gene descriptions

• Markers

• SNPs

• external annotations (EMBL)

• mouse traces

• ...

Hardware

• 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory

• 200 other nodes

• 10 days to do a complete blast + gene build

• ~ 30 million jobs

• ~ 30 GB

Software

• Digital Unix

• Apache

• relational database (MySQL)

• mostly perl, some C and Java– BioPerl, BioJava, BioCORBA

• LSF

• AltaVista

Software (2)

• Wiki Web

• CVS (~100 Mb)

• Code review, data review

• Testing conventions

• Interfaces

• VirtualContigs

• CORBA/Java

ID’s

• for genes, transcripts, exons, peptides, families

• ENSXnnnnnnnnnnn (eg: ENSG00000067369)– X denotes which type:

• G = gene

• T = transcript

• E = exon

• P = peptide (translation)

• F = family

ID’s (2)

• ID’s should be stable

• difficult, because underlying data keeps being refined!

• ID mapping

• version numbers

Results

• Latest release: ,1.1 (17. July)– Web code version: 1.1.1 (1 Aug.)

• April 2001 dataset

• 4,318,661,441 basepairs

• 143,479 exons

• 23,931 transcripts

• 21,921 genes (‘confirmed’)

Errors

• Missing data• Misassembly• Misidentification (pseudo-gene, paralog)• Sequencing errors

– in Human Genome Data– in supporting databases

• Bugs• GenScan tuning• GeneWise tuning

Gene Families

• Cluster EnsEMBL peptides together with SwissProt and SPTrembl– vertebrate

• GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)

Family descriptions

• distill consensus descriptions – using SwissProt DE-lines– may not work => unknown

• Transfer peptide’s family assignment to gene– resolve conflicts: choose family that has

best description– unknown < hypothetical < fragment < cDNA

Family statistics:

• 13,811 families– 7284 ‘unknown’ description

• 128,828 members– 21,894 ENS genes– 23,867 ENS peptides

Family statistics (2)

• 67591 member

• 34572-10 members

• 215 10-100

• 4 > 100

• max is 483 (zinc finger)

Gene descriptions

• Use SwissProt DE-line if known

• use Family if not

• Statistics:– 18053 descriptions

• 13202 from SwissProt • 4851 from family description

– 3868 still UNKNOWNs

Entry points

• http://www.ensembl.org

• ID search

• text search

• OMIM disease

• Browse chromosomes

• BLAST

TextSearch

DiseaseView

BLAST/SSAHA

MapView

MarkerView

ContigView

ContigView (2)

ContigView(3)

DAS annotations

Apollo

ExportView

GeneView

GeneView (2)

ProteinView

ProteinView (2)

ExpressionView

DomainView

FamilyView

Recent developments

• HelpDesk

• DAS– Adding annotations from anywhere

• Apollo– Genome viewer

• Expression data– SAGE

Future• Better genes!• Alignments• Other genomes

– Comparative Genomics

• CORBA/Java• More protein-structural links

– Scop profiles

• IGI/IPI• Entity infra-structure

Links

• http://www.ensembl.org– dev.ensembl.org

• http://www.ensembl.org/genome/central• http://genome.ucsc.edu• http://compbio.ornl.gov/channel• http://ncbi.nlm.gov/genome/guide/human• http://www.biodas.org• http://www.bio{perl,xml,corba,python,java}.org

Acknowledgements

• Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more

Join!

• http://www.ensembl.org

• mailing lists– ensembl-dev@ebi.ac.uk– ensembl-announce@ebi.ac.uk– (see http://www.ensembl.org/Dev/Lists )

top related