genome sequencing and analysis program millions …files.meetup.com/469457/millions of genes with...
TRANSCRIPT
Millions of Genes with Python and Jython
Clint HowarthJanet Dewar, Maia Hansen, Jeffrey Larimer,
Matthew Pearson, Andrew RobertsShailaja Gargeya, Jennifer Wortman,
Cheryl Murphy, Bruce Birren
Analysis and Annotation EngineeringGenome Sequencing and Analysis Program
Broad Institute
Annotation and Analysis
MalariaTuberculosisHIVWest NileDengue fevere. coliStreptococcusStaphylococcus aureus
Human Microbiome Project
Table of Contents: Applications
Internal ● java/jython analysis
and publication platform
● oliveweb publication
Open Source ● toothpick
data abstraction layer ● genepidgin
gene names ● accordion
genetics over time
Annotation sample problemCCCCAAGCTCACTGATTGACGGTGCTCTGATTGCGCAACCAGACAACGACGACAATGAGGGTGCTACTGTCTTTTCTCATTATAGGGCTGCTAGGCATTGCCAGTGCCCTCAGCACCAGTGGAAACAGATTACTGGTGGTACTGGAGGAGCTTGCGGAGAAGGACAAGTACTCCAAGTTCTTTGGGGATCTGAAAGGCAAGCGCACGGCTGGGAGATAGAACTGGCAGGGACAACGGGATTCTAGCTAACTTTTTGAGGATTGTAGGTCGGGGCTTTGATATCACATTTGAATCTCCAAAGAGTGATAGTTTGGCGCTGTTCGAGCTTGGCGAGAGAGCTTATGATCACCTTCTCATCCTCCCATCTAAGTCGAAAGGTCAGCATCCATTGAACGACATTGGACTGTCGTGTGCTAATATAATAGGCCTGGGACCAAACCTCACGCCCCAAACCTTATTGAAGTTCATCAATACCGAGGGCAACATCCTGCTCACCCTATCTTCATCCAACCCGATACCATCAGCTCTCGTATCAGTTCTGCTTGAGCTTGACATCCATCTCCCCACCGACCGCAACTCGATAGTGGTCGACCACTTCAACTACGACAGCCTCTCGGCCCCCGAATCCCACGATGTCGTTCTTGTTCCCCGCCCAAGCGCTGTGCGCCCCGGTGTTCGCAACTTCTTCGGCGGCATCCTCAAGAACGAGGTTATCGCGTTCCCCCACGGCGTGGGCCAGACTCTAGGCAACGATAGCCCGTACTTGACACCGGTCCTTCGCGCCCCCGGCACGGCGTACTCATACAACCCCAAGGAGGAGGCCGAGGCTGTGGAGGACCCGTTCGCGGTTGGCCAGCAACTGTCCCTCGTGACCGCCATGCAGGCTCGCAACTCAGCGAGGTTCACTGTCTTGGGCGCAGCGGAAATGCTTGAGGATAAGTGGTTCAAGGGGAAGGTCCAAGTTGCTGGCGGCAAGGTTGCTGCGGCTGCGAATGAGGCGTTTGCGAAGGAGATCTCCGGATGGACTTTCAATGAGGCTGGAGTCCTGAAGGTGAAGTCTGTTACGCATTTCCTCAACGAAGAGGGGTTGAAAACACCCAATGCTTCATTGACGAACCCCAAGATCTACCGTGTCAAGAACACTGTTGTAAGTGGATTTTCTGTGCCAAATGTGAGAATCATATGCTAACTATGTCTAGACTTACTCGATTGAGCTATCTGAATGGTCGTGGAAGGAGTATGTACCCTTCGTACCCGCCACCGGTGATGATGTGCAGCTTGAGTTCTCTATGCTCTCGCCCTTCCACCGACTGAATCTGGAGCGCACTCAGACGAATCCTTCATCTAGCGTCTTCAGCACCACATTCAAGCTTCCAGATCAGCATGGAATCTTCAACTTCCTGGTCGAGTTCCGCCGCCCCTTCCTCTCGAACATCGAGGACAAGAAAACGGTCACCGTACGCCACTTTGCACACGATGAATGGCCACGCAGTTGGGTCATCAGCGCCGCGTGGCCATGGATCTCTGGCGTTGCGGTCACTGTTGTTGGATGGATTATATTCGTGGGATTGTGGTTGTACAGTGCCCCACCGACAGTGAAGGGAAAGAAGTCGCGATGAGAAGAGCTAGATGTTGCATTTGAGACGTAAACGGGACTGTATGAATACCAAAATCGTGTATAGATGATATAGTAGGCAGGAACACATGGCATGTCTGATTCCGAATAAATCGCTCGTATTGCCTTGCGCGCTTGTTGATTGTACACGGTTGTG
Annotation sample solutionCCCCAAGCTCACTGATTGACGGTGCTCTGATTGCGCAACCAGACAACGACGACAATGAGGGTGCTACTGTCTTTTCTCATTATAGGGCTGCTAGGCATTGCCAGTGCCCTCAGCACCAGTGGAAACAGATTACTGGTGGTACTGGAGGAGCTTGCGGAGAAGGACAAGTACTCCAAGTTCTTTGGGGATCTGAAAGGCAAGCGCACGGCTGGGAGATAGAACTGGCAGGGACAACGGGATTCTAGCTAACTTTTTGAGGATTGTAGGTCGGGGCTTTGATATCACATTTGAATCTCCAAAGAGTGATAGTTTGGCGCTGTTCGAGCTTGGCGAGAGAGCTTATGATCACCTTCTCATCCTCCCATCTAAGTCGAAAGGTCAGCATCCATTGAACGACATTGGACTGTCGTGTGCTAATATAATAGGCCTGGGACCAAACCTCACGCCCCAAACCTTATTGAAGTTCATCAATACCGAGGGCAACATCCTGCTCACCCTATCTTCATCCAACCCGATACCATCAGCTCTCGTATCAGTTCTGCTTGAGCTTGACATCCATCTCCCCACCGACCGCAACTCGATAGTGGTCGACCACTTCAACTACGACAGCCTCTCGGCCCCCGAATCCCACGATGTCGTTCTTGTTCCCCGCCCAAGCGCTGTGCGCCCCGGTGTTCGCAACTTCTTCGGCGGCATCCTCAAGAACGAGGTTATCGCGTTCCCCCACGGCGTGGGCCAGACTCTAGGCAACGATAGCCCGTACTTGACACCGGTCCTTCGCGCCCCCGGCACGGCGTACTCATACAACCCCAAGGAGGAGGCCGAGGCTGTGGAGGACCCGTTCGCGGTTGGCCAGCAACTGTCCCTCGTGACCGCCATGCAGGCTCGCAACTCAGCGAGGTTCACTGTCTTGGGCGCAGCGGAAATGCTTGAGGATAAGTGGTTCAAGGGGAAGGTCCAAGTTGCTGGCGGCAAGGTTGCTGCGGCTGCGAATGAGGCGTTTGCGAAGGAGATCTCCGGATGGACTTTCAATGAGGCTGGAGTCCTGAAGGTGAAGTCTGTTACGCATTTCCTCAACGAAGAGGGGTTGAAAACACCCAATGCTTCATTGACGAACCCCAAGATCTACCGTGTCAAGAACACTGTTGTAAGTGGATTTTCTGTGCCAAATGTGAGAATCATATGCTAACTATGTCTAGACTTACTCGATTGAGCTATCTGAATGGTCGTGGAAGGAGTATGTACCCTTCGTACCCGCCACCGGTGATGATGTGCAGCTTGAGTTCTCTATGCTCTCGCCCTTCCACCGACTGAATCTGGAGCGCACTCAGACGAATCCTTCATCTAGCGTCTTCAGCACCACATTCAAGCTTCCAGATCAGCATGGAATCTTCAACTTCCTGGTCGAGTTCCGCCGCCCCTTCCTCTCGAACATCGAGGACAAGAAAACGGTCACCGTACGCCACTTTGCACACGATGAATGGCCACGCAGTTGGGTCATCAGCGCCGCGTGGCCATGGATCTCTGGCGTTGCGGTCACTGTTGTTGGATGGATTATATTCGTGGGATTGTGGTTGTACAGTGCCCCACCGACAGTGAAGGGAAAGAAGTCGCGATGAGAAGAGCTAGATGTTGCATTTGAGACGTAAACGGGACTGTATGAATACCAAAATCGTGTATAGATGATATAGTAGGCAGGAACACATGGCATGTCTGATTCCGAATAAATCGCTCGTATTGCCTTGCGCGCTTGTTGATTGTACACGGTTGTG
>>> transcript1 = s.get("from Transcript t where t.locus='EUKG_05092")>>> transcript1.length798 # not just a field, but live object>>> transcript1.containsInFrameStop()1>>> overlaps(transcript1, transcript2)0
Jython Interpreter Access
Analysis and Annotation scale
2004manual annotation and publicationfour genomes / year 2012high-throughput process, manually iteratedthirty-six genomes in one day
Long-term Success
Over the past eight years, this Java/Jython analysis platform has: ● 10K+ genomes annotated● 2M+ genes published● 2M+ jobs distributed across 1000+ nodes● 5TB+ of genomic data and analyses
previous publication platform deployed individual / small groups of genomes each one very customizable common data duplicated via cut / paste two hundred settings per genomefive hundred settings per publication
web publishing changing scale
annual use● 250k researchers● 3M pageviews our Java/Tapestry stack couldn't keep up● slow response, render time :(● restart tomcat a lot
genomes en masse
olive.broadinstitute.org
● Oracle-to-Java data model via Hibernate
● RESTful Java data service
● Python data model layer (Toothpick)
● Python.Flask web service (olive)
olive navigation
toothpick: data abstraction layer
author: Andrew Roberts Modeling data from separate data sources Single models with live references to multiple sources open source coming 2012Q3-Q4
An analysis project is composed of ● genomes
annotations, analyses, etc
● initiativesgrant info, sample tracking, status, etc
toothpick: multiple sources
toothpick: models with [email protected]_model(ttl=86400)@TopspinAdapter.collection("all", path="projects.json")@TopspinAdapter.resource("id", path="projects/%s.json")class Project(toothpick.Base): genomes = toothpick.has_many("Genome", data_field="genome_edition_ids") initiative = toothpick.belongs_to("Initiative", "squid_id", soft=True) def _display_title(self): return self.short_name ...
A genome is aware of what project it's part of: views/[email protected]("/genomes/<project_url>.<int:version>")def show_genome(genome_id=None): genome = toothpick.fetch_model( genome_url_and_version=(project_url, version)) models/genomes/show.html.jinja...{{ funding_via_initiative(genome.project.initiative) }}
olive and toothpick: simple use
naming genes
Naming genes is hard Naming genes based on what people attach to the description field of other genes is harder
naming example
"BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]"
genepidgin example>>> orig_name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]">>> (cleaned_name, etymology) = gpidg.cleanup(orig_name)>>> cleaned_name"glycine/betaine/L-proline ABC transporter">>> etymologyfiltered name in 4 steps:...4) reason: delete notes after commas, dashes, semicolon--except when followed by family or superfamily pattern: [-,;]\s+(?!family)(?!superfamily).* filtered: glycine/betaine/L-proline ABC transporter
genepidgin.sf.net
open source and freely available since 2010 named millions of genes being used to rename TIGRfam, one of the core hand-curated protein libraries
accordion: annotations over time
Navigating genomic data over time is challenging● Scientists refer to individual genes (loci) in
studies● MCBG_00123.1● Loci are database identifiers, but nobody
owns the primary key index
Genes can be removed, added, split, merged: ● 1st: MCBG_00123.1● 2nd: MCBG_00123.2● 3rd: MCBG_00123.3 MCBG_02786.3 Loci are kind of the wild west
accordion: cooperative identifiers
accordion: annotations over time
Can't fix loci, so fix the mechanical stuff that follows Match sequence to sequence, gene to gene, let people walk over genomes, over time Nobody's reference to shiga toxin will be lost or confused open source coming 2012Q4ish
Thanks!