genome sequencing and analysis program millions …files.meetup.com/469457/millions of genes with...

26
Millions of Genes with Python and Jython Clint Howarth Janet Dewar, Maia Hansen, Jeffrey Larimer, Matthew Pearson, Andrew Roberts Shailaja Gargeya, Jennifer Wortman, Cheryl Murphy, Bruce Birren Analysis and Annotation Engineering Genome Sequencing and Analysis Program Broad Institute

Upload: others

Post on 08-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Millions of Genes with Python and Jython

Clint HowarthJanet Dewar, Maia Hansen, Jeffrey Larimer,

Matthew Pearson, Andrew RobertsShailaja Gargeya, Jennifer Wortman,

Cheryl Murphy, Bruce Birren

Analysis and Annotation EngineeringGenome Sequencing and Analysis Program

Broad Institute

Page 2: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Annotation and Analysis

MalariaTuberculosisHIVWest NileDengue fevere. coliStreptococcusStaphylococcus aureus

Human Microbiome Project

Page 3: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Table of Contents: Applications

Internal ● java/jython analysis

and publication platform

● oliveweb publication

Open Source ● toothpick

data abstraction layer ● genepidgin

gene names ● accordion

genetics over time

Page 4: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Annotation sample problemCCCCAAGCTCACTGATTGACGGTGCTCTGATTGCGCAACCAGACAACGACGACAATGAGGGTGCTACTGTCTTTTCTCATTATAGGGCTGCTAGGCATTGCCAGTGCCCTCAGCACCAGTGGAAACAGATTACTGGTGGTACTGGAGGAGCTTGCGGAGAAGGACAAGTACTCCAAGTTCTTTGGGGATCTGAAAGGCAAGCGCACGGCTGGGAGATAGAACTGGCAGGGACAACGGGATTCTAGCTAACTTTTTGAGGATTGTAGGTCGGGGCTTTGATATCACATTTGAATCTCCAAAGAGTGATAGTTTGGCGCTGTTCGAGCTTGGCGAGAGAGCTTATGATCACCTTCTCATCCTCCCATCTAAGTCGAAAGGTCAGCATCCATTGAACGACATTGGACTGTCGTGTGCTAATATAATAGGCCTGGGACCAAACCTCACGCCCCAAACCTTATTGAAGTTCATCAATACCGAGGGCAACATCCTGCTCACCCTATCTTCATCCAACCCGATACCATCAGCTCTCGTATCAGTTCTGCTTGAGCTTGACATCCATCTCCCCACCGACCGCAACTCGATAGTGGTCGACCACTTCAACTACGACAGCCTCTCGGCCCCCGAATCCCACGATGTCGTTCTTGTTCCCCGCCCAAGCGCTGTGCGCCCCGGTGTTCGCAACTTCTTCGGCGGCATCCTCAAGAACGAGGTTATCGCGTTCCCCCACGGCGTGGGCCAGACTCTAGGCAACGATAGCCCGTACTTGACACCGGTCCTTCGCGCCCCCGGCACGGCGTACTCATACAACCCCAAGGAGGAGGCCGAGGCTGTGGAGGACCCGTTCGCGGTTGGCCAGCAACTGTCCCTCGTGACCGCCATGCAGGCTCGCAACTCAGCGAGGTTCACTGTCTTGGGCGCAGCGGAAATGCTTGAGGATAAGTGGTTCAAGGGGAAGGTCCAAGTTGCTGGCGGCAAGGTTGCTGCGGCTGCGAATGAGGCGTTTGCGAAGGAGATCTCCGGATGGACTTTCAATGAGGCTGGAGTCCTGAAGGTGAAGTCTGTTACGCATTTCCTCAACGAAGAGGGGTTGAAAACACCCAATGCTTCATTGACGAACCCCAAGATCTACCGTGTCAAGAACACTGTTGTAAGTGGATTTTCTGTGCCAAATGTGAGAATCATATGCTAACTATGTCTAGACTTACTCGATTGAGCTATCTGAATGGTCGTGGAAGGAGTATGTACCCTTCGTACCCGCCACCGGTGATGATGTGCAGCTTGAGTTCTCTATGCTCTCGCCCTTCCACCGACTGAATCTGGAGCGCACTCAGACGAATCCTTCATCTAGCGTCTTCAGCACCACATTCAAGCTTCCAGATCAGCATGGAATCTTCAACTTCCTGGTCGAGTTCCGCCGCCCCTTCCTCTCGAACATCGAGGACAAGAAAACGGTCACCGTACGCCACTTTGCACACGATGAATGGCCACGCAGTTGGGTCATCAGCGCCGCGTGGCCATGGATCTCTGGCGTTGCGGTCACTGTTGTTGGATGGATTATATTCGTGGGATTGTGGTTGTACAGTGCCCCACCGACAGTGAAGGGAAAGAAGTCGCGATGAGAAGAGCTAGATGTTGCATTTGAGACGTAAACGGGACTGTATGAATACCAAAATCGTGTATAGATGATATAGTAGGCAGGAACACATGGCATGTCTGATTCCGAATAAATCGCTCGTATTGCCTTGCGCGCTTGTTGATTGTACACGGTTGTG

Page 5: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Annotation sample solutionCCCCAAGCTCACTGATTGACGGTGCTCTGATTGCGCAACCAGACAACGACGACAATGAGGGTGCTACTGTCTTTTCTCATTATAGGGCTGCTAGGCATTGCCAGTGCCCTCAGCACCAGTGGAAACAGATTACTGGTGGTACTGGAGGAGCTTGCGGAGAAGGACAAGTACTCCAAGTTCTTTGGGGATCTGAAAGGCAAGCGCACGGCTGGGAGATAGAACTGGCAGGGACAACGGGATTCTAGCTAACTTTTTGAGGATTGTAGGTCGGGGCTTTGATATCACATTTGAATCTCCAAAGAGTGATAGTTTGGCGCTGTTCGAGCTTGGCGAGAGAGCTTATGATCACCTTCTCATCCTCCCATCTAAGTCGAAAGGTCAGCATCCATTGAACGACATTGGACTGTCGTGTGCTAATATAATAGGCCTGGGACCAAACCTCACGCCCCAAACCTTATTGAAGTTCATCAATACCGAGGGCAACATCCTGCTCACCCTATCTTCATCCAACCCGATACCATCAGCTCTCGTATCAGTTCTGCTTGAGCTTGACATCCATCTCCCCACCGACCGCAACTCGATAGTGGTCGACCACTTCAACTACGACAGCCTCTCGGCCCCCGAATCCCACGATGTCGTTCTTGTTCCCCGCCCAAGCGCTGTGCGCCCCGGTGTTCGCAACTTCTTCGGCGGCATCCTCAAGAACGAGGTTATCGCGTTCCCCCACGGCGTGGGCCAGACTCTAGGCAACGATAGCCCGTACTTGACACCGGTCCTTCGCGCCCCCGGCACGGCGTACTCATACAACCCCAAGGAGGAGGCCGAGGCTGTGGAGGACCCGTTCGCGGTTGGCCAGCAACTGTCCCTCGTGACCGCCATGCAGGCTCGCAACTCAGCGAGGTTCACTGTCTTGGGCGCAGCGGAAATGCTTGAGGATAAGTGGTTCAAGGGGAAGGTCCAAGTTGCTGGCGGCAAGGTTGCTGCGGCTGCGAATGAGGCGTTTGCGAAGGAGATCTCCGGATGGACTTTCAATGAGGCTGGAGTCCTGAAGGTGAAGTCTGTTACGCATTTCCTCAACGAAGAGGGGTTGAAAACACCCAATGCTTCATTGACGAACCCCAAGATCTACCGTGTCAAGAACACTGTTGTAAGTGGATTTTCTGTGCCAAATGTGAGAATCATATGCTAACTATGTCTAGACTTACTCGATTGAGCTATCTGAATGGTCGTGGAAGGAGTATGTACCCTTCGTACCCGCCACCGGTGATGATGTGCAGCTTGAGTTCTCTATGCTCTCGCCCTTCCACCGACTGAATCTGGAGCGCACTCAGACGAATCCTTCATCTAGCGTCTTCAGCACCACATTCAAGCTTCCAGATCAGCATGGAATCTTCAACTTCCTGGTCGAGTTCCGCCGCCCCTTCCTCTCGAACATCGAGGACAAGAAAACGGTCACCGTACGCCACTTTGCACACGATGAATGGCCACGCAGTTGGGTCATCAGCGCCGCGTGGCCATGGATCTCTGGCGTTGCGGTCACTGTTGTTGGATGGATTATATTCGTGGGATTGTGGTTGTACAGTGCCCCACCGACAGTGAAGGGAAAGAAGTCGCGATGAGAAGAGCTAGATGTTGCATTTGAGACGTAAACGGGACTGTATGAATACCAAAATCGTGTATAGATGATATAGTAGGCAGGAACACATGGCATGTCTGATTCCGAATAAATCGCTCGTATTGCCTTGCGCGCTTGTTGATTGTACACGGTTGTG

Page 6: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

>>> transcript1 = s.get("from Transcript t where t.locus='EUKG_05092")>>> transcript1.length798 # not just a field, but live object>>> transcript1.containsInFrameStop()1>>> overlaps(transcript1, transcript2)0

Jython Interpreter Access

Page 7: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Analysis and Annotation scale

2004manual annotation and publicationfour genomes / year 2012high-throughput process, manually iteratedthirty-six genomes in one day

Page 8: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Long-term Success

Over the past eight years, this Java/Jython analysis platform has: ● 10K+ genomes annotated● 2M+ genes published● 2M+ jobs distributed across 1000+ nodes● 5TB+ of genomic data and analyses

Page 9: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

previous publication platform deployed individual / small groups of genomes each one very customizable common data duplicated via cut / paste two hundred settings per genomefive hundred settings per publication

web publishing changing scale

Page 10: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

annual use● 250k researchers● 3M pageviews our Java/Tapestry stack couldn't keep up● slow response, render time :(● restart tomcat a lot

genomes en masse

Page 11: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+
Page 12: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

olive.broadinstitute.org

● Oracle-to-Java data model via Hibernate

● RESTful Java data service

● Python data model layer (Toothpick)

● Python.Flask web service (olive)

Page 13: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

olive navigation

Page 14: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

toothpick: data abstraction layer

author: Andrew Roberts Modeling data from separate data sources Single models with live references to multiple sources open source coming 2012Q3-Q4

Page 15: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

An analysis project is composed of ● genomes

annotations, analyses, etc

● initiativesgrant info, sample tracking, status, etc

toothpick: multiple sources

Page 16: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

toothpick: models with [email protected]_model(ttl=86400)@TopspinAdapter.collection("all", path="projects.json")@TopspinAdapter.resource("id", path="projects/%s.json")class Project(toothpick.Base): genomes = toothpick.has_many("Genome", data_field="genome_edition_ids") initiative = toothpick.belongs_to("Initiative", "squid_id", soft=True) def _display_title(self): return self.short_name ...

Page 17: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

A genome is aware of what project it's part of: views/[email protected]("/genomes/<project_url>.<int:version>")def show_genome(genome_id=None): genome = toothpick.fetch_model( genome_url_and_version=(project_url, version)) models/genomes/show.html.jinja...{{ funding_via_initiative(genome.project.initiative) }}

olive and toothpick: simple use

Page 18: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

naming genes

Naming genes is hard Naming genes based on what people attach to the description field of other genes is harder

Page 19: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

naming example

"BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]"

Page 20: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

genepidgin example>>> orig_name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]">>> (cleaned_name, etymology) = gpidg.cleanup(orig_name)>>> cleaned_name"glycine/betaine/L-proline ABC transporter">>> etymologyfiltered name in 4 steps:...4) reason: delete notes after commas, dashes, semicolon--except when followed by family or superfamily pattern: [-,;]\s+(?!family)(?!superfamily).* filtered: glycine/betaine/L-proline ABC transporter

Page 21: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

genepidgin.sf.net

open source and freely available since 2010 named millions of genes being used to rename TIGRfam, one of the core hand-curated protein libraries

Page 22: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

accordion: annotations over time

Navigating genomic data over time is challenging● Scientists refer to individual genes (loci) in

studies● MCBG_00123.1● Loci are database identifiers, but nobody

owns the primary key index

Page 23: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Genes can be removed, added, split, merged: ● 1st: MCBG_00123.1● 2nd: MCBG_00123.2● 3rd: MCBG_00123.3 MCBG_02786.3 Loci are kind of the wild west

accordion: cooperative identifiers

Page 24: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

accordion: annotations over time

Can't fix loci, so fix the mechanical stuff that follows Match sequence to sequence, gene to gene, let people walk over genomes, over time Nobody's reference to shiga toxin will be lost or confused open source coming 2012Q4ish

Page 25: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+
Page 26: Genome Sequencing and Analysis Program Millions …files.meetup.com/469457/Millions of Genes with Python and...analysis platform has: 10K+ genomes annotated 2M+ genes published 2M+

Thanks!