accelerate pharmaceutical r&d with big data and mongodb
DESCRIPTION
Introduction of disruptive technologies, including use of unstructured data, is critical to Pharmaceutical R&D. We will explore how MongoDB can be used to accelerate this. We will also have an open discussion with panel members who are using MongoDB in this space.TRANSCRIPT
Mongo Boston 2013
Jason Tetrault Architect - AstraZeneca
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Click icon to add picture
AstraZeneca at a glanceWe are a global, innovation led biopharmaceutical company
with a mission to make a meaningful difference to patient health through great medicines and a belief that health connects us all
Global Targeted Collaborative
Committed to driving business success responsibly
57,000 peopleSales in 100 countriesManufacturing in 16R&D across 3 continents$4 bn invested in R&D$33 bn sales in 2011
Constantly anticipating and adapting to the needs of a changing world.
CancerCardiovascular GastrointestinalInfection NeuroscienceRespiratory & inflammation
Driving continued innovation where we can make the most difference.
HCPs PatientsPayers RegulatorsPartnersLocal communities
Connecting with others to achieve common goals in improving healthcare.
Architect: R&D Information
What does this mean?
• Support the Researchers• AstraZeneca has Multiple iMeds that are
focused on different areas of R&D• Specifically, I work with the Oncology and
Infection iMeds here in Waltham• Support different software and system
builds and / or purchases• Looking to apply new technologies to
enable Researchers
• Core Focus:• Next Generation Sequencing
Scaling• IAAS• Big Data Pilots and Exploration
Introduction of Disruptive Technology:
Step 1: Introduce Concepts
• What• Unstructured Data• NoSQL
• Categories (Document, Key Value, Graph)• Hadoop• Map Reduce• Horizontal Scalability• Cloud (IAAS and SAAS)
• How• Lunch and Learns• Examples (Craigslist uses this)• “Big Cookies for Big Data”• Demonstrations
Introduction of Disruptive Technology:
Step 2: Pilots
• Goals:• We needed to show what “Unstructured Data” actually means.• We needed to prove what these technologies can and cannot
do for us.• Find something difficult and make it easy!• We needed to find the best way to enable researchers.
How quickly can I make indirect associations between gene sequence features and structural fingerprints?
6
Iterative Agile Analytics
AnalyzeGather Aggregate
Compound
Data sources
AssayResults
Target mappingsDecorate
JSON
• Compound with Fingerprints• Gene sequence• Target mappings• Assay results
Fingerprint with compounds
PivotMap Reduce
GeneCatalog
Tanimoto matrixGene matrix
Matrix
• Easily convert to JSON and import an initial cut of data from different sources (e.g. spreadsheets, RDBMS, …)
• Embrace unstructured data, massage it into a more useful format: Rinse, Wash, Repeat!• Ability to decorate data, adding fields and additional datastores quickly
(300K Compounds) – 200Gb (1.4M fingerprints) – 1Gb (500m pairs) – 81Gb
Now scale up to 4M compounds, 20K assays…and more decoration – 5to50 Tb
Introduction of Disruptive Technology:
Pilot Findings
• Tech Findings:• GSON can help with weird character
conversions.• Per Node write limits (500 per second)
but, you can save a bunch of documents at once (Change to bulk Insert).
• Users think that even though they could do it relationally, this was way quicker.
• Using arrays for multiple results in a doc can be interesting.
• JSON and JavaScript is fairly natural to technical researchers (python).
• We are not alone…• Davy Suvee• tranSMART• Seven Bridges• …
Next Generation Sequencing:
Driving Question:
How many other cancer types that I have processed have the same variation as the cancer
type I am working on?
Can we predict which drug is most effective against
specific tumors?
Fairly Inaccurate Overview of Genetics ProcessingA 2 Minutes Over Simplification to a Really Hard Problem
9
Fairly Inaccurate Overview of Genetics ProcessingSequencing
10
Fairly Inaccurate Overview of Genetics ProcessingSequencing
11
Fairly Inaccurate Overview of Genetics ProcessingAlignment
12 Set area descriptor | Sub level 1
HG19
Fairly Inaccurate Overview of Genetics ProcessingDown Stream Processing (Variant)
13
HG19
Can I Process 88 Whole Human Genomes?
Researcher: I would like to process 88 public Genomic Samples from of Cancer Patients. They are Whole Human Genomes. Each patient has 2 genomic sequences, one of the tumor and one from a normal cell.
AmazonStarCluster
Elastic HPC Infrastructure
Shared StorageScripts, programs, reference
Elastic Node ExpansionCompute
Local Storage Processing
Result offload to S3Transition to Glacier
Tech: • 200 GB raw uncompressed fastq per
experiment• 176 Genome Pipelines to process• Each “pipeline” runs on a m1.xlarge• We ran 4 runs of ~3.5 days on 50 nodes• Total processed data in the pipeline may be
5X per experiment• Could expand to 10X or more for more
complex pipelines• ~86 GB result average to save
• Stored in S3 / Glacier• Totals:
• ~171 TB Total Processed Storage• ~14,784 hours of processing• ~15 TB of results
PartnersStorageGenePatternBig Data
StoreInbound Seven Bridges
Genome Upload / Curation
PipelineEngines
Long Term Storage
Experiment Management /
Metadata Management
Partner Integration
Big Data Storage and Analytics
A Possible Vision for Experiment Management
Services
NGS DataExplants
Tumors-FFPETumors –fresh frozen
Cell linesRNA-Seq
Expression
Variants
DNA-SeqAmpliconCoding and non-
coding variantsWhole exome
Coding variants
Whole genome
New Target ID
Patient stratification Biomarkers for prognosis,
drug response, safety
Mechanism of drug action
Mechanism of disease
Lets look at a Variant …
Another Area Mongo May Help
16
VCF Format
17
##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
VCF as JSON
Header and Variant Information
18
{ "_id" : ObjectId("52617b613004b77f64efed62"), "ALT" : [ "A" ], "QUAL" : "29", "NA00001" : "0|0:48:1:51,51", "POS" : 14370, "NA00002" : "1|0:48:8:51,51", "FILTER" : "PASS", "CHROM" : "20", "NA00003" : "1/1:43:5:.,.", "FORMAT" : "GT:GQ:DP:HQ", "__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27", "ID" : "rs6054257", "INFO" : { "DP" : "14", "AF" : "0.5", "NS" : "3" }, "REF" : "G"}
{ "_id" : ObjectId("52617b613004b77f64efed67"), "phasing" : "partial", "fileformat" : "VCFv4.1", "fileDate" : "20090805", "source" : "myImputationProgramV3.1", "FORMAT" : { "Description" : "\"Haplotype Quality\"", "Type" : "Integer", "Number" : "2", "ID" : "HQ" }, "__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27", "contig" : { "species" : "\"Homo sapiens\"", "assembly" : "B36", "md5" : "f126cdf8a6e0c7f379d618ff66beb2da", "length" : "62435964", "ID" : "20", "taxonomy" : "x" }, "INFO" : { "Description" : "\"HapMap2 membership\"", "Type" : "Flag", "Number" : "0", "ID" : "H2" }, "reference" : "file:///seq/references/1000GenomesPilot-NCBI36.fasta", "FILTER" : { "Description" : "\"Less than 50% of samples have data\"", "ID" : "s50" }}
Query
Search Variant Ranges
19
// Here is our range definitionvar begin = 10000;var end = 10200;
// The Chromosome position is fuzzy in format so, we use a regexvar chromosome = ".*17$";var variant = "A"
// Query for range and chromosome position.db.publicvariants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome} })
db.variants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome} })
// Query for a specific variant in a rangedb.publicvariants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome}, "ALT":variant})
db.variants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome}, "ALT":variant})
Wrap Up and Panel
20
• Thanks• Todd Nelson, Rajan Desai• Sebastien Lefebvre, Robin Brouwer• Sara Dempster
• Panel• Deniz Kural: Founder and CEO – SevenBridges
• Code: • https://github.com/jjtetrault/bio-mongo
The Panel
…
21
22
Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com