accelerate pharmaceutical r&d with big data and mongodb

Mongo Boston 2013

Jason Tetrault Architect - AstraZeneca

Accelerate Pharmaceutical R&D with Big Data and MongoDB

Click icon to add picture

AstraZeneca at a glanceWe are a global, innovation led biopharmaceutical company

with a mission to make a meaningful difference to patient health through great medicines and a belief that health connects us all

Global Targeted Collaborative

Committed to driving business success responsibly

57,000 peopleSales in 100 countriesManufacturing in 16R&D across 3 continents$4 bn invested in R&D$33 bn sales in 2011

Constantly anticipating and adapting to the needs of a changing world.

CancerCardiovascular GastrointestinalInfection NeuroscienceRespiratory & inflammation

Driving continued innovation where we can make the most difference.

HCPs PatientsPayers RegulatorsPartnersLocal communities

Connecting with others to achieve common goals in improving healthcare.

Architect: R&D Information

What does this mean?

• Support the Researchers• AstraZeneca has Multiple iMeds that are

focused on different areas of R&D• Specifically, I work with the Oncology and

Infection iMeds here in Waltham• Support different software and system

builds and / or purchases• Looking to apply new technologies to

enable Researchers

• Core Focus:• Next Generation Sequencing

Scaling• IAAS• Big Data Pilots and Exploration

Introduction of Disruptive Technology:

Step 1: Introduce Concepts

• What• Unstructured Data• NoSQL

• Categories (Document, Key Value, Graph)• Hadoop• Map Reduce• Horizontal Scalability• Cloud (IAAS and SAAS)

• How• Lunch and Learns• Examples (Craigslist uses this)• “Big Cookies for Big Data”• Demonstrations


Step 2: Pilots

• Goals:• We needed to show what “Unstructured Data” actually means.• We needed to prove what these technologies can and cannot

do for us.• Find something difficult and make it easy!• We needed to find the best way to enable researchers.

How quickly can I make indirect associations between gene sequence features and structural fingerprints?

6

Iterative Agile Analytics

AnalyzeGather Aggregate

Compound

Data sources

AssayResults

Target mappingsDecorate

JSON

• Compound with Fingerprints• Gene sequence• Target mappings• Assay results

Fingerprint with compounds

PivotMap Reduce

GeneCatalog

Tanimoto matrixGene matrix

Matrix

• Easily convert to JSON and import an initial cut of data from different sources (e.g. spreadsheets, RDBMS, …)

• Embrace unstructured data, massage it into a more useful format: Rinse, Wash, Repeat!• Ability to decorate data, adding fields and additional datastores quickly

(300K Compounds) – 200Gb (1.4M fingerprints) – 1Gb (500m pairs) – 81Gb

Now scale up to 4M compounds, 20K assays…and more decoration – 5to50 Tb

http://www.mongodb.org/




Pilot Findings

• Tech Findings:• GSON can help with weird character

conversions.• Per Node write limits (500 per second)

but, you can save a bunch of documents at once (Change to bulk Insert).

• Users think that even though they could do it relationally, this was way quicker.

• Using arrays for multiple results in a doc can be interesting.

• JSON and JavaScript is fairly natural to technical researchers (python).

• We are not alone…• Davy Suvee• tranSMART• Seven Bridges• …

Next Generation Sequencing:

Driving Question:

How many other cancer types that I have processed have the same variation as the cancer

type I am working on?

Can we predict which drug is most effective against

specific tumors?

Fairly Inaccurate Overview of Genetics ProcessingA 2 Minutes Over Simplification to a Really Hard Problem

9

Fairly Inaccurate Overview of Genetics ProcessingSequencing

10

Fairly Inaccurate Overview of Genetics ProcessingSequencing

11

Fairly Inaccurate Overview of Genetics ProcessingAlignment

12 Set area descriptor | Sub level 1

HG19

Fairly Inaccurate Overview of Genetics ProcessingDown Stream Processing (Variant)

13

HG19

Can I Process 88 Whole Human Genomes?

Researcher: I would like to process 88 public Genomic Samples from of Cancer Patients. They are Whole Human Genomes. Each patient has 2 genomic sequences, one of the tumor and one from a normal cell.

AmazonStarCluster

Elastic HPC Infrastructure

Shared StorageScripts, programs, reference

Elastic Node ExpansionCompute

Local Storage Processing

Result offload to S3Transition to Glacier

Tech: • 200 GB raw uncompressed fastq per

experiment• 176 Genome Pipelines to process• Each “pipeline” runs on a m1.xlarge• We ran 4 runs of ~3.5 days on 50 nodes• Total processed data in the pipeline may be

5X per experiment• Could expand to 10X or more for more

complex pipelines• ~86 GB result average to save

• Stored in S3 / Glacier• Totals:

• ~171 TB Total Processed Storage• ~14,784 hours of processing• ~15 TB of results

PartnersStorageGenePatternBig Data

StoreInbound Seven Bridges

Genome Upload / Curation

PipelineEngines

Long Term Storage

Experiment Management /

Metadata Management

Partner Integration

Big Data Storage and Analytics

A Possible Vision for Experiment Management

Services

NGS DataExplants

Tumors-FFPETumors –fresh frozen

Cell linesRNA-Seq

Expression

Variants

DNA-SeqAmpliconCoding and non-

coding variantsWhole exome

Coding variants

Whole genome

New Target ID

Patient stratification Biomarkers for prognosis,

drug response, safety

Mechanism of drug action

Mechanism of disease


Lets look at a Variant …

Another Area Mongo May Help

16

VCF Format

17

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

VCF as JSON

Header and Variant Information

18

{ "_id" : ObjectId("52617b613004b77f64efed62"), "ALT" : [ "A" ], "QUAL" : "29", "NA00001" : "0|0:48:1:51,51", "POS" : 14370, "NA00002" : "1|0:48:8:51,51", "FILTER" : "PASS", "CHROM" : "20", "NA00003" : "1/1:43:5:.,.", "FORMAT" : "GT:GQ:DP:HQ", "__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27", "ID" : "rs6054257", "INFO" : { "DP" : "14", "AF" : "0.5", "NS" : "3" }, "REF" : "G"}

{ "_id" : ObjectId("52617b613004b77f64efed67"), "phasing" : "partial", "fileformat" : "VCFv4.1", "fileDate" : "20090805", "source" : "myImputationProgramV3.1", "FORMAT" : { "Description" : "\"Haplotype Quality\"", "Type" : "Integer", "Number" : "2", "ID" : "HQ" }, "__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27", "contig" : { "species" : "\"Homo sapiens\"", "assembly" : "B36", "md5" : "f126cdf8a6e0c7f379d618ff66beb2da", "length" : "62435964", "ID" : "20", "taxonomy" : "x" }, "INFO" : { "Description" : "\"HapMap2 membership\"", "Type" : "Flag", "Number" : "0", "ID" : "H2" }, "reference" : "file:///seq/references/1000GenomesPilot-NCBI36.fasta", "FILTER" : { "Description" : "\"Less than 50% of samples have data\"", "ID" : "s50" }}

Query

Search Variant Ranges

19

// Here is our range definitionvar begin = 10000;var end = 10200;

// The Chromosome position is fuzzy in format so, we use a regexvar chromosome = ".*17$";var variant = "A"

// Query for range and chromosome position.db.publicvariants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome} })

db.variants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome} })

// Query for a specific variant in a rangedb.publicvariants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome}, "ALT":variant})

db.variants.find( {"POS":{$gte: begin, $lt: end}, "CHROM":{$regex : chromosome}, "ALT":variant})

Wrap Up and Panel

20

• Thanks• Todd Nelson, Rajan Desai• Sebastien Lefebvre, Robin Brouwer• Sara Dempster

• Panel• Deniz Kural: Founder and CEO – SevenBridges

• Code: • https://github.com/jjtetrault/bio-mongo

The Panel

…

21

22

Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com

accelerate pharmaceutical r&d with big data and mongodb

Technology

big data demonstrations

initial cut of data

nodes total processed

hours of processing

researchers astrazeneca

cancer patients

pharmaceutical rd

different areas of rd