biocompute & scidb a pipeline-in-a-database · scidb & biocompute objects scidb • data...

BioCompute & SciDBa pipeline-in-a-database

© Paradigm4 2

Topics

• Why a pipeline-in-database? Copy NCBI• SciDB: a scientific data storage and computing platform• BioCompute Example: Group-based somatic mutation calling

3.2 billion columns, millions of rows, analysis-ready, not in files

© Paradigm4 3

Typical Research Workflow

Data Exploration

Pipeline Data Generation

© Paradigm4 4

Typical Research Workflow

Data Exploration


COMPUTE

LOAD

© Paradigm4 5

Typical Research Workflowas BioCompute objects

Data Exploration


COMPUTE

LOAD

© Paradigm4 6

Group based variants – big pileups

• Workflow– merge multiple BAMs– sort reads by genomic coordinate

• 8.8 TB (2x 4.4TB flash drive) needed for the sort, in addition to the distributed file system

• San Diego Super Computer Center

© Paradigm4 7

What did they find in Human Genomes

Optimum was ~ 15 whole genomes at a time

Known variants (dark blue and dark green)

Novel variants (light blue and light green)

Attribute Single file x 100s Grouped Files Concordance

Total variants 30,790,918 29,915,861 81.4%

Unique variants 2,668,331 3,543,283 MHC, Y, Tele

Minor Allele Frequency 1-5% <1%

Ti/Tv – 2.19 is ideal 1.2 1.6

© Paradigm4 8

SciDBa scalable Scientific, Computational DBMS

SciDB blurs the line between storage and

computation

In-situ, massively scalable analytics

Scientific data are stored natively as multi-

dimensional arrays

Genomic coordinate

Chr

omos

ome

Patie

ntPa

tient

© Paradigm4 9

Pipeline data generation

For reproducibility and re-analysis• Minimize data movement and copying• Data stored in analysis-ready form• Metadata stored with data (BioCompute Requirement)• Rapid selection of specific data of interest

SciDB

QueryF (x, y, z)

QueryG (x,y,z)

QueryT (x,y,z)

3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON

AKT1 EGFR TP53 ZNF11

TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON


TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON


TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON


TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY

© Paradigm4 10

Reproducible research

• Curate once• Explore many times• By many concurrent users• Enforces data integrity• Versions data• Track and trace data and queries

Data Exploration

SciDB

Data Exploration

Data Exploration

Data Exploration

User 1 User 3 User 4User 2

© Paradigm4 11

SciDB & BioCompute Objects

SciDB

• Data loaders enforce type & field constraints• Arrays are versioned and time-stamped• Database log tracks all parameters, data changes, user actions• Utility could represent queries as JSON objects

© Paradigm4 13

Group-based Somatic Point Mutation Calling

• Reveals statistically significant minority somatic mutations

• Technique provides FDA explanation and proof of repeatability, traceability and accuracy of variant calling pipeline

Simultaneous large group pile-ups provide more accurate identification and accommodation of sequencing errors*

* Standish, et al. BMC Bioinformatics (2015) 16:304

© Paradigm4 14

Somatic Point Variant Caller

BAM 1

BAM 2

BAM…

Aggregate Total Across BAMs

Per BAM

Count A,C,T,G,? Per Position

1:2:

…

Reference Arrays (some examples):

Join

+

Compute

stats

Reference Genome:

Known Variants:

Quality Regions:

Filtered CallsPer position

Filter based on:• Base Quality• Read Mapping

Quality• Coverage

Test using spike in

filter

Pile Up over ALL BAM files in single large array

Per position stats

© Paradigm4 15

Methodology

• Formulate the caller as a simple signal/noise filter

• Tunable and explainable ROC curves (receiver operating characteristics)

– Control false positive and false negatives rates by experimenting with settings for noise and filter thresholds without having to reload data

– Can use ‘spiked in’ data to provide known answer to guide parameter setting for variant calling QA

© Paradigm4 16

Base-level PHRED Score Distribution

Phred Score Probability of incorrect call Call accuracy10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10,000 99.99%

50 1 in 100,000 99.999%

60 1 in 1,000,000 99.9999%

© Paradigm4 17

MAPQ Scores and Coverage

• Histograms show % of bases excluded at specific thresholds• PHRED and MAPQ thresholds affect coverage distribution

© Paradigm4 18

Effect of PHRED & MAPQ thresholds on noise

Thresholding PHRED and MAPQ scores has an effect of "shifting" the noise to the lower allele frequency band

Noise Histogram

Second Call Ratio 2nd most common base / total coverage at that position

Den

sity

Cou

nt /

tota

l cov

erag

e at

that

pos

ition

© Paradigm4 19

Low-complexity filter or low-confidence filter does not reduce noise further

Low complexity filter Low confidence region filter

Second Call Ratio Second Call Ratio

Dens

ity

Dens

ity

© Paradigm4 20

BioCompute PiD Concept"id": "obj.1243","type”: "biocompute”,"name": "SciDB GIAB minority clone variant call SciDB PiD",#author,created,…"parametric_domain" : {"phred_threshold" : "50","coverage_threshold: "450",…

} "scidb_domain": { "hostname": "clust_scidb_01","db_user": "apoliakov",…"arrays": [{

"id": "obj.1243""name": "GIAB_CLONALITY.BAM_DATA""schema_type": "Multi-sample BAM Pileup Array""size": "78.5TB""created": "Jan 10 2017 11:57:34",…

JSON can be stored directly in the database

© Paradigm4 21

BioCompute PiD Concept"scidb_domain": {

…"arrays": [{

"id": "obj.1243","name": "GIAB_CLONALITY.BAM_DATA","schema_type": "Multi-sample BAM Array","size": "78.5TB","created": "Jan 10 2017 11:57:34","modified":…

}, {

"id": "obj.1244","name": "GIAB_CLONALITY.BAM_PILEUP","schema_type": "BAM per-sample filtered BAM pileup""size": "350GB","created": "Jan 10 2017 18:67:34",…

},...}

© Paradigm4 22

BioCompute PiD Concept

"execution_domain": [ { "id": "obj.1237", "env_parameters": "example", "location": "/workflows/giab_clonality.R", "platform": "SciDB-R", ...

}]

• Execution points to SciDB query scripts as before

• Optional Array Version History "arrays": [{

"name": "GIAB_CLONALITY.BAM_DATA","versions": [{

"version_id": "3""created": "Jan 10 2017 11:59:34",

}, ...

BioCompute & SciDBa pipeline-in-a-database

Zachary Pitluk, Ph.D., V.P. Life Sciences & [email protected]

biocompute & scidb a pipeline-in-a-database · scidb & biocompute objects scidb • data...

Documents