biocompute & scidb a pipeline-in-a-database · scidb & biocompute objects scidb • data...

23
BioCompute & SciDB a pipeline-in-a-database

Upload: others

Post on 09-Aug-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

BioCompute & SciDBa pipeline-in-a-database

Page 2: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 2

Topics

• Why a pipeline-in-database? Copy NCBI• SciDB: a scientific data storage and computing platform• BioCompute Example: Group-based somatic mutation calling

3.2 billion columns, millions of rows, analysis-ready, not in files

Page 3: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 3

Typical Research Workflow

Data Exploration

Pipeline Data Generation

Page 4: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 4

Typical Research Workflow

Data Exploration

Pipeline Data Generation

COMPUTE

LOAD

Page 5: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 5

Typical Research Workflowas BioCompute objects

Data Exploration

Pipeline Data Generation

COMPUTE

LOAD

Page 6: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 6

Group based variants – big pileups

• Workflow– merge multiple BAMs– sort reads by genomic coordinate

• 8.8 TB (2x 4.4TB flash drive) needed for the sort, in addition to the distributed file system

• San Diego Super Computer Center

Page 7: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 7

What did they find in Human Genomes

Optimum was ~ 15 whole genomes at a time

Known variants (dark blue and dark green)

Novel variants (light blue and light green)

Attribute Single file x 100s Grouped Files Concordance

Total variants 30,790,918 29,915,861 81.4%

Unique variants 2,668,331 3,543,283 MHC, Y, Tele

Minor Allele Frequency 1-5% <1%

Ti/Tv – 2.19 is ideal 1.2 1.6

Page 8: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 8

SciDBa scalable Scientific, Computational DBMS

SciDB blurs the line between storage and

computation

In-situ, massively scalable analytics

Scientific data are stored natively as multi-

dimensional arrays

Genomic coordinate

Chr

omos

ome

Patie

ntPa

tient

Page 9: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 9

Pipeline data generation

For reproducibility and re-analysis• Minimize data movement and copying• Data stored in analysis-ready form• Metadata stored with data (BioCompute Requirement)• Rapid selection of specific data of interest

SciDB

QueryF (x, y, z)

QueryG (x,y,z)

QueryT (x,y,z)

3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON

AKT1 EGFR TP53 ZNF11

TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON

AKT1 EGFR TP53 ZNF11

TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON

AKT1 EGFR TP53 ZNF11

TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY3.2 0.3 0.1 11

3.4 0.0 0.8 10

1.1 1.0 1.2 14

0.9 1.0 1.2 13

SAM

PLES

DIM

ENSI

ON

AKT1 EGFR TP53 ZNF11

TCGA1

TCGA2

TCGA3

TCGA4

GENES DIMENSION

ARRAY

Page 10: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 10

Reproducible research

• Curate once• Explore many times• By many concurrent users• Enforces data integrity• Versions data• Track and trace data and queries

Data Exploration

SciDB

Data Exploration

Data Exploration

Data Exploration

User 1 User 3 User 4User 2

Page 11: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 11

SciDB & BioCompute Objects

SciDB

• Data loaders enforce type & field constraints• Arrays are versioned and time-stamped• Database log tracks all parameters, data changes, user actions• Utility could represent queries as JSON objects

Page 12: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 12

Page 13: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 13

Group-based Somatic Point Mutation Calling

• Reveals statistically significant minority somatic mutations

• Technique provides FDA explanation and proof of repeatability, traceability and accuracy of variant calling pipeline

Simultaneous large group pile-ups provide more accurate identification and accommodation of sequencing errors*

* Standish, et al. BMC Bioinformatics (2015) 16:304

Page 14: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 14

Somatic Point Variant Caller

BAM 1

BAM 2

BAM…

Aggregate Total Across BAMs

Per BAM

Count A,C,T,G,? Per Position

1:2:

Reference Arrays (some examples):

Join

+

Compute

stats

Reference Genome:

Known Variants:

Quality Regions:

Filtered CallsPer position

Filter based on:• Base Quality• Read Mapping

Quality• Coverage

Test using spike in

filter

Pile Up over ALL BAM files in single large array

Per position stats

Page 15: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 15

Methodology

• Formulate the caller as a simple signal/noise filter

• Tunable and explainable ROC curves (receiver operating characteristics)

– Control false positive and false negatives rates by experimenting with settings for noise and filter thresholds without having to reload data

– Can use ‘spiked in’ data to provide known answer to guide parameter setting for variant calling QA

Page 16: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 16

Base-level PHRED Score Distribution

Phred Score Probability of incorrect call Call accuracy10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10,000 99.99%

50 1 in 100,000 99.999%

60 1 in 1,000,000 99.9999%

Page 17: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 17

MAPQ Scores and Coverage

• Histograms show % of bases excluded at specific thresholds• PHRED and MAPQ thresholds affect coverage distribution

Page 18: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 18

Effect of PHRED & MAPQ thresholds on noise

Thresholding PHRED and MAPQ scores has an effect of "shifting" the noise to the lower allele frequency band

Noise Histogram

Second Call Ratio 2nd most common base / total coverage at that position

Den

sity

Cou

nt /

tota

l cov

erag

e at

that

pos

ition

Page 19: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 19

Low-complexity filter or low-confidence filter does not reduce noise further

Low complexity filter Low confidence region filter

Second Call Ratio Second Call Ratio

Dens

ity

Dens

ity

Page 20: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 20

BioCompute PiD Concept"id": "obj.1243","type”: "biocompute”,"name": "SciDB GIAB minority clone variant call SciDB PiD",#author,created,…"parametric_domain" : {"phred_threshold" : "50","coverage_threshold: "450",…

} "scidb_domain": { "hostname": "clust_scidb_01","db_user": "apoliakov",…"arrays": [{

"id": "obj.1243""name": "GIAB_CLONALITY.BAM_DATA""schema_type": "Multi-sample BAM Pileup Array""size": "78.5TB""created": "Jan 10 2017 11:57:34",…

JSON can be stored directly in the database

Page 21: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 21

BioCompute PiD Concept"scidb_domain": {

…"arrays": [{

"id": "obj.1243","name": "GIAB_CLONALITY.BAM_DATA","schema_type": "Multi-sample BAM Array","size": "78.5TB","created": "Jan 10 2017 11:57:34","modified":…

}, {

"id": "obj.1244","name": "GIAB_CLONALITY.BAM_PILEUP","schema_type": "BAM per-sample filtered BAM pileup""size": "350GB","created": "Jan 10 2017 18:67:34",…

},...}

Page 22: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

© Paradigm4 22

BioCompute PiD Concept

"execution_domain": [ { "id": "obj.1237", "env_parameters": "example", "location": "/workflows/giab_clonality.R", "platform": "SciDB-R", ...

}]

• Execution points to SciDB query scripts as before

• Optional Array Version History "arrays": [{

"name": "GIAB_CLONALITY.BAM_DATA","versions": [{

"version_id": "3""created": "Jan 10 2017 11:59:34",

}, ...

Page 23: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

BioCompute & SciDBa pipeline-in-a-database

Zachary Pitluk, Ph.D., V.P. Life Sciences & [email protected]