biocompute & scidb a pipeline-in-a-database · scidb & biocompute objects scidb • data...
TRANSCRIPT
![Page 1: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/1.jpg)
BioCompute & SciDBa pipeline-in-a-database
![Page 2: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/2.jpg)
© Paradigm4 2
Topics
• Why a pipeline-in-database? Copy NCBI• SciDB: a scientific data storage and computing platform• BioCompute Example: Group-based somatic mutation calling
3.2 billion columns, millions of rows, analysis-ready, not in files
![Page 3: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/3.jpg)
© Paradigm4 3
Typical Research Workflow
Data Exploration
Pipeline Data Generation
![Page 4: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/4.jpg)
© Paradigm4 4
Typical Research Workflow
Data Exploration
Pipeline Data Generation
COMPUTE
LOAD
![Page 5: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/5.jpg)
© Paradigm4 5
Typical Research Workflowas BioCompute objects
Data Exploration
Pipeline Data Generation
COMPUTE
LOAD
![Page 6: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/6.jpg)
© Paradigm4 6
Group based variants – big pileups
• Workflow– merge multiple BAMs– sort reads by genomic coordinate
• 8.8 TB (2x 4.4TB flash drive) needed for the sort, in addition to the distributed file system
• San Diego Super Computer Center
![Page 7: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/7.jpg)
© Paradigm4 7
What did they find in Human Genomes
Optimum was ~ 15 whole genomes at a time
Known variants (dark blue and dark green)
Novel variants (light blue and light green)
Attribute Single file x 100s Grouped Files Concordance
Total variants 30,790,918 29,915,861 81.4%
Unique variants 2,668,331 3,543,283 MHC, Y, Tele
Minor Allele Frequency 1-5% <1%
Ti/Tv – 2.19 is ideal 1.2 1.6
![Page 8: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/8.jpg)
© Paradigm4 8
SciDBa scalable Scientific, Computational DBMS
SciDB blurs the line between storage and
computation
In-situ, massively scalable analytics
Scientific data are stored natively as multi-
dimensional arrays
Genomic coordinate
Chr
omos
ome
Patie
ntPa
tient
![Page 9: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/9.jpg)
© Paradigm4 9
Pipeline data generation
For reproducibility and re-analysis• Minimize data movement and copying• Data stored in analysis-ready form• Metadata stored with data (BioCompute Requirement)• Rapid selection of specific data of interest
SciDB
QueryF (x, y, z)
QueryG (x,y,z)
QueryT (x,y,z)
3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY3.2 0.3 0.1 11
3.4 0.0 0.8 10
1.1 1.0 1.2 14
0.9 1.0 1.2 13
SAM
PLES
DIM
ENSI
ON
AKT1 EGFR TP53 ZNF11
TCGA1
TCGA2
TCGA3
TCGA4
GENES DIMENSION
ARRAY
![Page 10: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/10.jpg)
© Paradigm4 10
Reproducible research
• Curate once• Explore many times• By many concurrent users• Enforces data integrity• Versions data• Track and trace data and queries
Data Exploration
SciDB
Data Exploration
Data Exploration
Data Exploration
User 1 User 3 User 4User 2
![Page 11: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/11.jpg)
© Paradigm4 11
SciDB & BioCompute Objects
SciDB
• Data loaders enforce type & field constraints• Arrays are versioned and time-stamped• Database log tracks all parameters, data changes, user actions• Utility could represent queries as JSON objects
![Page 12: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/12.jpg)
© Paradigm4 12
![Page 13: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/13.jpg)
© Paradigm4 13
Group-based Somatic Point Mutation Calling
• Reveals statistically significant minority somatic mutations
• Technique provides FDA explanation and proof of repeatability, traceability and accuracy of variant calling pipeline
Simultaneous large group pile-ups provide more accurate identification and accommodation of sequencing errors*
* Standish, et al. BMC Bioinformatics (2015) 16:304
![Page 14: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/14.jpg)
© Paradigm4 14
Somatic Point Variant Caller
BAM 1
BAM 2
BAM…
Aggregate Total Across BAMs
Per BAM
Count A,C,T,G,? Per Position
1:2:
…
Reference Arrays (some examples):
Join
+
Compute
stats
Reference Genome:
Known Variants:
Quality Regions:
Filtered CallsPer position
Filter based on:• Base Quality• Read Mapping
Quality• Coverage
Test using spike in
filter
Pile Up over ALL BAM files in single large array
Per position stats
![Page 15: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/15.jpg)
© Paradigm4 15
Methodology
• Formulate the caller as a simple signal/noise filter
• Tunable and explainable ROC curves (receiver operating characteristics)
– Control false positive and false negatives rates by experimenting with settings for noise and filter thresholds without having to reload data
– Can use ‘spiked in’ data to provide known answer to guide parameter setting for variant calling QA
![Page 16: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/16.jpg)
© Paradigm4 16
Base-level PHRED Score Distribution
Phred Score Probability of incorrect call Call accuracy10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
![Page 17: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/17.jpg)
© Paradigm4 17
MAPQ Scores and Coverage
• Histograms show % of bases excluded at specific thresholds• PHRED and MAPQ thresholds affect coverage distribution
![Page 18: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/18.jpg)
© Paradigm4 18
Effect of PHRED & MAPQ thresholds on noise
Thresholding PHRED and MAPQ scores has an effect of "shifting" the noise to the lower allele frequency band
Noise Histogram
Second Call Ratio 2nd most common base / total coverage at that position
Den
sity
Cou
nt /
tota
l cov
erag
e at
that
pos
ition
![Page 19: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/19.jpg)
© Paradigm4 19
Low-complexity filter or low-confidence filter does not reduce noise further
Low complexity filter Low confidence region filter
Second Call Ratio Second Call Ratio
Dens
ity
Dens
ity
![Page 20: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/20.jpg)
© Paradigm4 20
BioCompute PiD Concept"id": "obj.1243","type”: "biocompute”,"name": "SciDB GIAB minority clone variant call SciDB PiD",#author,created,…"parametric_domain" : {"phred_threshold" : "50","coverage_threshold: "450",…
} "scidb_domain": { "hostname": "clust_scidb_01","db_user": "apoliakov",…"arrays": [{
"id": "obj.1243""name": "GIAB_CLONALITY.BAM_DATA""schema_type": "Multi-sample BAM Pileup Array""size": "78.5TB""created": "Jan 10 2017 11:57:34",…
JSON can be stored directly in the database
![Page 21: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/21.jpg)
© Paradigm4 21
BioCompute PiD Concept"scidb_domain": {
…"arrays": [{
"id": "obj.1243","name": "GIAB_CLONALITY.BAM_DATA","schema_type": "Multi-sample BAM Array","size": "78.5TB","created": "Jan 10 2017 11:57:34","modified":…
}, {
"id": "obj.1244","name": "GIAB_CLONALITY.BAM_PILEUP","schema_type": "BAM per-sample filtered BAM pileup""size": "350GB","created": "Jan 10 2017 18:67:34",…
},...}
![Page 22: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/22.jpg)
© Paradigm4 22
BioCompute PiD Concept
"execution_domain": [ { "id": "obj.1237", "env_parameters": "example", "location": "/workflows/giab_clonality.R", "platform": "SciDB-R", ...
}]
• Execution points to SciDB query scripts as before
• Optional Array Version History "arrays": [{
"name": "GIAB_CLONALITY.BAM_DATA","versions": [{
"version_id": "3""created": "Jan 10 2017 11:59:34",
}, ...
![Page 23: BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2bdbcecf2d64a2c57ba6f/html5/thumbnails/23.jpg)
BioCompute & SciDBa pipeline-in-a-database
Zachary Pitluk, Ph.D., V.P. Life Sciences & [email protected]