herding ponies: how big data methods facilitate collaborative analytics
Post on 03-Jan-2016
25 Views
Preview:
DESCRIPTION
TRANSCRIPT
Herding Ponies: How big data methods facilitate collaborative analytics
Changes in Outcomes Research
New monikers…
Patient Centered Outcomes Research
Health Services Research
Comparative Effectiveness Research
Safety and Surveillance
Changes in funding agencies
PCORI - AHRQ
FDA – CMS
NIH
Changes in research models
More multi-site studies
Larger “center-based” studies
Greater interest in Patient Generated Data
Greater interest in EHR-based data
Less interest in claims
Collaboration Frameworks From other disciplines
Open Science Grid Physics, nanotechnolgy, structural biology
OSG: 1.4M CPU-hours/day, >90 sites, >3000 users,
>260 pubs in 2010
LIGOPhysics/Astrophysics
Established practices and metadata standards
1 PB data in last science run, distributed worldwide
ESGF
1.2 PB climate data
delivered to 23,000 users; 600+ pubs
Collage – Executable papersComputer science
“Why hasn’t Outcomes Research adopted collaborative methods used in physics,
climate science, and genomics?”
- Everyone in data-driven research
1. Healthcare data are not collected for research
Not standardized
Not complete
2. Privacy protection has legal and ethical implications
3. Data is an asset
4. Data sharing is not incentivized supported by journals, funding agencies, or the business of healthcare
Obtaining consent is expensive
Data hoarding is rewarded and conservative
Adapting to Collaborative Science
Are Federated Research Networks the solution?
In federated models data are not centralized. AHRQ and PCORI have invested heavily this approach.
5. Each data holder independently assumes responsibility for “data wrangling” and standardization
6. Requires distributed analysis as opposed to traditional central data pooling and analysis.
If data are simply used to independently estimate one model per site, value-added for causal inference is similar to a meta-analysis
7. Requires greater levels of coordination of governance, standards, software, and policies.
8. High barriers to entry – what is the ROI?
Federated Meta-Analysis vs. Distributed Analysis
Parallel Meta -analysis (Independently Estimated Results )
Data Site 1100
patients
Data Site 250
patients
Query Portal
Analysis Program
Results Site 1model fit to 100
patients
Results Site 2model fit to 50
patients
Parallel Distributed Analytics (Jointly Estimated Results )
Data Site 1100
patients
Data Site 250 patients
Query Portal
& Aggregator
IterativeAnalysis Program
Intermediate Statistics Site 1
Intermediate Statistics Site 2
Model Fit 150 Patients
Converged Estimate
Meta-analysis• 1 Independently estimated
model for each node in the network
• Not iterative
Distributed Analysis• One jointly estimated
model using data from all sites
• Typically iterative• Leverages computational
power of the entire network
What does this have to do with “big data?”
Two (of 8) barriers to collaborative data science solved with “Big Data” methods
Privacy protection has legal and ethical implications
If data are simply used to independently estimate one model per site, value-added for causal inference is similar to a meta-analysis
Bonus – specialized software or hardware like SAS and CMS repositories can be replaced with parallelized systems
Parallel Evolution of Distributed Computing and Federated Research Networks
1993 1998 2003 2008 2013
CaGRID
“Big Data” Analytics vs. Outcomes Research Analytics
“Big Data” in Distributed Environments
Outcomes Research in Federated Research
Networks
Analysis Questions PatternsPredictionsClassification
Causal InferencePredictionsHypothesis testing
Data Distribution Data can be randomly distributed across processors by a master
Data are non-randomly anchored to sites
# Nodes on network 100s or more 10s
Data Governance constraints between network nodes
Typically none or low Typically very high
Data set size Very large Relatively small
Query Distribution Platforms Apache SparkHadoop Map-ReduceApache Pig
SHRINEPopMedNetTRIAD
Common Analytic Platforms R-Volution/R-HadoopApache MahoutSpark Machine Learning LibSpark Graph X Lib
R SASStata
Size of developer community 1000s Dozens
“Big-Data” Methods are Incidentally “Privacy Preserving”
Feature Clinical Research Rationale
“Big Data” Rationale
Federation in the form of multiple networked nodes or processing cores
Multiple independently operating data partners
Inefficient to rely on a single very powerful processor or specialized hardware
Distributed computation across networked nodes (instead of central pooling of data)
Transferring patient-level data incurs re-identification risks
Inefficient to transfer large data sets across the network
Distributed Computing Frameworks
Grid Computing Architectures
Statistical Query OracleMostly an academic effort
HadoopFrom Google
Hundreds of developers
591 Active projects and organizations
Apache SparkBerkeley Computer Science answer to Hadoop
Most rapidly growing user base
99 Active projects and organizations
Collaboration Frameworks In Outcomes Research
SHRINE for I2B2
PopMedNet – for MiniSentinel, PCORnet
TRIAD for CAGrid, SAFTINet DRN
What distributed methods in the standard biostats toolbox are already supported in
“Big Data” vs. Clinical Frameworks?Algorithm/Method Apache Spark Libraries Map-Reduce Multi-
Core or RHadoopFederated Clinical Research Networks
Linear regression (weighted) X X
Logistic regression X X X
Cox Proportional Hazard X
Generalized Linear Models X
Naïve Bayes X X
Gaussian Discriminative Analysis X
K-means X X
Neural Network Backpropagation X
Matrix Factorization X
PCA * X
ICA * X
Support Vector Machine X X X
Expectation Maximization X
Random Forest Classifier X
No Longer a Technical Challenge
We have the tools we need to overcome privacy and liability concerns. Now we “only” need to change culture.
Moving Collaborative Outcomes Science Forward
Policies (aka incentives) Payer-driven incentives for better data hygiene and standardization
Payer incentives for sharing
Funding agency incentives for collaborative data management vs. data hoarding
Journal incentives
HIPAA Clarification
Infrastructure As a community - adopt existing easy-to-use, flexible platforms for sharing code and
data
Link clinical data and patient device infrastructure to research infrastructure
Culture Clinician demand
Patient demand
Tenure and promotion transformation
Replace “not invented here syndrome” with collective credit and shared efficiencies
top related