the durkheim project: social media risk & bayesian counters hadoop summit: june 27, 2013 chris...

24
The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov: Cloudera Disclaimers: This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Also supported by, the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract number N10PC20221. The opinions, findings and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space, the Naval Warfare Systems Center Pacific, or the IARPA, DOI/NBC, or the U.S. Government. © 2013 Patterns and Predictions

Upload: britton-gibson

Post on 25-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

The Durkheim Project: Social Media Risk & Bayesian Counters

Hadoop Summit: June 27, 2013

Chris Poulin: PATTERNS AND PREDICTIONS

Alex Kozlov: Cloudera

Disclaimers:

This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Also supported by, the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract number N10PC20221. The opinions, findings and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space, the Naval Warfare Systems Center Pacific, or the IARPA, DOI/NBC, or the U.S. Government.

© 2013 Patterns and Predictions

Page 2: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Speakers

PATTERNS AND PREDICTIONS

Chris Principal Investigator, DARPA DCAPS

Poulin-Dartmouth Suicide Prediction Team Former Co-Director, Dartmouth

Metalearning Working Group (Theoretical

Machine Learning) Artificial Intelligence Instructor, US Naval

War College Principal, Patterns and Predictions

(linguistics and prediction of financial events)

… and have now read many suicide notes.

AlexPrincipal Solutions Architect at Cloudera Ph.D. from Stanford University. Data mining and statistical analysis at SGI,

Hewlett-Packard

Page 3: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

Suicide is a hard societal problem,

but why?Stigma: Victims are socially outcast (i.e. disconnected)

Negative Topic: Intense negative emotion. And not a 'sexy'

research topic by any means.

Freedom of Choice: Ultimately you cant stop someone from

risky behaviors, or many other activities that risk self harm. And

suicide is the ultimate act of personal risk.

Logistics: Even if you know what to look for, there are not

enough clinicians to help the number of people suffering. Data

privacy issues are as intense, or more so then say banking.

Prediction: Accuracy (proper identification), false positives

(stigmatization), false negatives (malpractice)

Deeper issues?: Recent growth in suicide may be related to

something more systemically wrong. Suicide the symptom of

something else going on.

(e.g. Tony Blair quote on terrorism)…

Page 4: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

The project is named in honor of Emile Durkheim, a founding sociologist whose 1897 publication of Suicide defined early text analysis for suicide risk.

The team is comprised of a multidisciplinary team of artificial intelligence (machine learning and computational linguistics), and medical experts (psychiatrists).

www.durkheimproject.org

PATTERNS AND PREDICTIONS

Durkheim

Page 5: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

Social Problem:

Opt-In is critical

o Clear explanations for consent, no tricky EULAs

Technical Problem: How to build a system that collects, stores, analyzes,

and allows clinicians to react at Internet scale?

Architecture:

1) Opt-In Interface Layer

2) Data Collection Layer

3) Storage Layer

4) Machine Learning, Phase I

5) Machine Learning, Phase II

6) Automated Intervention

Our Approach

Page 6: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

1) Opt-In Interface LayerWe cant overemphasize the role of simplified user participation for consent, and privacy control, in our interface/interaction design.

Page 7: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

2) Data Collection LayerThe social media component is handled by a content aggregator (Gigya), and populates a Cassandra database.

Page 8: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

Data Collection Layer, ContinuedThe Cassandra instances were built and maintained (by Scale Unlimited) to handle high throughput storage. However, this is not the final destination of the data.

Page 9: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

3) Storage LayerEventually, the data is moved to the medical center (behind a HIPAA compliant firewall at Dartmouth). Here it persists for ongoing research.

Page 10: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

4) Machine Learning, Phase IIn 2011, we initiated a study with the U.S. Department of Veterans Affairs (VA) to study 3 cohorts of 100 subjects each (Non-Psychiatric, Psychiatric, and Suicide Positive).

We developed linguistics-

driven prediction models to

estimate the risk of suicide.

These models were

generated from unstructured

clinical notes

From the clinical notes, we

generated datasets of single

keywords and multi-word

phrases

We were able to predict

suicide with 65% accuracy on

a small dataset.

Page 11: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

5) Machine Learning, Phase II In 2011, we also initiated a study with Cloudera (Alex Kozlov) on a lightweight machine learning framework for detecting real-time risk at scale.

We wanted a clean statistical

model for distributed

inference (prediction).

We needed a more

lightweight framework than

Mahout.

We wanted to be able to

tradeoff runtime vs. accuracy.

We wanted the prediction

library to be eventually open

sourced (Apache license) for

the community.

‘‘AlphaAlpha’’ Build @ Build @ http://durkheimproject.org/bcount/

By Alex Kozlov <[email protected]>By Alex Kozlov <[email protected]>

Page 12: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

What is B-counts today? And Why?

Distributed aggregation of user events and correlations to fit into RAM of multiple machines

Smart client: Moves substantial amount of logic to clients

Time: An explicit time dimension to support ‘recency analysis’

Based on HBase

Previous analysis (Poulin) had indicated that words and correlations are a good predictor of target variable

Need a faster processing/response time (response time beats accuracy of the model)

http://www.slideshare.net/Hadoop_Summit/http://www.slideshare.net/Hadoop_Summit/bayesian-countersbayesian-counters

Page 13: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Time to Answer

Examples

Advertising: if you don’t figure what the user wants in 5 minutes, you lost him

Intrusion detection: the damage may be significantly bigger after a few minutes after break-in

Mental health risk: you need to screen before negative actions occur

Value vs. time

http://cetas.nethttp://www.woopra.com

http://www.wibidata.com/

Page 14: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Solution: Time Stamped Hadoop

•Key: subset of variables with their values + timestamp (variable length)

•Value: count (8 bytes)

Key Key 11

Key Key 11

ValValueueValValueue

Key Key 22

Key Key 22

ValValueueValValueue

Key Key 33

Key Key 33

ValValueueValValueue

Key Key 44

Key Key 44

ValValueueValValueue

indexindex

Pr(A|B, last 20 minutes) Pr(A|B, last 20 minutes)

Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days,

etc.)

What if we want to access more recent data more often?

What if we want to access more recent data more often?

Page 15: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

A Bayesian Counter, in detail

IrisIrisIrisIris

[sepal_width=2;class=[sepal_width=2;class=0]0]

[sepal_width=2;class=[sepal_width=2;class=0]0]

15151515

1321038671132103867113210386711321038671

30 mins30 mins30 mins30 mins

2 hours2 hours2 hours2 hours

……

Region (divide Region (divide between)between)

Column Column familyfamily

Column Column qualifierqualifier

FileFile

Value Value (data)(data)

Counter/Counter/TableTable

1321038998132103899813210389981321038998

VersionVersion

Page 16: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Command Line Implementation

Page 17: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Syntax

nb iris class=2 sepal_length=5\;petal_length=1.4 300

Target VariableTarget Variable

PredictorsPredictors

Time (seconds from now)Time (seconds from now)

Page 18: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Current Classifier Support (alpha release)

Naïve Bayes: Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(Fi|C)

Association rules: Confidence (A -> B): count(A and B)/count(A), Lift (A -> B):

count(A and B)/(count(A) x count(B))

Nearest Neighbor: P(C) for k nearest neighbors, count(C|X) = ΣXi count(C|Xi), where

X1, X2, ..., XN are in the vicinity of X

Clique ranking: I(X;Y)=ΣΣp(x,y)log(p(x,y)/p(x)p(y), Where x in X and y in Y, Using

random projection can generalize on two abstract subsets of Z

Page 19: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Performance

retail.dat example – 88K transactions over 14,246 items

o Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2)

o 10 ms per pattern on a 5 node cluster

Page 20: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

6) InterventionAutomated systems are coming online for potential patients and families seeking treatment, as well as passive intervention strategies (‘safety plans’).

Page 21: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

What's next?In 2013, we plan a variety of initiatives including the launch of our clinical observation study, deployment of Bayesian Counters on live data, and to seek approval for an automated intervention study.

Launch Data Collection Study

(CPHS #23781)… very soon

Deployment of B-Counts on

live data for live monitoring

Intervention Research

(Clinical Study Approval)

Page 22: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

PATTERNS AND PREDICTIONS

ConclusionWhat is Durkheim? And what is the Bayesian Counters library?

A near real-time classification library, that, while under development, you’re

free to use.

Hope that some help is coming to those in need…

Page 23: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

Team

PATTERNS AND PREDICTIONS

Chris Poulin, Director & Principal Investigator

Paul Thompson, Study Co-Principal Investigator

Thomas W. McAllister, M.D., Key Personnel

Ben Goertzel, Ph.D., Key Personnel

Brian Shiner, MD, Key Personnel

Craig J. Bryan, PsyD, Advisor

Linas Vepstas – Lead Machine Learning Programmer

Brian Nauheimer – Technical Project Manager

Chhean Saur – Lead Web/API Programmer

Kevin Watters – Principal Programmer, Middleware

Ken Krugler – Lead Distributed Systems Expert

Ann Marion – User Experience (UX) Design

Jane Nisselson – User Interface (UI) Design

Andrew Chen – Social Media Applications Developer

Alex Kozlov – Real-time/Distributed Classifier Development

Vivek Magotra – Cassandra Database Developer

Page 24: The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov : Cloudera

THANK YOU

Chris Poulin, Managing Partner, Patterns and Predictions

[email protected]

Alex Kozlov, Principal Solutions Architect, Cloudera

[email protected]

Note: We hope that you have found this talk useful and encouraging. However, if you are having thoughts of harming yourself, please call the Veterans Crisis Line at 1-800 273-8255 or 911.

© 2013 Patterns and Predictions

PATTERNS AND PREDICTIONS