c* summit 2013: suicide risk prediction using social media and cassandra by ken krugler
DESCRIPTION
In this presentation, Ken will describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There's a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus their time and energy on the most at-risk individuals. This project uses Cassandra as the scalable storage system for this social media data, which is then analyzed in a distributed environment using Hadoop. The project also uses the Solr search support from DataStax Enterprise to provide ways for users to dig into the underlying data, which is critical when understanding the assigned risk levels.TRANSCRIPT
![Page 1: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/1.jpg)
#CASSANDRA13
Ken Krugler | President, Scale Unlimited
Suicide Prevention Using Social Media and Cassandra
![Page 2: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/2.jpg)
#CASSANDRA13
What we will discuss today...
*Using Cassandra to store social media content
*Combining Hadoop workflows with Cassandra
*Leveraging Solr search support in DataStax Enterprise
*Doing good with big data
This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space and Naval Warfare Systems Center Pacific.
Fine Print!
![Page 3: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/3.jpg)
#CASSANDRA13
Obligatory Background
*Ken Krugler, Scale Unlimited - Nevada City, CA
*Consulting on big data workflows, machine learning & search
*Training for Hadoop, Cascading, Solr & Cassandra
![Page 4: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/4.jpg)
#CASSANDRA13
Durkheim Project OverviewIncluding things we didn't work on...
![Page 5: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/5.jpg)
#CASSANDRA13
What's the problem?
*More soldiers die from suicide than combat
*Suicide rate has gone up 80% since 2002
*Civilian suicide rates are also climbing
*More suicides than homicides
*Intervention after an "event" is often too late
![Page 6: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/6.jpg)
#CASSANDRA13
What is The Durkheim Project?
*DARPA-funded initiative to help military physicians
*Uses predictive analytics to estimate suicide risk from what people write online
*Each user is assigned a suicidality risk rating of red, yellow or green.
Émile Durkheim
![Page 7: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/7.jpg)
#CASSANDRA13
Current Status of Durkheim
*Collaborative effort involving Patterns and Predictions, Dartmouth Medical School & Facebook
*Details at http://www.durkheimproject.org/
*Finished phase I, now being rolled out to wider audience
![Page 8: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/8.jpg)
#CASSANDRA13
Predictive Analytics
*Guessing at state of mind from text
-"There are very few people in this world that know the REAL me."
-"I lay down to go to sleep, but all I can do is cry"
*Uses labeled training data from clinical notes
*Phase I results promising, for small sample set
-"ensemble" of predictors is a powerful ML technique
![Page 9: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/9.jpg)
#CASSANDRA13
Clinician Dashboard
*Multiple views on patient
*Prediction & confidence
*Backing data (key phrases, etc)
![Page 10: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/10.jpg)
#CASSANDRA13
Data CollectionWhere _do_ you put a billion text snippets?
![Page 11: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/11.jpg)
#CASSANDRA13
Saving Social Media Activity
*System to continuous save new activity
-Scalable data store
*Also needs a scalable, reliable way to access data
-Processed in bulk (workflows)
-Accessed at individual level
-Searched at activity level
![Page 12: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/12.jpg)
#CASSANDRA13
Data Collection
*Pink is what we wrote
*Green is in Cassandra
*Key data path in red
Exciting Social Media Activity
Gigya Daemon
Durkheim Social API
Users Table
Durkheim App
Gigya Service
Activity Table
![Page 13: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/13.jpg)
#CASSANDRA13
Designing the Column Families
*What queries do we need to handle?
-Always by user id (what we assign)
*We want all the data for a user
-Both for Users table, and Activities table
-Sometimes we want a date range of activities
*So one row per user
-And ordered by date in the Activities table
![Page 14: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/14.jpg)
#CASSANDRA13
Users Table (Column Family)
*One row per user - row key is a UUID we assign
*Standard "static" columns
-First name, last name, opt_in status, etc.
*Easy to add more xxx_id columns for new services
row key first_name last_name facebook_id twitter_id opt_in
![Page 15: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/15.jpg)
#CASSANDRA13
Activities Table (Column Family)
*One row per user - row key is a UUID we assign
*One composite column per social media event
-Timestamp (long value)
-Source (FB, TW, GP, etc)
-Type of column (data, activity id, user id, type of activity)
row key ts_src_data ts_src_id ts_src_providerUid ts_src_type
![Page 16: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/16.jpg)
#CASSANDRA13
Two Views of Composite Columns
*As a row/column view
*As a key-value map 213_FB_data
213_FB_id
213_FB_providerUid
213_FB_type
"I feel tired"
"FB post #32"
"FB user #66"
"Status update"
"uuid1"
"uuid1" 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type
"I feel tired" "FB post #32" "FB user #66" "Status update"
![Page 17: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/17.jpg)
#CASSANDRA13
Implementation Details
*API access protected via signature
*Gigya Daemon on both t1.micro servers
-But only active on one of them
*Astyanax client talks to Cassandra
*Cluster uses 3 m1.large servers
Durkheim Social API
Durkheim App
AWS Load Balancer
EC2 m1.largeservers
Durkheim Social API
EC2 t1.microservers
![Page 18: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/18.jpg)
#CASSANDRA13
Predictive Analytics at ScaleRunning workflows against Cassandra data
![Page 19: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/19.jpg)
#CASSANDRA13
How to process all this social media goodness?
*Models are defined elsewhere
*These are "black boxes" to us
213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type
"I feel tired" "FB post #32" "FB user #66" "Status update"
307_TW_data 307_TW_id 307_TW_providerUid 307_TW_type
"Where am I?" "Tweet #17" "TW user #109" "Tweet"
Feature Extraction Model
model rating probability keywords
![Page 20: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/20.jpg)
#CASSANDRA13
Why do we need Hadoop?
*Running one model on one user is easy
-And n models on one user is still OK
*But when a model changes...
-all users with the model need processing
![Page 21: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/21.jpg)
#CASSANDRA13
Batch processing is OK
*No strict minimum latency requirements
*So we use Hadoop, for scalability and reliability
![Page 22: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/22.jpg)
#CASSANDRA13
Hadoop Workflow Details
*Implemented using Cascading
*Read Activities Table using Cassandra Tap
*Read models from MySQL via JDBC
![Page 23: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/23.jpg)
#CASSANDRA13
Hadoop Bulk Classification Workflow
Convert to Cassandra
Write Classification Result Table
Run Classifier models
CoGroup by user profile ID
Convert from Cassandra
Read User Profiles Table
Convert from Cassandra
Read Social Media Activity Table
![Page 24: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/24.jpg)
#CASSANDRA13
Workflow Issues
*Currently manual operation
-Ultimately needs a daemon to trigger (time, users, models)
*Runs in separate cluster
-Lots of network activity to pull data from Cassandra cluster
-With DSE we could run on same cluster
*Fun with AWS security groups
![Page 25: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/25.jpg)
#CASSANDRA13
Solr SearchPoking at the data
![Page 26: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/26.jpg)
#CASSANDRA13
Solr Search
*Model results include key terms for classification result
-"feel angry" (0.732)
*Now you want to check actual usage of these terms
![Page 27: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/27.jpg)
#CASSANDRA13
Poking at the Data
*Hadoop turns petabytes intopie-charts
*How do you verify results?
*Search works really well here
![Page 28: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/28.jpg)
#CASSANDRA13
Solr Search
*Want "narrow" table for search
-Solr dynamic fields are usually not a great idea
-Limit to 1024 dynamic fields per document
*So we'll replicate some of our Activity CF data into a new CF
*Don't be afraid of making copies of data
![Page 29: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/29.jpg)
#CASSANDRA13
The "Search" Column Family
*Row key is derived from Activity CF UUID + target column name
*One column ("data") has content from that row + column in Activity CF
row key "data"
"uuid1_213_FB "I feel tired"
"uuid1" 213_FB_data 213_FB_id
"I feel tired" "FB post #32"
Activity Column Family
Search Column Family
![Page 30: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/30.jpg)
#CASSANDRA13
Solr Schema
*Very simple (which is how we like it)
*Direct one-to-one mapping with Cassandra columns
*Hits have key field, which contains UUID/Timestamp/Service
<fields> <field name="key" type="string" indexed="true" stored="true" /> <field name="data" type="text" indexed="true" stored="true" /></fields>
![Page 31: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/31.jpg)
#CASSANDRA13
Combined Cluster
*One Cassandra Cluster can allocate nodes for Hadoop & Search
![Page 32: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/32.jpg)
#CASSANDRA13
SecurityLocking things down
![Page 33: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/33.jpg)
#CASSANDRA13
The Most Important Detail
*We don't have any personal medical data!!!
*We don't have any personal medical data!!!
*We don't have any personal medical data!!!
![Page 34: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/34.jpg)
#CASSANDRA13
Three Aspects of Security
*Server-level
-ssh via restricted private key
*API-level
-validate requests using signature
-secure SHA1 hash
*Services-level
-Restrict open ports using security groups
![Page 35: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/35.jpg)
#CASSANDRA13
SummaryBringing it all home
![Page 36: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/36.jpg)
#CASSANDRA13
*You can effectively use Cassandra as:A repository for social media dataThe data source for workflowsA search index, via Solr integration
Key Points...
![Page 37: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/37.jpg)
#CASSANDRA13
*It is possible to do more with big data than optimize ad yields
And the Meta-Point
![Page 38: C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler](https://reader034.vdocument.in/reader034/viewer/2022051412/548314d9b4af9f640d8b4943/html5/thumbnails/38.jpg)
#CASSANDRA13
THANK YOU