cassandra day denver 2014: using cassandra to support crisis informatics research
DESCRIPTION
Crisis Informatics is an area of research that investigates how members of the public make use of social media during times of crisis. The amount of social media data generated by a single event is significant: millions of tweets and status updates accompanied by gigabytes of photos and video. To investigate the types of digital behaviors that occur around these events requires a significant investment in designing, developing, and deploying large-scale software infrastructure for both data collection and analysis. Project EPIC at the University of Colorado has been making use of Cassandra since Spring 2012 to provide a solid foundation for Project EPIC's data collection and analysis activities. Project EPIC has collected terabytes of social media data associated with hundreds of disaster events that must be stored, processed, analyzed, and visualized. This talk will cover how Project EPIC makes use of Cassandra and discuss some of the architectural, modeling, and analysis challenges encountered while developing the Project EPIC software infrastructure.TRANSCRIPT
Using Cassandra to Support Crisis Informatics Research
Kenneth M. Anderson Associate Professor
Department of Computer ScienceCo-Director of The Center for Software and Society Co-Director of Project EPIC
Director of CU’s Big Data Initiative
Happy Ada Lovelace Day!
Associate Professor; Department of Computer ScienceKen Anderson
‣ Research Interests • Software Architecture and Software Design
• Data-Intensive Systems and Crisis Informatics
‣ Teaching Interests • Software Engineering; OO A&D; Data Engineering
‣ Active in Broadening Participation in Computer Science • Led the creation of the BA in CS degree at CU
- 450 new CS majors in two years; 900 CS majors on campus
Project EPIC
‣ Empowering the Public with Information in Crisis • Largest NSF-Funded Project on Crisis Informatics
- ~4M since Fall 2009
‣ Results • ~60 research publications, 2 PostDocs, 5 PhD graduates, 4
MS graduates, 13 current PhD students
• Tweak the Tweet; 100+ data sets (~1.5B tweets)
• Software: Data collection, analytics, NLP, GIS
Crisis Informatics The study of how technology is changing the way the world responds to mass emergency events
70K Geotagged Tweets prior/during/after
Hurricane Sandy Landfall
0
35
70
105
1409/
12/1
3 12
:00
AM
9/12
/13
12:0
0 PM
9/13
/13
12:0
0 AM
9/13
/13
12:0
0 PM
9/14
/13
12:0
0 AM
9/14
/13
12:0
0 PM
9/15
/13
12:0
0 AM
9/15
/13
12:0
0 PM
9/16
/13
12:0
0 AM
9/16
/13
12:0
0 PM
9/17
/13
12:0
0 AM
9/17
/13
12:0
0 PM
9/18
/13
12:0
0 AM
9/18
/13
12:0
0 PM
9/19
/13
12:0
0 AM
9/19
/13
12:0
0 PM
9/20
/13
12:0
0 AM
9/20
/13
12:0
0 PM
Tweets Per Minute
2013 Colorado Floods — First Nine Days51 31 15 17 11 7 7 5 3
Average Tweets Per Minute
Project EPIC Software Infrastructure
‣ EPIC Collect • Twitter data collection infrastructure capable of collecting
24/7 with 99.9% uptime (since 2010)
- Built on top of Cassandra and designed for scalability, availability, and flexibility
‣ EPIC Analyze • A scalable and flexible data analytics environment that
allows Project EPIC analysts to browse, search, filter, annotate, and process EPIC Collect data sets
- Built on top of DataStax Enterprise, Redis, Rails, & Postgres
Logical Arrangement of Components Deployed across seven servers in a CU Data Center
Project EPIC Software Architecture
EPIC Event Editor EPIC Analyze Splunk ApplicationLayer
ServiceLayer
StorageLayer
Twitter Redis
PostgreSQL Cassandra
SolrPig HadoopEPIC
Collect
DataStax Enterprise
EPIC Collect
Data Center
TwitterTwitter
Collection Service
LogCassandraCassandra Cassandra Cassandra
Project EPIC Event
Editor
Data Center
TwitterTwitter
Collection Service
LogCassandraCassandra Cassandra Cassandra
Project EPIC Event
Editor
Why Cassandra?{ “id”: … }
Flexibility. Immune to changes in Tweet metadata.
Data Center
TwitterTwitter
Collection Service
LogCassandraCassandra Cassandra Cassandra
Project EPIC Event
Editor
Availability. Tweets can be written to any node in the cluster.Why Cassandra?
Data Center
Cassandra
TwitterTwitter
Collection Service
LogCassandra Cassandra Cassandra
Project EPIC Event
Editor
Scalability. Need more disk space? Add more nodes!
Cassandra CassandraCassandra Cassandra
Why Cassandra?
Data Center
TwitterTwitter
Collection Service
LogCassandraCassandra Cassandra Cassandra
Project EPIC Event
Editor
Robustness. Data on nodes automatically replicated.Why Cassandra?
However…
Data Modeling is Wicked Hard
Getting Row Keys Right
It’s hash tables all the way down…Cassandra Data Model
Row Key 1 Column Name A ••• Column Name X
Value ••• Value•••
Row Key N Column Name B ••• Column Name Y
Value ••• Value
�1
The design of row keys is critical.
Why?
‣ Row keys determine what you can retrieve • They are your primary means to make a query and retrieve
relevant data; their structure determines query expressivity
• It should be easy to generate them from elements of your problem domain
‣ Row keys determine how “wide” your rows are • This is important because Cassandra replicates rows
‣ Row keys are partitioned across your cluster’s nodes • A “bad” row key design can negatively impact performance
Row Keys Should Reflect Problem Domain ‣ You need to easily be able to generate row keys based on
information in your problem domain <region_name>:<entity_name>:<time_collected>
vs 751e8446ede178f10fd44e3a37affb6b15ed30ce
‣ The former: easily generated from domain objects • easily reconstructed at query time
‣ The latter might be easily generated • but not easily reconstructed
The Reason?
‣ No easy way to ask Cassandra for all row keys in a column family
• If you want to get this information, you have to query Cassandra for it, in batches, until all row keys have been retrieved
- This is not an O(1) operation!
‣ Instead, it’s better if you can skip this step and reconstruct from your problem domain
• US_EastCoast:Invoices:0000_01012014 to US_EastCoast:Invoices:2359_12312014
Wide vs. Narrow
‣ You can design “wide” rows or “narrow” rows • This corresponds to returning a LOT of information for a
given key or a limited amount of information
!
!
• Wide rows can be useful, for instance, if you’re domain has lots of “events” on a given day or within a given hour
fb_users_dk user 1; user 2; … user 100,000; …ken_age_ht age; height
CassandraCassandra Cassandra Cassandra
The Rub? Rows Get Replicated
As previously mentioned, rows get replicated
How wide is too wide? Depends on size of cluster and network bandwidth
For wide rows, this can be a performance concern.
Row Keys Get Partitioned
‣ The nodes in your cluster divide up the key space between them
• The value of a row key determines where it will get stored
‣ You have to be cognizant of this partition because often Cassandra is being used in situations where a LOT of data is being written to it
• You need to make sure your row key design does not overburden any one node in your cluster
CassandraCassandra Cassandra Cassandra
Imagine your row_key is a monotonically increasing integer
Twitter Collection
Service
Say, for instance, tweet ids
Over a single day, all tweets might be saved on just one node in the cluster; the others would remain idle!
CassandraCassandra Cassandra Cassandra
Instead, you want enough variation that keys get evenly distributed across the cluster
row_key_1 row_key_a row_key_$ row_key_2
Writer
Reader
Design of Row Key for EPIC Collect‣ For Project EPIC, we make use of a “hybrid” row key
• The first part of the row_key is a keyword used to collect tweets for a given event
- earthquake, flood, cowx, obama, …
• The second part of the row_key is the Julian day that a tweet was collected on
- January 1, 2014 equals “2014001”; February 1, 2014 equals “2014032”; etc.
• The third part of the row_key is the last digit of an MD5 hash of the entire Tweet JSON object
- i.e. 0-9, a-f; This is used to distribute tweets across the cluster
Tweets Column Familykeyword:day:tag Tweet Id 1 ••• Tweet Id N
JSON ••• JSON•••
keyword2:day:tag Tweet Id 1 ••• Tweet Id MJSON ••• JSON
•••
‣ keyword: a word of interest for an event; e.g. “flood” ‣ julian_day: the day of the year a tweet was collected ‣ tag: a hexadecimal character “0-9, a, b, c, d, e, f”
CassandraCassandra Cassandra Cassandra
flood:002:0
flood:002:3
flood:002:1flood:002:2
flood:002:4flood:002:5
flood:002:6
flood:002:7
flood:002:8 flood:002:cflood:002:9
flood:002:b
flood:002:a
flood:002:d
flood:002:e
flood:002:f
Row Key Distribution
flood:002:0 flood:002:1Replication …
EPIC Analyze
A data analytics environment for large Twitter Data SetsEPIC Analyze
‣ Provides a scalable and extensible analysis environment • Aims to partially automate Project EPIC’s analysis work
- Automatically calculate common metrics on all data sets
- Apply new analysis algorithms to entire data sets at once
- Support filtering/sampling on large data sets
- Support shared data set annotation by a team of analysts
• Provide these features while
- supporting data sets of millions of tweets
- with fast performance so as not to interrupt analysis work
DataStax Enterprise
Project EPIC Web Apps
Hadoop Cassandra Solr
3rd Party Analytics
AppsFacebook
RedisPig
Challenges
‣ Recall: goal of EPIC Collect is to store events in a reliable, scalable fashion
‣ Data not necessarily structured to support analysis • Implication: Need for Migration/Duplication to enable
features such as searching, filtering, analysis, etc.
Data Migration and Duplication
‣ With EPIC Collect, we chose to have fairly “wide” rows • Each row stores the tweets that contain a given keyword for
a given day
- “All tweets that contain the word “flood” collected on 01/01/14”
- We use the “tag” to keep the row from growing too large, but there can still be 100s of 1000s of tweets in each row
‣ To support searching/filtering, we want to use Solr • however, Solr requires “narrow” structured rows
- one tweet per row, each column defined by a schema
We go from…
{“text” : “This flood is …” …}
row_key
flood:2014002:a
tweet_1, tweet_2, tweet_3, … , tweet_999999, …, tweet_N
To this…tweet_1_attributes row_key_for_tweet_1
…tweet_2_attributes row_key_for_tweet_2
tweet_3_attributes row_key_for_tweet_3
tweet_N_attributes row_key_for_tweet_N
Implications
‣ Each time a data set is “imported” into EPIC Analyze • we must launch a script that reformats each tweet into the
“narrow row” format required by Solr
- In the future, we’ll modify collection to write tweets both ways
‣ It’s not a complete duplication • we only store those attributes that we want to search on
‣ but it’s still significant ‣ the benefit is that we can then apply all of Solr’s powerful
search capabilities to our data sets
Conclusions
Cassandra: Strong Foundation for Project EPIC
‣ With migration to Cassandra in 2012, EPIC Collect has been running 24/7 with minimal downtime
• Downtime usually related to network outages
• Cassandra keeps right on ticking!
‣ Has provided Project EPIC with a reliable environment to perform a wide range of crisis informatics research
• leading to new understanding of how people use Twitter to coordinate and collaborate during times of disaster
Cassandra: Strong Foundation for Project EPIC
‣ An excellent NoSQL technology but you must take time to understand Cassandra’s advantages and its data model
• Provides flexibility, availability, scalability, and robustness
• Row keys
- difficult to get right (but that’s true of all data modeling tasks!)
- design to reflect your problem domain
- to determine width of rows (and speed of replication)
- and to partition data across your cluster
Thank YouKen Anderson <[email protected]>
Project Epic: <http://epic.cs.colorado.edu>
Department of Computer Science University of Colorado
@epiccolorado