cassandra day denver 2014: using cassandra to support crisis informatics research

Using Cassandra to Support Crisis Informatics Research

Kenneth M. Anderson Associate Professor

Department of Computer ScienceCo-Director of The Center for Software and Society Co-Director of Project EPIC

Director of CU’s Big Data Initiative

Happy Ada Lovelace Day!

Associate Professor; Department of Computer ScienceKen Anderson

‣ Research Interests • Software Architecture and Software Design

• Data-Intensive Systems and Crisis Informatics

‣ Teaching Interests • Software Engineering; OO A&D; Data Engineering

‣ Active in Broadening Participation in Computer Science • Led the creation of the BA in CS degree at CU

- 450 new CS majors in two years; 900 CS majors on campus

Project EPIC

‣ Empowering the Public with Information in Crisis • Largest NSF-Funded Project on Crisis Informatics

- ~4M since Fall 2009

‣ Results • ~60 research publications, 2 PostDocs, 5 PhD graduates, 4

MS graduates, 13 current PhD students

• Tweak the Tweet; 100+ data sets (~1.5B tweets)

• Software: Data collection, analytics, NLP, GIS

Crisis Informatics The study of how technology is changing the way the world responds to mass emergency events

70K Geotagged Tweets prior/during/after

Hurricane Sandy Landfall

0

35

70

105

1409/

12/1

3 12

:00

AM

9/12

/13

12:0

0 PM

9/13

/13

12:0

0 AM

9/13

/13

12:0

0 PM

9/14

/13

12:0

0 AM

9/14

/13

12:0

0 PM

9/15

/13

12:0

0 AM

9/15

/13

12:0

0 PM

9/16

/13

12:0

0 AM

9/16

/13

12:0

0 PM

9/17

/13

12:0

0 AM

9/17

/13

12:0

0 PM

9/18

/13

12:0

0 AM

9/18

/13

12:0

0 PM

9/19

/13

12:0

0 AM

9/19

/13

12:0

0 PM

9/20

/13

12:0

0 AM

9/20

/13

12:0

0 PM

Tweets Per Minute

2013 Colorado Floods — First Nine Days51 31 15 17 11 7 7 5 3

Average Tweets Per Minute

Project EPIC Software Infrastructure

‣ EPIC Collect • Twitter data collection infrastructure capable of collecting

24/7 with 99.9% uptime (since 2010)

- Built on top of Cassandra and designed for scalability, availability, and flexibility

‣ EPIC Analyze • A scalable and flexible data analytics environment that

allows Project EPIC analysts to browse, search, filter, annotate, and process EPIC Collect data sets

- Built on top of DataStax Enterprise, Redis, Rails, & Postgres

Logical Arrangement of Components Deployed across seven servers in a CU Data Center

Project EPIC Software Architecture

EPIC Event Editor EPIC Analyze Splunk ApplicationLayer

ServiceLayer

StorageLayer

Twitter Redis

PostgreSQL Cassandra

SolrPig HadoopEPIC

Collect

DataStax Enterprise

EPIC Collect

Data Center

TwitterTwitter

Collection Service

LogCassandraCassandra Cassandra Cassandra

Project EPIC Event

Editor

Data Center

TwitterTwitter

Collection Service


Project EPIC Event

Editor

Why Cassandra?{ “id”: … }

Flexibility. Immune to changes in Tweet metadata.

Data Center

TwitterTwitter

Collection Service


Project EPIC Event

Editor

Availability. Tweets can be written to any node in the cluster.Why Cassandra?

Data Center

Cassandra

TwitterTwitter

Collection Service

LogCassandra Cassandra Cassandra

Project EPIC Event

Editor

Scalability. Need more disk space? Add more nodes!

Cassandra CassandraCassandra Cassandra

Why Cassandra?

Data Center

TwitterTwitter

Collection Service


Project EPIC Event

Editor

Robustness. Data on nodes automatically replicated.Why Cassandra?

However…

Data Modeling is Wicked Hard

Getting Row Keys Right

It’s hash tables all the way down…Cassandra Data Model

Row Key 1 Column Name A ••• Column Name X

Value ••• Value•••

Row Key N Column Name B ••• Column Name Y

Value ••• Value

�1

The design of row keys is critical.

Why?

‣ Row keys determine what you can retrieve • They are your primary means to make a query and retrieve

relevant data; their structure determines query expressivity

• It should be easy to generate them from elements of your problem domain

‣ Row keys determine how “wide” your rows are • This is important because Cassandra replicates rows

‣ Row keys are partitioned across your cluster’s nodes • A “bad” row key design can negatively impact performance

Row Keys Should Reflect Problem Domain ‣ You need to easily be able to generate row keys based on

information in your problem domain <region_name>:<entity_name>:<time_collected>

vs 751e8446ede178f10fd44e3a37affb6b15ed30ce

‣ The former: easily generated from domain objects • easily reconstructed at query time

‣ The latter might be easily generated • but not easily reconstructed

The Reason?

‣ No easy way to ask Cassandra for all row keys in a column family

• If you want to get this information, you have to query Cassandra for it, in batches, until all row keys have been retrieved

- This is not an O(1) operation!

‣ Instead, it’s better if you can skip this step and reconstruct from your problem domain

• US_EastCoast:Invoices:0000_01012014 to US_EastCoast:Invoices:2359_12312014

Wide vs. Narrow

‣ You can design “wide” rows or “narrow” rows • This corresponds to returning a LOT of information for a

given key or a limited amount of information

!

!

• Wide rows can be useful, for instance, if you’re domain has lots of “events” on a given day or within a given hour

fb_users_dk user 1; user 2; … user 100,000; …ken_age_ht age; height

CassandraCassandra Cassandra Cassandra

The Rub? Rows Get Replicated

As previously mentioned, rows get replicated

How wide is too wide? Depends on size of cluster and network bandwidth

For wide rows, this can be a performance concern.

Row Keys Get Partitioned

‣ The nodes in your cluster divide up the key space between them

• The value of a row key determines where it will get stored

‣ You have to be cognizant of this partition because often Cassandra is being used in situations where a LOT of data is being written to it

• You need to make sure your row key design does not overburden any one node in your cluster


Imagine your row_key is a monotonically increasing integer

Twitter Collection

Service

Say, for instance, tweet ids

Over a single day, all tweets might be saved on just one node in the cluster; the others would remain idle!


Instead, you want enough variation that keys get evenly distributed across the cluster

row_key_1 row_key_a row_key_$ row_key_2

Writer

Reader

Design of Row Key for EPIC Collect‣ For Project EPIC, we make use of a “hybrid” row key

• The first part of the row_key is a keyword used to collect tweets for a given event

- earthquake, flood, cowx, obama, …

• The second part of the row_key is the Julian day that a tweet was collected on

- January 1, 2014 equals “2014001”; February 1, 2014 equals “2014032”; etc.

• The third part of the row_key is the last digit of an MD5 hash of the entire Tweet JSON object

- i.e. 0-9, a-f; This is used to distribute tweets across the cluster

Tweets Column Familykeyword:day:tag Tweet Id 1 ••• Tweet Id N

JSON ••• JSON•••

keyword2:day:tag Tweet Id 1 ••• Tweet Id MJSON ••• JSON

•••

‣ keyword: a word of interest for an event; e.g. “flood” ‣ julian_day: the day of the year a tweet was collected ‣ tag: a hexadecimal character “0-9, a, b, c, d, e, f”


flood:002:0

flood:002:3

flood:002:1flood:002:2

flood:002:4flood:002:5

flood:002:6

flood:002:7

flood:002:8 flood:002:cflood:002:9

flood:002:b

flood:002:a

flood:002:d

flood:002:e

flood:002:f

Row Key Distribution

flood:002:0 flood:002:1Replication …

EPIC Analyze

A data analytics environment for large Twitter Data SetsEPIC Analyze

‣ Provides a scalable and extensible analysis environment • Aims to partially automate Project EPIC’s analysis work

- Automatically calculate common metrics on all data sets

- Apply new analysis algorithms to entire data sets at once

- Support filtering/sampling on large data sets

- Support shared data set annotation by a team of analysts

• Provide these features while

- supporting data sets of millions of tweets

- with fast performance so as not to interrupt analysis work

DataStax Enterprise

Project EPIC Web Apps

Hadoop Cassandra Solr

3rd Party Analytics

AppsFacebook

Twitter

RedisPig

Challenges

‣ Recall: goal of EPIC Collect is to store events in a reliable, scalable fashion

‣ Data not necessarily structured to support analysis • Implication: Need for Migration/Duplication to enable

features such as searching, filtering, analysis, etc.

Data Migration and Duplication

‣ With EPIC Collect, we chose to have fairly “wide” rows • Each row stores the tweets that contain a given keyword for

a given day

- “All tweets that contain the word “flood” collected on 01/01/14”

- We use the “tag” to keep the row from growing too large, but there can still be 100s of 1000s of tweets in each row

‣ To support searching/filtering, we want to use Solr • however, Solr requires “narrow” structured rows

- one tweet per row, each column defined by a schema

We go from…

{“text” : “This flood is …” …}

row_key

flood:2014002:a

tweet_1, tweet_2, tweet_3, … , tweet_999999, …, tweet_N

To this…tweet_1_attributes row_key_for_tweet_1

…tweet_2_attributes row_key_for_tweet_2

tweet_3_attributes row_key_for_tweet_3

tweet_N_attributes row_key_for_tweet_N

Implications

‣ Each time a data set is “imported” into EPIC Analyze • we must launch a script that reformats each tweet into the

“narrow row” format required by Solr

- In the future, we’ll modify collection to write tweets both ways

‣ It’s not a complete duplication • we only store those attributes that we want to search on

‣ but it’s still significant ‣ the benefit is that we can then apply all of Solr’s powerful

search capabilities to our data sets

Conclusions

Cassandra: Strong Foundation for Project EPIC

‣ With migration to Cassandra in 2012, EPIC Collect has been running 24/7 with minimal downtime

• Downtime usually related to network outages

• Cassandra keeps right on ticking!

‣ Has provided Project EPIC with a reliable environment to perform a wide range of crisis informatics research

• leading to new understanding of how people use Twitter to coordinate and collaborate during times of disaster

Cassandra: Strong Foundation for Project EPIC

‣ An excellent NoSQL technology but you must take time to understand Cassandra’s advantages and its data model

• Provides flexibility, availability, scalability, and robustness

• Row keys

- difficult to get right (but that’s true of all data modeling tasks!)

- design to reflect your problem domain

- to determine width of rows (and speed of replication)

- and to partition data across your cluster

Thank YouKen Anderson <[email protected]>

Project Epic: <http://epic.cs.colorado.edu>

Department of Computer Science University of Colorado

@epiccolorado

mailto:[email protected]?subject=Re:%202014%20Leeds%20Business%20Analytics%20Talk

http://epic.cs.colorado.edu/

cassandra day denver 2014: using cassandra to support crisis informatics research

Technology