an efficient data mining solution by integrating spark and cassandra

AN EFFICIENT DATA MINING SOLUTION

http://www.stratio.com/

Hadoop?

Cassandra?

Spark?

Stratio Deep

An efficient data mining solution

“Two and two are four?

Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

Goals

• Why do you need Cassandra?• What is the problem?• Why do you need Spark?• How do they work together?

#StratioBD

Cassandra

#StratioBD

• Based on DynamoDB…• Replication, Key/Value, P2P• And based on Big Table…• Column oriented

ROBUST FAST EFFICENT

NO BOTTLENECK REPLICATE

DDECENTRALIZED

Another Databas

e?

One User – Lot of data

Case A

#StratioBD

Many User – Few data

Case B

#StratioBD

Many user – Lot of data

Case C

#StratioBD

Crawler app

#StratioBD

Cassandra, I choose you

100M

Indexedpages

3kreads

Query time

< 1s

But…

Marketingwalks in

New query

“I need to find all the reference to the domain

ACME. I need the answer by Friday.”

#StratioBD

Problem

Cassandra is not well suited to resolved this

type of queries

You need to design the schema with the query

in mind

#StratioBD

ChallengeAccepted

What options do we have?

• Run Hive Query on top of C*• Write an ETL script and load data into

another DB• Clone the cluster

#StratioBD

What options do we have?

Run Hive Query on top of C*

Write ETL scripts and load into another DB

Clone the cluster

#StratioBD

And now… what can we do?

“We can't solve problems by using the same kind of thinking

we used when we created them”

#StratioBD

Albert Einstein

• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:

Interactive algorithms. Interactive data mining

Spark

#StratioBD

Logistic regression inSpark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioBD

WHO USES SPARK?

Spark and Cassandra

Integration points

#StratioBD

Cassandra’s HDFS abstraction layer

Advantantages:• Easily integrates with legacy systems.

Drawbacks:• Very high-level: no access to low level Cassandra’s features.• Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Cassandra’s Hadoop Interface• Thrift protocol• CQL3 (our implementation)

Uses the novel Cassandra’s

CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

• Supports CQL3 features• Respects data locality • Good compromise between performance / implementation complexity

CQL3 Integration

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

CQL3 Integration (II)

Provides a Java friendly API:

• Developers map Column Families to custom serializable

POJOs

• StratioDeep wraps the complexity of performing Spark

calculations directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

Drawbacks:

• Still not preforming as well as we’d like

Uses Cassandra’s Hadoop Interface• No analyst-friendly interface:

No SQL-like query features

CQL3 Integration (III)

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD

Bring the integration to another level:

• Dump Cassandra’s Hadoop Interface• Direct access to Cassandra’s SSTable(s) files.• Extend Cassandra’s CQL3 to make use of Spark’s

distributed data processing power

Future extensions

What are we currently working on?

#StratioBD

#StratioBD

Conclusion

THANKS

an efficient data mining solution by integrating spark and cassandra

Technology