stratio big data spain

37
AN EFFICIENT DATA MINING SOLUTION

Upload: alvaro-agea-herradon

Post on 02-Jul-2015

661 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Stratio   big data spain

AN EFFICIENT DATA MINING SOLUTION

Page 2: Stratio   big data spain

Hadoop?

Page 3: Stratio   big data spain

Cassandra?

Page 4: Stratio   big data spain

Spark?

Page 5: Stratio   big data spain

Stratio Deep

An efficient data mining solution

“Two and two are four?

Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

Page 6: Stratio   big data spain
Page 7: Stratio   big data spain

Goals

• Why do you need Cassandra?

• What is the problem?

• Why do you need Spark?

• How do they work together?

#StratioBD

Page 8: Stratio   big data spain

Cassandra

#StratioBD

• Based on DynamoDB…

• Replication, Key/Value, P2P

• And based on Big Table…

• Column oriented

Page 9: Stratio   big data spain

ROBUST FAST EFFICENT

Page 10: Stratio   big data spain

NO BOTTLENECK REPLICATEDDECENTRALIZED

Page 11: Stratio   big data spain

Another Database?

Page 12: Stratio   big data spain

Why?

Page 13: Stratio   big data spain

One User – Lot of data

Case A

#StratioBD

Page 14: Stratio   big data spain

Many User – Few data

Case B

#StratioBD

Page 15: Stratio   big data spain

Many user – Lot of data

Case C

#StratioBD

Page 16: Stratio   big data spain

Crawler app

#StratioBD

Cassandra, I choose you

100MIndexedpages

3kreads

Query time

< 1s

Page 17: Stratio   big data spain

But…

Page 18: Stratio   big data spain

Marketingwalks in

Page 19: Stratio   big data spain

New query

“I need to find all the reference to the domain ACME.

I need the answer by Friday.”

#StratioBD

Page 20: Stratio   big data spain

Problem

Cassandra is not well suited to resolved this type of

queries

You need to design the schema with the query in mind

#StratioBD

Page 21: Stratio   big data spain

ChallengeAccepted

Page 22: Stratio   big data spain

What options do we have?

• Run Hive Query on top of C*

• Write an ETL script and load data into another DB

• Clone the cluster

#StratioBD

Page 23: Stratio   big data spain

What options do we have?

Run Hive Query on top of C*

Write ETL scripts and load into another DB

Clone the cluster

#StratioBD

Page 24: Stratio   big data spain

And now… what can we do?

“We can't solve problems by using the same kind

of thinking we used when we created them”

#StratioBD

Albert Einstein

Page 25: Stratio   big data spain

• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:

Interactive algorithms. Interactive data mining

Spark

#StratioBD

Page 26: Stratio   big data spain

Logistic regression inSpark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioBD

Page 27: Stratio   big data spain

WHO USES SPARK?

Page 28: Stratio   big data spain

Spark and Cassandra

Integration points

#StratioBD

Page 29: Stratio   big data spain

Cassandra’s HDFS abstraction layer

Advantantages:• Easily integrates with legacy systems.

Drawbacks:• Very high-level: no access to low level Cassandra’s features.

• Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Page 30: Stratio   big data spain

Cassandra’s Hadoop Interface

• Thrift protocol

• CQL3 (our implementation)

Uses the novel Cassandra’s CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Page 31: Stratio   big data spain

• Supports CQL3 features

• Respects data locality

• Good compromise between

performance / implementation complexity

CQL3 Integration

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

Page 32: Stratio   big data spain

CQL3 Integration (II)

Provides a Java friendly API:

• Developers map Column Families to custom serializable POJOs

• StratioDeep wraps the complexity of performing Spark calculations

directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

Page 33: Stratio   big data spain

Demo

Page 34: Stratio   big data spain

Drawbacks:

• Still not preforming as well as we’d like

Uses Cassandra’s Hadoop Interface

• No analyst-friendly interface:

No SQL-like query features

CQL3 Integration (III)

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD

Page 35: Stratio   big data spain

Bring the integration to another level:

• Dump Cassandra’s Hadoop Interface

• Direct access to Cassandra’s SSTable(s) files.

• Extend Cassandra’s CQL3 to make use of Spark’s distributed

data processing power

Future extensions

What are we currently working on?

#StratioBD

Page 36: Stratio   big data spain

#StratioBD

Conclusion

Page 37: Stratio   big data spain

THANKS