crossdata: an efficient distributed datahub with batch and streaming query capabilities

Post on 29-Jun-2015

217 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big Data analysis is commonly associated with batch processing of data stored in distributed file systems. The advent of streaming data is exposing the shortcomings of the traditional data analysis. Users aiming to combine both worlds - batch processing and streaming - had to turn to unreliable in-house developments. We propose Stratio META to meet this new need. META is a technology based on a structured NoSQL datastore with advanced indexing capabilities. META includes an efficient query planner designed from scratch. The planner determines which is the optimal path to execute a query and which components should be involved.

TRANSCRIPT

Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com

Alvaro Agea alvaro@stratio.com

1"#CassandraSummit-2014

Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com

Alvaro Agea alvaro@stratio.com

2"#CassandraSummit-2014

•  Stra3o-is-a-Big-Data-Company •  Founded-in-2013 •  Commercially-launched-in-2014 •  50+-employees-in-Madrid •  Office-in-San-Francisco •  Cer3fied-Spark-distribu3on

STRATIO Who are we?

#CassandraSummit-2014 3"

•  P2P-architecture •  Read/write-performance •  Fault-tolerance •  Easy-to-deploy •  CQL

Cassandra We love…

#CassandraSummit-2014 4"

•  Introduction •  Crossdata architecture •  Metadata management •  Streaming sources •  Full text search •  Spark and Crossdata •  ODBC •  The future

Agenda

5"

#CassandraSummit-2014

o  Big-Data-analysis-is-commonly-associated-with-batch-processing

•  Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures

o  Users-buy-Big-Data-plaSorms,-but

•  How-do-I-start? •  What-is-my-entry-point-to-the-plaSorm?

Introduction

6"

o  Easy-deployment

o  Easy-administra3on

o  Read/write-performance

o  EasyRtoRlearn-query-language- o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?

#CassandraSummit-2014 7"

!  Easy%deployment%

!  Easy%administra0on%

!  Read/write%performance%

!  Easy6to6learn%query%language%o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?

#CassandraSummit-2014 8"

!  Easy"deployment"

!  Easy"administra8on"

!  Read/write"performance"

!  Easy>to>learn"query"language"!  Integra3on-with-BI-Tools !  Join-opera3ons !  Support-for-streaming-sources

!  Integra3on-with-other-data-stores !  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?

#CassandraSummit-2014 9"

o  A-new-technology-that: •  Is-not-limited-by-the-underlying-datastore-capabili3es

•  Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons

•  Supports-batch-and-streaming-queries

•  Supports-mul3ple-clusters-and-technologies

Crossdata

#CassandraSummit-2014 10"

Our architecture

#CassandraSummit-2014 11"

o  Crossdata-defines-an-IConnector-extension-interface o  User-can-easily-add-new-connectors-to-support •  Different-datastores •  Different-processing-engines •  Different-versions

o  Where-each-connector-defines-its-capabili3es

Connecting to the outside world

#CassandraSummit-2014 12"

Our planner will choose the best connector for each query

Query execution

#CassandraSummit-2014 13"

Parsing" Valida8on" Planning" Execu8on"

C*"

Connector1"

Connector2"

Connector3"

Our planner will choose the best connector for each query

o  Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-

•  Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance

"  E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-

•  A-table-is-saved-in-a-unique-datastore

Multi-cluster support

#CassandraSummit-2014 14"

#CassandraSummit-2014

Logical and physical mapping

15"

C*"produc8on" C*"development" Other"datastores"

App"catalog"

Users"table" Test"table" old_users"table"

SELECT&*&FROM&app.users;&

Metadata Management

16"

o  Some-datastores-are-schemaless-but-our-applica3ons-are-not!-

•  Flexible-schemas-vs-Schemaless

•  Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource

"  Remember-ODBC-and-those-BI-tools

Metadata in the era of Schemaless NoSQL datastores

#CassandraSummit-2014

?""

101001010101010101101010101111010001111001000"

17"

#CassandraSummit-2014

Metadata management

18"

C*"produc8on"

Connector"

Infinispan"

Metadata"Store"

Metadata"Manager"

2%

Updated"metadata"informa8on"is"

maintained"among"Crossdata"servers"using"Infinispan"

If"the"connector"does"not"support"metadata"opera8ons"those"are"

skipped" 2%1%

Streaming sources

19"

#CassandraSummit-2014

o  Nowadays-use-cases-expect-some-type-of-streaming-datasource

•  Streaming-data-has-an-ephemeral-nature

•  In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables

Managing streaming sources

20"

streaming"source"

col1:text" col2:int" col3:int" col4:text"

{schema:{col1:…},…}"Streaming_query0"

Streaming_queryn"

…"

#CassandraSummit-2014

o  Streaming-queries-are-infinite-by-defini3on

•  A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period

•  The-user-launches-queries-specifying-a-processing-3me-window

"  Crossdata-provides-methods-to-list-and-stop-running-streaming-queries

Streaming queries

21"

#CassandraSummit-2014

Streaming queries: windows syntax

22"

SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;

#CassandraSummit-2014

Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;

SELECT * FROM demo.temporal WITH WINDOW 10 secs "

SELECT * FROM demo.users "

INNER JOIN ON users.name = temporal.name "

23"

Full text search

24"

o  Clients-request-the-ability-to-perform-full-text-searches

o We-have-developed-an-integra3on-between-Lucene-and-Cassandra

o  C*-users-can-now-enjoy-all-Lucene-features: •  Full-text-searches,-range-queries,-fuzzy-queries….

Full text search with

#CassandraSummit-2014 25"

https://github.com/Stratio/stratio-cassandra

Stratio Lucene 2i

#CassandraSummit-2014 26"

C*"node"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

Lucene"index"

C*"node"

Lucene"index"

o  With-Crossdata,-we-simplify:

•  The-crea3on-syntax-

•  The-query-syntax-using-the-match-operator

Full text search queries

#CassandraSummit-2014 27"

CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&

SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&

& Stratio Crossdata

28"

o  Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons

o  Spark-brings-several-benefits-over-Hadoop- o  InRMemory-processing

o  RDD-abstrac3on o  Simpler-API-

o  Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)

Why Spark?

#CassandraSummit-2014 29"

o  Different-approach-to-query-execu3on •  We-only-use-Spark-when-it-speedups-queries

"  Na3ve-drivers-are-faster-for-simple-queries

"  Spark-SQL-has-limited-RDD-sources

•  Avoid-some-Spark-limita3ons

•  Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243

What about Spark SQL?

#CassandraSummit-2014 30"

#CassandraSummit-2014

Query approach

Cassandra"

Spark"

SparkSQL"

Cassandra"

Spark" Na8ve"driver"

SparkSQL"approach" Crossdata"approach"

31"

Stra8o"Crossdata"

#CassandraSummit-2014

o  Project-started-in-June-2013 "  With-the-objec3ve-of-providing-a-method-to-interact-with-

Cassandra-from-Spark

"  Ini3al-approach-based-on-the-HadoopInputFormat-interface

"  Current-version-uses-the-na3ve-Datastax-Java-driver

Our Cassandra-Spark integration

32"

https://github.com/Stratio/stratio-deep

#CassandraSummit-2014

o  Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver

•  Results-highly-influenced-by-the-split-size •  Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-

using-Datastax-default-values

•  Group-by-–-up-to-40%-faster •  Join-–-up-to-17%-faster

•  Stay-tuned-for-the-benchmark-publica3on!

Our Cassandra-Spark integration

33"

#CassandraSummit-2014

Spark vs Lucene 2i

34"

Time"

Records"returned"

Spark"

Lucen"2i"

ODBC

35"

o  WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)

o  We-have-implemented-it-using-Simba-SDK

o  ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world

o  Currently-tested-with-Tableau,-Qlikview-and-MS-Excel

Stratio Crossdata ODBC

#CassandraSummit-2014 36"

One ODBC for all datastores!

The future

37"

#CassandraSummit-2014

o  Security o  Query-op3mizer-and-smart-query-planner

o  Leverage-system-sta3s3cs

o  Support-for-UDFs o  Become-an-Apache-project

The future

38"

https://github.com/Stratio/stratio-meta

#CassandraSummit-2014

We are looking for an Apache Champion

39"

Can"you"help"us?"

o  Ability-to-stop-running-queries o  Interac3ve-users-are-unpredictable

o  Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)

o  Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator

•  E.g.,-aggrega3ons-like-count(*)

A wish list for Cassandra

#CassandraSummit-2014 40"

Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com

Alvaro Agea alvaro@stratio.com

41"#CassandraSummit-2014

top related