crossdata: an efficient distributed datahub with batch and streaming query capabilities

Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com

Alvaro Agea alvaro@stratio.com

1"#CassandraSummit-2014

Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com

•  Stra3o-is-a-Big-Data-Company •  Founded-in-2013 •  Commercially-launched-in-2014 •  50+-employees-in-Madrid •  Office-in-San-Francisco •  Cer3fied-Spark-distribu3on

STRATIO Who are we?

#CassandraSummit-2014 3"

•  P2P-architecture •  Read/write-performance •  Fault-tolerance •  Easy-to-deploy •  CQL

Cassandra We love…

•  Introduction •  Crossdata architecture •  Metadata management •  Streaming sources •  Full text search •  Spark and Crossdata •  ODBC •  The future

Agenda

#CassandraSummit-2014

o  Big-Data-analysis-is-commonly-associated-with-batch-processing

•  Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures

o  Users-buy-Big-Data-plaSorms,-but

•  How-do-I-start? •  What-is-my-entry-point-to-the-plaSorm?

Introduction

o  Easy-deployment

o  Easy-administra3on

o  Read/write-performance

o  EasyRtoRlearn-query-language- o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?

!  Easy%deployment%

!  Easy%administra0on%

!  Read/write%performance%

!  Easy6to6learn%query%language%o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

!  Easy"deployment"

!  Easy"administra8on"

!  Read/write"performance"

!  Easy>to>learn"query"language"!  Integra3on-with-BI-Tools !  Join-opera3ons !  Support-for-streaming-sources

!  Integra3on-with-other-data-stores !  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

o  A-new-technology-that: •  Is-not-limited-by-the-underlying-datastore-capabili3es

•  Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons

•  Supports-batch-and-streaming-queries

•  Supports-mul3ple-clusters-and-technologies

Crossdata

Our architecture

o  Crossdata-defines-an-IConnector-extension-interface o  User-can-easily-add-new-connectors-to-support •  Different-datastores •  Different-processing-engines •  Different-versions

o  Where-each-connector-defines-its-capabili3es

Connecting to the outside world

Our planner will choose the best connector for each query

Query execution

Parsing" Valida8on" Planning" Execu8on"

Connector1"

Connector2"

Connector3"

Our planner will choose the best connector for each query

o  Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-

•  Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance

"  E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-

•  A-table-is-saved-in-a-unique-datastore

Multi-cluster support

Logical and physical mapping

C*"produc8on" C*"development" Other"datastores"

App"catalog"

Users"table" Test"table" old_users"table"

SELECT&*&FROM&app.users;&

Metadata Management

o  Some-datastores-are-schemaless-but-our-applica3ons-are-not!-

•  Flexible-schemas-vs-Schemaless

•  Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource

"  Remember-ODBC-and-those-BI-tools

Metadata in the era of Schemaless NoSQL datastores

101001010101010101101010101111010001111001000"

Metadata management

C*"produc8on"

Connector"

Infinispan"

Metadata"Store"

Metadata"Manager"

Updated"metadata"informa8on"is"

maintained"among"Crossdata"servers"using"Infinispan"

If"the"connector"does"not"support"metadata"opera8ons"those"are"

skipped" 2%1%

Streaming sources

o  Nowadays-use-cases-expect-some-type-of-streaming-datasource

•  Streaming-data-has-an-ephemeral-nature

•  In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables

Managing streaming sources

streaming"source"

col1:text" col2:int" col3:int" col4:text"

{schema:{col1:…},…}"Streaming_query0"

Streaming_queryn"

o  Streaming-queries-are-infinite-by-defini3on

•  A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period

•  The-user-launches-queries-specifying-a-processing-3me-window

"  Crossdata-provides-methods-to-list-and-stop-running-streaming-queries

Streaming queries

Streaming queries: windows syntax

SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;

Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;

SELECT * FROM demo.temporal WITH WINDOW 10 secs "

SELECT * FROM demo.users "

INNER JOIN ON users.name = temporal.name "

Full text search

o  Clients-request-the-ability-to-perform-full-text-searches

o We-have-developed-an-integra3on-between-Lucene-and-Cassandra

o  C*-users-can-now-enjoy-all-Lucene-features: •  Full-text-searches,-range-queries,-fuzzy-queries….

Full text search with

https://github.com/Stratio/stratio-cassandra

Stratio Lucene 2i

C*"node"

Lucene"index"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

o  With-Crossdata,-we-simplify:

•  The-crea3on-syntax-

•  The-query-syntax-using-the-match-operator

Full text search queries

CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&

SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&

& Stratio Crossdata

o  Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons

o  Spark-brings-several-benefits-over-Hadoop- o  InRMemory-processing

o  RDD-abstrac3on o  Simpler-API-

o  Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)

Why Spark?

o  Different-approach-to-query-execu3on •  We-only-use-Spark-when-it-speedups-queries

"  Na3ve-drivers-are-faster-for-simple-queries

"  Spark-SQL-has-limited-RDD-sources

•  Avoid-some-Spark-limita3ons

•  Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243

What about Spark SQL?

Query approach

Cassandra"

Spark"

SparkSQL"

Cassandra"

Spark" Na8ve"driver"

SparkSQL"approach" Crossdata"approach"

Stra8o"Crossdata"

o  Project-started-in-June-2013 "  With-the-objec3ve-of-providing-a-method-to-interact-with-

Cassandra-from-Spark

"  Ini3al-approach-based-on-the-HadoopInputFormat-interface

"  Current-version-uses-the-na3ve-Datastax-Java-driver

Our Cassandra-Spark integration

https://github.com/Stratio/stratio-deep

o  Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver

•  Results-highly-influenced-by-the-split-size •  Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-

using-Datastax-default-values

•  Group-by-–-up-to-40%-faster •  Join-–-up-to-17%-faster

•  Stay-tuned-for-the-benchmark-publica3on!

Our Cassandra-Spark integration

Spark vs Lucene 2i

Records"returned"

Spark"

Lucen"2i"

o  WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)

o  We-have-implemented-it-using-Simba-SDK

o  ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world

o  Currently-tested-with-Tableau,-Qlikview-and-MS-Excel

Stratio Crossdata ODBC

One ODBC for all datastores!

The future

o  Security o  Query-op3mizer-and-smart-query-planner

o  Leverage-system-sta3s3cs

o  Support-for-UDFs o  Become-an-Apache-project

The future

https://github.com/Stratio/stratio-meta

We are looking for an Apache Champion

Can"you"help"us?"

o  Ability-to-stop-running-queries o  Interac3ve-users-are-unpredictable

o  Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)

o  Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator

•  E.g.,-aggrega3ons-like-count(*)

A wish list for Cassandra

Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com

crossdata: an efficient distributed datahub with batch and streaming query capabilities

streaming sourceso

streaming sources19

streamingsourceso integra3on

window crossdata

query execution

datastoreso ability

batchprocessing users

introductiono bigdata

Software

how to…....make sure your datahub is connected to your...

streaming media november 21, 2003. topics what is...

cameo datahub tutorial - nomagic.com installing cameo...

the ri datahub : rhode island’s story

video streaming © nanda ganesan, ph.d.. video streaming...

democratizing data in rhode island the ri datahub story

eosc-pillarinaction:socialand · dataverse inserm datahub...

cambodian land traffic law - od mekong datahub

cogent datahub webview: user interface still demo

different streaming technologies. three major streaming...

datahub for museums (poster)

live streaming bangalore - video streaming - streamcast.in

datahub – towards future electricity retail market

learn to use stratio crossdata

qualityworx cts datahub - cincinnati test systems€¦ ·...

create a passive income stream that could last a life...

datasets - od mekong datahub...author: com6 created date:...

version 9 new features - cogent datahub · 2018-11-12 ·...

streaming video content over streaming video content over

streaming for fund - portal.settrade.com · streaming...