stratio crossdata: an efficient distributed datahub with batch and streaming query capabilities
DESCRIPTION
Big Data analysis is commonly associated with batch processing. Users aiming to combine batch and stream processing have to rely on tailorRmade architectures o Users buy Big Data plaSorms, but, How do I start?. What is my entry point to the plaSorm? #CassandraSummit 2014 San FranciscoTRANSCRIPT
Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
1"#CassandraSummit-2014
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
2"#CassandraSummit-2014
• Stra3o-is-a-Big-Data-Company • Founded-in-2013 • Commercially-launched-in-2014 • 50+-employees-in-Madrid • Office-in-San-Francisco • Cer3fied-Spark-distribu3on
STRATIO Who are we?
#CassandraSummit-2014 3"
• P2P-architecture • Read/write-performance • Fault-tolerance • Easy-to-deploy • CQL
Cassandra We love…
#CassandraSummit-2014 4"
• Introduction • Crossdata architecture • Metadata management • Streaming sources • Full text search • Spark and Crossdata • ODBC • The future
Agenda
5"
#CassandraSummit-2014
o Big-Data-analysis-is-commonly-associated-with-batch-processing
• Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures
o Users-buy-Big-Data-plaSorms,-but
• How-do-I-start? • What-is-my-entry-point-to-the-plaSorm?
Introduction
6"
o Easy-deployment
o Easy-administra3on
o Read/write-performance
o EasyRtoRlearn-query-language- o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 7"
! Easy%deployment%
! Easy%administra0on%
! Read/write%performance%
! Easy6to6learn%query%language%o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 8"
! Easy"deployment"
! Easy"administra8on"
! Read/write"performance"
! Easy>to>learn"query"language"! Integra3on-with-BI-Tools ! Join-opera3ons ! Support-for-streaming-sources
! Integra3on-with-other-data-stores ! Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 9"
o A-new-technology-that: • Is-not-limited-by-the-underlying-datastore-capabili3es
• Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons
• Supports-batch-and-streaming-queries
• Supports-mul3ple-clusters-and-technologies
Crossdata
#CassandraSummit-2014 10"
Our architecture
#CassandraSummit-2014 11"
o Crossdata-defines-an-IConnector-extension-interface o User-can-easily-add-new-connectors-to-support • Different-datastores • Different-processing-engines • Different-versions
o Where-each-connector-defines-its-capabili3es
Connecting to the outside world
#CassandraSummit-2014 12"
Our planner will choose the best connector for each query
Query execution
#CassandraSummit-2014 13"
Parsing" Valida8on" Planning" Execu8on"
C*"
Connector1"
Connector2"
Connector3"
Our planner will choose the best connector for each query
o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-
• Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance
" E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-
• A-table-is-saved-in-a-unique-datastore
Multi-cluster support
#CassandraSummit-2014 14"
#CassandraSummit-2014
Logical and physical mapping
15"
C*"produc8on" C*"development" Other"datastores"
App"catalog"
Users"table" Test"table" old_users"table"
SELECT&*&FROM&app.users;&
Metadata Management
16"
o Some-datastores-are-schemaless-but-our-applica3ons-are-not!-
• Flexible-schemas-vs-Schemaless
• Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource
" Remember-ODBC-and-those-BI-tools
Metadata in the era of Schemaless NoSQL datastores
#CassandraSummit-2014
?""
101001010101010101101010101111010001111001000"
17"
#CassandraSummit-2014
Metadata management
18"
C*"produc8on"
Connector"
Infinispan"
Metadata"Store"
Metadata"Manager"
2%
Updated"metadata"informa8on"is"
maintained"among"Crossdata"servers"using"Infinispan"
If"the"connector"does"not"support"metadata"opera8ons"those"are"
skipped" 2%1%
Streaming sources
19"
#CassandraSummit-2014
o Nowadays-use-cases-expect-some-type-of-streaming-datasource
• Streaming-data-has-an-ephemeral-nature
• In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables
Managing streaming sources
20"
streaming"source"
col1:text" col2:int" col3:int" col4:text"
{schema:{col1:…},…}"Streaming_query0"
Streaming_queryn"
…"
#CassandraSummit-2014
o Streaming-queries-are-infinite-by-defini3on
• A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period
• The-user-launches-queries-specifying-a-processing-3me-window
" Crossdata-provides-methods-to-list-and-stop-running-streaming-queries
Streaming queries
21"
#CassandraSummit-2014
Streaming queries: windows syntax
22"
SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;
#CassandraSummit-2014
Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;
SELECT * FROM demo.temporal WITH WINDOW 10 secs "
SELECT * FROM demo.users "
INNER JOIN ON users.name = temporal.name "
23"
Full text search
24"
o Clients-request-the-ability-to-perform-full-text-searches
o We-have-developed-an-integra3on-between-Lucene-and-Cassandra
o C*-users-can-now-enjoy-all-Lucene-features: • Full-text-searches,-range-queries,-fuzzy-queries….
Full text search with
#CassandraSummit-2014 25"
https://github.com/Stratio/stratio-cassandra
Stratio Lucene 2i
#CassandraSummit-2014 26"
C*"node"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
Lucene"index"
C*"node"
Lucene"index"
o With-Crossdata,-we-simplify:
• The-crea3on-syntax-
• The-query-syntax-using-the-match-operator
Full text search queries
#CassandraSummit-2014 27"
CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&
SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&
& Stratio Crossdata
28"
o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons
o Spark-brings-several-benefits-over-Hadoop- o InRMemory-processing
o RDD-abstrac3on o Simpler-API-
o Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)
Why Spark?
#CassandraSummit-2014 29"
o Different-approach-to-query-execu3on • We-only-use-Spark-when-it-speedups-queries
" Na3ve-drivers-are-faster-for-simple-queries
" Spark-SQL-has-limited-RDD-sources
• Avoid-some-Spark-limita3ons
• Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243
What about Spark SQL?
#CassandraSummit-2014 30"
#CassandraSummit-2014
Query approach
Cassandra"
Spark"
SparkSQL"
Cassandra"
Spark" Na8ve"driver"
SparkSQL"approach" Crossdata"approach"
31"
Stra8o"Crossdata"
#CassandraSummit-2014
o Project-started-in-June-2013 " With-the-objec3ve-of-providing-a-method-to-interact-with-
Cassandra-from-Spark
" Ini3al-approach-based-on-the-HadoopInputFormat-interface
" Current-version-uses-the-na3ve-Datastax-Java-driver
Our Cassandra-Spark integration
32"
https://github.com/Stratio/stratio-deep
#CassandraSummit-2014
o Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver
• Results-highly-influenced-by-the-split-size • Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-
using-Datastax-default-values
• Group-by-–-up-to-40%-faster • Join-–-up-to-17%-faster
• Stay-tuned-for-the-benchmark-publica3on!
Our Cassandra-Spark integration
33"
#CassandraSummit-2014
Spark vs Lucene 2i
34"
Time"
Records"returned"
Spark"
Lucen"2i"
ODBC
35"
o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)
o We-have-implemented-it-using-Simba-SDK
o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world
o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel
Stratio Crossdata ODBC
#CassandraSummit-2014 36"
One ODBC for all datastores!
The future
37"
#CassandraSummit-2014
o Security o Query-op3mizer-and-smart-query-planner
o Leverage-system-sta3s3cs
o Support-for-UDFs o Become-an-Apache-project
The future
38"
https://github.com/Stratio/stratio-meta
#CassandraSummit-2014
We are looking for an Apache Champion
39"
Can"you"help"us?"
o Ability-to-stop-running-queries o Interac3ve-users-are-unpredictable
o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)
o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator
• E.g.,-aggrega3ons-like-count(*)
A wish list for Cassandra
#CassandraSummit-2014 40"
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
41"#CassandraSummit-2014