crossdata: an efficient distributed datahub with batch and streaming query capabilities
Post on 29-Jun-2015
217 Views
Preview:
DESCRIPTION
TRANSCRIPT
Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com
Alvaro Agea alvaro@stratio.com
1"#CassandraSummit-2014
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com
Alvaro Agea alvaro@stratio.com
2"#CassandraSummit-2014
• Stra3o-is-a-Big-Data-Company • Founded-in-2013 • Commercially-launched-in-2014 • 50+-employees-in-Madrid • Office-in-San-Francisco • Cer3fied-Spark-distribu3on
STRATIO Who are we?
#CassandraSummit-2014 3"
• P2P-architecture • Read/write-performance • Fault-tolerance • Easy-to-deploy • CQL
Cassandra We love…
#CassandraSummit-2014 4"
• Introduction • Crossdata architecture • Metadata management • Streaming sources • Full text search • Spark and Crossdata • ODBC • The future
Agenda
5"
#CassandraSummit-2014
o Big-Data-analysis-is-commonly-associated-with-batch-processing
• Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures
o Users-buy-Big-Data-plaSorms,-but
• How-do-I-start? • What-is-my-entry-point-to-the-plaSorm?
Introduction
6"
o Easy-deployment
o Easy-administra3on
o Read/write-performance
o EasyRtoRlearn-query-language- o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 7"
! Easy%deployment%
! Easy%administra0on%
! Read/write%performance%
! Easy6to6learn%query%language%o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 8"
! Easy"deployment"
! Easy"administra8on"
! Read/write"performance"
! Easy>to>learn"query"language"! Integra3on-with-BI-Tools ! Join-opera3ons ! Support-for-streaming-sources
! Integra3on-with-other-data-stores ! Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 9"
o A-new-technology-that: • Is-not-limited-by-the-underlying-datastore-capabili3es
• Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons
• Supports-batch-and-streaming-queries
• Supports-mul3ple-clusters-and-technologies
Crossdata
#CassandraSummit-2014 10"
Our architecture
#CassandraSummit-2014 11"
o Crossdata-defines-an-IConnector-extension-interface o User-can-easily-add-new-connectors-to-support • Different-datastores • Different-processing-engines • Different-versions
o Where-each-connector-defines-its-capabili3es
Connecting to the outside world
#CassandraSummit-2014 12"
Our planner will choose the best connector for each query
Query execution
#CassandraSummit-2014 13"
Parsing" Valida8on" Planning" Execu8on"
C*"
Connector1"
Connector2"
Connector3"
Our planner will choose the best connector for each query
o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-
• Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance
" E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-
• A-table-is-saved-in-a-unique-datastore
Multi-cluster support
#CassandraSummit-2014 14"
#CassandraSummit-2014
Logical and physical mapping
15"
C*"produc8on" C*"development" Other"datastores"
App"catalog"
Users"table" Test"table" old_users"table"
SELECT&*&FROM&app.users;&
Metadata Management
16"
o Some-datastores-are-schemaless-but-our-applica3ons-are-not!-
• Flexible-schemas-vs-Schemaless
• Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource
" Remember-ODBC-and-those-BI-tools
Metadata in the era of Schemaless NoSQL datastores
#CassandraSummit-2014
?""
101001010101010101101010101111010001111001000"
17"
#CassandraSummit-2014
Metadata management
18"
C*"produc8on"
Connector"
Infinispan"
Metadata"Store"
Metadata"Manager"
2%
Updated"metadata"informa8on"is"
maintained"among"Crossdata"servers"using"Infinispan"
If"the"connector"does"not"support"metadata"opera8ons"those"are"
skipped" 2%1%
Streaming sources
19"
#CassandraSummit-2014
o Nowadays-use-cases-expect-some-type-of-streaming-datasource
• Streaming-data-has-an-ephemeral-nature
• In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables
Managing streaming sources
20"
streaming"source"
col1:text" col2:int" col3:int" col4:text"
{schema:{col1:…},…}"Streaming_query0"
Streaming_queryn"
…"
#CassandraSummit-2014
o Streaming-queries-are-infinite-by-defini3on
• A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period
• The-user-launches-queries-specifying-a-processing-3me-window
" Crossdata-provides-methods-to-list-and-stop-running-streaming-queries
Streaming queries
21"
#CassandraSummit-2014
Streaming queries: windows syntax
22"
SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;
#CassandraSummit-2014
Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;
SELECT * FROM demo.temporal WITH WINDOW 10 secs "
SELECT * FROM demo.users "
INNER JOIN ON users.name = temporal.name "
23"
Full text search
24"
o Clients-request-the-ability-to-perform-full-text-searches
o We-have-developed-an-integra3on-between-Lucene-and-Cassandra
o C*-users-can-now-enjoy-all-Lucene-features: • Full-text-searches,-range-queries,-fuzzy-queries….
Full text search with
#CassandraSummit-2014 25"
https://github.com/Stratio/stratio-cassandra
Stratio Lucene 2i
#CassandraSummit-2014 26"
C*"node"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
Lucene"index"
C*"node"
Lucene"index"
o With-Crossdata,-we-simplify:
• The-crea3on-syntax-
• The-query-syntax-using-the-match-operator
Full text search queries
#CassandraSummit-2014 27"
CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&
SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&
& Stratio Crossdata
28"
o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons
o Spark-brings-several-benefits-over-Hadoop- o InRMemory-processing
o RDD-abstrac3on o Simpler-API-
o Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)
Why Spark?
#CassandraSummit-2014 29"
o Different-approach-to-query-execu3on • We-only-use-Spark-when-it-speedups-queries
" Na3ve-drivers-are-faster-for-simple-queries
" Spark-SQL-has-limited-RDD-sources
• Avoid-some-Spark-limita3ons
• Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243
What about Spark SQL?
#CassandraSummit-2014 30"
#CassandraSummit-2014
Query approach
Cassandra"
Spark"
SparkSQL"
Cassandra"
Spark" Na8ve"driver"
SparkSQL"approach" Crossdata"approach"
31"
Stra8o"Crossdata"
#CassandraSummit-2014
o Project-started-in-June-2013 " With-the-objec3ve-of-providing-a-method-to-interact-with-
Cassandra-from-Spark
" Ini3al-approach-based-on-the-HadoopInputFormat-interface
" Current-version-uses-the-na3ve-Datastax-Java-driver
Our Cassandra-Spark integration
32"
https://github.com/Stratio/stratio-deep
#CassandraSummit-2014
o Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver
• Results-highly-influenced-by-the-split-size • Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-
using-Datastax-default-values
• Group-by-–-up-to-40%-faster • Join-–-up-to-17%-faster
• Stay-tuned-for-the-benchmark-publica3on!
Our Cassandra-Spark integration
33"
#CassandraSummit-2014
Spark vs Lucene 2i
34"
Time"
Records"returned"
Spark"
Lucen"2i"
ODBC
35"
o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)
o We-have-implemented-it-using-Simba-SDK
o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world
o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel
Stratio Crossdata ODBC
#CassandraSummit-2014 36"
One ODBC for all datastores!
The future
37"
#CassandraSummit-2014
o Security o Query-op3mizer-and-smart-query-planner
o Leverage-system-sta3s3cs
o Support-for-UDFs o Become-an-Apache-project
The future
38"
https://github.com/Stratio/stratio-meta
#CassandraSummit-2014
We are looking for an Apache Champion
39"
Can"you"help"us?"
o Ability-to-stop-running-queries o Interac3ve-users-are-unpredictable
o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)
o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator
• E.g.,-aggrega3ons-like-count(*)
A wish list for Cassandra
#CassandraSummit-2014 40"
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero dhiguero@stratio.com
Alvaro Agea alvaro@stratio.com
41"#CassandraSummit-2014
top related