hugfr spark & riak -20160114_hug_france
Post on 22-Jan-2018
1.641 Views
Preview:
TRANSCRIPT
SPARK & RIAKINTRODUCTION TO THE SPARK-RIAK-CONNECTOR
LATERALTHOUGHTS
Me, Myself & I
Associate at LateralThoughts.com
Scala, Java, Python Developer
Data Engineer @ Axa & Carrefour
Apache Spark Trainer with Databricks
LATERALTHOUGHTS
And the Other One …
Director Sales @ Basho Technologies
(Basho make Riak)
Ex of MySQL France
Co-Founder MariaDB
Funny Accent
Quick Introduction …2011 Creators of Riak
Riak KV: NoSQL key value database Riak S2: Large Object Storage
2015 New Products Basho Data Platform: Integrated NoSQL databases, caching, in-memory analytics, and search
Riak TS: NoSQL Time Series database
120+ employees
Global Offices Seattle (HQ), Washington DC, London, Paris, Tokyo
300+ Enterprise customers, 1/3 of the Fortune 50
PRIORITIZED NEEDS
High Availability - Critical Data
High Scale – Heavy Reads & Writes
Geo Locality – Multiple Data Centers
Operational Simplicity – Resources
Don’t Scale as Clusters
Data Accuracy – Write Conflict Options
∂
RIAK S2 USE CASES
Large Object Store Content Distribution
Web & Cloud Services Active Archives
∂
RIAK KV USE CASES
User Data Session Data Profile Data
Real-time Data Log Data
∂
RIAK TS USE CASES
IoT/Devices Financial/Economic
Scientific Observations Log Data
The Evolution of NoSQL
Unstructured Data Platforms
Multi-Model Solutions
Point Solutions
Basho Data Platform …
ABOUT SPARK & RIAK
Spark & Riak
Disclaimer, the following presentation uses :
Spark v1.5.2
Spark-Riak-Connector v1.1.0
Pre-Requisites
To use the Spark Riak Connector, as of now, you need to build it yourself :
Clone https://github.com/basho/spark-riak-connector
`git checkout v1.1.0`
`mvn clean install`
Bootstrapped project
Reading from
Connect to a Riak KV Cluster from Spark
Query it :
Full Scan
Using Keys
Using secondary indexes (2i)
Connecting to
Loading data from
riakBucket[V](bucketName: String): RiakRDD[V]
riakBucket[V](bucketName: String, bucketType: String): RiakRDD[V]
riakBucket[K, V](bucketName: String, convert: (Location, RiakObject) => (K, V)): RiakRDD[(K, V)]
…
On your Spark Context, you can use :
add a query, otherwise…
Find all :
Find by key(s) :
Implicits that will give you the riak* methods
Reading from
Using case classes
Using Secondary Indexes
Basic I/O
Mapping Objects - Buckets
Adding fields during save
Spark Riak Connector - RoadmapBetter Integration with Riak TS
Enhanced DataFrames - based on Riak TS Schema APIs
Server-side aggregations and grouping - using TS SQL commands
Speed
Data Locality (partition RDDs according to replication in the cluster) - launch Spark executors on the same nodes where the data resides.
Better mapping from vnodes to Spark workers using coverage plan
Better support for Riak data types (CRDT) and Search queries
Today requires using Java Riak client APIs
Spark Streaming
Provide example and sample integration with Apache Kafka
Improve reliability using Riak for checkpoints and WAL
Add examples and documentation for Python support
DRAFT
Thank you@ogirardot
o.girardot@lateral-thoughts.com
https://github.com/ogirardot/spark-riak-example
https://speakerdeck.com/ogirardot/spark-and-riak-introduction-to-the-spark-riak-connector
@mcarney23
michael.carney@basho.com
fr.basho.com
top related