implementing use cases on a unified data...

14
1 © Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect @MartonBalassi | [email protected] BigPetStore on Spark Implementing use cases on a unified data engine

Upload: ngotruc

Post on 11-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

1© Cloudera, Inc. All rights reserved.

Marton Balassi | Solutions Architect@MartonBalassi | [email protected]

BigPetStore on SparkImplementing use cases on a unified data engine

Page 2: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

2© Cloudera, Inc. All rights reserved.

Outline

• BigPetStore model

• Data generator with the RDD API

• ETL with RDDs & Spark SQL

• Matrix factorization with MLlib

• Recommendation with the Spark Streaming API

• Summary

Page 3: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

3© Cloudera, Inc. All rights reserved.

BigPetStore

• Blueprints for Big Data applications

• Consists of:• Data Generators

• Examples using tools in Big Data ecosystem to process data

• Build system and tests for integrating tools and multiple JVM languages

• Part of the Apache BigTop project

Page 4: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

4© Cloudera, Inc. All rights reserved.

BigPetStore model

• Customers visiting pet stores generating transactions, location based

Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

Page 5: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

5© Cloudera, Inc. All rights reserved.

Data generation

• Use RJ Nowling’s Java generator classes

• Write transactions to JSON

val sc = new SparkContext(conf)

val (stores, productCategories, customers) = getData()

val startTime = getCurrentMillis()

val (storesBC, productsBC) = (sc.broadcast(stores), sc.broadcast(productCategories))

val transactions = sc.parallelize(customers)

.mapPartitionsWithIndex(generateTransactions)

.map{t => t.setDateTime(t.getDateTime + startTime); t}

transactions.saveAsTextFile(output)

Page 6: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

6© Cloudera, Inc. All rights reserved.

ETL with the RDD API

• Read the dirty JSON

• Output (customer, product) pairs for the recommender

val sc = new SparkContext(conf)

val transactions = sc.textFile(json).map(new Transaction(_))

val productsWithIndex = transactions.flatMap(_.getProducts)

.distinct

.zipWithUniqueId

val customerAndProductPairs = transactions

.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p)))

.map(pair => (pair._2, pair._1.toLong))

.join(productsWithIndex).map(res => (res._2._1, res._2._2))

.distinct

customerAndProductPairs.map(mkstring).saveAsTextFile(output)

Page 7: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

7© Cloudera, Inc. All rights reserved.

ETL with SparkSQL

• Read the dirty JSON

• SQL queries: select the stores with the most transacions

val (sc, sqlContext) = (new SparkContext(conf), new SQLContext(sc))

val transactions = sqlContext.read.json(input)

transactions.registerTempTable(”transactions”)

val countByStore = sqlContext.sql(”SELECT store.id id, store.name name, COUNT(store.id) count ” +

”FROM transactions GROUP BY store.id, store.name”)

countByStore.registerTempTable(”countByStore”)

val bestStores = sqlContext.sql(”SELECT ct.id, ct.name, max.count ” +

”FROM countByStore, (SELECT MAX(count) as count FROM countByStore) max ” +

”WHERE ct.count = max.count”

Page 8: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

8© Cloudera, Inc. All rights reserved.

ETL with SparkSQL (2)

• Read the dirty JSON

• SQL queries: list transaction count by month

def month = (dateTime : Double) => {

val millis = (dateTime * 24 * 3600 * 1000).toLong

new Date(millis).getMonth

}

sqlContext.udf.register(”month”, month)

val byMonthCont = sqlContext.sql(”SELECT month(dateTime), COUNT(month) ” +

”FROM transactions ”

”GROUP BY month(dateTime)”)

Page 9: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

9© Cloudera, Inc. All rights reserved.

A little recommender theory

Itemfactors

User sideinformation User-Item matrixUser factors

Item sideinformation

U

I

PQ

R

• R is potentially huge, approximate it with P∗Q• Prediction is TopK(user’s row ∗ Q)

Page 10: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

10© Cloudera, Inc. All rights reserved.

• Read the (customer, product) pairs

• Write P and Q to file

Matrix factorization with MLlib

val sc = new SparkContext(conf)

val transactions = sc.textFile(input)

.map(s => Rating(s))

val model = ALS.trainImplicit(input, numFactors, iterations, lambda, alpha)

model.save(sc, out)

Page 11: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

11© Cloudera, Inc. All rights reserved.

Recommendation with DStreams

• Get the ratings for the row of the user

• (Could be optimized)

val sc = new SparkContext(conf)

val ssc = new StreamingContext(sc, Seconds(1))

val model = MatrixFactorizationModel.load(sc, input)

val query = ssc.socketTextStream(”localhost”, 9999)

.map(user => getItems(user))

val result = query.foreachRDD{ userVector =>

val ratings = model.predict(userVector)

print(topk(ratings))

}

ssc.start(); ssc.awaitTermination()

Page 12: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

12© Cloudera, Inc. All rights reserved.

Summary

• Go beyond WordCount with BigPetStore

• Feel free to mix the RDD, DataFrames, SQL, MLlib and

streaming APIs (and all the others) in your Spark workflows

• Data generation, cleaning, ETL, Machine learning, streaming

prediction on top of one engine with under 500 lines of code

• A Spark pet project is always fun. No pun intended.

Page 13: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

13© Cloudera, Inc. All rights reserved.

Big thanks to

• The BigPetStore folks:

Suneel Marthi

Ronald J. Nowling

Jay Vyas

• Clouderans supporting the pet project:

Sean Owen

Alexander Bartfeld

Alexander Bibighaus

Page 14: Implementing use cases on a unified data enginebiconsulting.hu/letoltes/2016budapestdata/balassi_marton...Implementing use cases on a unified data engine © Cloudera, Inc. All rights

14© Cloudera, Inc. All rights reserved.

Thank you@[email protected]