hivemall: scalable machine learning library for apache hive/spark/pig
TRANSCRIPT
![Page 1: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/1.jpg)
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
1). Research Engineer, Treasure DataMakoto YUI @myui
2016/10/26 Hadoop Summit '16, Tokyo
2). Research Engineer, NTTTakashi Yamamuro @maropu
#hivemall
![Page 2: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/2.jpg)
Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark
2016/10/26 Hadoop Summit '16, Tokyo 2
![Page 3: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/3.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 3
Hivemall enters Apache Incubator on Sept 13, 2016 🎉
http://incubator.apache.org/projects/hivemall.html
@ApacheHivemall
![Page 4: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/4.jpg)
• Makoto Yui <Treasure Data>• Takeshi Yamamuro <NTT>
Hivemall on Apache Spark• Daniel Dai <Hortonworks>
Hivemall on Apache Pig Apache Pig PMC member
• Tsuyoshi Ozawa <NTT>Apache Hadoop PMC member
• Kai Sasaki <Treasure Data>
2016/10/26 Hadoop Summit '16, Tokyo 4
Initial committers
![Page 5: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/5.jpg)
Champion
Nominated Mentors
2016/10/26 Hadoop Summit '16, Tokyo 5
Project mentors
• Reynold Xin <Databricks, ASF member>Apache Spark PMC member
• Markus Weimer <Microsoft, ASF member> Apache REEF PMC member
• Xiangrui Meng <Databricks, ASF member>Apache Spark PMC member
• Roman Shaposhnik <Pivotal, ASF member>Apache Bigtop/Incubator PMC member
![Page 6: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/6.jpg)
What is Apache Hivemall
Scalable machine learning library built as a collection of Hive UDFs
2016/10/26 Hadoop Summit '16, Tokyo
github.com/myui/hivemall
hivemall.incubator.apache.org
![Page 7: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/7.jpg)
Hivemall’s Vision: ML on SQL
2016/10/26 Hadoop Summit '16, Tokyo 7
100+ lines
of code
Classification with Mahout
CREATE TABLE lr_model ASSELECT feature, -- reducers perform model averaging in parallel avg(weight) as weightFROM ( SELECT logress(features,label,..) as (feature,weight) FROM train) t -- map-only taskGROUP BY feature; -- shuffled to reducers
✓ Machine Learning made easy for SQL developers✓ Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in parallel on Hadoop
![Page 8: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/8.jpg)
Hadoop HDFS
MapReduce(MRv1)
Hivemall
Apache YARN
Apache Tez DAG processing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File SystemCloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
2016/10/26 Hadoop Summit '16, Tokyo 8
Amazon S3
![Page 9: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/9.jpg)
List of supported AlgorithmsClassification
✓ Perceptron✓ Passive Aggressive (PA, PA1, PA2)✓ Confidence Weighted (CW)✓ Adaptive Regularization of Weight
Vectors (AROW)✓ Soft Confidence Weighted (SCW)✓ AdaGrad+RDA✓ Factorization Machines✓ RandomForest Classification
2016/10/26 Hadoop Summit '16, Tokyo 9
Regression✓Logistic Regression (SGD)✓AdaGrad (logistic loss)✓AdaDELTA (logistic loss)✓PA Regression ✓AROW Regression✓Factorization Machines✓RandomForest Regression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
![Page 10: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/10.jpg)
List of Algorithms for Recommendation
2016/10/26 Hadoop Summit '16, Tokyo 10
K-Nearest Neighbor✓ Minhash and b-Bit Minhash (LSH variant)
✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular)
Matrix Completion✓ Matrix Factorization✓ Factorization Machines (regression)
each_top_k function of Hivemall is useful for recommending top-k items
![Page 11: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/11.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 11
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-k query processing
student class score3 a 902 a 801 b 706 b 60
List top-2 students for each class
![Page 12: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/12.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 12
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-k query processing
List top-2 students for each class
SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table) tWHERE rank <= 2
![Page 13: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/13.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 13
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-k query processing
List top-2 students for each classSELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student)FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class) t
![Page 14: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/14.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 14
Top-k query processing by RANK OVER()
partition by class
Node 1
Sort by class, score
rank over()
rank >= 2
![Page 15: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/15.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 15
Top-k query processing by EACH_TOP_K
distributed by class
Node 1
Sort by class
each_top_k
OUTPUT only K items
![Page 16: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/16.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 16
Comparison between RANK and EACH_TOP_K
distributed by class
Sort by class
each_top_k
Sort by class, score
rank over()
rank >= 2
SORTING IS HEAVY
NEED TO PROCESS ALL
OUTPUT only K items
Each_top_k is very efficient where the number of class is large
Bounded Priority Queueis utilized
![Page 17: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/17.jpg)
Performance reported by TD customer
2016/10/26 Hadoop Summit '16, Tokyo 17
•1,000 students in each class•20 million classes
RANK over() query does not finishes in 24 hours EACH_TOP_K finishes in 2 hours
Refer for detailhttps://speakerdeck.com/kaky0922/hivemall-meetup-20160908
![Page 18: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/18.jpg)
Other Supported Algorithms
2016/10/26 Hadoop Summit '16, Tokyo 18
Anomaly Detection✓ Local Outlier Factor (LoF)
Feature Engineering✓Feature Hashing✓Feature Scaling (normalization, z-score)
✓ TF-IDF vectorizer✓ Polynomial Expansion (Feature Pairing)
✓ Amplifier
NLP✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji)
![Page 19: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/19.jpg)
Industry use cases of Hivemall
CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc., Smartnews, and more
Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.
2016/10/26 Hadoop Summit '16, Tokyo 19
http://www.slideshare.net/eventdotsjp/hivemall
![Page 20: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/20.jpg)
Industry use cases of Hivemall
Item/User recommendation• Algorithm: Recommendation• Wish.com, GMO pepabo
2016/10/26 Hadoop Summit '16, Tokyo 20
Problem: Recommendation using hot-item is hard in hand-crafted product market because each creator sells few single items (will soon become out-of-stock)
minne.com
![Page 21: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/21.jpg)
Industry use cases of Hivemall
Value prediction of Real estates• Algorithm: Regression• Livesense
2016/10/26 Hadoop Summit '16, Tokyo 21
![Page 22: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/22.jpg)
Industry use cases of Hivemall
User score calculation• Algrorithm: Regression• Klout
2016/10/26 Hadoop Summit '16, Tokyo 22
bit.ly/klout-hivemall
Influencer marketing
klout.com
![Page 23: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/23.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 23
![Page 24: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/24.jpg)
Efficient algorithm for finding change point and outliers from timeseries data
2016/10/26 Hadoop Summit '16, Tokyo 24
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
![Page 25: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/25.jpg)
Efficient algorithm for finding change point and outliers from timeseries data
2016/10/26 Hadoop Summit '16, Tokyo 25
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.
![Page 26: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/26.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 26
• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
![Page 27: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/27.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 27
Evaluation Metrics
![Page 28: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/28.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 28
Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of bins based on quantiles/distribution
Map Ages into 3 bins
![Page 29: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/29.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 29
Feature Selection – Signal Noise Ratio
![Page 30: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/30.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 30
Feature Selection – Chi-Square
![Page 31: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/31.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 31
Feature Transformation – Onehot encoding
Maps a categorical variable to a unique number starting from 1
![Page 32: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/32.jpg)
Spark 2.0 support XGBoost Integration Field-aware Factorization Machines Generalized Linear Model
• Optimizer framework including ADAM• L1/L2 regularization
2016/10/26 Hadoop Summit '16, Tokyo 32
Other new features to come
![Page 33: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/33.jpg)
Copyright©2016 NTT corp. All Rights Reserved.
Hivemall onTakeshi Yamamuro @ NTT
Hadoop Summit 2016Tokyo, Japan
![Page 34: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/34.jpg)
34Copyright©2016 NTT corp. All Rights Reserved.
Who am I?
![Page 35: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/35.jpg)
35Copyright©2016 NTT corp. All Rights Reserved.
What’s Spark?• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs• easy-to-use, rich optimization
• 3. Integrate broadly• storages, libraries, ...
![Page 36: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/36.jpg)
36Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark
• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark
What’s Hivemall on Spark?
![Page 37: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/37.jpg)
37Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall already has many fascinating ML algorithms and useful utilities
• High barriers to add newer algorithms in MLlib
Why’s Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
![Page 38: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/38.jpg)
38Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v1.6 and v2.0
Current Status
- For Spark v2.0$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...
![Page 39: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/39.jpg)
39Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...
-Pspark-1.6 for Spark v1.6
![Page 40: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/40.jpg)
40Copyright©2016 NTT corp. All Rights Reserved.
• 1. Download a Spark binary• 2. Fetch training and test data• 3. Load these data in Spark• 4. Build a model• 5. Do predictions
Running an Example
![Page 41: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/41.jpg)
41Copyright©2016 NTT corp. All Rights Reserved.
1. Download a Spark binary Running an Example
http://spark.apache.org/downloads.html
![Page 42: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/42.jpg)
42Copyright©2016 NTT corp. All Rights Reserved.
• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
datasets/regression.html#E2006-tfidf
2. Fetch training and test dataRunning an Example
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2
![Page 43: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/43.jpg)
43Copyright©2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
$ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
// Create DataFrame from the bzip’d libsvm-formatted filescala> val trainDf = sqlContext.sparkSession.read.format("libsvm”) .load(“E2006.train.bz2")
scala> trainDf.printSchemaroot |-- label: double (nullable = true) |-- features: vector (nullable = true)
![Page 44: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/44.jpg)
44Copyright©2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…
trainDfPartition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel becausebzip2 is splittable
![Page 45: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/45.jpg)
45Copyright©2016 NTT corp. All Rights Reserved.
4. Build a model - DataFrame Running an Example
scala> paste:val modelDf = trainDf.train_logregr($"features", $"label") .groupBy("feature”) .agg("weight" -> "avg”)
![Page 46: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/46.jpg)
46Copyright©2016 NTT corp. All Rights Reserved.
4. Build a model - SQL Running an Example
scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql(""" | SELECT feature, AVG(weight) AS weight | FROM ( | SELECT train_logregr(features, label) | AS (feature, weight) | FROM TrainTable | ) | GROUP BY feature """.stripMargin)
![Page 47: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/47.jpg)
47Copyright©2016 NTT corp. All Rights Reserved.
5. Do predictions - DataFrame Running an Example
scala> paste:val df = testDf.select(rowid(), $"features") .explode_vector($"features") .cache
# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER") .groupBy("rowid") .avg(sigmoid(sum($"weight" * $"value")))
![Page 48: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/48.jpg)
48Copyright©2016 NTT corp. All Rights Reserved.
5. Do predictions - SQL Running an Example
scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql(""" | SELECT rowid, sigmoid(value * weight) AS predicted | FROM TrainTable t | LEFT OUTER JOIN ModelTable m | ON t.feature = m.feature | GROUP BY rowid """.stripMargin)
![Page 49: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/49.jpg)
49Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))
![Page 50: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/50.jpg)
50Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))
![Page 51: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/51.jpg)
51Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
![Page 52: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/52.jpg)
52Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
scala> paste:df.withColumn( “rank”, rank().over(Window.partitionBy($"key").orderBy($"score".desc)) .where($"rank" <= topK)
![Page 53: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/53.jpg)
53Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark
Improve Some Functions in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)
![Page 54: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/54.jpg)
54Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
~4 times faster than rank!!
![Page 55: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/55.jpg)
55Copyright©2016 NTT corp. All Rights Reserved.
• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions
• This integration will make you...• load built models and predict in parallel• build multiple models in parallel for cross-validation
3rd–Party Library Integration
![Page 56: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/56.jpg)
Conclusion and TakeawayHivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs
2016/10/26 Hadoop Summit '16, Tokyo 56
We welcome your contributions to Apache Hivemall
![Page 57: Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig](https://reader036.vdocument.in/reader036/viewer/2022062522/586fde2d1a28ab18428b6b03/html5/thumbnails/57.jpg)
2016/10/26 Hadoop Summit '16, Tokyo 57
Any feature request or questions?