![Page 1: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/1.jpg)
Building Machine Learning Applications with Sparkling Water
NYC Big Data Science Meetup
Michal Malohlava and Alex Tellez and H2O.ai
![Page 2: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/2.jpg)
Who am I?Background
PhD in CS from Charles University in Prague, 2012
1 year PostDoc at Purdue University experimenting with algos for large-scale computation
2 years at H2O.ai helping to develop H2O engine for big data computation and analysis
Experience with domain-specific languages, distributed system, software engineering, and big data.
![Page 4: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/4.jpg)
Scalable Machine Learning
For Smarter Applications
![Page 5: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/5.jpg)
Smarter Applications
![Page 6: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/6.jpg)
Scalable Applications
Distributed
Easy to experiment
Able to process huge data from different sources
Powerful machine learning engine inside
![Page 7: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/7.jpg)
BUT how to build
them?
![Page 8: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/8.jpg)
Build an application with …
?
![Page 9: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/9.jpg)
…with Spark and H2O
![Page 10: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/10.jpg)
Open-source distributed execution platform
User-friendly API for data transformation based on RDD
Platform components - SQL, MLLib, text mining
Multitenancy
Large and active community
![Page 11: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/11.jpg)
Open-source scalable machine learning platform
Tuned for efficient computation and memory use
Mature machine learning algorithms
R, Python, Java, Scala APIs
Interactive UI
![Page 12: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/12.jpg)
Ensembles
Deep Neural Networks
• Generalized Linear Models : Binomial, Gaussian, Gamma, Poisson and Tweedie
• Cox Proportional Hazards Models • Naïve Bayes • Distributed Random Forest : Classification or
regression models • Gradient Boosting Machine : Produces an
ensemble of decision trees with increasing refined approximations
• Deep learning : Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
Statistical Analysis
Dimensionality Reduction
Anomaly Detection
• K-means : Partitions observations into k clusters/groups of the same spatial size
• Principal Component Analysis : Linearly transforms correlated variables to independent components
• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
Clustering
Supe
rvis
ed L
earn
ing
Unsupervised Learning
![Page 13: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/13.jpg)
![Page 14: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/14.jpg)
Sparkling WaterProvides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and algorithms with Spark API
Platform to build Smarter Applications
Excels in existing Spark workflows requiring advanced Machine Learning algorithms
![Page 15: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/15.jpg)
Sparkling Water Design
spark-submitSpark Master JVM
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Sparkling App
implements
?
Contains application and Sparkling Water
classes
![Page 16: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/16.jpg)
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMData
Source (e.g. HDFS)
H2O RDD
Spark Executor JVM
Spark Executor JVM
Spark RDD
RDDs and DataFrames share same memory
space
![Page 17: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/17.jpg)
Development InternalsSparkling Water Assembly
H2O Core
H2O Algos
H2O Scala API
H2O Flow
Sparkling Water Core
Spark Platform
Spark Core
Spark SQL
Application Code+
Assembly is deployedto Spark cluster as regular
Spark application
![Page 18: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/18.jpg)
Lets build an application !
![Page 19: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/19.jpg)
OR
Detect spam text messages
![Page 20: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/20.jpg)
Data example
case class SMS(target: String, fv: Vector)
![Page 21: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/21.jpg)
ML Workflow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF
4. Create and evaluate Deep Learning model
5. Use the model
Goal: For a given text message identify if it is spam or not
![Page 22: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/22.jpg)
Application environment
![Page 23: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/23.jpg)
Lego #1: Data load
// Data loaddef load(dataFile: String): RDD[Array[String]] = { sc.textFile(dataFile).map(l => l.split(“\t")) .filter(r => !r(0).isEmpty)}
![Page 24: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/24.jpg)
Lego #2: Ad-hoc Tokenization
def tokenize(data: RDD[String]): RDD[Seq[String]] = { val ignoredWords = Seq("the", “a", …) val ignoredChars = Seq(',', ‘:’, …) val texts = data.map( r => { var smsText = r.toLowerCase for( c <- ignoredChars) { smsText = smsText.replace(c, ' ') } val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct words.toSeq }) texts}
![Page 25: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/25.jpg)
Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}
Hash words into large
space
Term freq scale
“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]
Thank Order
![Page 26: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/26.jpg)
Lego #4: Build a modeldef buildDLModel(train: Frame, valid: Frame, epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0, hidden: Array[Int] = Array[Int](200, 200)) (implicit h2oContext: H2OContext): DeepLearningModel = { import h2oContext._ // Build a model val dlParams = new DeepLearningParameters() dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]] dlParams._train = train dlParams._valid = valid dlParams._response_column = 'target dlParams._epochs = epochs dlParams._l1 = l1 dlParams._hidden = hidden // Create a job val dl = new DeepLearning(dlParams) val dlModel = dl.trainModel.get // Compute metrics on both datasets dlModel.score(train).delete() dlModel.score(valid).delete() dlModel}
Deep Learning: Create multi-layer feed forward neural networks starting w i t h an i npu t l a ye r fo l lowed by mul t ip le l a y e r s o f n o n l i n e a r transformations
![Page 27: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/27.jpg)
Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)
Split dataset
Build model
Data munging
![Page 28: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/28.jpg)
Data exploration
![Page 29: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/29.jpg)
Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)
Collect model metrics
![Page 30: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/30.jpg)
Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}
Prepared models
Default decision threshold
Scoring
![Page 31: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/31.jpg)
Predict spamisSpam("Michal, beer tonight in MV?")
isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")
![Page 32: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/32.jpg)
Interactions with application from R
![Page 33: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/33.jpg)
Where is the code?https://github.com/h2oai/sparkling-water/
blob/master/examples/scripts/
![Page 34: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/34.jpg)
Sparkling Water Downloadhttp://h2o.ai/download/
http://h2o-release.s3.amazonaws.com/sparkling-water/master/91/index.html
![Page 35: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/35.jpg)
Checkout H2O.ai Training Books
http://learn.h2o.ai/
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
![Page 36: Building Machine Learning Applications with Sparkling Water](https://reader030.vdocument.in/reader030/viewer/2022032714/55aacf331a28abf37a8b4708/html5/thumbnails/36.jpg)
Learn more at h2o.ai Follow us at @h2oai
Thank you!Sparkling Water is
open-source ML application platform
combining power of Spark and H2O