sparkling water, ask craig
TRANSCRIPT
![Page 2: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/2.jpg)
THE RED PILL (SPARK + ML)Finally, ONE TO RULE THEM ALL!
1. Scrape & Collect Data2. Cleanse Data + Feature Extraction / Engineering
3. Build Machine Learning Models + Iterate4. Throw More Data to Improve Model
5. Deploy Model(s) in Real-Time
![Page 3: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/3.jpg)
THE BLUE PILL (H2O.AI)What is H2O? (water, duh!)
It is ALSO an open-source, distributed and parallel predictive engine for machine learning.
What makes H2O different?Cutting-edge algorithms + parallel architecture + ease-of-use
=Happy Data Scientists / Analysts
![Page 4: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/4.jpg)
WHY NOT BOTH PILLS?!
Build smarter applications USING BOTH in harmony within the Spark Ecosystem !!!
Convert Spark RDDs H2O RDDs for Machine Learning
![Page 5: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/5.jpg)
LET’S BUILD AN APP!
Task: Predict the job category from a Craigslist Ad Title
![Page 6: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/6.jpg)
ML WORKFLOW1. Perform Feature Extraction on Words + Munging
2. Run Word2Vec algo (MLlib) on Job Title words
3. Create “title vectors” from individual word vectors for each job title
4. Pass the Spark RDD H2O RDD for ML in Flow
5. Run H2O GBM algorithm on H2O RDD
6. Create Spark Streaming Application + Score on new data
![Page 7: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/7.jpg)
1. TEXT MUNGING
Example: “Site Supervisor and Pre K Teachers Needed Now!!!”
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
val tokens = jobTitles.map(line => token(line))
Next: Apply Spark’s Word2Vec model to each word
![Page 8: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/8.jpg)
2. WORD2VECSimply: A mathematical way to represent a single word as a vector of numbers. These vector ‘representations’ encode information about the
about a given word (i.e. its meaning)
Post Tokenization: Seq(site, supervisor, pre, teachers, needed)
Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
![Page 9: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/9.jpg)
BUT THAT’S ON WORDS!Post Word2Vec Results:
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
WE NEED TITLE VECTORS BASED ON ALL THE WORDS!
HOW?
Averaging word vectors to make ‘Title Vectors’
v(King) - v(Man) + V(Woman) ~ v(Queen)
![Page 10: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/10.jpg)
3. TITLE VECTORSIn Steps:
1. Sum the word2vec vectors in a given title2. Divide this sum by # of words in a given title
Result: ~ Average vector for a given title of N words
needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]
site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+
+
Divide by Total Words (post tokenization)
~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]
![Page 11: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/11.jpg)
4. PASS SPARK RDD TO H2O
OPEN H2O FLOW!
![Page 12: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/12.jpg)
5. BUILD A MODEL!
![Page 13: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/13.jpg)
80% ACCURACY - DEFAULT!Algo: Gradient Boosting Machine
# Trees: 50# Bins: 20Depth: 5
(ALL DEFAULT VALUES)
~ 20% Error Rate
![Page 14: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/14.jpg)
6. SPARK STREAMING + DEPLOYMENT
Create Spark Streaming App to read in new Job Titles
a) Create a Spark Streaming Producer - Reads data from a file & generates events in real-time which we will predict category.
![Page 15: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/15.jpg)
APP ARCHITECTURE
Posting job title
“HIRING Painting
CONTRACTORS NOW!!!”
StreamCategorize a job title
Prediction = “Labor”
Re-train the model
Craigslist jobs
Word2Vec Model
GBMModel
Word2Vec
Train a model
![Page 16: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/16.jpg)
“ASK CRAIG” LIVE DEMO!
![Page 17: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/17.jpg)
END-TO-ENDIn JUST 25 minutes…we:
1. Performed sophisticated feature extraction + engineering2. Passed a Spark RDD H2O RDD for ML3. Created a Spark Stream to read in new data
5. “Productionalized” H2O + Spark MLlib model to score on new dataSo happy I took
both pills!
4. Built a GBM to classify titles w/ 80% accuracy
![Page 18: Sparkling Water, ASK CRAIG](https://reader033.vdocument.in/reader033/viewer/2022042818/55b6e336bb61eb75268b488e/html5/thumbnails/18.jpg)
TRY SPARKLING WATER!!Download @ h2o.ai
Coming Soon: Release 1.4 for Spark 1.4!
NEW GUI! H2O FLOWMeetup: Silicon Valley Big Data Science