looking into the future: using google's prediction api
Post on 16-Apr-2017
424 Views
Preview:
TRANSCRIPT
Looking into the FutureUsing Google’s Prediction API
Justin Grammens Recursive Awesome & IoT Weekly
What is Prediction?
• Defined by Wikipedia as: “A statement about an uncertain event.”
• Continues on to read… “It is often, but not always, based upon experience or knowledge.”
• In statistics, prediction is a part of Statistical Inference.
Statistical Inference• Statistical inference is the process of deducing
properties of an underlying distribution by analysis of data.
• Two major paradigms used for statistical inference
• Frequentist Inference
• Bayesian Inference
Frequentist Inference• Data is repeatable random sample with a specific
probability
• Parameters and probabilities remain constant during the test
• Results are independent results from prior tests
• Q: Will the sun rise tomorrow? What’s the probability of a sun dying based on all the suns in the universe
Bayesian Inference• Take into account prior results and subjective
beliefs
• Update probabilities of occurrence based on new data
• Tests are NOT run in isolation and affect one another
• Q: Will the sun rise tomorrow? Depends on how many times we have seen it rise in the past
Predictions by Machines
• Could therefore define prediction as an “informed guess or opinion.”
• Software systems have to be trained before they can be effective.
source: reading.pppst.com
What is Prediction API?• Announced at Google I/O in 2011
• Provides pattern-matching and machine learning capabilities.
• Handles both numeric or text input
• Handles both classification or regression output
• Access from App Engine, client libs and command line
• Able to retrain the model on the fly - Bayesian?
What Are Some Usages?
What Do You Need?
• Google Account
• Google Platform Console project
• Google Predication API Activated
• Google Cloud Storage API Activated
Steps Involved• Define what you are trying to accomplish
• Find the training data and format to support your goal (hardest part)
• Upload training data to Google Cloud Storage
• Train the system against the data you provide
• Send queries to your model
• Upload additional data with new information gained.
Hosted Model• The Prediction API hosts a gallery of user-submitted
models
• Owners can charge for the use of the model
• Hosted models are versioned so they an be updated easily
• Models are submitted in PMML format
• XML-based language to define statistical & data models
• Appears to currently be a waitlist
How To Train• 3 ways to create and train the correct type of model
• CSV File - Lives on Google Cloud Storage
• Training data embedded in request
• Limited to the size of an HTTP Request < 2MB
• Empty model created and trained with update calls
CSV File Rules• Maximum file size 2.5 GB
• No header row. Yes, to the system it’s irrelevant
• One example per line
• The first column indicates to the system the type of model.
• Ideally remove punctuation (other then apostrophes) from your data.
CSV File Rules• Text Strings
• Double quotes around all text strings
• Text matching is case-sensitive
• Numeric Values
• Integer and decimals are supported
• Numbers: "1", "23", “999"
• Strings: "6 12", “colt 45"
Structuring Data• Example Value
• “The Answer”
• Features
• No limit on number of feature
• More features & examples the better
• To train 16MB ~ 1 hour
What’s The Answer?
Regression ModelExample Data
• Define your data to support numbers and strings
• Query of “Seattle, 288, sunny”, might get back value of 62
• Don’t need to match any values in the dataset
• Fill model with all columns then query with first column missing
Classification ModelExample Data
• Query of “Lose weight now!” you would get result of “spam”
• Returns the category from the dataset
Authorization• You must use OAuth 2.0 to authorize requests
• Can share your model with others
• View: User can call Analyze, Get, List and Predict on the project and/or any model owned by the project.
• Edit: User has all the permissions of Can view, but can also Delete, Insert, and Update any models owned by the project.
• Is Owner: User has all the permissions of Can edit, but can also grant permissions to other users to access the project.
Tips & Tricks• The more examples & features the better results
• However - Adding more features doesn’t always give better predictions
is_comedy is_drama is_action is_horror
Y N N N
VS
genre
Comedy
Tips & Tricks
• Need to add a numeric aspect to the genre?
• Add additional genre columns and weight it based on count
genre genre genre genre genre
Drama Drama Drama Comedy Comedy
Tips & Tricks• Always put something into each feature
• Include all the features that you know about
• For Regression:
• Make sure will have the time to ensure the values are correct
• Conversely, if you have exact numbers use them
• Try to have at least a few hundred examples for each category
Tips & Tricks
• Can only compare against known relationships
• Can’t feed an untrained title and user to get rating
• Solution is to break the title into genre, director, actors
Rating user_name movie_title9.5 Justin Star Wars2.2 Justin Disaster Movie5.0 Justin Billy Madison
Let’s Talk Data!• Nice Ride
• Based on the starting station, predict the ending station
• New York Cab Rides
• Given a starting GPS coordinate, predict where the cab ride will end
• Sentiment Analysis
• Based on the state of the union speech define the sentiment
Based on the starting station, can we predict the ending station?
Nice Ride Location Rides
• https://www.niceridemn.org/data/
• Offers a live XML stream to update along the way
Nice Ride Location RidesStarted
with this:
Next: Ended with this:
Nice Ride Insert DataID &
Location
Nice Ride Running Prediction
Status
Lessons Learned• I forgot to put the
values in quotes. Treated it as numerical regression.
• Verify how it’s interpreting your data with “get” call.
Type
Nice Ride Location Rides
Show Scripts, API & Results
Can we predict the movement of NYC cabs?
NYC Cab Ride Data
Data DictionaryData Website
Sample Data
Contains pickup & drop off latitude and longitude
There’s A Problem
• Asking for 2 inputs and 2 outputs!
• Not possible with Prediction API as it only supports one dependent variable. :(
• Change of plan…
Let’s predict the cost of a NYC cab ride instead!
Prediction Demo• Features are
distances (B)
• Examples are prices (A)
• Is this accurate?
• Different fares based on areas of the city
Ok, not really… Let's use location based
data instead
Prediction Demo
• Latitude / Longitude are the features (B, C, D, E
• Price Is The Example (A)
• Examples
NYC Cab Ride Location
Show Scripts, API & Results
Sentiment Analysis of a Speech
Speech Sentiment• Always Check Your Data!
• Website incorrectly claimed positive(4), negative(0) and neutral(2) sentiment.
• Data had groups of sentiment values.
• Source
Speech SentimentFeatureExample Value
Training Examples
Sentiment Training
Sentiment Example
Show Scripts, API & Results
Obama State of the Union Speech - 1/16
Donald Trump Speech Des Moines, IA - 1/24
Smart Spreadsheets
Install Smart Autofill Add-on
Smart Spreadsheets
Prediction API used to fill in missing values
Smart Spreadsheets
Select columns to use for data training
Smart Spreadsheets
“Example Values” are populated
Final Thoughts - Overfitting
• Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study.
• Therefore, a model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.
• Exact query should not return EXACT examples
Thank YouJustin Grammens
justin@recursiveawesome.com http://recursiveawesome.com
Checkout my IoT Weekly Newsletter http://iotweeklynews.com
top related