creating answerbot with keras and tensorflow (tensorbeat)
TRANSCRIPT
AnswerBot
Introduction
• Avkash Chauhan, H2O.aio Head of enterprise products and customerso @avkashchauhan | https://www.linkedin.com/in/avkashchauhan• Productso H2Oo Sparkling Watero Deep Water• NN – Tensorflow, mxnet, Caffe• GPU• xgboost - Distributed
What is an AnswerBot?
• An AnswerBot is an standalone intelligent application • AnswerBot uses machine learning to respond user input• Provide relevant knowledge base articles as answers• Self-service customer base • Raises awareness of knowledge base offerings• Generate product feedback silently
AnswerBot – Client Interface
AnswerBot – Result Interface
PossibleAnswers: PossibleAnswers:60%
42%
60%
42%
More..
More..
More..
More..
AnswerBot – Administrator Interface
Male Female
Positive Negative
Question
Tags
Sentiment
Priority Low CriticalMedium High
Sex
Ratings
Top(n)Answers 728 35% 728 27% 718 17% 800 13% 128 3%
128 20% 18 20% 621 20% 801 20% 1208 20%
NSFW 3
Community
Stackoverflow
Quora
SlackBot
AWSAPIGateway
AWSLambda(QuestionScoring)
S3
DynamoDB
AWSSQS
AMLpipelineprototypetogettopNmatchinganswers
AWSSNS
AnswerBot in production - Teaser
ScoringPipeline
ModelPreparationProcess
ModelProduction
SupportPortal
Problems to solve
• Finding proper tags• Finding & Removing NSFW words• Sentiment in the question (positive or negative)• Priority to find the answer (Low, medium, high, critical)• Can we figure out if questioner is male or female?• Question rating (How the question was written?)• Findings best available answers• Duplicate Questions
Problems to solve – Solutions (Part 1)
1. Finding proper tags:1. Word Embedding's2. Matching words
2. Finding & Removing NSFW words1. Brute Force Search2. NLTK Stop Words
3. Sentiment in the question: (Positive or Negative)1. Binomial (2 classes)classification1. Tree Based Algorithms (GBM/RF/DRF) or NN
Problems to solve – Solutions (Part 2)
1. Priority to find the answer (Low/Medium/High/Critical)1. Multinomial (4 classes) classification1. Tree based algorithms (GBM/RF/DRF) or NN
2. Can we figure our if questioner is male or female?1. Binomial (2 classes Classification)1. Tree based algorithms (GBM/RF/DRF) or NN
3. Question rating (How the question was written?)1. Multinomial (N class – 1-5 star) classification1. Tree Based algorithms or NN
Problems to solve – Solutions (Part 3)
1. Findings best available answers1. Looking for the tags and keywords – Clustering / Reduction2. Creating tag & keywords weights for each question3. Matching tag, keywords and their weights to find top
probabilities2. Duplicate Questions1. Quora has same problem to solve on Kaggle1. https://www.kaggle.com/c/quora-question-pairs/data2. https://www.kaggle.com/anokas/data-analysis-xgboost-
starter-0-35460-lb
Data Preparation
• Real Datao Real Question/Answers• StackOverflow, Community, Quora, Support System
• Experimental Datao Yelp – 41M reviews in 1-5 stars category - Supervised• Ratings: 1-5o Twitter Sentiment – Search it OR Mine It - Supervised• Positive/Negative• Male/Female
Our Experimentation Today
• Classifying sentences to predicto Ratings: Starts (1-5)• Multinomial classification exampleo Sentiments: Positive or Negative• Binomial classification example
Demo
• Binomial & Multinomial Classification$ python PredictNow.py
Why Keras?
• High level API (Python) to run top of Tensorflow & Theano• Great for quick and fast experimentation• Supports both CNN and RNN and combination of two• Run on CPU & GPU• Visit: https://blog.keras.io/keras-as-a-simplified-interface-to-
tensorflow-tutorial.html
Word2vec
• Word2vec is an Neural Network based word embedding method.
• A Neural Network with only 1 linear hidden layero Hidden layer's is used to transform inputs into something
that the output layer can use.o Each hidden unit has the linear activation• Represent words in a continuous, low dimensional vector
space ((i.e., the embedding space)o Semantically similar words are mapped to nearby points.
Understanding Dataset
• Ratings Analysiso review,starso The food is WAAAAY overpriced and totally not worth it, they charged for the salsa and the service was ridiculously slow....The
guacamole was good though., 2o Decent food at a great price. Unfortunately, the place is so jam packed it's almost an inconvenience to head back to the buffet
lines., 2o Love getting my haircut here! It's only $25 for a women's haircut. I'm pretty picky about how much my hair is layered and I've
never had a problem here. Make sure to call in to schedule your appointment ahead of time during the school year because she's usually booked two days in advance., 5
• Sentiment Analysiso Text, Sentimento I lost $80 today I know I shouldn't put things in my back pocket but I was about to put in my bag when I realized it was gone., 0o Just got back from Seattle. Lots of crowds. Nordstrom was nuts. But Taphouse Grill was practically empty. Found hardcover of
Mad Love!, 1o Crunch week! This Friday, I'll be heading to Oddmall, my first major craft fair, in Hudson, Ohio! I'm tricking out the website., 1o Another beautiful day out today!! Going to build some models first then go for running!! 1o Tired. Just tired. Home time!! I'm weaksauce, I know , 0
Components & Experimentation
• Keras • Tensorflowo GPU• NLTKo Using Stop Words• Gloveo Pre-trained word2vec datasetso Small (400K words)• Python• Jupyter notebook
Experimentation – Part 1
1. Data Preparation 2. Creating word collection
1. Removing stop words2. Collecting all words into a big list
3. Tokenization and uniform data collection1. Using full words collection2. Get unique words in our collection3. Tokenize are sentence level4. Final Dataset1. Sentences [sentences_per_record, length] - X2. Labels [label_per_recordm, length] – Y
Experimentation– Part 2
4. Splitting dataset to training and validation5. Creating Embedding Matrixo Loading predefined word vectoro Finding match words from our collection and creating
embedding word matrix6. Creating Embedding Layer/Configuration7. Training
Experimentation– Part 3
8. Understanding resultso Layers connectiono Model configuration o Model weights
9. Saving model configuration, weights, data-modelo HDF5 is a data model, library, and file format for
storing and managing data
Experimentation– Part 4
10. Model Metrics and Performanceo Getting Model Metricso Model Performance Grapho Model Accuracy• Training • Validation
11.Predictiono Validation Datao User Input
What i f you hit exact same predict ion
• Bad Model - Could be a bad model. Retrain it.• Rebalance your dataset:o Either upsample less frequent class oOr downsample more frequent one.• Adjust class weights: Setting higher class weight for
less frequent class, network will put more attention on the downsampled class during training
• Increase the time of training: After long training time network starts concentrating more on less frequent classes.
Advance Processing
• Engine:o Doc2seq -https://radimrehurek.com/gensim/models/doc2vec.html
o Seq2seq - https://github.com/farizrahman4u/seq2seq
o Lda2vec - http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/
o RNN & LSTM - https://arxiv.org/pdf/1502.06922.pdf
• Trainingo CPU vs GPU o Checkpoints with training
AnswerBot production pipeline in cloud (AWS)
Community
Stackoverflow
Quora
SlackBot
AWSAPIGateway
AWSLambda(QuestionScoring)
S3
DynamoDB
AWSSQS
AMLpipelineprototypetogettopNmatchinganswers
AWSSNS
ScoringPipeline
ModelPreparationProcess
ModelProduction
SupportPortal
Content
• Github - https://github.com/Avkash/mldl/tree/master/tensorbeat-answerbot• Dataseto Sentiment : Search it or Mine ito 5Star - https://www.yelp.com/dataset_challenge/dataset
• Python/Jupyter Notebooko Sentiment:• make-sentiment-model.py• PositiveNegative.ipynb
o 5Star – make-5star-model.py• make-5star-model.py• 5StarReviews.ipynb
o Prediction – PredictNow.py
Thank you so much