daniel shank, data scientist, talla at mlconf sf 2017
TRANSCRIPT
![Page 1: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/1.jpg)
Getting Value Out of Chat DataWHAT TO DO WHEN YOUR DATA IS NOISY, SPARSE, AND SHORT
0
![Page 3: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/3.jpg)
Talla
NLP for internal business use cases
Smart knowledge management
Hiring!
2
![Page 4: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/4.jpg)
What is “Chat data?”
USER2: USER3 do you have new new cal on your Talla account already? Looks like it’s not available for me yet. Would be nice if we could also get inbox support enabled since it’s so much better than gmail. cc USER1USER3: USER2 I realized that after I typed this that I was using my personal gmail when I updated to the new changes. I looked on Talla and I didn’t see the same option to update to new calendar yet.USER4: USER2 I just enabled Inbox for our domainUSER4: new calendar is set to letting google decide when to roll it out, but it looks like we can also enable it as an option nowUSER4: I've now set that to be available as well. These may take some time to show upUSER1: USER2 its been enabled for awhile.USER1: (inbox)USER1: and the new calendar is enabled, soon as google decides you are allowed to have it.USER2: Thanks USER1 USER4
3
![Page 5: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/5.jpg)
Things similar to chat data
Sequential interactions
Forum posts
Some email
IT ticketing system interactions
Short text
Associated with a user
Possibly directed at another user
Highly context dependent
4
![Page 6: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/6.jpg)
Problems with chat
Increasing number of data sources
In theory contains lots of valuable information
In practice data is unlabeled
“Water, water, everywhere, but not a drop to drink.”
5
![Page 7: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/7.jpg)
Goal: Issue detection and matching
People get help through chat platforms
Extract that data and automate the process
USER1’s interaction should help USER3!
USER1: Hi, does anyone know if we have patriot’s day off?USER2: Yeah USER1, we do.USER1: Thanks! …USER3: Hey, do we get patriot’s day off?
6
![Page 8: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/8.jpg)
Automating knowledge delivery
Find issues or questions that people have
Match new issues to pre-existing ones
Serve the appropriate response or answer
Extracting answers is very hard
Focus on matching and search
7
![Page 9: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/9.jpg)
Overview
Jumpstart ML: Active Learning
Topic modeling
Dimensionality Reduction and Representations
8
![Page 10: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/10.jpg)
Find questions and analyze
Use patterns to find questions
Has ‘?’ token
Has a question word
Not too hard
Good start for finding past issues
9
![Page 11: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/11.jpg)
Problems with extracted questions
Most questions need context to understand. e.g.:
“What is it?”
”Can I use her personal email?”
Intent varies:
Want information
Do this thing for me
Huh?
10
![Page 12: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/12.jpg)
Only some questions make sense out of context
“Who is she?” “What is that?” “Will that fix my computer?”
Anaphora—it, that
Pronouns—He, she, etc
“What day is it?”, “Where am I?”
Answer depends on time, person asking
Requires more involved data model
11
![Page 13: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/13.jpg)
Questions have different intents
“Performative” – Please help me? ex:
hi can you please help me reset my 2 factor authentication on salesforce?
“Informational” – What is it?
what's the pl code?
“Navigational” – How do I do this?
how do i record a vidyo meeting?
12
![Page 14: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/14.jpg)
Can we write special case rules?
Borderline cases
is there a way to find out the size of an hbase table? – User asks “Is there (a way…)” to get directions
can anyone tell me where i find the out of stock request report? –User asks someone to give them information
Many variants
Alternative is to label data and use supervised learning
13
![Page 15: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/15.jpg)
We want to label data, but…
Managing crowdworkers:
Expensive
Time consuming
Can’t be used unless data is safely anonymous
Will the model work afterwards?
14
![Page 16: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/16.jpg)
Active Learning makes labeling more efficient
More value for your time
Can use with crowd workers or without
Good for chat:
Models train fast
Quick to annotate
Supervised learning with little labeled data
Annotate
Train/Predict Get data
15
![Page 17: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/17.jpg)
How it works (roughly)
Annotate 𝐷0 ∈ 𝐷
Train your model on 𝐷0
Predict labels on remaining data (𝐷 − 𝐷0)
Choose more data, 𝐷1 ∈ 𝐷 − 𝐷0,
Choice of 𝐷1 is based on label predictions
Repeat
???
Profit!
Annotate
Train/Predict Get data
16
![Page 18: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/18.jpg)
Where we are
Jumpstart ML: Active Learning
Topic modeling
Dimensionality Reduction and Representations
17
![Page 19: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/19.jpg)
More to data than questions or intent
What do people talk about?
What kind of issues are common?
Are there clear lines defining topics?
Finding problem areas
Strategic thinking about what to tackle
18
![Page 20: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/20.jpg)
Know Your Data
Read some of it (if you can)
Learn the context
Cluster and overview
19
![Page 21: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/21.jpg)
Clustering or modeling chat topics
LDA, LSA, NMF, others
Human supervision necessary for interpretation(boo!)
Messages short, so chat is hard
Larger documents have broader topic distributions
We expect messages to be about fewer topics
20
![Page 22: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/22.jpg)
Using LDA with Chat
𝜶 =. 𝟓 𝜶 =. 𝟏 𝜶 =. 𝟎𝟓 𝜶 = . 𝟎𝟑
know; does; link database; jermaine; running file; area; bank free; jermaine; database
did; try; work online; palace; sorry mean; try; screen user; hi; email
send; test; agent try; user; free did; ok; want client; server; user
look; able; mean user; client; error error; server; user ok; did; update
online; help; screen mean; app; does whats; agent; end mean; user; file
hi; palace; property shall; working; process client; property; user online; user; change
email; error; just emails; kelly; time online; user; update mandy; wrong; chance
user; issue; want did; ok; property palace; live; test owner; end; invoice
client; need; check ticket; whats; right run; right; check want; error; agent
owner; report; password check; chloe; duncan emails; know; link live; palace; try
21
![Page 23: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/23.jpg)
Where we are
Jumpstart ML: Active Learning
Topic modeling
Dimensionality Reduction and Representations
22
![Page 24: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/24.jpg)
Why do dimensionality reduction?
We want to improve our supervised learning techniques
Chat data is even more sparse than many NL datasets
Good representations can help search and similarity models
Off the shelf representations are good
Off the shelf + custom representations are better
23
![Page 25: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/25.jpg)
Setting up methods for learning
Word2vec, NMF, even LDA
Most methods equivalent*
Chat has no clear document barriers
Methods assume either continuous context or separate documents
Using messages as contexts too sparse
24
![Page 26: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/26.jpg)
Choosing a context
Representations are influenced by context choice
Figure out your goal
Choose context where words are associated in a way helpful for your goal
For our purposes: Words should be similar if they occur together in issues people have
25
![Page 27: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/27.jpg)
Using a time-based context window
Window before each question
Problem statement and questions should be related
USER2: Can I email this form, or do I have to print it out?USER1: You need to drop the form off in personUSER2: OK, sure. USER1: Great.USER2: Where can I get access to the printers? …
26
![Page 28: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/28.jpg)
Keywords are extracted from recent history
USER2: Can I email this form, or do I have to print it out?USER1: You need to drop the form off in personUSER2: OK, sure.USER1: Great.USER2: Where can I get access to the printers?…
27
![Page 29: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/29.jpg)
Similarity from resulting representations
‘printer’
['printer', 'choice', 'fuji', 'xerox', 'settings', 'sequence', 'default', 'rollover', 'driver', 'takes', 'smaller', 'main', ]
‘issue’
['issue', 'resolved', 'helping', 'experiencing', 'companies', 'related', 'assuming', 'reported', 'double', 'site', 'saw', 'causing', 'understand', 'sorted', 'logging', 'heard’]
‘ssh’
['ssh', 'config', 'dhcp, 'ping', 'reconnect', 'jpg’, 'webconsole', 'coats', 'lab’, 'browsers', 'instances', 'bypass’]
28
![Page 30: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/30.jpg)
Final Thoughts...
Tip of the iceberg
Understand how people interact
What information can we extract?
Can we escape our corpus?
29
![Page 31: Daniel Shank, Data Scientist, Talla at MLconf SF 2017](https://reader031.vdocument.in/reader031/viewer/2022030318/5a65ee6c7f8b9ad02f8b4c77/html5/thumbnails/31.jpg)
Thank you everyone!
thanks
['heaps', 'great', 'perfect', 'fantastic',]
30