nlp_compromise
TRANSCRIPT
![Page 1: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/1.jpg)
The pope is catholic.
language as an interfacelanguage as data
introduction
![Page 2: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/2.jpg)
(@spencermountain)introduction
![Page 3: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/3.jpg)
problem
![Page 4: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/4.jpg)
problem
![Page 5: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/5.jpg)
problem
![Page 6: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/6.jpg)
london in the rain london in the
in the rain
london in
in the
the rain
london
in
the
rain
4-gram: 3-gram: 2-gram: 1-gram:
10 requests per keystroke4-words -
5-words : 15 6-words : 21 7-words : 28 8-words : 36
problem
![Page 7: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/7.jpg)
london in the rain
london in the
in the rain
london in
in the
the rain
London
in
the
rain
in
the
london in the
in the rain
london in
the rain
Stopwords blacklist:
Edge gram filter:
Redundancy check:
#1
#2
#3
in the
“rain”
“london”
“london in the rain”
problem
![Page 8: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/8.jpg)
● NLTK - excellent, huge, python
● Stanford parser - excellent, huge, java
Alchemy,
TextRazor,
OpenCalais,
Embedly,
Zemanta
Or an offsite API?● Freeling - excellent, huge, C++
● Illinois tagger - excellent, huge, java
When all you’ve got is a jackhammer..
niche
![Page 9: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/9.jpg)
Can it be hacked?
tldr: yes.
niche
![Page 10: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/10.jpg)
How big is a language?
Shakespeare - 35,000
niche
![Page 11: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/11.jpg)
Zipfs law
The top 10 words account for 25% of language.
The top 100 words account for 50% of language.
The top 50,000 words account for 95% of language.
niche
![Page 12: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/12.jpg)
602 kb uncompressed
50,000 different words
An average person will ever hear
~4 lookups in binary search
niche
![Page 13: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/13.jpg)
first, let’s kill the
nouns 70%
process
180 kb uncompressed
![Page 14: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/14.jpg)
Noun Verb Adjective Adverb
TomatoTomatoes
TorontoTorontonian
Speak
SpokeSpeakingwill speak
have spokenhad spoken
...
nice
nicernicest
quickly
quicklierquickliest
“awesome”“awesomeify”
improveify your vocabularies
“quickly”“quick”n/2.3Each word
*not handsome *not truly*not is*not economics
“tomatoey”“tomato”
“speak”“speaker”
“aggressive”“agressiveness”
“civil”“civilize”
niche
![Page 15: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/15.jpg)
then, let’s conjugate our verbs
process
110 kb uncompressed
![Page 16: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/16.jpg)
process
jQuery 256kb
d3js 330kb
react653kb110 kb
uncompressed lodash 503kb
the wholeenglish
language110kb
![Page 17: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/17.jpg)
Ok, let’s roll our own POS tagger..
(what could go rong?)
process
1) ☑� Lexicon2) ◻� Suffix regexes3) ◻� Sentence
markov
![Page 18: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/18.jpg)
Suffix rules
process
1) ☑� Lexicon2) ☑� Suffix regexes3) ◻� Sentence
markov
![Page 19: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/19.jpg)
Grammar rules - markov
She could walk the walk .
before: Verb - Det - Verb
after: Verb - Det - Noun
process
1) ☑� Lexicon2) ☑� Suffix regexes3) ☑� Sentence
markov
![Page 20: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/20.jpg)
“Unreasonable effectiveness” of rule-based taggers-
● a 1,000 word lexicon - 45% precision
● fallback to [Noun] - 70% precision
● a little regex - 74% precision
● a little grammar in it - 81% precision
process
![Page 21: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/21.jpg)
t.text(“keep on rocking in the free world”)t.negate()//“don’t keep on rocking in the free world.”
outcome
![Page 22: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/22.jpg)
t.text(“it is a cool library”)t.toValleyGirl()//“so, it is like, a cool library.”
outcome
![Page 23: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/23.jpg)
We gave the monkeys the bananas,
..because they were ripe. ..because they were hungry.
outcome
![Page 24: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/24.jpg)
We gave the monkeys the bananas
[Pr] [Verb] [Dt] [Noun] [Dt] [Noun]
list of letters
POS-tagging
Dependency parser We give [Noun] [Noun]
� [act / transfer / voluntary] [genus / monkey]
Knowledge engine[plant / banana]
outcome
![Page 25: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/25.jpg)
#TODOFML
● Mutable/Immutable API
● Speed, performance testing
● Romantic-language verb conjugations
● ‘bl.ocks.org’ of demos and docs
outcome
![Page 26: nlp_compromise](https://reader031.vdocument.in/reader031/viewer/2022030315/5882b38f1a28abd75a8b6b1d/html5/thumbnails/26.jpg)
npm install --wooyeah
@spencermountain
Slack group, mailing list, github, Toronto/coffee