nlp and entropy
TRANSCRIPT
-
7/27/2019 NLP and Entropy
1/54
ENTROPYINNLP
Presented by
Avishek Dan (113050011)
Lahari Poddar (113050029)
Aishwarya Ganesan (113050042)
Guided by
Dr. Pushpak Bhattacharyya
-
7/27/2019 NLP and Entropy
2/54
MOTIVATION
-
7/27/2019 NLP and Entropy
3/54
3
-
7/27/2019 NLP and Entropy
4/54
-
7/27/2019 NLP and Entropy
5/54
40% REMOVED
5
-
7/27/2019 NLP and Entropy
6/54
-
7/27/2019 NLP and Entropy
7/54
20% REMOVED
7
-
7/27/2019 NLP and Entropy
8/54
10% REMOVED
8
-
7/27/2019 NLP and Entropy
9/54
0% REMOVED
-
7/27/2019 NLP and Entropy
10/54
10
-
7/27/2019 NLP and Entropy
11/54
ENTROPY
-
7/27/2019 NLP and Entropy
12/54
Entropy or self-information is the average uncertainty
of a single random variable:
(i) H(x) >=0,
(ii) H(X) = 0 only when the value of X is determinate,
hence providing no new information
From a language perspective, it is the information
that is produced on the average for each letter of text
in the language
ENTROPY
12
-
7/27/2019 NLP and Entropy
13/54
EXAMPLE: SIMPLE POLYNESIAN
Random sequence of letters with probabilities:
Per-letter entropy is:
Code to represent language:
13
-
7/27/2019 NLP and Entropy
14/54
SHANONSEXPERIMENTTO
DETERMINEENTROPY
-
7/27/2019 NLP and Entropy
15/54
SHANNONSENTROPYEXPERIMENT
15Source: http://www.math.ucsd.edu/~crypto/java/ENTROPY/
http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/ -
7/27/2019 NLP and Entropy
16/54
CALCULATIONOFENTROPY
User as a language model
Encode number of guesses required
Apply entropy encoding algorithm (lossless compression)
16
-
7/27/2019 NLP and Entropy
17/54
ENTROPY OF A LANGUAGE
Series of approximations
F0,F1, F2 ... Fn
ji biN jpjbpF i, 2 )(log),( i
ii
ji
ii bpbpjbpjbp )(log)(),(log),( 2,
2
Nn
FLimH
-
7/27/2019 NLP and Entropy
18/54
ENTROPY OF ENGLISH
F0= log226= 4.7 bits per letter
-
7/27/2019 NLP and Entropy
19/54
19
RELATIVEFREQUENCYOFOCCURRENCEOF
ENGLISHALPHABETS
-
7/27/2019 NLP and Entropy
20/54
WORDENTROPYOFENGLISH
20
Source: Shannon Prediction and Entropy of Printed English
-
7/27/2019 NLP and Entropy
21/54
ZIPFSLAW
Zipfs law states that given some corpus of natural
language utterances, the frequency of any word
is inversely proportional to its rank in the frequency table
21
-
7/27/2019 NLP and Entropy
22/54
LANGUAGEMODELING
-
7/27/2019 NLP and Entropy
23/54
LANGUAGEMODELING
A language model computes either:
probability of a sequence of words:
P(W) = P(w1,w2,w3,w4,w5wn)
probability of an upcoming word:
P(wn|w1,w2wn-1)
-
7/27/2019 NLP and Entropy
24/54
APPLICATIONSOFLANGUAGEMODELS
POS Tagging
Machine Translation P(heavyrains tonight) > P(weighty rains tonight)
Spell Correction P(about fifteen minutesfrom) > P(about fifteen minuetsfrom)
Speech Recognition P(I saw a van) >> P(eyes awe of an)
-
7/27/2019 NLP and Entropy
25/54
N-GRAMMODEL
P(w1w2 wn ) = P(wi|w1w2 wi 1)i
)|()|( 1121 ikiiii wwwPwwwwP
Chain rule
Markov Assumption
Maximum Likelihood Estimate (for k=1)
)(
),()|(
1
11
i
iiii
wc
wwcwwP
-
7/27/2019 NLP and Entropy
26/54
EVALUATIONOFLANGUAGEMODELS
A good language model gives a high probability to
real English
Extrinsic Evaluation
For comparing models A and B
Run applications like POSTagging, translation in each
model and get an accuracy for A and for B
Compare accuracy for A and B
Intrinsic Evaluation Use of cross entropy and perplexity
True model for data has the lowest possible entropy / perlexity
-
7/27/2019 NLP and Entropy
27/54
RELATIVEENTROPY
For two probability mass functions, p(x), q(x) theirrelative entropy:
Also known as KL divergence, a measure of how
different two distributions are.
-
7/27/2019 NLP and Entropy
28/54
CROSSENTROPY
Entropy as a measure of how surprised we are, measuredby pointwise entropy for model m:
H(w/h) = - log(m(w/h))
Produce q of real distribution to minimize D(p||q)
The cross entropy between a random variable X with p(x)
and another q (a model of p) is:
-
7/27/2019 NLP and Entropy
29/54
PERPLEXITY
Perplexity is defined as
Probability of the test set assigned by the language
model, normalized by the number of word
Most common measure to evaluate language
models
),(
112),(
mxH
nnmxPP
-
7/27/2019 NLP and Entropy
30/54
EXAMPLE
-
7/27/2019 NLP and Entropy
31/54
EXAMPLE(CONTD.)
Cross Entropy:
Perplexity:
-
7/27/2019 NLP and Entropy
32/54
SMOOTHING
Bigrams with zero probability - cannot computeperplexity
When we have sparse statistics Steal probabilitymass to generalize better ones.
Many techniques available
Add-one estimation : add one to all the counts
PAdd 1(wi|wi 1)=
c(wi 1,wi )+1
c(wi 1)+V
-
7/27/2019 NLP and Entropy
33/54
PERPEXITYFORN-GRAMMODELS
Perplexity values yielded by n-gram models on
English text range from about 50 to almost 1000
(corresponding to cross entropies from about 6 to
10 bits/word)
Training: 38 million words from WSJ by Jurafsky
Vocabulary: 19,979 words
Test: 1.5 million words from WSJ
N-gram
Order
Unigram Bigram Trigram
Perplexity 962 170 109
-
7/27/2019 NLP and Entropy
34/54
MAXIMUMENTROPYMODEL
-
7/27/2019 NLP and Entropy
35/54
STATISTICALMODELING
Constructs a stochastic model to predict the
behavior of a random process
Given a sample, represent a process
Use the representation to make predictions about
the future behavior of the process
Eg: Team selectors employ batting averages,
compiled from history of players, to gauge the
likelihood that a player will succeed in the next
match. Thus informed, they manipulate their lineups
accordingly
-
7/27/2019 NLP and Entropy
36/54
STAGESOFSTATISTICALMODELING
1. Feature Selection :Determine a set of statistics that
captures the behavior of a random process.
2. Model Selection: Design an accurate model of the
process--a model capable of predicting the future
output of the process.
-
7/27/2019 NLP and Entropy
37/54
MOTIVATINGEXAMPLE
Model an expert translators decisions concerning
the proper French rendering of the English word on
A model(p) of the experts decisions assigns to
each French word or phrase(f)an estimate,p(f), of
the probability that the expert would choose fas atranslation of on
Our goal is to
Extract a set of facts about the decision-making process
from the sample Construct a model of this process
-
7/27/2019 NLP and Entropy
38/54
MOTIVATINGEXAMPLE
A clue from the sample is the list of allowed translations on{sur, dans, par, au bord de}
With this information in hand, we can impose our first constraint onp:
The most uniform model will divide the probability values equally
Suppose we notice that the expert chose either dansor sur30% of
the time, then a second constraint can be added
Intuitive Principle: Model all that is known and assume nothing about that
which is unknown
1)()()()( debordaupparpdanspsurp
10/3)()( surpdansp
-
7/27/2019 NLP and Entropy
39/54
MAXIMUMENTROPYMODEL
A random process which produces an output value
y, a member of a finite set .
y{sur, dans, par, au bord de}
The process may be influenced by some contextual
informationx, a member of a finite set X.
xcould include the words in the English sentence
surrounding on
A stochastic model: accurately representing the
behavior of the random process Given a contextx, the process will output y
-
7/27/2019 NLP and Entropy
40/54
MAXIMUMENTROPYMODEL
Empirical Probability Distribution:
Feature Function:
Expected Value of the Feature Function
For training data:
For model:
samplein theoccurs,thattimesofnumber1
,~ yxN
yxp
otherwise0
followsandif1,
onholdsuryyxf
(1),,~~, yx yxfyxpfp
(2),|~,
yx
yxfxypxpfp
-
7/27/2019 NLP and Entropy
41/54
MAXIMUMENTROPYMODEL
To accord the model with the statistic, the expected
values are constrained to be equal
Given nfeature functions fi
, the modelpshould lie in the
subset Cof Pdefined by
Choose the model p* with maximum entropy H(p):
(3)~ fpfp
(4),...,2,1for~| nifpfpp ii PC
(5)|log|~,
yx
xypxypxppH
-
7/27/2019 NLP and Entropy
42/54
APPLICATIONS OF MEM
-
7/27/2019 NLP and Entropy
43/54
STATISTICALMACHINETRANSLATION
A French sentence F, is translated to an English
sentence E as:
Addition of MEM can introduce context-dependency:
Pe(f|x) : Probability of choosing e (English) as the
rendering of f (French) given the context x
)()|(maxarg EPEFpEE
-
7/27/2019 NLP and Entropy
44/54
PARTOFSPEECHTAGGING The probability model is defined over HxT
H : set of possible word and tag contexts(histories)
T : set of allowable tags
Entropy of the distribution :
Sample Feature Set :
Precision :96.6%
TtHh
thpthppH
,
,log,
-
7/27/2019 NLP and Entropy
45/54
PREPOSITIONPHRASEATTACHMENT
MEM produces a probability distribution for the PP-attachmentdecision using only information from the verb phrase in whichthe attachment occurs
conditional probability of an attachment isp(d|h) h is the history
d{0, 1} , corresponds to a noun or verb attachment (respectively)
Features: Testing for features should only involve Head Verb (V)
Head Noun (N1)
Head Preposition (P)
Head Noun of the Object of the Preposition (N2)
Performance : Decision Tree : 79.5 %
MEM : 82.2%
-
7/27/2019 NLP and Entropy
46/54
ENTROPYOFOTHERLANGUAGES
E
-
7/27/2019 NLP and Entropy
47/54
ENTROPYOFEIGHTLANGUAGESBELONGINGTO
FIVELINGUISTICFAMILIES
47
Indo-European: English, French, and German
Finno-Ugric: Finnish
Austronesian: Tagalog
Isolate: Sumerian
Afroasiatic: Old Egyptian
Sino-Tibetan: Chinese
Hs- entropy when words are random , H- entropy when words are orderedDs= Hs-H
Source: Universal Entropy of Word Ordering Across Linguistic Families
-
7/27/2019 NLP and Entropy
48/54
ENTROPYOFHINDI
Zero-order :5.61 bits/symbol.
First-order : 4.901 bits/symbol.
Second-order :3.79 bits/symbol.
Third-order : 2.89 bits/symbol. Fourth-order : 2.31 bits/symbol.
Fifth-order : 1.89 bits/symbol.
-
7/27/2019 NLP and Entropy
49/54
ENTROPYTOPROVETHATASCRIPT
REPRESENTLANGUAGE
Pictish (a Scottish, Iron Age culture) symbols
revealed as a written language through application
of Shannon entropy
In Entropy, the Indus Script, and Language by
proving the block entropies of the Indus texts
remain close to those of a variety of natural
languages and far from the entropies for unorderedand rigidly ordered sequences
-
7/27/2019 NLP and Entropy
50/54
ENTROPYOFLINGUISTICANDNON-LINGUISTIC
LANGUAGES
Source: Entropy, the Indus Script, and Language
-
7/27/2019 NLP and Entropy
51/54
CONCLUSION
The concept of entropy dates back to biblical times
but it has wide range of applications in modern NLP
NLP is heavily supported by Entropy models in
various stages. We have tried to touch upon few
aspects of it.
-
7/27/2019 NLP and Entropy
52/54
REFERENCES
C. E. Shannon, Prediction and Entropy of PrintedEnglish,Bell System Technical Journal Search, Volume30, Issue 1, January 1951
http://www.math.ucsd.edu/~crypto/java/ENTROPY/
Chris Manning and Hinrich Schtze, Foundations of
Statistical Natural Language Processing, MIT Press.Cambridge, MA: May 1999.
Daniel Jurafsky and James H. Martin, SPEECH andLANGUAGE PROCESSING An Introduction to NaturalLanguage Processing, Computational Linguistics, andSpeech Recognition Second Edition ,January 6, 2009
Marcelo A. Montemurro, Damian H. Zanette, UniversalEntropy of Word Ordering Across Linguistic Families,PLoS ONE 6(5): e19875.doi:10.1371/journal.pone.0019875
http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/ -
7/27/2019 NLP and Entropy
53/54
Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia,
Hrishikesh Joglekar, Ronojoy Adhikari, Iravatham
Mahadevan, Entropy, the Indus Script, and
Language: A Reply to R. Sproat
Adwait Ratnaparkhi,A Maximum Entropy Model forPrepositional Phrase Attachment, HLT '94
Berger, A Maximum Entropy Approach to Natural
Language Processing, 1996
Adwait Ratnaparkhi, A Maximum Entropy Model forPart-Of-Speech Tagging, 1996
http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/ -
7/27/2019 NLP and Entropy
54/54
THANKYOU