nlp and entropy

Upload: myoxygen

Post on 13-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 NLP and Entropy

    1/54

    ENTROPYINNLP

    Presented by

    Avishek Dan (113050011)

    Lahari Poddar (113050029)

    Aishwarya Ganesan (113050042)

    Guided by

    Dr. Pushpak Bhattacharyya

  • 7/27/2019 NLP and Entropy

    2/54

    MOTIVATION

  • 7/27/2019 NLP and Entropy

    3/54

    3

  • 7/27/2019 NLP and Entropy

    4/54

  • 7/27/2019 NLP and Entropy

    5/54

    40% REMOVED

    5

  • 7/27/2019 NLP and Entropy

    6/54

  • 7/27/2019 NLP and Entropy

    7/54

    20% REMOVED

    7

  • 7/27/2019 NLP and Entropy

    8/54

    10% REMOVED

    8

  • 7/27/2019 NLP and Entropy

    9/54

    0% REMOVED

  • 7/27/2019 NLP and Entropy

    10/54

    10

  • 7/27/2019 NLP and Entropy

    11/54

    ENTROPY

  • 7/27/2019 NLP and Entropy

    12/54

    Entropy or self-information is the average uncertainty

    of a single random variable:

    (i) H(x) >=0,

    (ii) H(X) = 0 only when the value of X is determinate,

    hence providing no new information

    From a language perspective, it is the information

    that is produced on the average for each letter of text

    in the language

    ENTROPY

    12

  • 7/27/2019 NLP and Entropy

    13/54

    EXAMPLE: SIMPLE POLYNESIAN

    Random sequence of letters with probabilities:

    Per-letter entropy is:

    Code to represent language:

    13

  • 7/27/2019 NLP and Entropy

    14/54

    SHANONSEXPERIMENTTO

    DETERMINEENTROPY

  • 7/27/2019 NLP and Entropy

    15/54

    SHANNONSENTROPYEXPERIMENT

    15Source: http://www.math.ucsd.edu/~crypto/java/ENTROPY/

    http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/
  • 7/27/2019 NLP and Entropy

    16/54

    CALCULATIONOFENTROPY

    User as a language model

    Encode number of guesses required

    Apply entropy encoding algorithm (lossless compression)

    16

  • 7/27/2019 NLP and Entropy

    17/54

    ENTROPY OF A LANGUAGE

    Series of approximations

    F0,F1, F2 ... Fn

    ji biN jpjbpF i, 2 )(log),( i

    ii

    ji

    ii bpbpjbpjbp )(log)(),(log),( 2,

    2

    Nn

    FLimH

  • 7/27/2019 NLP and Entropy

    18/54

    ENTROPY OF ENGLISH

    F0= log226= 4.7 bits per letter

  • 7/27/2019 NLP and Entropy

    19/54

    19

    RELATIVEFREQUENCYOFOCCURRENCEOF

    ENGLISHALPHABETS

  • 7/27/2019 NLP and Entropy

    20/54

    WORDENTROPYOFENGLISH

    20

    Source: Shannon Prediction and Entropy of Printed English

  • 7/27/2019 NLP and Entropy

    21/54

    ZIPFSLAW

    Zipfs law states that given some corpus of natural

    language utterances, the frequency of any word

    is inversely proportional to its rank in the frequency table

    21

  • 7/27/2019 NLP and Entropy

    22/54

    LANGUAGEMODELING

  • 7/27/2019 NLP and Entropy

    23/54

    LANGUAGEMODELING

    A language model computes either:

    probability of a sequence of words:

    P(W) = P(w1,w2,w3,w4,w5wn)

    probability of an upcoming word:

    P(wn|w1,w2wn-1)

  • 7/27/2019 NLP and Entropy

    24/54

    APPLICATIONSOFLANGUAGEMODELS

    POS Tagging

    Machine Translation P(heavyrains tonight) > P(weighty rains tonight)

    Spell Correction P(about fifteen minutesfrom) > P(about fifteen minuetsfrom)

    Speech Recognition P(I saw a van) >> P(eyes awe of an)

  • 7/27/2019 NLP and Entropy

    25/54

    N-GRAMMODEL

    P(w1w2 wn ) = P(wi|w1w2 wi 1)i

    )|()|( 1121 ikiiii wwwPwwwwP

    Chain rule

    Markov Assumption

    Maximum Likelihood Estimate (for k=1)

    )(

    ),()|(

    1

    11

    i

    iiii

    wc

    wwcwwP

  • 7/27/2019 NLP and Entropy

    26/54

    EVALUATIONOFLANGUAGEMODELS

    A good language model gives a high probability to

    real English

    Extrinsic Evaluation

    For comparing models A and B

    Run applications like POSTagging, translation in each

    model and get an accuracy for A and for B

    Compare accuracy for A and B

    Intrinsic Evaluation Use of cross entropy and perplexity

    True model for data has the lowest possible entropy / perlexity

  • 7/27/2019 NLP and Entropy

    27/54

    RELATIVEENTROPY

    For two probability mass functions, p(x), q(x) theirrelative entropy:

    Also known as KL divergence, a measure of how

    different two distributions are.

  • 7/27/2019 NLP and Entropy

    28/54

    CROSSENTROPY

    Entropy as a measure of how surprised we are, measuredby pointwise entropy for model m:

    H(w/h) = - log(m(w/h))

    Produce q of real distribution to minimize D(p||q)

    The cross entropy between a random variable X with p(x)

    and another q (a model of p) is:

  • 7/27/2019 NLP and Entropy

    29/54

    PERPLEXITY

    Perplexity is defined as

    Probability of the test set assigned by the language

    model, normalized by the number of word

    Most common measure to evaluate language

    models

    ),(

    112),(

    mxH

    nnmxPP

  • 7/27/2019 NLP and Entropy

    30/54

    EXAMPLE

  • 7/27/2019 NLP and Entropy

    31/54

    EXAMPLE(CONTD.)

    Cross Entropy:

    Perplexity:

  • 7/27/2019 NLP and Entropy

    32/54

    SMOOTHING

    Bigrams with zero probability - cannot computeperplexity

    When we have sparse statistics Steal probabilitymass to generalize better ones.

    Many techniques available

    Add-one estimation : add one to all the counts

    PAdd 1(wi|wi 1)=

    c(wi 1,wi )+1

    c(wi 1)+V

  • 7/27/2019 NLP and Entropy

    33/54

    PERPEXITYFORN-GRAMMODELS

    Perplexity values yielded by n-gram models on

    English text range from about 50 to almost 1000

    (corresponding to cross entropies from about 6 to

    10 bits/word)

    Training: 38 million words from WSJ by Jurafsky

    Vocabulary: 19,979 words

    Test: 1.5 million words from WSJ

    N-gram

    Order

    Unigram Bigram Trigram

    Perplexity 962 170 109

  • 7/27/2019 NLP and Entropy

    34/54

    MAXIMUMENTROPYMODEL

  • 7/27/2019 NLP and Entropy

    35/54

    STATISTICALMODELING

    Constructs a stochastic model to predict the

    behavior of a random process

    Given a sample, represent a process

    Use the representation to make predictions about

    the future behavior of the process

    Eg: Team selectors employ batting averages,

    compiled from history of players, to gauge the

    likelihood that a player will succeed in the next

    match. Thus informed, they manipulate their lineups

    accordingly

  • 7/27/2019 NLP and Entropy

    36/54

    STAGESOFSTATISTICALMODELING

    1. Feature Selection :Determine a set of statistics that

    captures the behavior of a random process.

    2. Model Selection: Design an accurate model of the

    process--a model capable of predicting the future

    output of the process.

  • 7/27/2019 NLP and Entropy

    37/54

    MOTIVATINGEXAMPLE

    Model an expert translators decisions concerning

    the proper French rendering of the English word on

    A model(p) of the experts decisions assigns to

    each French word or phrase(f)an estimate,p(f), of

    the probability that the expert would choose fas atranslation of on

    Our goal is to

    Extract a set of facts about the decision-making process

    from the sample Construct a model of this process

  • 7/27/2019 NLP and Entropy

    38/54

    MOTIVATINGEXAMPLE

    A clue from the sample is the list of allowed translations on{sur, dans, par, au bord de}

    With this information in hand, we can impose our first constraint onp:

    The most uniform model will divide the probability values equally

    Suppose we notice that the expert chose either dansor sur30% of

    the time, then a second constraint can be added

    Intuitive Principle: Model all that is known and assume nothing about that

    which is unknown

    1)()()()( debordaupparpdanspsurp

    10/3)()( surpdansp

  • 7/27/2019 NLP and Entropy

    39/54

    MAXIMUMENTROPYMODEL

    A random process which produces an output value

    y, a member of a finite set .

    y{sur, dans, par, au bord de}

    The process may be influenced by some contextual

    informationx, a member of a finite set X.

    xcould include the words in the English sentence

    surrounding on

    A stochastic model: accurately representing the

    behavior of the random process Given a contextx, the process will output y

  • 7/27/2019 NLP and Entropy

    40/54

    MAXIMUMENTROPYMODEL

    Empirical Probability Distribution:

    Feature Function:

    Expected Value of the Feature Function

    For training data:

    For model:

    samplein theoccurs,thattimesofnumber1

    ,~ yxN

    yxp

    otherwise0

    followsandif1,

    onholdsuryyxf

    (1),,~~, yx yxfyxpfp

    (2),|~,

    yx

    yxfxypxpfp

  • 7/27/2019 NLP and Entropy

    41/54

    MAXIMUMENTROPYMODEL

    To accord the model with the statistic, the expected

    values are constrained to be equal

    Given nfeature functions fi

    , the modelpshould lie in the

    subset Cof Pdefined by

    Choose the model p* with maximum entropy H(p):

    (3)~ fpfp

    (4),...,2,1for~| nifpfpp ii PC

    (5)|log|~,

    yx

    xypxypxppH

  • 7/27/2019 NLP and Entropy

    42/54

    APPLICATIONS OF MEM

  • 7/27/2019 NLP and Entropy

    43/54

    STATISTICALMACHINETRANSLATION

    A French sentence F, is translated to an English

    sentence E as:

    Addition of MEM can introduce context-dependency:

    Pe(f|x) : Probability of choosing e (English) as the

    rendering of f (French) given the context x

    )()|(maxarg EPEFpEE

  • 7/27/2019 NLP and Entropy

    44/54

    PARTOFSPEECHTAGGING The probability model is defined over HxT

    H : set of possible word and tag contexts(histories)

    T : set of allowable tags

    Entropy of the distribution :

    Sample Feature Set :

    Precision :96.6%

    TtHh

    thpthppH

    ,

    ,log,

  • 7/27/2019 NLP and Entropy

    45/54

    PREPOSITIONPHRASEATTACHMENT

    MEM produces a probability distribution for the PP-attachmentdecision using only information from the verb phrase in whichthe attachment occurs

    conditional probability of an attachment isp(d|h) h is the history

    d{0, 1} , corresponds to a noun or verb attachment (respectively)

    Features: Testing for features should only involve Head Verb (V)

    Head Noun (N1)

    Head Preposition (P)

    Head Noun of the Object of the Preposition (N2)

    Performance : Decision Tree : 79.5 %

    MEM : 82.2%

  • 7/27/2019 NLP and Entropy

    46/54

    ENTROPYOFOTHERLANGUAGES

    E

  • 7/27/2019 NLP and Entropy

    47/54

    ENTROPYOFEIGHTLANGUAGESBELONGINGTO

    FIVELINGUISTICFAMILIES

    47

    Indo-European: English, French, and German

    Finno-Ugric: Finnish

    Austronesian: Tagalog

    Isolate: Sumerian

    Afroasiatic: Old Egyptian

    Sino-Tibetan: Chinese

    Hs- entropy when words are random , H- entropy when words are orderedDs= Hs-H

    Source: Universal Entropy of Word Ordering Across Linguistic Families

  • 7/27/2019 NLP and Entropy

    48/54

    ENTROPYOFHINDI

    Zero-order :5.61 bits/symbol.

    First-order : 4.901 bits/symbol.

    Second-order :3.79 bits/symbol.

    Third-order : 2.89 bits/symbol. Fourth-order : 2.31 bits/symbol.

    Fifth-order : 1.89 bits/symbol.

  • 7/27/2019 NLP and Entropy

    49/54

    ENTROPYTOPROVETHATASCRIPT

    REPRESENTLANGUAGE

    Pictish (a Scottish, Iron Age culture) symbols

    revealed as a written language through application

    of Shannon entropy

    In Entropy, the Indus Script, and Language by

    proving the block entropies of the Indus texts

    remain close to those of a variety of natural

    languages and far from the entropies for unorderedand rigidly ordered sequences

  • 7/27/2019 NLP and Entropy

    50/54

    ENTROPYOFLINGUISTICANDNON-LINGUISTIC

    LANGUAGES

    Source: Entropy, the Indus Script, and Language

  • 7/27/2019 NLP and Entropy

    51/54

    CONCLUSION

    The concept of entropy dates back to biblical times

    but it has wide range of applications in modern NLP

    NLP is heavily supported by Entropy models in

    various stages. We have tried to touch upon few

    aspects of it.

  • 7/27/2019 NLP and Entropy

    52/54

    REFERENCES

    C. E. Shannon, Prediction and Entropy of PrintedEnglish,Bell System Technical Journal Search, Volume30, Issue 1, January 1951

    http://www.math.ucsd.edu/~crypto/java/ENTROPY/

    Chris Manning and Hinrich Schtze, Foundations of

    Statistical Natural Language Processing, MIT Press.Cambridge, MA: May 1999.

    Daniel Jurafsky and James H. Martin, SPEECH andLANGUAGE PROCESSING An Introduction to NaturalLanguage Processing, Computational Linguistics, andSpeech Recognition Second Edition ,January 6, 2009

    Marcelo A. Montemurro, Damian H. Zanette, UniversalEntropy of Word Ordering Across Linguistic Families,PLoS ONE 6(5): e19875.doi:10.1371/journal.pone.0019875

    http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/
  • 7/27/2019 NLP and Entropy

    53/54

    Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia,

    Hrishikesh Joglekar, Ronojoy Adhikari, Iravatham

    Mahadevan, Entropy, the Indus Script, and

    Language: A Reply to R. Sproat

    Adwait Ratnaparkhi,A Maximum Entropy Model forPrepositional Phrase Attachment, HLT '94

    Berger, A Maximum Entropy Approach to Natural

    Language Processing, 1996

    Adwait Ratnaparkhi, A Maximum Entropy Model forPart-Of-Speech Tagging, 1996

    http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/
  • 7/27/2019 NLP and Entropy

    54/54

    THANKYOU