probabilistic language processing

Probabilistic Language Processing

Chapter 23

Probabilistic Language Models

• Goal -- define probability distribution over set of strings

• Unigram, bigram, n-gram• Count using corpus but need smoothing:

– add-one– Linear interpolation

• Evaluate with Perplexity measure• E.g. segmentwordswithoutspaces w/ Viterbi

PCFGs

• Rewrite rules have probabilities.

• Prob of a string is sum of probs of its parse trees.

• Context-freedom means no lexical constraints.

• Prefers short sentences.

Learning PCFGs

• Parsed corpus -- count trees.

• Unparsed corpus– Rule structure known -- use EM (inside-outside

algorithm)– Rules unknown -- Chomsky normal form…

problems.

Information Retrieval

• Goal: Google. Find docs relevant to user’s needs.

• IR system has doc. Collection, query in some language, set of results, and a presentation of results.

• Ideally, parse docs into knowledge base… too hard.

IR 2

• Boolean Keyword Model -- in or out?

• Problem -- single bit of “relevance”

• Boolean combinations a bit mysterious

• How compute P(R=true | D,Q)?

• Estimate language model for each doc, computes prob of query given the model.

• Can rank documents by P(r|D,Q)/P(~r|D,Q)

IR3

• For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes.

• Good example pp 842-843.

Evaluating IR

• Precision is proportion of results that are relevant.• Recall is proportion of relevant docs that are in

results• ROC curve (there are several varieties): standard

is to plot false negatives vs. false positives.• More “practical” for web: reciprocal rank of first

relevant result, or just “time to answer”

IR Refinements

• Case

• Stems

• Synonyms

• Spelling correction

• Metadata --keywords

IR Presentation

• Give list in order of relevance, deal with duplicates

• Cluster results into classes– Agglomerative– K-means

• How describe automatically-generated clusters? Word list? Title of centroid doc?

IR Implementation

• CSC172!

• Lexicon with “stop list”,

• “inverted” index: where words occur

• Match with vectors: vectorof freq of words dotted with query terms.

Information Extraction

• Goal: create database entries from docs.

• Emphasis on massive data, speed, stylized expressions

• Regular expression grammars OK if stylized enough

• Cascaded Finite State Transducers,,,stages of grouping and structure-finding

Machine Translation Goals

• Rough Translation (E.g. p. 851)

• Restricted Doman (mergers, weather)

• Pre-edited (Caterpillar or Xerox English)

• Literary Translation -- not yet!

• Interlingua-- or canonical semantic representation like Conceptual Dependency

• Basic Problem != languages, != categories

MT in Practice

• Transfer -- uses data base of rules for translating small units of language

• Memory -based. Memorize sentence pairs

• Good diagram p. 853

Statistical MT

• Bilingual corpus• Find most likely translation given corpus.• Argmax_F P(F|E) = argmax_F P(E|F)P(F)• P(F) is language model• P(E|F) is translation model• Lots of interesting problems: fertility (home vs. a

la maison).• Horrible drastic simplfications and hacks work

pretty well!

Learning and MT

• Stat. MT needs: language model, fertility model, word choice model, offset model.

• Millions of parameters

• Counting , estimate, EM.

probabilistic language processing

Documents