probabilistic language processing
DESCRIPTION
Probabilistic Language Processing. Chapter 23. Probabilistic Language Models. Goal -- define probability distribution over set of strings Unigram, bigram, n-gram Count using corpus but need smoothing: add-one Linear interpolation Evaluate with Perplexity measure - PowerPoint PPT PresentationTRANSCRIPT
Probabilistic Language Processing
Chapter 23
Probabilistic Language Models
• Goal -- define probability distribution over set of strings
• Unigram, bigram, n-gram• Count using corpus but need smoothing:
– add-one– Linear interpolation
• Evaluate with Perplexity measure• E.g. segmentwordswithoutspaces w/ Viterbi
PCFGs
• Rewrite rules have probabilities.
• Prob of a string is sum of probs of its parse trees.
• Context-freedom means no lexical constraints.
• Prefers short sentences.
Learning PCFGs
• Parsed corpus -- count trees.
• Unparsed corpus– Rule structure known -- use EM (inside-outside
algorithm)– Rules unknown -- Chomsky normal form…
problems.
Information Retrieval
• Goal: Google. Find docs relevant to user’s needs.
• IR system has doc. Collection, query in some language, set of results, and a presentation of results.
• Ideally, parse docs into knowledge base… too hard.
IR 2
• Boolean Keyword Model -- in or out?
• Problem -- single bit of “relevance”
• Boolean combinations a bit mysterious
• How compute P(R=true | D,Q)?
• Estimate language model for each doc, computes prob of query given the model.
• Can rank documents by P(r|D,Q)/P(~r|D,Q)
IR3
• For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes.
• Good example pp 842-843.
Evaluating IR
• Precision is proportion of results that are relevant.• Recall is proportion of relevant docs that are in
results• ROC curve (there are several varieties): standard
is to plot false negatives vs. false positives.• More “practical” for web: reciprocal rank of first
relevant result, or just “time to answer”
IR Refinements
• Case
• Stems
• Synonyms
• Spelling correction
• Metadata --keywords
IR Presentation
• Give list in order of relevance, deal with duplicates
• Cluster results into classes– Agglomerative– K-means
• How describe automatically-generated clusters? Word list? Title of centroid doc?
IR Implementation
• CSC172!
• Lexicon with “stop list”,
• “inverted” index: where words occur
• Match with vectors: vectorof freq of words dotted with query terms.
Information Extraction
• Goal: create database entries from docs.
• Emphasis on massive data, speed, stylized expressions
• Regular expression grammars OK if stylized enough
• Cascaded Finite State Transducers,,,stages of grouping and structure-finding
Machine Translation Goals
• Rough Translation (E.g. p. 851)
• Restricted Doman (mergers, weather)
• Pre-edited (Caterpillar or Xerox English)
• Literary Translation -- not yet!
• Interlingua-- or canonical semantic representation like Conceptual Dependency
• Basic Problem != languages, != categories
MT in Practice
• Transfer -- uses data base of rules for translating small units of language
• Memory -based. Memorize sentence pairs
• Good diagram p. 853
Statistical MT
• Bilingual corpus• Find most likely translation given corpus.• Argmax_F P(F|E) = argmax_F P(E|F)P(F)• P(F) is language model• P(E|F) is translation model• Lots of interesting problems: fertility (home vs. a
la maison).• Horrible drastic simplfications and hacks work
pretty well!
Learning and MT
• Stat. MT needs: language model, fertility model, word choice model, offset model.
• Millions of parameters
• Counting , estimate, EM.