methods in computational linguistics ii queens college lecture 3: counting more things
TRANSCRIPT
Methods in Computational Linguistics II
Queens College
Lecture 3: Counting More Things
2
Overview
• Basics of Probability. – Example Bayes Rule Question
• Implementing – FreqDist– ConditionalFreqDist
• Using the Command Line
3
Definitions
• Joint Probability• Marginal Probability• Conditional Probability
Bayes’ Rule
Bayes’ Rule relates conditional probability distributions:
P(h | e) = P(e | h) * P(h) P(e)
or with additional conditioning information:
P(h | e k) = P(e | h k) * P(h | k) P(e | k)
Bayes Rule Problem
• The probability I think that my cup of coffee tastes good is 0.80.
P(G) = .80• I add Equal to my coffee 60% of the time.
P(E) = .60• I think when coffee has Equal in it, it tastes good
50% of the time.P(G|E) = .50
• If I sip my coffee, and it tastes good, what are the odds that it has Equal in it?
P(E|G) = P(G|E) * P(E) / P(G)
Bayes’ Rule
• P(disease | symptom) =
P(symptom | disease) P(disease) P(symptom)
• Assess diagnostic probability from causal probability:– P(Cause|Effect) = P(Effect|Cause) P(Cause)
P(Effect)
• Prior, Likelihood, Posterior
Bayes Example
• Imagine – disease = BirdFlu, symptom = coughing– P(disease | symptom) is different in
BirdFlu-indicated country vs. USA– P(symptom | disease) should be the same
• It is more useful to learn P(symptom | disease)
– What about the denominator: P(symptom)? How do we determine this? Use conditioning (next slide).Skip this detail, Spring 2007
Conditioning
• Idea: Use conditional probabilities instead of joint probabilities
• P(A) = P(A B) + P(A B) = P(A | B) P(B) + P(A | B) P( B) Example:
P(symptom) = P( symptom | disease ) P(disease) + P( symptom | disease ) P( disease)
• More generally: P(Y) = åz P(Y|z) P(z)
• Marginalization and conditioning are useful rules for derivations involving probability expressions.
Independence
• A and B are independent iff– P(A B) = P(A) P(B)– P(A | B) = P(A)– P(B | A) = P(B)
• Independence is essential for efficient probabilistic reasoning• 32 entries reduced to 12; for n independent biased coins,
O(2n) →O(n)• Absolute independence powerful but rare• Dentistry is a large field with hundreds of variables, none of
which are independent. What to do?
CavityToothache Xray
Weatherdecomposes into
CavityToothache Xray
Weather
P(T, X, C, W) = P(T, X, C) P(W)
Conditional Independence
• A and B are conditionally independent given C iff– P(A | B, C) = P(A | C)– P(B | A, C) = P(B | C)– P(A B | C) = P(A | C) P(B | C)
• Toothache (T), Spot in Xray (X), Cavity (C)– None of these propositions are independent of
one other– But:
T and X are conditionally independent given C
11
Boxes and Balls
• 2 Boxes, one red and one blue.• Each contain colored balls.
12
Frequency Distribution
• Count up the number of occurrences of each member of a set of items.
• This counting can be used to calculate the probability of seeing any word.
13
nltk.FreqDist
• Let’s look at some code.
• Feel free to code along.
14
How would you implement a Frequency Distribution
• First conceptually.– What needs to happen.
• Implementation– Dictionary Objects. – Let’s see some examples.
15
Conditional Frequency Distribution
• Construct Frequency Distributions based on “conditions”.
16
Using the command line
• You can write and run code in IDLE or the interpreter.
• This is called ‘interactive’ mode.• However, a more useful way to build tools
is to write a file that– contains your code– can be run from the command line
17
Why do we do this?
18
Command Line Arguments
python my_code.py
python my_code.py readthisfile.txt writethisfile.txt
python my_code.py readthisfile.txt 4 writethisfile.txt
19
How do we get at this information?
sys.argv
We’ll code some examples.
20
Argparse
import argparse
parser = argparse.ArgumentParser(description=‘fun with files’)
# Positional Arguments
parser.add_argument(‘infile’)
parser.add_argument(‘outfile’)
args = parser.parse_args()
print args.infile
print args.outfile
21
Help Information
import argparse
parser = argparse.ArgumentParser(description=‘fun with files’)
# Positional Arguments
parser.add_argument(‘infile’, help=‘file to read’)
parser.add_argument(‘outfile’, help=‘file to write’)
args = parser.parse_args()
print args.infile
print args.outfile
22
Optional Arguments
import argparse
parser = argparse.ArgumentParser(description=‘fun with files’)
# Positional Arguments
parser.add_argument(‘infile’, help=‘file to read’)
parser.add_argument(‘outfile’, help=‘file to write’)
parser.add_argument(‘--num_lines’, help=‘number of lines to read’)
args = parser.parse_args()
print args.infile
print args.outfile
23
Presence Arguments
import argparse
parser = argparse.ArgumentParser(description=‘fun with files’)
# Positional Arguments
parser.add_argument(‘infile’, help=‘file to read’)
parser.add_argument(‘outfile’, help=‘file to write’)
parser.add_argument(‘--verbose’, help=‘how wordy’,
action=‘store_true’)
args = parser.parse_args()
print args.infile
print args.outfile
24
Multiple descriptors
import argparse
parser = argparse.ArgumentParser(description=‘fun with files’)
# Positional Arguments
parser.add_argument(‘infile’, help=‘file to read’)
parser.add_argument(‘outfile’, help=‘file to write’)
parser.add_argument(‘-v’, ‘--verbose’, help=‘how wordy’,
action=‘store_true’)
args = parser.parse_args()
print args.infile
print args.outfile
25
Typed Arguments
import argparse
parser = argparse.ArgumentParser(description=‘fun with files’)
# Positional Arguments
parser.add_argument(‘infile’, type=str, help=‘file to read’)
parser.add_argument(‘outfile’, type=str, help=‘file to write’)
parser.add_argument(‘-n’, ‘--num_lines’, type=int,
help=‘number of lines to read’)
args = parser.parse_args()
print args.infile
print args.outfile
26
Default values
import argparse
parser = argparse.ArgumentParser(description=‘fun with files’)
# Positional Arguments
parser.add_argument(‘infile’, type=str, help=‘file to read’)
parser.add_argument(‘outfile’, type=str, help=‘file to write’)
parser.add_argument(‘-n’, ‘--num_lines’, type=int,
help=‘number of lines to read’,
default=10)
args = parser.parse_args()
print args.infile
print args.outfile
27
Documentation and Comments
Why bother?
28
Good Documentation
• Every file gets a header describing what it does.
• Every function includes a string with 3 quotes describing what it does.– This allows help() to work
29
Documentation vs. Comments
• There are differing philosophies here.
• Documentation is for ‘what is done’
• Comments are for ‘how it’s done’
30
Effective Variable & Function Names
x = a + b
x = x / m
y = x * x
31
Effective Variable & Function Names
x = a + b
x = x / m
y = x * x
num_things = thing_count + thang_count
avg_things = num_things / document_count
sq_avg_things = avg_thing * avg_things
32
A simple, but good piece of code
• Let’s write code that– reads a file– counts the number of words– writes a file containing the frequency of the N
most-frequent words (or all of the words if N isn’t specified.)
33
Next Time
• Matching Things– Regular Expressions and Finite State Machines