machine learning ii ensembles & cotraining cse 573 representing uncertainty
Post on 20-Dec-2015
214 views
TRANSCRIPT
Machine Learning IIEnsembles & Cotraining
CSE 573
Representing Uncertainty
© Daniel S. Weld 2
Logistics
• Reading Ch 13 Ch 14 thru 14.3
© Daniel S. Weld 3
DT Learning as Search• Nodes
• Operators
• Initial node
• Heuristic?
• Goal?
• Type of Search?
Decision Trees
Tree Refinement: Sprouting the tree
Smallest tree possible: a single leaf
Information Gain
Best tree possible (???)
Hill climbing
Bias
• Restriction
• Preference
© Daniel S. Weld 5
Decision Tree Representation
Outlook
Humidity Wind
YesYes
Yes
No No
Sunny Overcast Rain
High StrongNormal Weak
Good day for tennis?Leaves = classificationArcs = choice of valuefor parent attribute
Decision tree is equivalent to logic in disjunctive normal formG-Day (Sunny Normal) Overcast (Rain Weak)
© Daniel S. Weld 6
Overfitting
Model complexity (e.g. Number of Nodes in Decision tree )
Accuracy
0.9
0.8
0.7
0.6
On training dataOn test data
© Daniel S. Weld 7
Machine Learning Outline
•Supervised Learning Review
•Ensembles of Classifiers Bagging Cross-validated committees Boosting Stacking
•Co-training
© Daniel S. Weld 8
Voting
© Daniel S. Weld 9
Ensembles of Classifiers• Assume
Errors are independent (suppose 30% error) Majority vote
• Probability that majority is wrong…
• If individual area is 0.3• Area under curve for 11 wrong is
0.026• Order of magnitude improvement!
Ensemble of 2
1
classi
fiers
Prob 0.2
0.1
Number of classifiers in error
= area under binomial distribution
© Daniel S. Weld 10
Constructing Ensembles
• Partition examples into k disjoint equiv classes• Now create k training sets
Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data
• Now train a classifier on each set
Cross-validated committees
Hold
ou
t
© Daniel S. Weld 11
Ensemble Construction II
• Generate k sets of training examples• For each set
Draw m examples randomly (with replacement) From the original set of m examples
• Each training set corresponds to 63.2% of original (+ duplicates)
• Now train classifier on each set
Bagging
© Daniel S. Weld 12
Ensemble Creation III
• Maintain prob distribution over set of training ex• Create k sets of training data iteratively:• On iteration i
Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error ex
• Create harder and harder learning problems...• “Bagging with optimized choice of examples”
Boosting
© Daniel S. Weld 13
Ensemble Creation IVStacking
• Train several base learners• Next train meta-learner
Learns when base learners are right / wrong Now meta learner arbitrates
Train using cross validated committees• Meta-L inputs = base learner predictions• Training examples = ‘test set’ from cross
validation
© Daniel S. Weld 14
Machine Learning Outline
•Supervised Learning Review
•Ensembles of Classifiers Bagging Cross-validated committees Boosting Stacking
•Co-training
© Daniel S. Weld 15
Co-Training Motivation
• Learning methods need labeled data Lots of <x, f(x)> pairs Hard to get… (who wants to label data?)
• But unlabeled data is usually plentiful… Could we use this instead??????
© Daniel S. Weld 16
Co-training
• Have little labeled data + lots of unlabeled
• Each instance has two parts:x = [x1, x2]x1, x2 conditionally independent given f(x)
• Each half can be used to classify instancef1, f2 such that f1(x1) ~ f2(x2) ~ f(x)
• Both f1, f2 are learnablef1 H1, f2 H2, learning algorithms A1, A2
Suppose
© Daniel S. Weld 17
Without Co-training f1(x1) ~ f2(x2) ~ f(x)
A1 learns f1 from x1
A2 learns f2 from x2A Few Labeled Instances
[x1, x2]
f2
A2
<[x1, x2], f()>
Unlabeled Instances
A1
f1 }Combine with ensemble?
Bad!! Not using
Unlabeled Instances!
f’
© Daniel S. Weld 18
Co-training f1(x1) ~ f2(x2) ~ f(x)
A1 learns f1 from x1
A2 learns f2 from x2A Few Labeled Instances
[x1, x2]
Lots of Labeled Instances
<[x1, x2], f1(x1)>f2
Hypothesis
A2
<[x1, x2], f()>
Unlabeled InstancesA
1
f1
© Daniel S. Weld 19
Observations
• Can apply A1 to generate as much training data as one wants If x1 is conditionally independent of x2 / f(x), then the error in the labels produced by A1 will look like random noise to A2 !!!
• Thus no limit to quality of the hypothesis A2 can make
© Daniel S. Weld 20
Co-training f1(x1) ~ f2(x2) ~ f(x)
A1 learns f1 from x1
A2 learns f2 from x2A Few Labeled Instances
[x1, x2]
Lots of Labeled Instances
<[x1, x2], f1(x1)>
Hypothesis
A2
<[x1, x2], f()>
Unlabeled InstancesA
1
f1 f2
f 2
Lots of
f2f1
© Daniel S. Weld 21
It really works!• Learning to classify web pages as course pages
x1 = bag of words on a page x2 = bag of words from all anchors pointing to a
page
• Naïve Bayes classifiers 12 labeled pages 1039 unlabeled
Percentage error
Representing Uncertainty
© Daniel S. Weld 23
Many Techniques Developed
• Fuzzy Logic• Certainty Factors• Non-monotonic logic• Probability
• Only one has stood the test of time!
© Daniel S. Weldd
© Daniel S. Weld 24
Aspects of Uncertainty
• Suppose you have a flight at 12 noon• When should you leave for SEATAC
What are traffic conditions? How crowded is security?
• Leaving 18 hours early may get you there But … ?
© Daniel S. Weld 25
Decision Theory = Probability + Utility Theory
Min before noon P(arrive-in-time) 20 min 0.05 30 min 0.25 45 min 0.50 60 min 0.75
120 min 0.98 1080 min 0.99
Depends on your preferencesUtility theory: representing &
reasoning about preferences
© Daniel S. Weld 26
What Is Probability?
• Probability: Calculus for dealing with nondeterminism and uncertainty Cf. Logic
• Probabilistic model: Says how often we expect different things to occur
© Daniel S. Weld 27
What Is Statistics?
• Statistics 1: Describing data
• Statistics 2: Inferring probabilistic models from data
• Structure• Parameters
© Daniel S. Weld 28
Why Should You Care?• The world is full of uncertainty
Logic is not enough Computers need to be able to handle uncertainty
• Probability: new foundation for AI (& CS!)
• Massive amounts of data around today Statistics and CS are both about data Statistics lets us summarize and understand it Statistics is the basis for most learning
• Statistics lets data do our work for us
© Daniel S. Weld 29
Outline
• Basic notions Atomic events, probabilities, joint distribution Inference by enumeration Independence & conditional independence Bayes’ rule
• Bayesian Networks• Statistical Learning• Dynamic Bayesian networks (DBNs)• Markov decision processes (MDPs)
© Daniel S. Weld 30
Prop. Logic vs. Probability
Symbol: Q, R … Random variable: Q …
Boolean values: T, F Domain: you specifye.g. {heads, tails} [1, 6]
State of the world: Assignment to Q, R … Z
Atomic event: completespecification of world: Q… Z• Mutually exclusive• Exhaustive
Prior probability (akaUnconditional prob: P(Q)
Joint distribution: Prob.of every atomic event
Types of Random Variables
Axioms of Probability Theory• Just 3 are enough to build entire theory!
1. All probabilities between 0 and 1 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is:
)()()()( BAPBPAPBAP
A B
A B
Tru
e
Prior and Joint Probability
We will see later how any question can be answered by the joint distribution
0.2
Conditional (or Posterior) Probability
• Conditional or posterior probabilitiese.g., P(cavity | toothache) = 0.8
i.e., given that Toothache is true (and all I know)
• Notation for conditional distributions:P(Cavity | Toothache) = 2-element vector of 2-
element vectors (2 P values when Toothache is true and 2 when false)
• If we know more, e.g., cavity is also given, then we haveP(cavity | toothache, cavity) = ?
• New evidence may be irrelevant, allowing simplification:P(cavity | toothache, sunny) = P(cavity | toothache) =
0.8
1
Conditional Probability
• P(A | B) is the probability of A given B• Assumes that B is the only info known.• Defined as:
)(
)()|(
BP
BAPBAP
A BAB
Tru
e
Dilemma at the Dentist’s
What is the probability of a cavity given a toothache?What is the probability of a cavity given the probe catches?
© Daniel S. Weld 37
Inference by Enumeration
P(toothache)=.108+.012+.016+.064 = .20 or 20%This process is called
“Marginalization”
© Daniel S. Weld 38
Inference by Enumeration
P(toothachecavity = .20 + ??
.072 + .008
.28
© Daniel S. Weld 39
Inference by Enumeration
Problems with Enumeration
• Worst case time: O(dn) Where d = max arity of random
variables e.g., d = 2 for Boolean (T/F)
And n = number of random variables• Space complexity also O(dn)
Size of joint distribution• Problem: Hard/impossible to
estimate all O(dn) entries for large problems
© Daniel S. Weld 41
Independence
• A and B are independent iff:
)()|( APBAP
)()|( BPABP
)()(
)()|( AP
BP
BAPBAP
)()()( BPAPBAP
These two constraints are logically equivalent
• Therefore, if A and B are independent:
© Daniel S. Weld 42
Independence
Tru
e
B
A A B
© Daniel S. Weld 43
Independence
Complete independence is powerful but rareWhat to do if it doesn’t hold?
© Daniel S. Weld 44
Conditional IndependenceTru
e
B
A A B
A&B not independent, since P(A|B) < P(A)
© Daniel S. Weld 45
Conditional IndependenceTru
e
B
A A B
C
B C
AC
But: A&B are made independent by C
P(A|C) =P(A|B,C)
© Daniel S. Weld 46
Conditional Independence
Instead of 7 entries, only need 5
© Daniel S. Weld 47
Conditional Independence IIP(catch | toothache, cavity) = P(catch | cavity)P(catch | toothache,cavity) = P(catch |cavity)
Why only 5 entries in table?
© Daniel S. Weld 48
Power of Cond. Independence
• Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!!
• Conditional independence is the most basic & robust form of knowledge about uncertain environments.
Next Up…
• Bayes’ Rule• Bayesian Inference• Bayesian Networks
Bayes rules!
© Daniel S. Weld 50
Bayes Rule
Simple proof from def of conditional probability:
)(
)()|()|(
EP
HPHEPEHP
)(
)()|(
EP
EHPEHP
)(
)()|(
HP
EHPHEP
)()|()( HPHEPEHP
QED:
(Def. cond. prob.)
(Def. cond. prob.)
)(
)()|()|(
EP
HPHEPEHP
(Mult by P(H) in line 2)
(Substitute #3 in #1)
© Daniel S. Weld 51
Use to Compute Diagnostic Probability from Causal Probability
E.g. let M be meningitis, S be stiff neckP(M) = 0.0001, P(S) = 0.1, P(S|M)= 0.8
P(M|S) =
© Daniel S. Weld 52
Bayes’ Rule & Cond. Independence
© Daniel S. Weld 53
Bayes Nets
•In general, joint distribution P over set of variables (X1 x ... x Xn) requires exponential space for representation & inference
•BNs provide a graphical representation of conditional independence relations in P
usually quite compact requires assessment of fewer parameters, those being
quite natural (e.g., causal) efficient (usually) inference: query answering and belief
update
© Daniel S. Weld 55
An Example Bayes Net
Earthquake Burglary
Alarm
Nbr2CallsNbr1Calls
Pr(B=t) Pr(B=f) 0.05 0.95
Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)
Radio
© Daniel S. Weld 56
Earthquake Example (con’t)
•If I know if Alarm, no other evidence influences my degree of belief in Nbr1Calls
P(N1|N2,A,E,B) = P(N1|A) also: P(N2|N2,A,E,B) = P(N2|A) and P(E|B) = P(E)
•By the chain rule we haveP(N1,N2,A,E,B) = P(N1|N2,A,E,B) ·P(N2|A,E,B)· P(A|E,B) ·P(E|B) ·P(B) = P(N1|A) ·P(N2|A) ·P(A|B,E) ·P(E) ·P(B)
•Full joint requires only 10 parameters (cf. 32)
Earthquake Burglary
Alarm
Nbr2CallsNbr1Calls
Radio
© Daniel S. Weld 57
BNs: Qualitative Structure
•Graphical structure of BN reflects conditional independence among variables
•Each variable X is a node in the DAG•Edges denote direct probabilistic influence
usually interpreted causally parents of X are denoted Par(X)
•X is conditionally independent of all nondescendents given its parents
Graphical test exists for more general independence “Markov Blanket”
© Daniel S. Weld 58
Given Parents, X is Independent of
Non-Descendants
© Daniel S. Weld 59
For Example
Earthquake Burglary
Alarm
Nbr2CallsNbr1Calls
Radio
© Daniel S. Weld 60
Given Markov Blanket, X is Independent of All Other Nodes
MB(X) = Par(X) Childs(X) Par(Childs(X))
© Daniel S. Weld 61
Conditional Probability Tables
Earthquake Burglary
Alarm
Nbr2CallsNbr1Calls
Pr(B=t) Pr(B=f) 0.05 0.95
Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)
Radio
© Daniel S. Weld 62
Conditional Probability Tables•For complete spec. of joint dist., quantify BN
•For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)|
•If X1, X2,... Xn is any topological sort of the network, then we
are assured:P(Xn,Xn-1,...X1) = P(Xn| Xn-1,...X1)·P(Xn-1 | Xn-2,… X1)
… P(X2 | X1) · P(X1)
= P(Xn| Par(Xn)) · P(Xn-1 | Par(Xn-1)) … P(X1)