1 machine learning for information retrieval rong jin michigan state university yi zhang university...
Post on 15-Jan-2016
218 views
TRANSCRIPT
1
Machine Learning for Information RetrievalRong JinMichigan State University
Yi ZhangUniversity of California Santa Cruz
2
Outline Introduction to information retrieval, statistical inference
and machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions
3
Roadmap of Information Retrieval
Search
Data
Filtering
Categorization
Summarization
Clustering
Data Analysis
Extraction
Mining
VisualizationRetrievalApplications
Mining/LearningApplications
InformationAccess
KnowledgeAcquisition
Why Machine Learning is Important ?
4
Text Categorization
5
Text Categorization Open directory project
the largest human-edited directory of the Web
Manual classification Over 4 million sites and
590 K categories Need to automate the
process
6
Document Clustering
7
Question AnsweringQuestion Answering
Classify question; identify answers; match questions and answers
8
Image Retrieval
Image segmentation by data clustering
9
Image Retrieval by Key Points
Key features visual words: data clustering
b1 b2
b3
b4
b5
b6
b7
b8
…
…
…
b1 b2 b3 b4
10
Image Retrieval by Text Query Automatically annotate images with textual words Retrieve images with textual queries Key technique: classification
Each keyword a different category
11
Information Extraction
Title J2EE Developer
Length 4 month
Salary ….
Location
Reference
Web page: free style text Relational DB
Structure prediction by Hidden Markov Model and Markov Random Field
12
Citation/Link Analysis
13
Recommender Systems
14
Recommender Systems
User 1 ? 5 3 4 2
User 2 4 1 5 ? 5
User 3 5 ? 4 2 5
User 4 1 5 3 5 ?
Sparse data problem: a lot of missing values
15
Recommender System
User Class I 1 p(4)=1/4
p(5)=3/4
3
User Class II p(4)=1/4
p(5)=3/4
p(1)=1/2
p(2)=1/2
p(4)=1/2
p(5)=1/2
Movie Type I
Movie Type II
Movie Type III
Fill out sparse data by data clustering
16
One More Reason for ML
$ 1,000,000 award
17
Review of Basic Prob. Concepts Probability Pr(A): “the fraction of possible world in
which A is true” ExamplesA = Your paper will be accepted by SIGIR 2008A = It rains in SingaporeA = A document contains the word “IR”
A is true
Event space of all possible worlds. The area is 1.
18
Conditional Probability SIGIR2008 = “a document contains the phrase SIGIR 2008” SINGAPORE = “a document contains the word singpaore”
P(SINGAPORE) = 0.000001 P(SIGIR2008) = 0.00000001 P(SINGAPORE|SIGIR2008) = 1/2
“Singapore” is rare and “SIGIR 2008” is rarer, but if you have a document with SIGIR 2008, there’s a 50-50 chance you’ll find the word “Singapore” in it
19
Conditional Prob.
B is trueA is true
Pr(AjB) =Pr(A;B)Pr(B)
Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule
20
Conditional Prob.
B is trueA is true
Pr(AjB) =Pr(A;B)Pr(B)
Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule
Independent variables
Pr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)
21
Conditional Prob.
Marginal probability B is trueA is true
Pr(AjB) =Pr(A;B)Pr(B)
Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule
IndependencePr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)
Pr(B) =kX
j =1
Pr(B;A = aj )
22
Bayes’ Rule
Pr(H jE ) / Pr(H ) £ Pr(E jH )
LikelihoodPriorPosterior
H E
Inference: Pr(H|E)
Information: Pr(E|H)
Hypothesis Evidence
23
Bayes’ Rule
Pr(H jE ) / Pr(H ) £ Pr(E jH )
LikelihoodPriorPosterior
W R
Inference: Pr(R|W)
Information: Pr(W|R)
Pr(W|R)
R R
W 0.7 0.4
W 0.3 0.6
R: It rains
W: The grass is wet
24
Statistical Inference
Learning stage: a parametric model for Pr(E|H) Inference stage: for a given observation E
Compute Pr(H|E) for each hypothesis H Choose the hypothesis with the largest Pr(H|E)
Pr(H jE ) / Pr(H ) £ Pr(E jH )
LikelihoodPriorPosterior
25
Example: Language Model (LM) for IR
d1 … d1000
q: ‘Signapore SIGIR’
? ??
Estimating some statistics for each document
Estimating likelihood p(q| )
Hypothesis: H
Evidence: E
Pr(H jE )Pr(E jH )
Pr(H )
26
Probability Distributions Binomial distributions Beta distribution Multinomial distributions Dirichlet distribution Gaussian distributions Laplacian distribution
Language models
Smoothing LM
Sparse solution L1 regularizer
27
Outline Introduction to information retrieval, statistical inference and
machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions
28
Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
y
x
f (x) = ax ¡ b
Regression: Continuous Y`Regression: Continuous Y`
29
Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x
x1
x2
y = +1
x = (x1;x2)
y = -1
w>x ¡ b= 0
f (x) = sign(w>x ¡ b)
+
Classification: Discrete YClassification: Discrete Y
30
Examples Text categorization
Input x: word histogram Output y: document categories (e.g., 1 for
“domestic economics”, 2 for “politics”, 3 “sports”, and 4 for “others”)
Question answering: classify question types Input x: a parsing tree of a qestion Output y: question types (e.g., when, where, …)
31
K Nearest-Neighbor (KNN) Classifiers
– Compute distance to other training documents
– Identify the k nearest neighbors
– determine the class of the unknown point by the class labels of its closest neighbors
Unknown record
Based on Tan,Steinbach, Kumar
32
K Nearest-Neighbor (KNN) Classifiers Compute distance between two points
Euclidean distance, cosine distance, Kullback-Leibler distance, Bregman distance, …
Learning distance function from data (Distance learning) Determine the class
Majority vote, or weighted majority vote
Bregman distance: generated by a convex function
33
K Nearest-Neighbor (KNN) Classifiers Decide K (# of nearest neighbors)
Bias-variance tradeoff Cross validation (or leave-one-out)
(k=1)(k=4)
Training Dataset
Validation Dataset
34
K Nearest-Neighbor (KNN) Classifiers Curse of dimensionality
Many attributes are irrelevant High dimension less informative distance
Distribution of square distance, generated by 1000 random data points in 1000 dims
35
KNN for Collaborative Filtering Collaborative filtering
Will user u like item b? Assumption:
Users have similar tastes are likely to have similar preferences on items
Making filtering decisions for one user based on the feedback from other users that are similar to this user
36
KNN for Collaborative Filtering
User 1 1 5 3 4 3
User 2 4 1 5 2 5
User 3 2 ? 3 5 4?
37
KNN for Collaborative Filtering
User 1 1 5 3 4 3
User 2 4 1 5 2 5
User 3 2 ? 3 5 4
Similarity measure of user interests can be learned
5
38
Paradigm for Supervised Learning Gathering training data Determine the input features (i.e., What’s x ?)
e.g., text categorization, bags of words Feature engineering is very very very important
Determine the functional form f(x) Linear or nonlinear What is the function form for KNN?
Determine the learning algorithm Learn optimal parameters (optimization, cross validation) Probabilistic or non-probabilistic
Test on a test set
39
Bayesian LearningLikelihoodPriorPosterior
Pr(H jE ) / Pr(H ) £ Pr(E jH )
MAP Learning: Maximum A Posterior
MAP Learning: Maximum A Posterior
Hypothesis space: H = fY1;Y2; : : : ;g
Y ¤ = argmaxY 2H
Pr(Y jX )
= argmaxY 2H
Pr(Y ) Pr(X jY )
Baye’s Rule
40
Hypothesis space: H = fY1;Y2; : : : ;g
Y ¤ = argmaxY 2H
Pr(Y jX )
= argmaxY 2H
Pr(Y ) Pr(X jY )
Bayesian Learning
MLE Learning: Maximum Likelihood Estimation
MLE Learning: Maximum Likelihood Estimation
LikelihoodPriorPosterior
Pr(H jE ) / Pr(H ) £ Pr(E jH ) Baye’s Rule
41
Bayesian Learning: Conjugate Prior
Posterior Pr(Y|X) is in the same form as prior Pr(Y) e.g., Dirchlet dist. is conjugate prior for multinomial
dist. (widely used in language model)
Hypothesis space: H = fY1;Y2; : : : ;g
Y ¤ = argmaxY 2H
Pr(Y jX )
= argmaxY 2H
Pr(Y ) Pr(X jY )
42
Example: Text Categorization
How to estimate Pr(Y=Student) or Pr(Y= Prof.) ? How to estimate Pr(w|Y) ?
What is Y ?
What is feature X?
What is Y ?
What is feature X?
Web page for Prof. or student ?Web page for Prof. or student ?
Y ¤ = argmaxY 2H
Pr(Y ) Pr(X jY )
1. Counting = MLE2. Counting + Pseudo = MAP
1. Counting = MLE2. Counting + Pseudo = MAP
Counting !
43
f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)
= logPr(Y = P)Pr(Y = S)
+x1 logPr(w1jY = P)Pr(w1jY = S)
+:::+xv logPr(wV jY = P)Pr(wV jY = S)
Pr(X jY ) ¼[Pr(w1jY )]x1 ¢¢¢[Pr(wV jY )]xV
Naïve Bayes
Pr(wjY ) Pr(X jY )?
[w1;w2; : : : ;wV ]
Weight for wordsWeight for words
X = (x1; x2; : : : ; xV )
ThresholdConstant
ThresholdConstant
44
f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)
= logPr(Y = P)Pr(Y = S)
+x1 logPr(w1jY = P)Pr(w1jY = S)
+:::+xv logPr(wV jY = P)Pr(wV jY = S)
Naïve Bayes: A Linear Classifier
x1
x2
y = +1
y = -1
f (x) = sign(w>x ¡ b)+
Logistic Regression
Directly model f(x) or Pr(Y|X)
Logistic Regression
Directly model f(x) or Pr(Y|X)
45
logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S) = b+t1x1 +:::+tV xV
logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S)
= logPr(Y = P )Pr(Y = S)
+x1 logPr(w1jY = P )Pr(w1jY = S)
+:::+xv logPr(wV jY = P )Pr(wV jY = S)
Logistic Regression (LR)
t1…tV are unknown weights that are learned from data by maximum likelihood estimation (MLE)
Pr(y = § 1jX ) =1
1+exp[¡ y(t1x1 +:::+tV xV +b)]
46
Logistic Regression (LR) Learning parameters: b, t1…tV
Maximum Likelihood Estimation (MLE)
(~t¤;b¤) = argmax~t;b
NX
i=1
logPr(yi jX i ;~t;b)
47
Logistic Regression (LR) Learning parameters: b, t1…tV
Maximum Likelihood Estimation (MLE)
(~t¤;b¤) = argmax~t;b
NX
i=1
logPr(yi jX i ;~t;b)
worse performance
OverfittingOverfitting
Maximum Likelihood Estimation
Maximum A Posterior
Maximum Likelihood Estimation
Maximum A Posterior
+Pr(t)
Why only word weights?Why only word weights?
48
Learning Logistic Regression
Pr(y = § 1jX ) =1
1+ exp[¡ yf (X )]
(~t¤;b¤) = argmin~t;b
NX
i=1
¡ logPr(yi jX i ;~t;b)
Loss functionMismatch between y and f(X)
Loss functionMismatch between y and f(X)
Other Loss functionsOther Loss functions
f(X)
49
Logistic Regression (LR) Closely related to Maximum Entropy (ME)
Advantage of LR Bayesian approach Convenient for incorporating prior knowledge Useful for semi-supervised learning, transfer
learning, …
Logistic Regression
Logistic Regression
Maximum Entropy
Maximum EntropyDual
50
Comparison of Classifiers
From Li and Yang SIGIR03
Macro F1
Micro F1
KNN 0.8557 0.5975
Naïve Bayes
0.8009 0.4737
Logistic Regression
0.8748 0.6084
51x1
x2
Comparison of Classifiers
Logistic RegressionLogistic Regression Naïve BayesNaïve Bayes
1. Model Pr(Y|X)2. Model decision boundary3. NB is a special case of LR
1. Model Pr(Y|X)2. Model decision boundary3. NB is a special case of LR
1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)
1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)
1. Require numerical solution2. Large number of training examples, slow convergence
1. Require numerical solution2. Large number of training examples, slow convergence
1. Simple solution2. Small number of training examples, fast convergence
1. Simple solution2. Small number of training examples, fast convergence
52x1
x2
Comparison of Classifiers
Discriminative ModelDiscriminative Model Generative ModelGenerative Model
1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption
1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption
1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)
1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)
1. Require numerical solution2. Large number of training examples, slow convergence
1. Require numerical solution2. Large number of training examples, slow convergence
1. Simple solution2. Small number of training examples, fast convergence
1. Simple solution2. Small number of training examples, fast convergence
Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important
Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test
Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important
Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test
Rule of Thumb
53
Comparison of Classifiers
Discriminative ModelDiscriminative Model Generative ModelGenerative Model
1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption
1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption
1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)
1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)
1. Require numerical solution2. Large number of training examples, slow convergence
1. Require numerical solution2. Large number of training examples, slow convergence
1. Simple solution2. Small number of training examples, fast convergence
1. Simple solution2. Small number of training examples, fast convergence
What about KNN ?What about KNN ?
54
Other Discriminative Classifiers Decision tree
Aggregation of decision rules via a tree
Easy interpretation
55
Other Discriminative Classifiers Decision tree
Aggregation of decision rules via a tree
Easy interpretation Support vector machine
A maximum margin classifier
best text classifier
x1
x2
y = +1
y = -1
56
Comparison of Classifiers
From Li and Yang SIGIR03
Macro F1
Micro F1
KNN 0.8557 0.5975
Naïve Bayes
0.8009 0.4737
Logistic Regression
0.8748 0.6084
Support vector machine
0.8857 0.5975
57
Ensemble Learning Generate multiple classifiers Classification by (weighted) majority votes Bagging & Boosting
Train a classifier for a different sampling of training data
x1
x2
D
…
D1 D2 Dk
Sampling
h1 h2 hk
58
Ensemble Learning Bias-variance tradeoff
Reduce variance (bagging) and bias (boosting)
Error caused by variance
Error caused by bias
50 decision trees Majority vote
59
Multi-Class Classification
c1 c2 … cK
X1 0 1 … 0
X2 1 0 0
…
XN 1 0 1
More than 2 classes Multi-labels assigned to
each example Approaches
One against all ECOC coding
Binary classifier
Binary classifier ………
f K (X )f 1(X )
60
More than 2 classes Multi-labels assigned to
each example Approaches
One against all ECOC coding
……
f 1(X )
f M (X )
Multi-Class Classification
c1 c2 … cK
X1 0 1 … 0
X2 1 0 0
…
XN 1 0 1
0 1 … 0
1 0 … 1
… … … …
1 1 … 0
# of codingbits
f 1(X )
f 2(X )
f 3(X )c1
c2
c3
61
Multi-Class Classification
c1 c2 … cK
X1 0 1 … 0
X2 1 0 0
…
XN 1 0 1
More than 2 classes Multi-labels assigned to
each example Approaches
One against all ECOC coding Transfer learning
f 1(X )Binary classifier Binary classifier
f K (X )………
62
Beyond Vector Inputs
gene sequence classification
question type classification
Character Recognition
sequences trees graphs
63
Beyond Vector Inputs: Kernel Kernel function k(x1, x2)
Assess the similarity between two objects x1, x2
Don’t have to represent objects by vectors
64
Beyond Vector Inputs: Kernel Kernel function k(x1, x2)
Assess the similarity between two objects x1, x2
Don’t have to represent objects by vectors Vector representation by kernel function
Given training examples Represent any example x by vector
x1;: : : ;xN
[k(x1;x);k(x2;x); : : : ;k(xN ;x)]
Related to representer theorem
65
Beyond Vector Inputs
Strong Kernel Tree Kernel Graph Kernel
sequences trees graphs
66
Kernel for Nonlinear Classifiers
67
Words are associated with Kernels Reproducing Kernel Hilbert Space (RKHS)
Vector representation Mercer’s conditions
Good kernels Representer theorem Kernel learning (e.g., multiple kernel
learning)
68
Sequence Prediction
Part-of-speech tagging But, all the taggings are related
Hidden Markov Model (HMM), Conditional Random Field (CRF), and Maximum Margin Markov Network (M3)
[He] [reckons] [the] [current] [account] [deficit]
[PRP] [VBZ] [DT] [JJ] [NN] [NN]
Pr(N N j\ account) ! Pr(N N j\ account;tag-for-\ current)
69
Outline Introduction to information retrieval, statistical inference and
machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions
70
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training
Semi-supervised data clustering
71
Spectrum of Learning Problems
72
What is Semi-supervised Learning Learning from a mixture of labeled and unlabeled examples
f (x) : X ! Y
L = f (x1;y1); : : : ; (xn l;yn l
)gLabeled Data
U = fx1; : : : ;xnug
Unlabeled Data
Total number of examples:N = nl +nu
73
Why Semi-supervised Learning? Labeling is expensive and difficult Labeling is unreliable
Ex. Segmentation applications Need for multiple experts
Unlabeled examples Easy to obtain in large numbers Ex. Web pages, text documents, etc.
74
Semi-supervised Learning Problems Classification
Transductive – predict labels of unlabeled data Inductive – learn a classification function
Clustering (constrained clustering) Ranking (semi-supervised ranking) Almost every learning problem has a semi-
supervised counterpart.
75
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training
Semi-supervised data clustering
76
Why Unlabeled Could be Helpful Clustering assumption
Unlabeled data help decide the decision boundary
Manifold assumption Unlabeled data help decide decision function
f (X ) = 0
f (X )
77
Clustering Assumption
?
78
Clustering Assumption
?
Points with same label are connected through high density regions, thereby defining a cluster
Clusters are separated through low-density regions
Suggest a simple alg. forSemi-supervised Learning ?
Suggest a simple alg. forSemi-supervised Learning ?
79
Manifold Assumption
Regularize the classification function f(x)
Graph representation Vertex: training example
(labeled and unlabeled) Edge: similar examples
x1
x2
x1 and x2 are connected ¡ ! jf (x1) ¡ f (x2)j is small
Labeled examples
80
Manifold Assumption
Manifold assumption Data lies on a low-dimensional manifold Classification function f(x) should “follow” the
data manifold
Graph representation Vertex: training example
(labeled and unlabeled) Edge: similar examples
81
Statistical View
Generative model for classification
θ
Y X
Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)
82
Statistical View
Generative model for classification
Unlabeled data help estimate
Clustering assumption θ
Y X
Pr(X jY ;µ)
Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)
83
Statistical View Discriminative model for classification
θ
Y X
μ
Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)
84
Statistical View Discriminative model for classification
Unlabeled data help regularize θ
via a prior
Manifold assumption
θ
Y X
μPr(µjX )
Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)
85
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training
Semi-supervised data clustering
86
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training
Semi-supervised data clustering
87
Label Propagation: Key Idea
A decision boundary based on the labeled examples is unable to take into account the layout of the data points
How to incorporate the data distribution into the prediction of class labels?
88
Label Propagation: Key Idea
Connect the data points that are close to each other
89
Label Propagation: Key Idea
Connect the data points that are close to each other
Propagate the class labels over the connected graph
90
Label Propagation: Key Idea Connect the data
points that are close to each other
Propagate the class labels over the connected graph
Different from the K Nearest Neighbor
91
Label Propagation: Representation Adjancy matrix
Similarity matrix
Matrix
Wi ;j =
½1 xi and xj connect0 otherwise
W 2 f0;1gN £ N
W 2 RN £ N+
Wi ;j : similarity between xi and xj
D = diag(d1;: ::;dN )
di =P
j 6=i Wi ;j
92
Label Propagation: Representation Adjancy matrix
Similarity matrix
Degree matrix
Wi ;j =
½1 xi and xj connect0 otherwise
W 2 f0;1gN £ N
W 2 RN £ N+
Wi ;j : similarity between xi and xj
D = diag(d1;: ::;dN ) di =P
j 6=i Wi ;j
93
Label Propagation: Representation Given Label information
W 2 RN £ N+
yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l
yu = (y1;y2; : : : ;ynu) 2 f ¡ 1;+1gnu
94
Label Propagation: Representation Given Label information
W 2 RN £ N+
yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l
y = (yl ;yu)
95
Label Propagation Initial class assignments
Predicted class assignments First predict the confidence scores Then predict the class assignments
by 2 f ¡ 1;0;+1gN
y 2 f ¡ 1;+1gN
f 2 RN
yi =
½+1 f i > 0¡ 1 f i · 0
byi =
½§ 1 xi is labeled0 xi is unlabeled
96
Label Propagation Initial class assignments
Predicted class assignments First predict the confidence scores Then predict the class assignments
by 2 f ¡ 1;0;+1gN
y 2 f ¡ 1;+1gN
yi =
½+1 f i > 0¡ 1 f i · 0
byi =
½§ 1 xi is labeled0 xi is unlabeled
f = (f 1; : : : ; f N )
97
Label Propagation (II)
One round of propagation
f i =
½byi xi is labeled
®P N
i=1 Wi ;j byi otherwise
f1 = by + ®Wby
Weighted KNNWeighted KNNWeight for each propagation
Weight for each propagation
98
Label Propagation (II)
Two rounds of propagation
How to generate any number of iterations?
fk = by +kX
i=1
®i W i by
f2 = f1 + ®Wf1
= by + ®Wby + ®2W2by
99
Label Propagation (II)
Two rounds of propagation
Results for any number of iterations
fk = by +kX
i=1
®i W i by
f2 = f1 + ®Wf1
= by + ®Wby + ®2W2by
100
Label Propagation (II)
Two rounds of propagation
Results for infinite number of iterations
f1 = by +1X
i=1
®i W i by
f2 = f1 + ®Wf1
= by + ®Wby + ®2W2by
101
Label Propagation (II)
Two rounds of propagation
Results for infinite number of iterations
f1 = (I ¡ ®W)¡ 1by
¹W = D ¡ 1=2WD ¡ 1=2Normalized Similarity Matrix:
f2 = f1 + ®Wf1
= by + ®Wby + ®2W2by
Matrix InverseMatrix Inverse
102
Local and Global Consistency [Zhou et.al., NIPS 03]
Local consistency:
Like KNN
Global consistency:
Beyond KNN
103
Summary: Construct a graph using pairwise similarities Propagate class labels along the graph Key parameters
: the decay of propagation W: similarity matrix
Computational complexity Matrix inverse: O(n3) Chelosky decomposition Clustering f = (I ¡ ®W)¡ 1by
104
Questions
?
Cluster Assumption Manifold Assumption
?
Transductive predict classes for unlabeled data
Transductive predict classes for unlabeled data
Inductive learn classification function
Inductive learn classification function
105
Application: Text Classification [Zhou et.al., NIPS 03]
20-newsgroups autos, motorcycles, baseball,
and hockey under rec
Pre-processing stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
SVM
KNN
Propagation
106
Application: Image Retrieval [Wang et al., ACM MM 2004]
5,000 images Relevance feedback for the top
20 ranked images Classification problem
Relevant or not? f(x): degree of relevance
Learning relevance function f(x) Supervised learning: SVM Label propagation
Label propagation
SVM
107
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partition based approaches Transductive Support Vector Machine (TSVM) Co-training
Semi-supervised data clustering
108
Graph Partition Classification as graph partitioning Search for a classification boundary
Consistent with labeled examples Partition with small graph cut
Graph Cut = 1Graph Cut = 2
109
Graph Partitioning Classification as graph partitioning Search for a classification boundary
Consistent with labeled examples Partition with small graph cut
Graph Cut = 1
110
Min-cuts for semi-supervised learning [Blum and Chawla, ICML 2001]
Additional nodes V+ : source, V-: sink
Infinite weights connecting sinks and sources High computational cost
V+V R
SourceSink
Graph Cut = 1
111
Harmonic Function [Zhu et al., ICML 2003]
Weight matrix W wi,j 0: similarity between xi and xi
Membership vector f = (f 1; : : : ; f N )
f i =
½+1 xi 2 A¡ 1 xi 2 B
¡ 1
A B
+1
+1
+1
+1¡ 1
¡ 1
¡ 1
¡ 1
¡ 1¡ 1
¡ 1
¡ 1
112
Harmonic Function (cont’d) Graph cut
Degree matrix Diagonal element:
C(f) A B
+1
+1
+1
+1¡ 1
¡ 1
¡ 1
¡ 1
¡ 1¡ 1
¡ 1
¡ 1
D = diag(d1;: :: ;dN )di =
Pj 6=i Wi ;j
C(f) =NX
i=1
NX
j =1
(f i ¡ f j )2
4wi ;j
=14f>(D ¡ W)f =
14f>Lf
113
Harmonic Function (cont’d) Graph cut
Graph Laplacian L = D –W Pairwise relationships among data poitns Mainfold geometry of data
C(f) A B
+1
+1
+1
+1¡ 1
¡ 1
¡ 1
¡ 1
¡ 1¡ 1
¡ 1
¡ 1
C(f) =NX
i=1
NX
j =1
(f i ¡ f j )2
4wi ;j
=14f>(D ¡ W)f =
14f>Lf
114
Harmonic Function
A B
+1
+1
+1
+1¡ 1
¡ 1¡ 1
¡ 1
¡ 1¡ 1
¡ 1
¡ 1
minf 2 f ¡ 1;+1gN
C(f) =14f>Lf
s. t. f i = yi ;1 · i · nl
Consistency with graph structures
Consistent with labeled dataChallenge:
Discrete space Combinatorial Opt.
115
Harmonic Function
Relaxation: {-1, +1} continuous real number
Convert continuous f to binary ones
A B
+1
+1
+1
+1¡ 1
¡ 1¡ 1
¡ 1
¡ 1¡ 1
¡ 1
¡ 1
minf 2 f ¡ 1;+1gN
C(f) =14f>Lf
s. t. f i = yi ;1 · i · nl
minf 2RN
C(f) =14f>Lf
s. t. f i = yi ;1 · i · nl
116
Harmonic Function
minf 2RN
C(f) =14f>Lf
s. t. f i = yi ;1 · i · nl
L =
µL l ;l Lu;lL l ;u Lu;u
¶; f = (fl ; fu)
fu = ¡ L ¡ 1u;uLu;lyl
117
Harmonic Function
fu = ¡ L ¡ 1u;uLu;lyl
Local Propagation
118
Harmonic Function
Local Propagation
Global propagation
fu = ¡ L ¡ 1u;uLu;lyl
Sound familiar ?Sound familiar ?
119
Spectral Graph Transducer [Joachim , 2003]
minf 2RN
C(f) =14f>Lf
s. t. f i = yi ;1 · i · nl
Soften hard constraints
+®n lX
i=1
(f i ¡ yi )2
120
Spectral Graph Transducer [Joachim , 2003]
minf 2RN
C(f) =14f>Lf
s. t. f i = yi ;1 · i · nl
+®n lX
i=1
(f i ¡ yi )2
minf 2RN
C(f) =14f>Lf +®
n lX
i=1
(f i ¡ yi )2
s. t.NX
i=1
f 2i = N
Solved by Constrained Eigenvector ProblemSolved by Constrained Eigenvector Problem
121
Manifold Regularization [Belkin, 2006]
minf 2RN
C(f) =14f>Lf +®
n lX
i=1
(f i ¡ yi )2
s. t.NX
i=1
f 2i = N Loss function for
misclassification
Regularize the norm of classifier
122
Manifold Regularization [Belkin, 2006]
minf 2RN
14f>Lf +®
n lX
i=1
(f i ¡ yi )2
s. t.NX
i=1
f 2i = N
Loss function: l(f (xi );yi )
minf 2RN
f>Lf +®n lX
i=1
l(f (xi );yi ) +°jf j2H K
Manifold Regularization
123
Summary Construct a graph using pairwise similarity Key quantity: graph Laplacian
Captures the geometry of the graph Decision boundary is consistent
Graph structure Labeled examples
Parameters , , similarity
A B
+1+1
+1+1
¡ 1¡ 1
¡ 1
¡ 1¡ 1
¡ 1
¡ 1
¡ 1
124
Questions
?
Cluster Assumption Manifold Assumption
?
Transductive predict classes for unlabeled data
Transductive predict classes for unlabeled data
Inductive learn classification function
Inductive learn classification function
125
Application: Text Classification 20-newsgroups
autos, motorcycles, baseball,
and hockey under rec
Pre-processing stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
Propagation Harmonic
SVM
KNN
126
Application: Text Classification
PRBEP: precision recall break even point.
127
Application: Text Classification
Improvement in PRBEP by SGT
128
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training
Semi-supervised data clustering
129
Transductive SVM Support vector machine
Classification margin Maximum classification
margin Decision boundary given a
small number of labeled examples
130
Transductive SVM Decision boundary given a
small number of labeled examples
How to change decision boundary given both labeled and unlabeled examples ?
131
Transductive SVM Decision boundary given a
small number of labeled examples
Move the decision boundary to low local density
132
Transductive SVM Classification margin
f(x): classification function Supervised learning
Semi-supervised learning Optimize over both f(x) and yu
! (X ;y;f )
f ¤ = argmaxf 2H K
! (X ;y;f )
f (x)
! (X ;y;f )
133
Transductive SVM Classification margin
f(x): classification function Supervised learning
Semi-supervised learning Optimize over both f(x) and yu
! (X ;y;f )
f ¤ = argmaxf 2H K
! (X ;y;f )
f (x)
134
Transductive SVM Classification margin
f(x): classification function Supervised learning
Semi-supervised learning Optimize over both f(x) and yu
! (X ;y;f )
f ¤ = argmaxf 2H K
! (X ;y;f )
f (x)
f ¤ = argmaxf 2H K ;yu 2f ¡ 1;+1gn u
! (X ;yl ;yu; f )
135
Transductive SVM Decision boundary given
a small number of labeled examples
Move the decision boundary to place with low local density
Classification results How to formulate this
idea?
136
Transductive SVM: Formulation
* *
,
1 1
2 2
{ , }= argmin
1
1 labeled
examples....
1
w b
n n
w b w w
y w x b
y w x b
y w x b
Original SVM
1
* *
,..., ,
1 1
2 2
1 1
{ , }= argmin argmin
1
1 labeled
examples....
1
1 unlabeled
....examples
1
n n my y w b
n n
n n
n m n m
w b w w
y w x b
y w x b
y w x b
y w x b
y w x b
Transductive SVM
Constraints for unlabeled data
A binary variables for label of each example
137
Computational Issue
No longer convex optimization problem. Alternating optimization
1
* *1 1
,..., ,
1 1 11 1 1
2 2 2
{ , }= argmin argmin
1 1
1 labeled unlabeled ....
examples exampl....1
1
n n m
n ni ii i
y y w b
n n
n m n m mn n n
w b w w
y w x by w x b
y w x b
y w x by w x b
es
138
Summary
Based on maximum margin principle Classification margin is decided by
Labeled examples Class labels assigned to unlabeled data
High computational cost Variants: Low Density Separation (LDS), Semi-
Supervised Support Vector Machine (S3VM), TSVM
139
Questions
?
Cluster Assumption Manifold Assumption
?
Transductive predict classes for unlabeled data
Transductive predict classes for unlabeled data
Inductive learn classification function
Inductive learn classification function
140
Text Classification by TSVM
10 categories from the Reuter collection
3299 test documents 1000 informative words
selected by MI criterion
141
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training
142
Co-training [Blum & Mitchell, 1998]
Classify web pages into category for students and category for professors
Two views of web pages Content
“I am currently the second year Ph.D. student …”
Hyperlinks “My advisor is …” “Students: …”
143
Co-training for Semi-Supervised Learning
144
Co-training for Semi-Supervised Learning
It is easy to classify the type of
this web page based on its
content
It is easier to classify this web
page using hyperlinks
145
Co-training Two representation for each web page
Content representation:
(doctoral, student, computer, university…)
Hyperlink representation:
Inlinks: Prof. Cheng
Oulinks: Prof. Cheng
146
Co-training Train a content-based classifier
147
Co-training Train a content-based classifier using
labeled examples Label the unlabeled examples that are
confidently classified
148
Co-training Train a content-based classifier using
labeled examples Label the unlabeled examples that are
confidently classified Train a hyperlink-based classifier
149
Co-training Train a content-based classifier using
labeled examples Label the unlabeled examples that are
confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are
confidently classified
150
Co-training Train a content-based classifier using
labeled examples Label the unlabeled examples that are
confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are
confidently classified
151
Co-training Assume two views of objects
Two sufficient representations Key idea
Augment training examples of one view by exploiting the classifier of the other view
Extension to multiple view Problem: how to find equivalent views
152
A Few Words about Active Learning Active learning
Select the most informative examples In contrast to passive learning
Key question: which examples are informative Uncertainty principle: most informative example is
the one that is most uncertain to classify Measure classification uncertainty
153
A Few Words about Active Learning Query by committee (QBC)
Construct an ensemble of classifiers Classification uncertainty largest degree of
disagreement SVM based approach
Classification uncertainty distance to decision boundary
Simple but very effective approaches
154
Topics of Semi-supervised Learning
Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms
Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training
Semi-supervised clustering algorithms
155
Semi-supervised Clustering
Clustering data into two clusters
156
Semi-supervised Clustering
Clustering data into two clusters Side information:
Must links vs. cannot links
Must link
cannot link
157
Semi-supervised Clustering Also called constrained clustering Two types of approaches
Restricted data partitions Distance metric learning approaches
158
Restricted Data Partition Require data partitions to be consistent
with the given links Links hard constraints
E.g. constrained K-Means (Wagstaff et al., 2001)
Links soft constraints E.g., Metric Pairwise Constraints K-means
(Basu et al., 2004)
159
Restricted Data Partition Hard constraints
Cluster memberships must obey the link constraints
must link
cannot linkYes
160
Restricted Data Partition Hard constraints
Cluster memberships must obey the link constraints
must link
cannot linkYes
161
Restricted Data Partition Hard constraints
Cluster memberships must obey the link constraints
must link
cannot linkNo
162
Restricted Data Partition Soft constraints
Penalize data clustering if it violates some links
must link
cannot linkPenality = 0
163
Restricted Data Partition Hard constraints
Cluster memberships must obey the link constraints
must link
cannot link
Penality = 0
164
Restricted Data Partition Hard constraints
Cluster memberships must obey the link constraints
must link
cannot linkPenality = 1
165
Distance Metric Learning Learning a distance metric from pairwise links
Enlarge the distance for a cannot-link Shorten the distance for a must-link
Applied K-means with pairwise distance measured by the learned distance metric
must link
cannot link
Transformed by learned distance metric
166
Example of Distance Metric Learning
Solid lines: must links
dotted lines: cannot links
2D data projection using Euclidean distance metric
2D data projection using learned distance metric
167
111
BoostCluster [Liu, Jin & Jain, 2007]
General framework for semi-supervised clustering Improves any given unsupervised clustering algorithm with
pairwise constraints
Key challenges How to influence an arbitrary clustering algorithm by side
information?
Encode constraints into data representation
How to take into account the performance of underlying clustering algorithm?
Iteratively improve the clustering performance
168
168
BoostCluster
Given: (a) pairwise constraints, (b) data examples, and (c) a clustering algorithm
Data
PairwiseConstraints
New data Representation
ClusteringAlgorithm
ClusteringResults
Final Results
KernelMatrix
ClusteringAlgorithm
169
169
BoostCluster
Find the best data rep. that encodes the unsatisfied pairwise constraints
Data
PairwiseConstraints
New data Representation
ClusteringAlgorithm
ClusteringResults
Final Results
KernelMatrix
ClusteringAlgorithm
170
170
BoostCluster
Obtain the clustering results given the new data representation
Data
PairwiseConstraints
New data Representation
ClusteringAlgorithm
ClusteringResults
Final Results
KernelMatrix
ClusteringAlgorithm
171
171
BoostCluster
Update the kernel with the clustering results
Data
PairwiseConstraints
New data Representation
ClusteringAlgorithm
ClusteringResults
Final Results
KernelMatrix
ClusteringAlgorithm
172
172
BoostCluster
Run the procedure iteratively
Data
PairwiseConstraints
New data Representation
ClusteringAlgorithm
ClusteringResults
Final Results
KernelMatrix
ClusteringAlgorithm
173
173
BoostCluster
Compute the final clustering result
Data
PairwiseConstraints
New data Representation
ClusteringAlgorithm
ClusteringResults
Final Results
KernelMatrix
ClusteringAlgorithm
174
Summary Clustering data under given pairwise constraints
Must links vs. cannot links Two types of approaches
Restricted data partitions (either soft or hard) Distance metric learning
Questions: how to acquire links/constraints? Manual assignments Derive from side information: hyper links, citation, user
logs, etc. May be noisy and unreliable
175
Application: Document Clustering[Basu et al., 2004]
300 docs from topics (atheism, baseball, space) of 20-newsgroups
3251 unique words after removal of stopwords and rare words and stemming
Evaluation metric: Normalized Mutual Informtion (NMI)
KMeans-x-x: different variants of constrained clustering algs.
176
Outline Introduction to information retrieval, statistical inference and
machine learning Supervised learning and its application to text classification,
adaptive filtering, collaborative filtering and ranking Semi-supervised learning and its application to text
classification Emerging research directions
177
Efficient Learning In IR, we have massive amount of data But, most learning algs. are relatively slow
Difficult to handle millions of documents How to improve scalability ?
Sampling, only use part of data Stochastic optimization, update model one example each
time (related to online learning) More interesting, more examples may mean more
efficient training (Sebro, ICML 2008)
178
Kernel Learning Kernel plays central role in machine learning Kernel functions can be learned from data
Kernel alignment, multiple kernel learning, non-parametric learning, …
Kernel learning is suitable for IR Similarity measure is key to IR Kernel learning allows us to identify the optimal
similarity measure automatically
179
Transfer Learning Different document categories are correlated We should be able to borrow information of
one class to the training of another class Key question: what to transfer between
classes? Representation, model priors, similarity measure
…
180
Active Learning IR Applications Relevance feedback (text retrieval or image
retrieval) Text classification Adaptive information filtering Collaborative filtering Query Rewriting
181
Discriminative Language Models
Language models have shown to be effective for information retrieval
But most language models are generative, thus missing the discriminative power
Key difficulty in discriminative language models: no outputs! Side information Mixture of generative and discriminative models
182
References A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification.
In AAAI-98 Workshop on Learning for Text Categorization, 1998 Tong Zhang and Frank J. Oles, Text Categorization Based on Regularized Linear Classification
Methods, Journal of Information Retrieval, 2001 F. Li and Y. Yang. A loss function analysis for classification methods in text categorization, The
Twentieth International Conference on Machine Learning (ICML'03) Chengxiang Zhai and John Lafferty, A study of smoothing methods for language models applied
to information retrieval, ACM Trans. Inf. System, 2004 A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-training, COLT 1998 D. Blei and M. Jordan, Variational methods for the Dirichlet process, ICML 2004 T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Mach. Learn.,
42(1-2), 2001 D. Blei, A. Ng and M. Jordan, Latent Dirichlet allocation, NIPS*2002 R. Jin, C. Ding, and F. Kang, A Probabilistic Approach for Optimizing Spectral Clustering,
NIPS*2005 D. Zhou, B. Scholkopf, and T. Hofmann, Semi-supervised learning on directed graphs,
NIPS*2005. X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using Gaussian fields and
harmonic functions. ICML 2003. T. Joachims, Transductive Learning via Spectral Graph Partitioning, ICML 2003
183
References Andrew McCallum and Kamal Nigam, Employing {EM} in Pool-Based Active Learning for
Text Classification, Proceeding of the International Conference on Machine Learning, 1998 David A. Cohn and Zoubin Ghahramani and Michael I. Jordan, Active Learning with
Statistical Models, Journal of Artificial Intelligence Research, 1996 S. Tong and E. Chang. Support vector machine active learning for image retrieval. In ACM
Multimedia, 2001 Xuehua Shen and ChengXiang Zhai, Active feedback in ad hoc information retrieval, SIGIR
'05 J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear
predictors. Information and Computation, 1997. X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li. Multi-Model Similarity Propagation and its Application
for Web Image Retrieval, ACM Multimedia, 2004 M. Belkin and P. Niyogi and V. Sindhwani, Manifold Regularization, Technical Report,
Univ. of Chicago, 2006 K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with
background knowledge. In ICML '01, 2001. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised
clustering. In SIGKDD '04, 2004.
184
References Xiaofei He, Benjamin Rey, Wei Vivian Zhang, Rosie Jones, Query Rewriting using Active Learning
for Sponsored Search, SIGIR07 Y. Zhang, W. Xu, and J. Callan. Exploration and exploitation in adaptive filtering based on bayesian
active learning. In Proceedings of 20th International Conf. on Machine Learning, 2003. Z. Xu and R. Akella. A bayesian logistic regression model for active relevance feedback (SIGIR08) G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. ICML 2000 M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking.
Machine learning, 2004 J. Rocchio. Relevance feedback in information retrieval, In The Smart System: experiments in
automatic document processing. Prentice Hall, 1971. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the fifth annual
workshop on Computational learning theory, 1992 Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee
algorithm. Machine Learning, 28(2-3):133–168, 1997 D. A. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learn-ing. Machine
learning, 1994. Robert M. Bell and Yehuda Koren, Lessons from the Netix Prize Challenge, KDD Exploration 2008 Tie-Yan Liu, Tutorial: Learning to rank Soumen Chakrabarti, Learning to Rank in Vector Spaces and Social Networks, www 2007
185
Thank You
God, it is finally over !God, it is finally over !