1 machine learning for information retrieval rong jin michigan state university yi zhang university...

1

Machine Learning for Information RetrievalRong JinMichigan State University

Yi ZhangUniversity of California Santa Cruz

2

Outline Introduction to information retrieval, statistical inference

and machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

3

Roadmap of Information Retrieval

Search

Data

Filtering

Categorization

Summarization

Clustering

Data Analysis

Extraction

Mining

VisualizationRetrievalApplications

Mining/LearningApplications

InformationAccess

KnowledgeAcquisition

Why Machine Learning is Important ?

4

Text Categorization

5

Text Categorization Open directory project

the largest human-edited directory of the Web

Manual classification Over 4 million sites and

590 K categories Need to automate the

process

6

Document Clustering

7

Question AnsweringQuestion Answering

Classify question; identify answers; match questions and answers

8

Image Retrieval

Image segmentation by data clustering

9

Image Retrieval by Key Points

Key features visual words: data clustering

b1 b2

b3

b4

b5

b6

b7

b8

…

…

…

b1 b2 b3 b4

10

Image Retrieval by Text Query Automatically annotate images with textual words Retrieve images with textual queries Key technique: classification

Each keyword a different category

11

Information Extraction

Title J2EE Developer

Length 4 month

Salary ….

Location

Reference

Web page: free style text Relational DB

Structure prediction by Hidden Markov Model and Markov Random Field

12

Citation/Link Analysis

13

Recommender Systems

14

Recommender Systems

User 1 ? 5 3 4 2

User 2 4 1 5 ? 5

User 3 5 ? 4 2 5

User 4 1 5 3 5 ?

Sparse data problem: a lot of missing values

15

Recommender System

User Class I 1 p(4)=1/4

p(5)=3/4

3

User Class II p(4)=1/4

p(5)=3/4

p(1)=1/2

p(2)=1/2

p(4)=1/2

p(5)=1/2

Movie Type I

Movie Type II

Movie Type III

Fill out sparse data by data clustering

16

One More Reason for ML

$ 1,000,000 award

17

Review of Basic Prob. Concepts Probability Pr(A): “the fraction of possible world in

which A is true” ExamplesA = Your paper will be accepted by SIGIR 2008A = It rains in SingaporeA = A document contains the word “IR”

A is true

Event space of all possible worlds. The area is 1.

18

Conditional Probability SIGIR2008 = “a document contains the phrase SIGIR 2008” SINGAPORE = “a document contains the word singpaore”

P(SINGAPORE) = 0.000001 P(SIGIR2008) = 0.00000001 P(SINGAPORE|SIGIR2008) = 1/2

“Singapore” is rare and “SIGIR 2008” is rarer, but if you have a document with SIGIR 2008, there’s a 50-50 chance you’ll find the word “Singapore” in it

19

Conditional Prob.

B is trueA is true

Pr(AjB) =Pr(A;B)Pr(B)

Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule

20

Conditional Prob.

B is trueA is true



Independent variables

Pr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)

21

Conditional Prob.

Marginal probability B is trueA is true



IndependencePr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)

Pr(B) =kX

j =1

Pr(B;A = aj )

22

Bayes’ Rule

Pr(H jE ) / Pr(H ) £ Pr(E jH )

LikelihoodPriorPosterior

H E

Inference: Pr(H|E)

Information: Pr(E|H)

Hypothesis Evidence

23

Bayes’ Rule



W R

Inference: Pr(R|W)

Information: Pr(W|R)

Pr(W|R)

R R

W 0.7 0.4

W 0.3 0.6

R: It rains

W: The grass is wet

24

Statistical Inference

Learning stage: a parametric model for Pr(E|H) Inference stage: for a given observation E

Compute Pr(H|E) for each hypothesis H Choose the hypothesis with the largest Pr(H|E)



25

Example: Language Model (LM) for IR

d1 … d1000

q: ‘Signapore SIGIR’

? ??

Estimating some statistics for each document

Estimating likelihood p(q| )

Hypothesis: H

Evidence: E

Pr(H jE )Pr(E jH )

Pr(H )

26

Probability Distributions Binomial distributions Beta distribution Multinomial distributions Dirichlet distribution Gaussian distributions Laplacian distribution

Language models

Smoothing LM

Sparse solution L1 regularizer

27

Outline Introduction to information retrieval, statistical inference and

machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

28

Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

y

x

f (x) = ax ¡ b

Regression: Continuous Y`Regression: Continuous Y`

29

Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x

x1

x2

y = +1

x = (x1;x2)

y = -1

w>x ¡ b= 0

f (x) = sign(w>x ¡ b)

+

Classification: Discrete YClassification: Discrete Y

30

Examples Text categorization

Input x: word histogram Output y: document categories (e.g., 1 for

“domestic economics”, 2 for “politics”, 3 “sports”, and 4 for “others”)

Question answering: classify question types Input x: a parsing tree of a qestion Output y: question types (e.g., when, where, …)

31

K Nearest-Neighbor (KNN) Classifiers

– Compute distance to other training documents

– Identify the k nearest neighbors

– determine the class of the unknown point by the class labels of its closest neighbors

Unknown record

Based on Tan,Steinbach, Kumar

32

K Nearest-Neighbor (KNN) Classifiers Compute distance between two points

Euclidean distance, cosine distance, Kullback-Leibler distance, Bregman distance, …

Learning distance function from data (Distance learning) Determine the class

Majority vote, or weighted majority vote

Bregman distance: generated by a convex function

33

K Nearest-Neighbor (KNN) Classifiers Decide K (# of nearest neighbors)

Bias-variance tradeoff Cross validation (or leave-one-out)

(k=1)(k=4)

Training Dataset

Validation Dataset

34

K Nearest-Neighbor (KNN) Classifiers Curse of dimensionality

Many attributes are irrelevant High dimension less informative distance

Distribution of square distance, generated by 1000 random data points in 1000 dims

35

KNN for Collaborative Filtering Collaborative filtering

Will user u like item b? Assumption:

Users have similar tastes are likely to have similar preferences on items

Making filtering decisions for one user based on the feedback from other users that are similar to this user

36

KNN for Collaborative Filtering

User 1 1 5 3 4 3

User 2 4 1 5 2 5

User 3 2 ? 3 5 4?

37

KNN for Collaborative Filtering

User 1 1 5 3 4 3

User 2 4 1 5 2 5

User 3 2 ? 3 5 4

Similarity measure of user interests can be learned

5

38

Paradigm for Supervised Learning Gathering training data Determine the input features (i.e., What’s x ?)

e.g., text categorization, bags of words Feature engineering is very very very important

Determine the functional form f(x) Linear or nonlinear What is the function form for KNN?

Determine the learning algorithm Learn optimal parameters (optimization, cross validation) Probabilistic or non-probabilistic

Test on a test set

39

Bayesian LearningLikelihoodPriorPosterior


MAP Learning: Maximum A Posterior

MAP Learning: Maximum A Posterior

Hypothesis space: H = fY1;Y2; : : : ;g

Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Baye’s Rule

40


Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Bayesian Learning

MLE Learning: Maximum Likelihood Estimation

MLE Learning: Maximum Likelihood Estimation


Pr(H jE ) / Pr(H ) £ Pr(E jH ) Baye’s Rule

41

Bayesian Learning: Conjugate Prior

Posterior Pr(Y|X) is in the same form as prior Pr(Y) e.g., Dirchlet dist. is conjugate prior for multinomial

dist. (widely used in language model)


Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

42

Example: Text Categorization

How to estimate Pr(Y=Student) or Pr(Y= Prof.) ? How to estimate Pr(w|Y) ?

What is Y ?

What is feature X?

What is Y ?

What is feature X?

Web page for Prof. or student ?Web page for Prof. or student ?

Y ¤ = argmaxY 2H

Pr(Y ) Pr(X jY )

1. Counting = MLE2. Counting + Pseudo = MAP

1. Counting = MLE2. Counting + Pseudo = MAP

Counting !

43

f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)

= logPr(Y = P)Pr(Y = S)

+x1 logPr(w1jY = P)Pr(w1jY = S)

+:::+xv logPr(wV jY = P)Pr(wV jY = S)

Pr(X jY ) ¼[Pr(w1jY )]x1 ¢¢¢[Pr(wV jY )]xV

Naïve Bayes

Pr(wjY ) Pr(X jY )?

[w1;w2; : : : ;wV ]

Weight for wordsWeight for words

X = (x1; x2; : : : ; xV )

ThresholdConstant

ThresholdConstant

44

f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)

= logPr(Y = P)Pr(Y = S)

+x1 logPr(w1jY = P)Pr(w1jY = S)

+:::+xv logPr(wV jY = P)Pr(wV jY = S)

Naïve Bayes: A Linear Classifier

x1

x2

y = +1

y = -1

f (x) = sign(w>x ¡ b)+

Logistic Regression

Directly model f(x) or Pr(Y|X)

Logistic Regression

Directly model f(x) or Pr(Y|X)

45

logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S) = b+t1x1 +:::+tV xV

logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S)

= logPr(Y = P )Pr(Y = S)

+x1 logPr(w1jY = P )Pr(w1jY = S)

+:::+xv logPr(wV jY = P )Pr(wV jY = S)

Logistic Regression (LR)

t1…tV are unknown weights that are learned from data by maximum likelihood estimation (MLE)

Pr(y = § 1jX ) =1

1+exp[¡ y(t1x1 +:::+tV xV +b)]

46

Logistic Regression (LR) Learning parameters: b, t1…tV

Maximum Likelihood Estimation (MLE)

(~t¤;b¤) = argmax~t;b

NX

i=1

logPr(yi jX i ;~t;b)

47

Logistic Regression (LR) Learning parameters: b, t1…tV

Maximum Likelihood Estimation (MLE)

(~t¤;b¤) = argmax~t;b

NX

i=1

logPr(yi jX i ;~t;b)

worse performance

OverfittingOverfitting

Maximum Likelihood Estimation

Maximum A Posterior

Maximum Likelihood Estimation

Maximum A Posterior

+Pr(t)

Why only word weights?Why only word weights?

48

Learning Logistic Regression

Pr(y = § 1jX ) =1

1+ exp[¡ yf (X )]

(~t¤;b¤) = argmin~t;b

NX

i=1

¡ logPr(yi jX i ;~t;b)

Loss functionMismatch between y and f(X)

Loss functionMismatch between y and f(X)

Other Loss functionsOther Loss functions

f(X)

49

Logistic Regression (LR) Closely related to Maximum Entropy (ME)

Advantage of LR Bayesian approach Convenient for incorporating prior knowledge Useful for semi-supervised learning, transfer

learning, …

Logistic Regression

Logistic Regression

Maximum Entropy

Maximum EntropyDual

50

Comparison of Classifiers

From Li and Yang SIGIR03

Macro F1

Micro F1

KNN 0.8557 0.5975

Naïve Bayes

0.8009 0.4737

Logistic Regression

0.8748 0.6084

51x1

x2


Logistic RegressionLogistic Regression Naïve BayesNaïve Bayes

1. Model Pr(Y|X)2. Model decision boundary3. NB is a special case of LR

1. Model Pr(Y|X)2. Model decision boundary3. NB is a special case of LR

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)


1. Require numerical solution2. Large number of training examples, slow convergence


1. Simple solution2. Small number of training examples, fast convergence


52x1

x2


Discriminative ModelDiscriminative Model Generative ModelGenerative Model

1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption








Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important

Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test

Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important

Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test

Rule of Thumb

53


Discriminative ModelDiscriminative Model Generative ModelGenerative Model









What about KNN ?What about KNN ?

54

Other Discriminative Classifiers Decision tree

Aggregation of decision rules via a tree

Easy interpretation

55

Other Discriminative Classifiers Decision tree

Aggregation of decision rules via a tree

Easy interpretation Support vector machine

A maximum margin classifier

best text classifier

x1

x2

y = +1

y = -1

56


From Li and Yang SIGIR03

Macro F1

Micro F1

KNN 0.8557 0.5975

Naïve Bayes

0.8009 0.4737

Logistic Regression

0.8748 0.6084

Support vector machine

0.8857 0.5975

57

Ensemble Learning Generate multiple classifiers Classification by (weighted) majority votes Bagging & Boosting

Train a classifier for a different sampling of training data

x1

x2

D

…

D1 D2 Dk

Sampling

h1 h2 hk

58

Ensemble Learning Bias-variance tradeoff

Reduce variance (bagging) and bias (boosting)

Error caused by variance

Error caused by bias

50 decision trees Majority vote

59

Multi-Class Classification

c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

…

XN 1 0 1

More than 2 classes Multi-labels assigned to

each example Approaches

One against all ECOC coding

Binary classifier

Binary classifier ………

f K (X )f 1(X )

60



One against all ECOC coding

……

f 1(X )

f M (X )


c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

…

XN 1 0 1

0 1 … 0

1 0 … 1

… … … …

1 1 … 0

# of codingbits

f 1(X )

f 2(X )

f 3(X )c1

c2

c3

61


c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

…

XN 1 0 1



One against all ECOC coding Transfer learning

f 1(X )Binary classifier Binary classifier

f K (X )………

62

Beyond Vector Inputs

gene sequence classification

question type classification

Character Recognition

sequences trees graphs

63

Beyond Vector Inputs: Kernel Kernel function k(x1, x2)

Assess the similarity between two objects x1, x2

Don’t have to represent objects by vectors

64

Beyond Vector Inputs: Kernel Kernel function k(x1, x2)

Assess the similarity between two objects x1, x2

Don’t have to represent objects by vectors Vector representation by kernel function

Given training examples Represent any example x by vector

x1;: : : ;xN

[k(x1;x);k(x2;x); : : : ;k(xN ;x)]

Related to representer theorem

65

Beyond Vector Inputs

Strong Kernel Tree Kernel Graph Kernel

sequences trees graphs

66

Kernel for Nonlinear Classifiers

67

Words are associated with Kernels Reproducing Kernel Hilbert Space (RKHS)

Vector representation Mercer’s conditions

Good kernels Representer theorem Kernel learning (e.g., multiple kernel

learning)

68

Sequence Prediction

Part-of-speech tagging But, all the taggings are related

Hidden Markov Model (HMM), Conditional Random Field (CRF), and Maximum Margin Markov Network (M3)

[He] [reckons] [the] [current] [account] [deficit]

[PRP] [VBZ] [DT] [JJ] [NN] [NN]

Pr(N N j\ account) ! Pr(N N j\ account;tag-for-\ current)

69


machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

70

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

71

Spectrum of Learning Problems

72

What is Semi-supervised Learning Learning from a mixture of labeled and unlabeled examples

f (x) : X ! Y

L = f (x1;y1); : : : ; (xn l;yn l

)gLabeled Data

U = fx1; : : : ;xnug

Unlabeled Data

Total number of examples:N = nl +nu

73

Why Semi-supervised Learning? Labeling is expensive and difficult Labeling is unreliable

Ex. Segmentation applications Need for multiple experts

Unlabeled examples Easy to obtain in large numbers Ex. Web pages, text documents, etc.

74

Semi-supervised Learning Problems Classification

Transductive – predict labels of unlabeled data Inductive – learn a classification function

Clustering (constrained clustering) Ranking (semi-supervised ranking) Almost every learning problem has a semi-

supervised counterpart.

75





76

Why Unlabeled Could be Helpful Clustering assumption

Unlabeled data help decide the decision boundary

Manifold assumption Unlabeled data help decide decision function

f (X ) = 0

f (X )

77

Clustering Assumption

?

78

Clustering Assumption

?

Points with same label are connected through high density regions, thereby defining a cluster

Clusters are separated through low-density regions

Suggest a simple alg. forSemi-supervised Learning ?

Suggest a simple alg. forSemi-supervised Learning ?

79

Manifold Assumption

Regularize the classification function f(x)

Graph representation Vertex: training example

(labeled and unlabeled) Edge: similar examples

x1

x2

x1 and x2 are connected ¡ ! jf (x1) ¡ f (x2)j is small

Labeled examples

80

Manifold Assumption

Manifold assumption Data lies on a low-dimensional manifold Classification function f(x) should “follow” the

data manifold

Graph representation Vertex: training example

(labeled and unlabeled) Edge: similar examples

81

Statistical View

Generative model for classification

θ

Y X

Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)

82

Statistical View

Generative model for classification

Unlabeled data help estimate

Clustering assumption θ

Y X

Pr(X jY ;µ)

Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)

83

Statistical View Discriminative model for classification

θ

Y X

μ

Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)

84

Statistical View Discriminative model for classification

Unlabeled data help regularize θ

via a prior

Manifold assumption

θ

Y X

μPr(µjX )

Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)

85





86





87

Label Propagation: Key Idea

A decision boundary based on the labeled examples is unable to take into account the layout of the data points

How to incorporate the data distribution into the prediction of class labels?

88


Connect the data points that are close to each other

89


Connect the data points that are close to each other

Propagate the class labels over the connected graph

90

Label Propagation: Key Idea Connect the data

points that are close to each other

Propagate the class labels over the connected graph

Different from the K Nearest Neighbor

91

Label Propagation: Representation Adjancy matrix

Similarity matrix

Matrix

Wi ;j =

½1 xi and xj connect0 otherwise

W 2 f0;1gN £ N

W 2 RN £ N+

Wi ;j : similarity between xi and xj

D = diag(d1;: ::;dN )

di =P

j 6=i Wi ;j

92

Label Propagation: Representation Adjancy matrix

Similarity matrix

Degree matrix

Wi ;j =

½1 xi and xj connect0 otherwise

W 2 f0;1gN £ N

W 2 RN £ N+

Wi ;j : similarity between xi and xj

D = diag(d1;: ::;dN ) di =P

j 6=i Wi ;j

93

Label Propagation: Representation Given Label information

W 2 RN £ N+

yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l

yu = (y1;y2; : : : ;ynu) 2 f ¡ 1;+1gnu

94

Label Propagation: Representation Given Label information

W 2 RN £ N+

yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l

y = (yl ;yu)

95

Label Propagation Initial class assignments

Predicted class assignments First predict the confidence scores Then predict the class assignments

by 2 f ¡ 1;0;+1gN

y 2 f ¡ 1;+1gN

f 2 RN

yi =

½+1 f i > 0¡ 1 f i · 0

byi =

½§ 1 xi is labeled0 xi is unlabeled

96

Label Propagation Initial class assignments

Predicted class assignments First predict the confidence scores Then predict the class assignments

by 2 f ¡ 1;0;+1gN

y 2 f ¡ 1;+1gN

yi =

½+1 f i > 0¡ 1 f i · 0

byi =

½§ 1 xi is labeled0 xi is unlabeled

f = (f 1; : : : ; f N )

97

Label Propagation (II)

One round of propagation

f i =

½byi xi is labeled

®P N

i=1 Wi ;j byi otherwise

f1 = by + ®Wby

Weighted KNNWeighted KNNWeight for each propagation

Weight for each propagation

98


Two rounds of propagation

How to generate any number of iterations?

fk = by +kX

i=1

®i W i by

f2 = f1 + ®Wf1

= by + ®Wby + ®2W2by

99



Results for any number of iterations

fk = by +kX

i=1

®i W i by

f2 = f1 + ®Wf1


100



Results for infinite number of iterations

f1 = by +1X

i=1

®i W i by

f2 = f1 + ®Wf1


101



Results for infinite number of iterations

f1 = (I ¡ ®W)¡ 1by

¹W = D ¡ 1=2WD ¡ 1=2Normalized Similarity Matrix:

f2 = f1 + ®Wf1


Matrix InverseMatrix Inverse

102

Local and Global Consistency [Zhou et.al., NIPS 03]

Local consistency:

Like KNN

Global consistency:

Beyond KNN

103

Summary: Construct a graph using pairwise similarities Propagate class labels along the graph Key parameters

: the decay of propagation W: similarity matrix

Computational complexity Matrix inverse: O(n3) Chelosky decomposition Clustering f = (I ¡ ®W)¡ 1by

104

Questions

?

Cluster Assumption Manifold Assumption

?

Transductive predict classes for unlabeled data


Inductive learn classification function


105

Application: Text Classification [Zhou et.al., NIPS 03]

20-newsgroups autos, motorcycles, baseball,

and hockey under rec

Pre-processing stemming, remove stopwords

& rare words, and skip header

#Docs: 3970, #word: 8014

SVM

KNN

Propagation

106

Application: Image Retrieval [Wang et al., ACM MM 2004]

5,000 images Relevance feedback for the top

20 ranked images Classification problem

Relevant or not? f(x): degree of relevance

Learning relevance function f(x) Supervised learning: SVM Label propagation

Label propagation

SVM

107



Label propagation Graph partition based approaches Transductive Support Vector Machine (TSVM) Co-training


108

Graph Partition Classification as graph partitioning Search for a classification boundary

Consistent with labeled examples Partition with small graph cut

Graph Cut = 1Graph Cut = 2

109

Graph Partitioning Classification as graph partitioning Search for a classification boundary

Consistent with labeled examples Partition with small graph cut

Graph Cut = 1

110

Min-cuts for semi-supervised learning [Blum and Chawla, ICML 2001]

Additional nodes V+ : source, V-: sink

Infinite weights connecting sinks and sources High computational cost

V+V R

SourceSink

Graph Cut = 1

111

Harmonic Function [Zhu et al., ICML 2003]

Weight matrix W wi,j 0: similarity between xi and xi

Membership vector f = (f 1; : : : ; f N )

f i =

½+1 xi 2 A¡ 1 xi 2 B

¡ 1

A B

+1

+1

+1

+1¡ 1

¡ 1

¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

112

Harmonic Function (cont’d) Graph cut

Degree matrix Diagonal element:

C(f) A B

+1

+1

+1

+1¡ 1

¡ 1

¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

D = diag(d1;: :: ;dN )di =

Pj 6=i Wi ;j

C(f) =NX

i=1

NX

j =1

(f i ¡ f j )2

4wi ;j

=14f>(D ¡ W)f =

14f>Lf

113

Harmonic Function (cont’d) Graph cut

Graph Laplacian L = D –W Pairwise relationships among data poitns Mainfold geometry of data

C(f) A B

+1

+1

+1

+1¡ 1

¡ 1

¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

C(f) =NX

i=1

NX

j =1

(f i ¡ f j )2

4wi ;j

=14f>(D ¡ W)f =

14f>Lf

114

Harmonic Function

A B

+1

+1

+1

+1¡ 1

¡ 1¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

minf 2 f ¡ 1;+1gN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Consistency with graph structures

Consistent with labeled dataChallenge:

Discrete space Combinatorial Opt.

115

Harmonic Function

Relaxation: {-1, +1} continuous real number

Convert continuous f to binary ones

A B

+1

+1

+1

+1¡ 1

¡ 1¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

minf 2 f ¡ 1;+1gN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

116

Harmonic Function

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

L =

µL l ;l Lu;lL l ;u Lu;u

¶; f = (fl ; fu)

fu = ¡ L ¡ 1u;uLu;lyl

117

Harmonic Function


Local Propagation

118

Harmonic Function

Local Propagation

Global propagation


Sound familiar ?Sound familiar ?

119

Spectral Graph Transducer [Joachim , 2003]

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Soften hard constraints

+®n lX

i=1

(f i ¡ yi )2

120

Spectral Graph Transducer [Joachim , 2003]

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

+®n lX

i=1

(f i ¡ yi )2

minf 2RN

C(f) =14f>Lf +®

n lX

i=1

(f i ¡ yi )2

s. t.NX

i=1

f 2i = N

Solved by Constrained Eigenvector ProblemSolved by Constrained Eigenvector Problem

121

Manifold Regularization [Belkin, 2006]

minf 2RN

C(f) =14f>Lf +®

n lX

i=1

(f i ¡ yi )2

s. t.NX

i=1

f 2i = N Loss function for

misclassification

Regularize the norm of classifier

122

Manifold Regularization [Belkin, 2006]

minf 2RN

14f>Lf +®

n lX

i=1

(f i ¡ yi )2

s. t.NX

i=1

f 2i = N

Loss function: l(f (xi );yi )

minf 2RN

f>Lf +®n lX

i=1

l(f (xi );yi ) +°jf j2H K

Manifold Regularization

123

Summary Construct a graph using pairwise similarity Key quantity: graph Laplacian

Captures the geometry of the graph Decision boundary is consistent

Graph structure Labeled examples

Parameters , , similarity

A B

+1+1

+1+1

¡ 1¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

¡ 1

124

Questions

?


?





125

Application: Text Classification 20-newsgroups

autos, motorcycles, baseball,

and hockey under rec

Pre-processing stemming, remove stopwords

& rare words, and skip header

#Docs: 3970, #word: 8014

Propagation Harmonic

SVM

KNN

126

Application: Text Classification

PRBEP: precision recall break even point.

127

Application: Text Classification

Improvement in PRBEP by SGT

128





129

Transductive SVM Support vector machine

Classification margin Maximum classification

margin Decision boundary given a

small number of labeled examples

130

Transductive SVM Decision boundary given a


How to change decision boundary given both labeled and unlabeled examples ?

131

Transductive SVM Decision boundary given a


Move the decision boundary to low local density

132

Transductive SVM Classification margin

f(x): classification function Supervised learning

Semi-supervised learning Optimize over both f(x) and yu

! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f (x)

! (X ;y;f )

133




! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f (x)

134




! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f (x)

f ¤ = argmaxf 2H K ;yu 2f ¡ 1;+1gn u

! (X ;yl ;yu; f )

135

Transductive SVM Decision boundary given

a small number of labeled examples

Move the decision boundary to place with low local density

Classification results How to formulate this

idea?

136

Transductive SVM: Formulation

* *

,

1 1

2 2

{ , }= argmin

1

1 labeled

examples....

1

w b

n n

w b w w

y w x b

y w x b

y w x b

Original SVM

1

* *

,..., ,

1 1

2 2

1 1

{ , }= argmin argmin

1

1 labeled

examples....

1

1 unlabeled

....examples

1

n n my y w b

n n

n n

n m n m

w b w w

y w x b

y w x b

y w x b

y w x b

y w x b

Transductive SVM

Constraints for unlabeled data

A binary variables for label of each example

137

Computational Issue

No longer convex optimization problem. Alternating optimization

1

* *1 1

,..., ,

1 1 11 1 1

2 2 2

{ , }= argmin argmin

1 1

1 labeled unlabeled ....

examples exampl....1

1

n n m

n ni ii i

y y w b

n n

n m n m mn n n

w b w w

y w x by w x b

y w x b

y w x by w x b

es

138

Summary

Based on maximum margin principle Classification margin is decided by

Labeled examples Class labels assigned to unlabeled data

High computational cost Variants: Low Density Separation (LDS), Semi-

Supervised Support Vector Machine (S3VM), TSVM

139

Questions

?


?





140

Text Classification by TSVM

10 categories from the Reuter collection

3299 test documents 1000 informative words

selected by MI criterion

141




142

Co-training [Blum & Mitchell, 1998]

Classify web pages into category for students and category for professors

Two views of web pages Content

“I am currently the second year Ph.D. student …”

Hyperlinks “My advisor is …” “Students: …”

143

Co-training for Semi-Supervised Learning

144

Co-training for Semi-Supervised Learning

It is easy to classify the type of

this web page based on its

content

It is easier to classify this web

page using hyperlinks

145

Co-training Two representation for each web page

Content representation:

(doctoral, student, computer, university…)

Hyperlink representation:

Inlinks: Prof. Cheng

Oulinks: Prof. Cheng

146

Co-training Train a content-based classifier

147

Co-training Train a content-based classifier using

labeled examples Label the unlabeled examples that are

confidently classified

148



confidently classified Train a hyperlink-based classifier

149



confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are


150



confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are


151

Co-training Assume two views of objects

Two sufficient representations Key idea

Augment training examples of one view by exploiting the classifier of the other view

Extension to multiple view Problem: how to find equivalent views

152

A Few Words about Active Learning Active learning

Select the most informative examples In contrast to passive learning

Key question: which examples are informative Uncertainty principle: most informative example is

the one that is most uncertain to classify Measure classification uncertainty

153

A Few Words about Active Learning Query by committee (QBC)

Construct an ensemble of classifiers Classification uncertainty largest degree of

disagreement SVM based approach

Classification uncertainty distance to decision boundary

Simple but very effective approaches

154




Semi-supervised clustering algorithms

155

Semi-supervised Clustering

Clustering data into two clusters

156

Semi-supervised Clustering

Clustering data into two clusters Side information:

Must links vs. cannot links

Must link

cannot link

157

Semi-supervised Clustering Also called constrained clustering Two types of approaches

Restricted data partitions Distance metric learning approaches

158

Restricted Data Partition Require data partitions to be consistent

with the given links Links hard constraints

E.g. constrained K-Means (Wagstaff et al., 2001)

Links soft constraints E.g., Metric Pairwise Constraints K-means

(Basu et al., 2004)

159

Restricted Data Partition Hard constraints

Cluster memberships must obey the link constraints

must link

cannot linkYes

160



must link

cannot linkYes

161



must link

cannot linkNo

162

Restricted Data Partition Soft constraints

Penalize data clustering if it violates some links

must link

cannot linkPenality = 0

163



must link

cannot link

Penality = 0

164



must link

cannot linkPenality = 1

165

Distance Metric Learning Learning a distance metric from pairwise links

Enlarge the distance for a cannot-link Shorten the distance for a must-link

Applied K-means with pairwise distance measured by the learned distance metric

must link

cannot link

Transformed by learned distance metric

166

Example of Distance Metric Learning

Solid lines: must links

dotted lines: cannot links

2D data projection using Euclidean distance metric

2D data projection using learned distance metric

167

111

BoostCluster [Liu, Jin & Jain, 2007]

General framework for semi-supervised clustering Improves any given unsupervised clustering algorithm with

pairwise constraints

Key challenges How to influence an arbitrary clustering algorithm by side

information?

Encode constraints into data representation

How to take into account the performance of underlying clustering algorithm?

Iteratively improve the clustering performance

168

168

BoostCluster

Given: (a) pairwise constraints, (b) data examples, and (c) a clustering algorithm

Data

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

169

169

BoostCluster

Find the best data rep. that encodes the unsatisfied pairwise constraints

Data

PairwiseConstraints


ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

170

170

BoostCluster

Obtain the clustering results given the new data representation

Data

PairwiseConstraints


ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

171

171

BoostCluster

Update the kernel with the clustering results

Data

PairwiseConstraints


ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

172

172

BoostCluster

Run the procedure iteratively

Data

PairwiseConstraints


ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

173

173

BoostCluster

Compute the final clustering result

Data

PairwiseConstraints


ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

174

Summary Clustering data under given pairwise constraints

Must links vs. cannot links Two types of approaches

Restricted data partitions (either soft or hard) Distance metric learning

Questions: how to acquire links/constraints? Manual assignments Derive from side information: hyper links, citation, user

logs, etc. May be noisy and unreliable

175

Application: Document Clustering[Basu et al., 2004]

300 docs from topics (atheism, baseball, space) of 20-newsgroups

3251 unique words after removal of stopwords and rare words and stemming

Evaluation metric: Normalized Mutual Informtion (NMI)

KMeans-x-x: different variants of constrained clustering algs.

176


machine learning Supervised learning and its application to text classification,

adaptive filtering, collaborative filtering and ranking Semi-supervised learning and its application to text

classification Emerging research directions

177

Efficient Learning In IR, we have massive amount of data But, most learning algs. are relatively slow

Difficult to handle millions of documents How to improve scalability ?

Sampling, only use part of data Stochastic optimization, update model one example each

time (related to online learning) More interesting, more examples may mean more

efficient training (Sebro, ICML 2008)

178

Kernel Learning Kernel plays central role in machine learning Kernel functions can be learned from data

Kernel alignment, multiple kernel learning, non-parametric learning, …

Kernel learning is suitable for IR Similarity measure is key to IR Kernel learning allows us to identify the optimal

similarity measure automatically

179

Transfer Learning Different document categories are correlated We should be able to borrow information of

one class to the training of another class Key question: what to transfer between

classes? Representation, model priors, similarity measure

…

180

Active Learning IR Applications Relevance feedback (text retrieval or image

retrieval) Text classification Adaptive information filtering Collaborative filtering Query Rewriting

181

Discriminative Language Models

Language models have shown to be effective for information retrieval

But most language models are generative, thus missing the discriminative power

Key difficulty in discriminative language models: no outputs! Side information Mixture of generative and discriminative models

182

References A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification.

In AAAI-98 Workshop on Learning for Text Categorization, 1998 Tong Zhang and Frank J. Oles, Text Categorization Based on Regularized Linear Classification

Methods, Journal of Information Retrieval, 2001 F. Li and Y. Yang. A loss function analysis for classification methods in text categorization, The

Twentieth International Conference on Machine Learning (ICML'03) Chengxiang Zhai and John Lafferty, A study of smoothing methods for language models applied

to information retrieval, ACM Trans. Inf. System, 2004 A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-training, COLT 1998 D. Blei and M. Jordan, Variational methods for the Dirichlet process, ICML 2004 T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Mach. Learn.,

42(1-2), 2001 D. Blei, A. Ng and M. Jordan, Latent Dirichlet allocation, NIPS*2002 R. Jin, C. Ding, and F. Kang, A Probabilistic Approach for Optimizing Spectral Clustering,

NIPS*2005 D. Zhou, B. Scholkopf, and T. Hofmann, Semi-supervised learning on directed graphs,

NIPS*2005. X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using Gaussian fields and

harmonic functions. ICML 2003. T. Joachims, Transductive Learning via Spectral Graph Partitioning, ICML 2003

183

References Andrew McCallum and Kamal Nigam, Employing {EM} in Pool-Based Active Learning for

Text Classification, Proceeding of the International Conference on Machine Learning, 1998 David A. Cohn and Zoubin Ghahramani and Michael I. Jordan, Active Learning with

Statistical Models, Journal of Artificial Intelligence Research, 1996 S. Tong and E. Chang. Support vector machine active learning for image retrieval. In ACM

Multimedia, 2001 Xuehua Shen and ChengXiang Zhai, Active feedback in ad hoc information retrieval, SIGIR

'05 J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear

predictors. Information and Computation, 1997. X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li. Multi-Model Similarity Propagation and its Application

for Web Image Retrieval, ACM Multimedia, 2004 M. Belkin and P. Niyogi and V. Sindhwani, Manifold Regularization, Technical Report,

Univ. of Chicago, 2006 K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with

background knowledge. In ICML '01, 2001. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised

clustering. In SIGKDD '04, 2004.

184

References Xiaofei He, Benjamin Rey, Wei Vivian Zhang, Rosie Jones, Query Rewriting using Active Learning

for Sponsored Search, SIGIR07 Y. Zhang, W. Xu, and J. Callan. Exploration and exploitation in adaptive filtering based on bayesian

active learning. In Proceedings of 20th International Conf. on Machine Learning, 2003. Z. Xu and R. Akella. A bayesian logistic regression model for active relevance feedback (SIGIR08) G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. ICML 2000 M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking.

Machine learning, 2004 J. Rocchio. Relevance feedback in information retrieval, In The Smart System: experiments in

automatic document processing. Prentice Hall, 1971. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the fifth annual

workshop on Computational learning theory, 1992 Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee

algorithm. Machine Learning, 28(2-3):133–168, 1997 D. A. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learn-ing. Machine

learning, 1994. Robert M. Bell and Yehuda Koren, Lessons from the Netix Prize Challenge, KDD Exploration 2008 Tie-Yan Liu, Tutorial: Learning to rank Soumen Chakrabarti, Learning to Rank in Vector Spaces and Social Networks, www 2007

185

Thank You

God, it is finally over !God, it is finally over !

1 machine learning for information retrieval rong jin michigan state university yi zhang university...

Documents