1 machine learning for information retrieval rong jin michigan state university yi zhang university...

185
1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

Post on 15-Jan-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

1

Machine Learning for Information RetrievalRong JinMichigan State University

Yi ZhangUniversity of California Santa Cruz

Page 2: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

2

Outline Introduction to information retrieval, statistical inference

and machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

Page 3: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

3

Roadmap of Information Retrieval

Search

Data

Filtering

Categorization

Summarization

Clustering

Data Analysis

Extraction

Mining

VisualizationRetrievalApplications

Mining/LearningApplications

InformationAccess

KnowledgeAcquisition

Why Machine Learning is Important ?

Page 4: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

4

Text Categorization

Page 5: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

5

Text Categorization Open directory project

the largest human-edited directory of the Web

Manual classification Over 4 million sites and

590 K categories Need to automate the

process

Page 6: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

6

Document Clustering

Page 7: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

7

Question AnsweringQuestion Answering

Classify question; identify answers; match questions and answers

Page 8: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

8

Image Retrieval

Image segmentation by data clustering

Page 9: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

9

Image Retrieval by Key Points

Key features visual words: data clustering

b1 b2

b3

b4

b5

b6

b7

b8

b1 b2 b3 b4

Page 10: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

10

Image Retrieval by Text Query Automatically annotate images with textual words Retrieve images with textual queries Key technique: classification

Each keyword a different category

Page 11: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

11

Information Extraction

Title J2EE Developer

Length 4 month

Salary ….

Location

Reference

Web page: free style text Relational DB

Structure prediction by Hidden Markov Model and Markov Random Field

Page 12: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

12

Citation/Link Analysis

Page 13: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

13

Recommender Systems

Page 14: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

14

Recommender Systems

User 1 ? 5 3 4 2

User 2 4 1 5 ? 5

User 3 5 ? 4 2 5

User 4 1 5 3 5 ?

Sparse data problem: a lot of missing values

Page 15: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

15

Recommender System

User Class I 1 p(4)=1/4

p(5)=3/4

3

User Class II p(4)=1/4

p(5)=3/4

p(1)=1/2

p(2)=1/2

p(4)=1/2

p(5)=1/2

Movie Type I

Movie Type II

Movie Type III

Fill out sparse data by data clustering

Page 16: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

16

One More Reason for ML

$ 1,000,000 award

Page 17: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

17

Review of Basic Prob. Concepts Probability Pr(A): “the fraction of possible world in

which A is true” ExamplesA = Your paper will be accepted by SIGIR 2008A = It rains in SingaporeA = A document contains the word “IR”

A is true

Event space of all possible worlds. The area is 1.

Page 18: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

18

Conditional Probability SIGIR2008 = “a document contains the phrase SIGIR 2008” SINGAPORE = “a document contains the word singpaore”

P(SINGAPORE) = 0.000001 P(SIGIR2008) = 0.00000001 P(SINGAPORE|SIGIR2008) = 1/2

“Singapore” is rare and “SIGIR 2008” is rarer, but if you have a document with SIGIR 2008, there’s a 50-50 chance you’ll find the word “Singapore” in it

Page 19: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

19

Conditional Prob.

B is trueA is true

Pr(AjB) =Pr(A;B)Pr(B)

Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule

Page 20: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

20

Conditional Prob.

B is trueA is true

Pr(AjB) =Pr(A;B)Pr(B)

Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule

Independent variables

Pr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)

Page 21: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

21

Conditional Prob.

Marginal probability B is trueA is true

Pr(AjB) =Pr(A;B)Pr(B)

Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule

IndependencePr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)

Pr(B) =kX

j =1

Pr(B;A = aj )

Page 22: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

22

Bayes’ Rule

Pr(H jE ) / Pr(H ) £ Pr(E jH )

LikelihoodPriorPosterior

H E

Inference: Pr(H|E)

Information: Pr(E|H)

Hypothesis Evidence

Page 23: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

23

Bayes’ Rule

Pr(H jE ) / Pr(H ) £ Pr(E jH )

LikelihoodPriorPosterior

W R

Inference: Pr(R|W)

Information: Pr(W|R)

Pr(W|R)

R R

W 0.7 0.4

W 0.3 0.6

R: It rains

W: The grass is wet

Page 24: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

24

Statistical Inference

Learning stage: a parametric model for Pr(E|H) Inference stage: for a given observation E

Compute Pr(H|E) for each hypothesis H Choose the hypothesis with the largest Pr(H|E)

Pr(H jE ) / Pr(H ) £ Pr(E jH )

LikelihoodPriorPosterior

Page 25: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

25

Example: Language Model (LM) for IR

d1 … d1000

q: ‘Signapore SIGIR’

? ??

Estimating some statistics for each document

Estimating likelihood p(q| )

Hypothesis: H

Evidence: E

Pr(H jE )Pr(E jH )

Pr(H )

Page 26: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

26

Probability Distributions Binomial distributions Beta distribution Multinomial distributions Dirichlet distribution Gaussian distributions Laplacian distribution

Language models

Smoothing LM

Sparse solution L1 regularizer

Page 27: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

27

Outline Introduction to information retrieval, statistical inference and

machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

Page 28: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

28

Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

y

x

f (x) = ax ¡ b

Regression: Continuous Y`Regression: Continuous Y`

Page 29: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

29

Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x

x1

x2

y = +1

x = (x1;x2)

y = -1

w>x ¡ b= 0

f (x) = sign(w>x ¡ b)

+

Classification: Discrete YClassification: Discrete Y

Page 30: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

30

Examples Text categorization

Input x: word histogram Output y: document categories (e.g., 1 for

“domestic economics”, 2 for “politics”, 3 “sports”, and 4 for “others”)

Question answering: classify question types Input x: a parsing tree of a qestion Output y: question types (e.g., when, where, …)

Page 31: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

31

K Nearest-Neighbor (KNN) Classifiers

– Compute distance to other training documents

– Identify the k nearest neighbors

– determine the class of the unknown point by the class labels of its closest neighbors

Unknown record

Based on Tan,Steinbach, Kumar

Page 32: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

32

K Nearest-Neighbor (KNN) Classifiers Compute distance between two points

Euclidean distance, cosine distance, Kullback-Leibler distance, Bregman distance, …

Learning distance function from data (Distance learning) Determine the class

Majority vote, or weighted majority vote

Bregman distance: generated by a convex function

Page 33: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

33

K Nearest-Neighbor (KNN) Classifiers Decide K (# of nearest neighbors)

Bias-variance tradeoff Cross validation (or leave-one-out)

(k=1)(k=4)

Training Dataset

Validation Dataset

Page 34: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

34

K Nearest-Neighbor (KNN) Classifiers Curse of dimensionality

Many attributes are irrelevant High dimension less informative distance

Distribution of square distance, generated by 1000 random data points in 1000 dims

Page 35: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

35

KNN for Collaborative Filtering Collaborative filtering

Will user u like item b? Assumption:

Users have similar tastes are likely to have similar preferences on items

Making filtering decisions for one user based on the feedback from other users that are similar to this user

Page 36: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

36

KNN for Collaborative Filtering

User 1 1 5 3 4 3

User 2 4 1 5 2 5

User 3 2 ? 3 5 4?

Page 37: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

37

KNN for Collaborative Filtering

User 1 1 5 3 4 3

User 2 4 1 5 2 5

User 3 2 ? 3 5 4

Similarity measure of user interests can be learned

5

Page 38: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

38

Paradigm for Supervised Learning Gathering training data Determine the input features (i.e., What’s x ?)

e.g., text categorization, bags of words Feature engineering is very very very important

Determine the functional form f(x) Linear or nonlinear What is the function form for KNN?

Determine the learning algorithm Learn optimal parameters (optimization, cross validation) Probabilistic or non-probabilistic

Test on a test set

Page 39: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

39

Bayesian LearningLikelihoodPriorPosterior

Pr(H jE ) / Pr(H ) £ Pr(E jH )

MAP Learning: Maximum A Posterior

MAP Learning: Maximum A Posterior

Hypothesis space: H = fY1;Y2; : : : ;g

Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Baye’s Rule

Page 40: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

40

Hypothesis space: H = fY1;Y2; : : : ;g

Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Bayesian Learning

MLE Learning: Maximum Likelihood Estimation

MLE Learning: Maximum Likelihood Estimation

LikelihoodPriorPosterior

Pr(H jE ) / Pr(H ) £ Pr(E jH ) Baye’s Rule

Page 41: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

41

Bayesian Learning: Conjugate Prior

Posterior Pr(Y|X) is in the same form as prior Pr(Y) e.g., Dirchlet dist. is conjugate prior for multinomial

dist. (widely used in language model)

Hypothesis space: H = fY1;Y2; : : : ;g

Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Page 42: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

42

Example: Text Categorization

How to estimate Pr(Y=Student) or Pr(Y= Prof.) ? How to estimate Pr(w|Y) ?

What is Y ?

What is feature X?

What is Y ?

What is feature X?

Web page for Prof. or student ?Web page for Prof. or student ?

Y ¤ = argmaxY 2H

Pr(Y ) Pr(X jY )

1. Counting = MLE2. Counting + Pseudo = MAP

1. Counting = MLE2. Counting + Pseudo = MAP

Counting !

Page 43: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

43

f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)

= logPr(Y = P)Pr(Y = S)

+x1 logPr(w1jY = P)Pr(w1jY = S)

+:::+xv logPr(wV jY = P)Pr(wV jY = S)

Pr(X jY ) ¼[Pr(w1jY )]x1 ¢¢¢[Pr(wV jY )]xV

Naïve Bayes

Pr(wjY ) Pr(X jY )?

[w1;w2; : : : ;wV ]

Weight for wordsWeight for words

X = (x1; x2; : : : ; xV )

ThresholdConstant

ThresholdConstant

Page 44: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

44

f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)

= logPr(Y = P)Pr(Y = S)

+x1 logPr(w1jY = P)Pr(w1jY = S)

+:::+xv logPr(wV jY = P)Pr(wV jY = S)

Naïve Bayes: A Linear Classifier

x1

x2

y = +1

y = -1

f (x) = sign(w>x ¡ b)+

Logistic Regression

Directly model f(x) or Pr(Y|X)

Logistic Regression

Directly model f(x) or Pr(Y|X)

Page 45: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

45

logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S) = b+t1x1 +:::+tV xV

logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S)

= logPr(Y = P )Pr(Y = S)

+x1 logPr(w1jY = P )Pr(w1jY = S)

+:::+xv logPr(wV jY = P )Pr(wV jY = S)

Logistic Regression (LR)

t1…tV are unknown weights that are learned from data by maximum likelihood estimation (MLE)

Pr(y = § 1jX ) =1

1+exp[¡ y(t1x1 +:::+tV xV +b)]

Page 46: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

46

Logistic Regression (LR) Learning parameters: b, t1…tV

Maximum Likelihood Estimation (MLE)

(~t¤;b¤) = argmax~t;b

NX

i=1

logPr(yi jX i ;~t;b)

Page 47: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

47

Logistic Regression (LR) Learning parameters: b, t1…tV

Maximum Likelihood Estimation (MLE)

(~t¤;b¤) = argmax~t;b

NX

i=1

logPr(yi jX i ;~t;b)

worse performance

OverfittingOverfitting

Maximum Likelihood Estimation

Maximum A Posterior

Maximum Likelihood Estimation

Maximum A Posterior

+Pr(t)

Why only word weights?Why only word weights?

Page 48: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

48

Learning Logistic Regression

Pr(y = § 1jX ) =1

1+ exp[¡ yf (X )]

(~t¤;b¤) = argmin~t;b

NX

i=1

¡ logPr(yi jX i ;~t;b)

Loss functionMismatch between y and f(X)

Loss functionMismatch between y and f(X)

Other Loss functionsOther Loss functions

f(X)

Page 49: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

49

Logistic Regression (LR) Closely related to Maximum Entropy (ME)

Advantage of LR Bayesian approach Convenient for incorporating prior knowledge Useful for semi-supervised learning, transfer

learning, …

Logistic Regression

Logistic Regression

Maximum Entropy

Maximum EntropyDual

Page 50: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

50

Comparison of Classifiers

From Li and Yang SIGIR03

Macro F1

Micro F1

KNN 0.8557 0.5975

Naïve Bayes

0.8009 0.4737

Logistic Regression

0.8748 0.6084

Page 51: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

51x1

x2

Comparison of Classifiers

Logistic RegressionLogistic Regression Naïve BayesNaïve Bayes

1. Model Pr(Y|X)2. Model decision boundary3. NB is a special case of LR

1. Model Pr(Y|X)2. Model decision boundary3. NB is a special case of LR

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)

1. Require numerical solution2. Large number of training examples, slow convergence

1. Require numerical solution2. Large number of training examples, slow convergence

1. Simple solution2. Small number of training examples, fast convergence

1. Simple solution2. Small number of training examples, fast convergence

Page 52: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

52x1

x2

Comparison of Classifiers

Discriminative ModelDiscriminative Model Generative ModelGenerative Model

1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption

1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)

1. Require numerical solution2. Large number of training examples, slow convergence

1. Require numerical solution2. Large number of training examples, slow convergence

1. Simple solution2. Small number of training examples, fast convergence

1. Simple solution2. Small number of training examples, fast convergence

Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important

Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test

Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important

Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test

Rule of Thumb

Page 53: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

53

Comparison of Classifiers

Discriminative ModelDiscriminative Model Generative ModelGenerative Model

1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption

1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)

1. Require numerical solution2. Large number of training examples, slow convergence

1. Require numerical solution2. Large number of training examples, slow convergence

1. Simple solution2. Small number of training examples, fast convergence

1. Simple solution2. Small number of training examples, fast convergence

What about KNN ?What about KNN ?

Page 54: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

54

Other Discriminative Classifiers Decision tree

Aggregation of decision rules via a tree

Easy interpretation

Page 55: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

55

Other Discriminative Classifiers Decision tree

Aggregation of decision rules via a tree

Easy interpretation Support vector machine

A maximum margin classifier

best text classifier

x1

x2

y = +1

y = -1

Page 56: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

56

Comparison of Classifiers

From Li and Yang SIGIR03

Macro F1

Micro F1

KNN 0.8557 0.5975

Naïve Bayes

0.8009 0.4737

Logistic Regression

0.8748 0.6084

Support vector machine

0.8857 0.5975

Page 57: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

57

Ensemble Learning Generate multiple classifiers Classification by (weighted) majority votes Bagging & Boosting

Train a classifier for a different sampling of training data

x1

x2

D

D1 D2 Dk

Sampling

h1 h2 hk

Page 58: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

58

Ensemble Learning Bias-variance tradeoff

Reduce variance (bagging) and bias (boosting)

Error caused by variance

Error caused by bias

50 decision trees Majority vote

Page 59: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

59

Multi-Class Classification

c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

XN 1 0 1

More than 2 classes Multi-labels assigned to

each example Approaches

One against all ECOC coding

Binary classifier

Binary classifier ………

f K (X )f 1(X )

Page 60: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

60

More than 2 classes Multi-labels assigned to

each example Approaches

One against all ECOC coding

……

f 1(X )

f M (X )

Multi-Class Classification

c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

XN 1 0 1

0 1 … 0

1 0 … 1

… … … …

1 1 … 0

# of codingbits

f 1(X )

f 2(X )

f 3(X )c1

c2

c3

Page 61: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

61

Multi-Class Classification

c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

XN 1 0 1

More than 2 classes Multi-labels assigned to

each example Approaches

One against all ECOC coding Transfer learning

f 1(X )Binary classifier Binary classifier

f K (X )………

Page 62: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

62

Beyond Vector Inputs

gene sequence classification

question type classification

Character Recognition

sequences trees graphs

Page 63: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

63

Beyond Vector Inputs: Kernel Kernel function k(x1, x2)

Assess the similarity between two objects x1, x2

Don’t have to represent objects by vectors

Page 64: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

64

Beyond Vector Inputs: Kernel Kernel function k(x1, x2)

Assess the similarity between two objects x1, x2

Don’t have to represent objects by vectors Vector representation by kernel function

Given training examples Represent any example x by vector

x1;: : : ;xN

[k(x1;x);k(x2;x); : : : ;k(xN ;x)]

Related to representer theorem

Page 65: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

65

Beyond Vector Inputs

Strong Kernel Tree Kernel Graph Kernel

sequences trees graphs

Page 66: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

66

Kernel for Nonlinear Classifiers

Page 67: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

67

Words are associated with Kernels Reproducing Kernel Hilbert Space (RKHS)

Vector representation Mercer’s conditions

Good kernels Representer theorem Kernel learning (e.g., multiple kernel

learning)

Page 68: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

68

Sequence Prediction

Part-of-speech tagging But, all the taggings are related

Hidden Markov Model (HMM), Conditional Random Field (CRF), and Maximum Margin Markov Network (M3)

[He] [reckons] [the] [current] [account] [deficit]

[PRP] [VBZ] [DT] [JJ] [NN] [NN]

Pr(N N j\ account) ! Pr(N N j\ account;tag-for-\ current)

Page 69: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

69

Outline Introduction to information retrieval, statistical inference and

machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

Page 70: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

70

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

Page 71: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

71

Spectrum of Learning Problems

Page 72: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

72

What is Semi-supervised Learning Learning from a mixture of labeled and unlabeled examples

f (x) : X ! Y

L = f (x1;y1); : : : ; (xn l;yn l

)gLabeled Data

U = fx1; : : : ;xnug

Unlabeled Data

Total number of examples:N = nl +nu

Page 73: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

73

Why Semi-supervised Learning? Labeling is expensive and difficult Labeling is unreliable

Ex. Segmentation applications Need for multiple experts

Unlabeled examples Easy to obtain in large numbers Ex. Web pages, text documents, etc.

Page 74: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

74

Semi-supervised Learning Problems Classification

Transductive – predict labels of unlabeled data Inductive – learn a classification function

Clustering (constrained clustering) Ranking (semi-supervised ranking) Almost every learning problem has a semi-

supervised counterpart.

Page 75: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

75

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

Page 76: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

76

Why Unlabeled Could be Helpful Clustering assumption

Unlabeled data help decide the decision boundary

Manifold assumption Unlabeled data help decide decision function

f (X ) = 0

f (X )

Page 77: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

77

Clustering Assumption

?

Page 78: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

78

Clustering Assumption

?

Points with same label are connected through high density regions, thereby defining a cluster

Clusters are separated through low-density regions

Suggest a simple alg. forSemi-supervised Learning ?

Suggest a simple alg. forSemi-supervised Learning ?

Page 79: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

79

Manifold Assumption

Regularize the classification function f(x)

Graph representation Vertex: training example

(labeled and unlabeled) Edge: similar examples

x1

x2

x1 and x2 are connected ¡ ! jf (x1) ¡ f (x2)j is small

Labeled examples

Page 80: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

80

Manifold Assumption

Manifold assumption Data lies on a low-dimensional manifold Classification function f(x) should “follow” the

data manifold

Graph representation Vertex: training example

(labeled and unlabeled) Edge: similar examples

Page 81: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

81

Statistical View

Generative model for classification

θ

Y X

Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)

Page 82: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

82

Statistical View

Generative model for classification

Unlabeled data help estimate

Clustering assumption θ

Y X

Pr(X jY ;µ)

Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)

Page 83: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

83

Statistical View Discriminative model for classification

θ

Y X

μ

Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)

Page 84: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

84

Statistical View Discriminative model for classification

Unlabeled data help regularize θ

via a prior

Manifold assumption

θ

Y X

μPr(µjX )

Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)

Page 85: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

85

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

Page 86: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

86

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

Page 87: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

87

Label Propagation: Key Idea

A decision boundary based on the labeled examples is unable to take into account the layout of the data points

How to incorporate the data distribution into the prediction of class labels?

Page 88: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

88

Label Propagation: Key Idea

Connect the data points that are close to each other

Page 89: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

89

Label Propagation: Key Idea

Connect the data points that are close to each other

Propagate the class labels over the connected graph

Page 90: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

90

Label Propagation: Key Idea Connect the data

points that are close to each other

Propagate the class labels over the connected graph

Different from the K Nearest Neighbor

Page 91: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

91

Label Propagation: Representation Adjancy matrix

Similarity matrix

Matrix

Wi ;j =

½1 xi and xj connect0 otherwise

W 2 f0;1gN £ N

W 2 RN £ N+

Wi ;j : similarity between xi and xj

D = diag(d1;: ::;dN )

di =P

j 6=i Wi ;j

Page 92: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

92

Label Propagation: Representation Adjancy matrix

Similarity matrix

Degree matrix

Wi ;j =

½1 xi and xj connect0 otherwise

W 2 f0;1gN £ N

W 2 RN £ N+

Wi ;j : similarity between xi and xj

D = diag(d1;: ::;dN ) di =P

j 6=i Wi ;j

Page 93: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

93

Label Propagation: Representation Given Label information

W 2 RN £ N+

yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l

yu = (y1;y2; : : : ;ynu) 2 f ¡ 1;+1gnu

Page 94: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

94

Label Propagation: Representation Given Label information

W 2 RN £ N+

yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l

y = (yl ;yu)

Page 95: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

95

Label Propagation Initial class assignments

Predicted class assignments First predict the confidence scores Then predict the class assignments

by 2 f ¡ 1;0;+1gN

y 2 f ¡ 1;+1gN

f 2 RN

yi =

½+1 f i > 0¡ 1 f i · 0

byi =

½§ 1 xi is labeled0 xi is unlabeled

Page 96: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

96

Label Propagation Initial class assignments

Predicted class assignments First predict the confidence scores Then predict the class assignments

by 2 f ¡ 1;0;+1gN

y 2 f ¡ 1;+1gN

yi =

½+1 f i > 0¡ 1 f i · 0

byi =

½§ 1 xi is labeled0 xi is unlabeled

f = (f 1; : : : ; f N )

Page 97: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

97

Label Propagation (II)

One round of propagation

f i =

½byi xi is labeled

®P N

i=1 Wi ;j byi otherwise

f1 = by + ®Wby

Weighted KNNWeighted KNNWeight for each propagation

Weight for each propagation

Page 98: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

98

Label Propagation (II)

Two rounds of propagation

How to generate any number of iterations?

fk = by +kX

i=1

®i W i by

f2 = f1 + ®Wf1

= by + ®Wby + ®2W2by

Page 99: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

99

Label Propagation (II)

Two rounds of propagation

Results for any number of iterations

fk = by +kX

i=1

®i W i by

f2 = f1 + ®Wf1

= by + ®Wby + ®2W2by

Page 100: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

100

Label Propagation (II)

Two rounds of propagation

Results for infinite number of iterations

f1 = by +1X

i=1

®i W i by

f2 = f1 + ®Wf1

= by + ®Wby + ®2W2by

Page 101: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

101

Label Propagation (II)

Two rounds of propagation

Results for infinite number of iterations

f1 = (I ¡ ®W)¡ 1by

¹W = D ¡ 1=2WD ¡ 1=2Normalized Similarity Matrix:

f2 = f1 + ®Wf1

= by + ®Wby + ®2W2by

Matrix InverseMatrix Inverse

Page 102: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

102

Local and Global Consistency [Zhou et.al., NIPS 03]

Local consistency:

Like KNN

Global consistency:

Beyond KNN

Page 103: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

103

Summary: Construct a graph using pairwise similarities Propagate class labels along the graph Key parameters

: the decay of propagation W: similarity matrix

Computational complexity Matrix inverse: O(n3) Chelosky decomposition Clustering f = (I ¡ ®W)¡ 1by

Page 104: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

104

Questions

?

Cluster Assumption Manifold Assumption

?

Transductive predict classes for unlabeled data

Transductive predict classes for unlabeled data

Inductive learn classification function

Inductive learn classification function

Page 105: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

105

Application: Text Classification [Zhou et.al., NIPS 03]

20-newsgroups autos, motorcycles, baseball,

and hockey under rec

Pre-processing stemming, remove stopwords

& rare words, and skip header

#Docs: 3970, #word: 8014

SVM

KNN

Propagation

Page 106: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

106

Application: Image Retrieval [Wang et al., ACM MM 2004]

5,000 images Relevance feedback for the top

20 ranked images Classification problem

Relevant or not? f(x): degree of relevance

Learning relevance function f(x) Supervised learning: SVM Label propagation

Label propagation

SVM

Page 107: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

107

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partition based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

Page 108: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

108

Graph Partition Classification as graph partitioning Search for a classification boundary

Consistent with labeled examples Partition with small graph cut

Graph Cut = 1Graph Cut = 2

Page 109: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

109

Graph Partitioning Classification as graph partitioning Search for a classification boundary

Consistent with labeled examples Partition with small graph cut

Graph Cut = 1

Page 110: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

110

Min-cuts for semi-supervised learning [Blum and Chawla, ICML 2001]

Additional nodes V+ : source, V-: sink

Infinite weights connecting sinks and sources High computational cost

V+V R

SourceSink

Graph Cut = 1

Page 111: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

111

Harmonic Function [Zhu et al., ICML 2003]

Weight matrix W wi,j 0: similarity between xi and xi

Membership vector f = (f 1; : : : ; f N )

f i =

½+1 xi 2 A¡ 1 xi 2 B

¡ 1

A B

+1

+1

+1

+1¡ 1

¡ 1

¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

Page 112: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

112

Harmonic Function (cont’d) Graph cut

Degree matrix Diagonal element:

C(f) A B

+1

+1

+1

+1¡ 1

¡ 1

¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

D = diag(d1;: :: ;dN )di =

Pj 6=i Wi ;j

C(f) =NX

i=1

NX

j =1

(f i ¡ f j )2

4wi ;j

=14f>(D ¡ W)f =

14f>Lf

Page 113: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

113

Harmonic Function (cont’d) Graph cut

Graph Laplacian L = D –W Pairwise relationships among data poitns Mainfold geometry of data

C(f) A B

+1

+1

+1

+1¡ 1

¡ 1

¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

C(f) =NX

i=1

NX

j =1

(f i ¡ f j )2

4wi ;j

=14f>(D ¡ W)f =

14f>Lf

Page 114: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

114

Harmonic Function

A B

+1

+1

+1

+1¡ 1

¡ 1¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

minf 2 f ¡ 1;+1gN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Consistency with graph structures

Consistent with labeled dataChallenge:

Discrete space Combinatorial Opt.

Page 115: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

115

Harmonic Function

Relaxation: {-1, +1} continuous real number

Convert continuous f to binary ones

A B

+1

+1

+1

+1¡ 1

¡ 1¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

minf 2 f ¡ 1;+1gN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Page 116: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

116

Harmonic Function

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

L =

µL l ;l Lu;lL l ;u Lu;u

¶; f = (fl ; fu)

fu = ¡ L ¡ 1u;uLu;lyl

Page 117: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

117

Harmonic Function

fu = ¡ L ¡ 1u;uLu;lyl

Local Propagation

Page 118: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

118

Harmonic Function

Local Propagation

Global propagation

fu = ¡ L ¡ 1u;uLu;lyl

Sound familiar ?Sound familiar ?

Page 119: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

119

Spectral Graph Transducer [Joachim , 2003]

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Soften hard constraints

+®n lX

i=1

(f i ¡ yi )2

Page 120: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

120

Spectral Graph Transducer [Joachim , 2003]

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

+®n lX

i=1

(f i ¡ yi )2

minf 2RN

C(f) =14f>Lf +®

n lX

i=1

(f i ¡ yi )2

s. t.NX

i=1

f 2i = N

Solved by Constrained Eigenvector ProblemSolved by Constrained Eigenvector Problem

Page 121: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

121

Manifold Regularization [Belkin, 2006]

minf 2RN

C(f) =14f>Lf +®

n lX

i=1

(f i ¡ yi )2

s. t.NX

i=1

f 2i = N Loss function for

misclassification

Regularize the norm of classifier

Page 122: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

122

Manifold Regularization [Belkin, 2006]

minf 2RN

14f>Lf +®

n lX

i=1

(f i ¡ yi )2

s. t.NX

i=1

f 2i = N

Loss function: l(f (xi );yi )

minf 2RN

f>Lf +®n lX

i=1

l(f (xi );yi ) +°jf j2H K

Manifold Regularization

Page 123: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

123

Summary Construct a graph using pairwise similarity Key quantity: graph Laplacian

Captures the geometry of the graph Decision boundary is consistent

Graph structure Labeled examples

Parameters , , similarity

A B

+1+1

+1+1

¡ 1¡ 1

¡ 1

¡ 1¡ 1

¡ 1

¡ 1

¡ 1

Page 124: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

124

Questions

?

Cluster Assumption Manifold Assumption

?

Transductive predict classes for unlabeled data

Transductive predict classes for unlabeled data

Inductive learn classification function

Inductive learn classification function

Page 125: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

125

Application: Text Classification 20-newsgroups

autos, motorcycles, baseball,

and hockey under rec

Pre-processing stemming, remove stopwords

& rare words, and skip header

#Docs: 3970, #word: 8014

Propagation Harmonic

SVM

KNN

Page 126: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

126

Application: Text Classification

PRBEP: precision recall break even point.

Page 127: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

127

Application: Text Classification

Improvement in PRBEP by SGT

Page 128: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

128

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

Page 129: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

129

Transductive SVM Support vector machine

Classification margin Maximum classification

margin Decision boundary given a

small number of labeled examples

Page 130: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

130

Transductive SVM Decision boundary given a

small number of labeled examples

How to change decision boundary given both labeled and unlabeled examples ?

Page 131: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

131

Transductive SVM Decision boundary given a

small number of labeled examples

Move the decision boundary to low local density

Page 132: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

132

Transductive SVM Classification margin

f(x): classification function Supervised learning

Semi-supervised learning Optimize over both f(x) and yu

! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f (x)

! (X ;y;f )

Page 133: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

133

Transductive SVM Classification margin

f(x): classification function Supervised learning

Semi-supervised learning Optimize over both f(x) and yu

! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f (x)

Page 134: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

134

Transductive SVM Classification margin

f(x): classification function Supervised learning

Semi-supervised learning Optimize over both f(x) and yu

! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f (x)

f ¤ = argmaxf 2H K ;yu 2f ¡ 1;+1gn u

! (X ;yl ;yu; f )

Page 135: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

135

Transductive SVM Decision boundary given

a small number of labeled examples

Move the decision boundary to place with low local density

Classification results How to formulate this

idea?

Page 136: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

136

Transductive SVM: Formulation

* *

,

1 1

2 2

{ , }= argmin

1

1 labeled

examples....

1

w b

n n

w b w w

y w x b

y w x b

y w x b

Original SVM

1

* *

,..., ,

1 1

2 2

1 1

{ , }= argmin argmin

1

1 labeled

examples....

1

1 unlabeled

....examples

1

n n my y w b

n n

n n

n m n m

w b w w

y w x b

y w x b

y w x b

y w x b

y w x b

Transductive SVM

Constraints for unlabeled data

A binary variables for label of each example

Page 137: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

137

Computational Issue

No longer convex optimization problem. Alternating optimization

1

* *1 1

,..., ,

1 1 11 1 1

2 2 2

{ , }= argmin argmin

1 1

1 labeled unlabeled ....

examples exampl....1

1

n n m

n ni ii i

y y w b

n n

n m n m mn n n

w b w w

y w x by w x b

y w x b

y w x by w x b

es

Page 138: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

138

Summary

Based on maximum margin principle Classification margin is decided by

Labeled examples Class labels assigned to unlabeled data

High computational cost Variants: Low Density Separation (LDS), Semi-

Supervised Support Vector Machine (S3VM), TSVM

Page 139: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

139

Questions

?

Cluster Assumption Manifold Assumption

?

Transductive predict classes for unlabeled data

Transductive predict classes for unlabeled data

Inductive learn classification function

Inductive learn classification function

Page 140: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

140

Text Classification by TSVM

10 categories from the Reuter collection

3299 test documents 1000 informative words

selected by MI criterion

Page 141: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

141

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Page 142: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

142

Co-training [Blum & Mitchell, 1998]

Classify web pages into category for students and category for professors

Two views of web pages Content

“I am currently the second year Ph.D. student …”

Hyperlinks “My advisor is …” “Students: …”

Page 143: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

143

Co-training for Semi-Supervised Learning

Page 144: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

144

Co-training for Semi-Supervised Learning

It is easy to classify the type of

this web page based on its

content

It is easier to classify this web

page using hyperlinks

Page 145: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

145

Co-training Two representation for each web page

Content representation:

(doctoral, student, computer, university…)

Hyperlink representation:

Inlinks: Prof. Cheng

Oulinks: Prof. Cheng

Page 146: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

146

Co-training Train a content-based classifier

Page 147: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

147

Co-training Train a content-based classifier using

labeled examples Label the unlabeled examples that are

confidently classified

Page 148: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

148

Co-training Train a content-based classifier using

labeled examples Label the unlabeled examples that are

confidently classified Train a hyperlink-based classifier

Page 149: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

149

Co-training Train a content-based classifier using

labeled examples Label the unlabeled examples that are

confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are

confidently classified

Page 150: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

150

Co-training Train a content-based classifier using

labeled examples Label the unlabeled examples that are

confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are

confidently classified

Page 151: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

151

Co-training Assume two views of objects

Two sufficient representations Key idea

Augment training examples of one view by exploiting the classifier of the other view

Extension to multiple view Problem: how to find equivalent views

Page 152: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

152

A Few Words about Active Learning Active learning

Select the most informative examples In contrast to passive learning

Key question: which examples are informative Uncertainty principle: most informative example is

the one that is most uncertain to classify Measure classification uncertainty

Page 153: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

153

A Few Words about Active Learning Query by committee (QBC)

Construct an ensemble of classifiers Classification uncertainty largest degree of

disagreement SVM based approach

Classification uncertainty distance to decision boundary

Simple but very effective approaches

Page 154: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

154

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised clustering algorithms

Page 155: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

155

Semi-supervised Clustering

Clustering data into two clusters

Page 156: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

156

Semi-supervised Clustering

Clustering data into two clusters Side information:

Must links vs. cannot links

Must link

cannot link

Page 157: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

157

Semi-supervised Clustering Also called constrained clustering Two types of approaches

Restricted data partitions Distance metric learning approaches

Page 158: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

158

Restricted Data Partition Require data partitions to be consistent

with the given links Links hard constraints

E.g. constrained K-Means (Wagstaff et al., 2001)

Links soft constraints E.g., Metric Pairwise Constraints K-means

(Basu et al., 2004)

Page 159: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

159

Restricted Data Partition Hard constraints

Cluster memberships must obey the link constraints

must link

cannot linkYes

Page 160: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

160

Restricted Data Partition Hard constraints

Cluster memberships must obey the link constraints

must link

cannot linkYes

Page 161: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

161

Restricted Data Partition Hard constraints

Cluster memberships must obey the link constraints

must link

cannot linkNo

Page 162: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

162

Restricted Data Partition Soft constraints

Penalize data clustering if it violates some links

must link

cannot linkPenality = 0

Page 163: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

163

Restricted Data Partition Hard constraints

Cluster memberships must obey the link constraints

must link

cannot link

Penality = 0

Page 164: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

164

Restricted Data Partition Hard constraints

Cluster memberships must obey the link constraints

must link

cannot linkPenality = 1

Page 165: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

165

Distance Metric Learning Learning a distance metric from pairwise links

Enlarge the distance for a cannot-link Shorten the distance for a must-link

Applied K-means with pairwise distance measured by the learned distance metric

must link

cannot link

Transformed by learned distance metric

Page 166: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

166

Example of Distance Metric Learning

Solid lines: must links

dotted lines: cannot links

2D data projection using Euclidean distance metric

2D data projection using learned distance metric

Page 167: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

167

111

BoostCluster [Liu, Jin & Jain, 2007]

General framework for semi-supervised clustering Improves any given unsupervised clustering algorithm with

pairwise constraints

Key challenges How to influence an arbitrary clustering algorithm by side

information?

Encode constraints into data representation

How to take into account the performance of underlying clustering algorithm?

Iteratively improve the clustering performance

Page 168: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

168

168

BoostCluster

Given: (a) pairwise constraints, (b) data examples, and (c) a clustering algorithm

Data

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

Page 169: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

169

169

BoostCluster

Find the best data rep. that encodes the unsatisfied pairwise constraints

Data

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

Page 170: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

170

170

BoostCluster

Obtain the clustering results given the new data representation

Data

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

Page 171: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

171

171

BoostCluster

Update the kernel with the clustering results

Data

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

Page 172: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

172

172

BoostCluster

Run the procedure iteratively

Data

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

Page 173: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

173

173

BoostCluster

Compute the final clustering result

Data

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

Page 174: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

174

Summary Clustering data under given pairwise constraints

Must links vs. cannot links Two types of approaches

Restricted data partitions (either soft or hard) Distance metric learning

Questions: how to acquire links/constraints? Manual assignments Derive from side information: hyper links, citation, user

logs, etc. May be noisy and unreliable

Page 175: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

175

Application: Document Clustering[Basu et al., 2004]

300 docs from topics (atheism, baseball, space) of 20-newsgroups

3251 unique words after removal of stopwords and rare words and stemming

Evaluation metric: Normalized Mutual Informtion (NMI)

KMeans-x-x: different variants of constrained clustering algs.

Page 176: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

176

Outline Introduction to information retrieval, statistical inference and

machine learning Supervised learning and its application to text classification,

adaptive filtering, collaborative filtering and ranking Semi-supervised learning and its application to text

classification Emerging research directions

Page 177: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

177

Efficient Learning In IR, we have massive amount of data But, most learning algs. are relatively slow

Difficult to handle millions of documents How to improve scalability ?

Sampling, only use part of data Stochastic optimization, update model one example each

time (related to online learning) More interesting, more examples may mean more

efficient training (Sebro, ICML 2008)

Page 178: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

178

Kernel Learning Kernel plays central role in machine learning Kernel functions can be learned from data

Kernel alignment, multiple kernel learning, non-parametric learning, …

Kernel learning is suitable for IR Similarity measure is key to IR Kernel learning allows us to identify the optimal

similarity measure automatically

Page 179: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

179

Transfer Learning Different document categories are correlated We should be able to borrow information of

one class to the training of another class Key question: what to transfer between

classes? Representation, model priors, similarity measure

Page 180: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

180

Active Learning IR Applications Relevance feedback (text retrieval or image

retrieval) Text classification Adaptive information filtering Collaborative filtering Query Rewriting

Page 181: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

181

Discriminative Language Models

Language models have shown to be effective for information retrieval

But most language models are generative, thus missing the discriminative power

Key difficulty in discriminative language models: no outputs! Side information Mixture of generative and discriminative models

Page 182: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

182

References A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification.

In AAAI-98 Workshop on Learning for Text Categorization, 1998 Tong Zhang and Frank J. Oles, Text Categorization Based on Regularized Linear Classification

Methods, Journal of Information Retrieval, 2001 F. Li and Y. Yang. A loss function analysis for classification methods in text categorization, The

Twentieth International Conference on Machine Learning (ICML'03) Chengxiang Zhai and John Lafferty, A study of smoothing methods for language models applied

to information retrieval, ACM Trans. Inf. System, 2004 A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-training, COLT 1998 D. Blei and M. Jordan, Variational methods for the Dirichlet process, ICML 2004 T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Mach. Learn.,

42(1-2), 2001 D. Blei, A. Ng and M. Jordan, Latent Dirichlet allocation, NIPS*2002 R. Jin, C. Ding, and F. Kang, A Probabilistic Approach for Optimizing Spectral Clustering,

NIPS*2005 D. Zhou, B. Scholkopf, and T. Hofmann, Semi-supervised learning on directed graphs,

NIPS*2005. X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using Gaussian fields and

harmonic functions. ICML 2003. T. Joachims, Transductive Learning via Spectral Graph Partitioning, ICML 2003

Page 183: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

183

References Andrew McCallum and Kamal Nigam, Employing {EM} in Pool-Based Active Learning for

Text Classification, Proceeding of the International Conference on Machine Learning, 1998 David A. Cohn and Zoubin Ghahramani and Michael I. Jordan, Active Learning with

Statistical Models, Journal of Artificial Intelligence Research, 1996 S. Tong and E. Chang. Support vector machine active learning for image retrieval. In ACM

Multimedia, 2001 Xuehua Shen and ChengXiang Zhai, Active feedback in ad hoc information retrieval, SIGIR

'05 J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear

predictors. Information and Computation, 1997. X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li. Multi-Model Similarity Propagation and its Application

for Web Image Retrieval, ACM Multimedia, 2004 M. Belkin and P. Niyogi and V. Sindhwani, Manifold Regularization, Technical Report,

Univ. of Chicago, 2006 K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with

background knowledge. In ICML '01, 2001. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised

clustering. In SIGKDD '04, 2004.

Page 184: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

184

References Xiaofei He, Benjamin Rey, Wei Vivian Zhang, Rosie Jones, Query Rewriting using Active Learning

for Sponsored Search, SIGIR07 Y. Zhang, W. Xu, and J. Callan. Exploration and exploitation in adaptive filtering based on bayesian

active learning. In Proceedings of 20th International Conf. on Machine Learning, 2003. Z. Xu and R. Akella. A bayesian logistic regression model for active relevance feedback (SIGIR08) G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. ICML 2000 M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking.

Machine learning, 2004 J. Rocchio. Relevance feedback in information retrieval, In The Smart System: experiments in

automatic document processing. Prentice Hall, 1971. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the fifth annual

workshop on Computational learning theory, 1992 Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee

algorithm. Machine Learning, 28(2-3):133–168, 1997 D. A. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learn-ing. Machine

learning, 1994. Robert M. Bell and Yehuda Koren, Lessons from the Netix Prize Challenge, KDD Exploration 2008 Tie-Yan Liu, Tutorial: Learning to rank Soumen Chakrabarti, Learning to Rank in Vector Spaces and Social Networks, www 2007

Page 185: 1 Machine Learning for Information Retrieval Rong Jin Michigan State University Yi Zhang University of California Santa Cruz

185

Thank You

God, it is finally over !God, it is finally over !