information retrieval

13
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Upload: wilbur

Post on 02-Feb-2016

10 views

Category:

Documents


0 download

DESCRIPTION

Information Retrieval. For the MSc Computer Science Programme. Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17. Dell Zhang Birkbeck , University of London. … (30). agriculture. biology. physics. CS. space. dairy. botany. cell. AI. courses. crops. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval

Information Retrieval

Lecture 7Introduction to Information Retrieval (Manning et al. 2007)

Chapter 17

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Page 2: Information Retrieval

Yahoo! Hierarchy

http://dir.yahoo.com/science

dairycrops

agronomyforestry

AI

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

... ...

Page 3: Information Retrieval

Hierarchical Clustering

Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents. Divisive (top-down)

Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own.

Recursive application of a (flat) partitional clustering algorithm, e.g., kMeans (k=2) Bi-secting kMeans.

Agglomerative (bottom-up) Start with each document being a single cluster.

Eventually all documents belong to the same cluster.

Page 4: Information Retrieval

Dendrogram

Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

The number of clusters k is not required in advance.

Page 5: Information Retrieval

Dendrogram – Example

Clusters of News Stories:Reuters RCV1

Page 6: Information Retrieval

Dendrogram – Example

Clusters of Things that People Want: ZEBO

Page 7: Information Retrieval

HAC

Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster. Repeat until there is only one cluster:

Among the current clusters, determine the pair of clusters, ci and cj, that are most similar. (Single-Link, Complete-Link, etc.)

Then merges ci and cj to a single cluster. The history of merging forms a binary tree or

hierarchy.

Page 8: Information Retrieval

Single-Link

The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine-similarity) between their members:

After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycx

ji

),(),,(max)),(( kjkikji ccsimccsimcccsim

Page 9: Information Retrieval

HAC – Example

Page 10: Information Retrieval

HAC – Example

Page 11: Information Retrieval

HAC – Example

d1

d2

d3

d4

d5

d1,d2

d4,d5

d3

d3,d4,d5

As clusters agglomerate, docs are likely to fall into a dendrogram.

Page 12: Information Retrieval

HAC – Example Single-Link

Page 13: Information Retrieval

Take Home Message

Single-Link HAC Dendrogram

),(max),(,

yxsimccsimji cycx

ji

),(),,(max)),(( kjkikji ccsimccsimcccsim