web taxonomy integration through co-bootstrapping

41
Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04

Upload: alisa-anderson

Post on 04-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Web Taxonomy Integration through Co-Bootstrapping. Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04. Introduction. Problem Statement. Games > Roleplaying Final Fantasy Fan Dragon Quest Home EverQuest Addict Warcraft III Clan - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Taxonomy Integration through Co-Bootstrapping

Web Taxonomy Integration throughCo-Bootstrapping

Dell ZhangNational University of Singapore

Wee Sun LeeNational University of Singapore

SIGIR’04

Page 2: Web Taxonomy Integration through Co-Bootstrapping

Introduction

Page 3: Web Taxonomy Integration through Co-Bootstrapping

Problem Statement

Games > Roleplaying

•Final Fantasy Fan

•Dragon Quest Home

Games > Strategy

•Shogun: Total War

Games > Online

•EverQuest Addict

•Warcraft III Clan

Games > Single-Player

•Warcraft III Clan

Games > Roleplaying

•Final Fantasy Fan

•Dragon Quest Home

•EverQuest Addict

•Warcraft III Clan

Games > Strategy

•Shogun: Total War

•Warcraft III Clan

Page 4: Web Taxonomy Integration through Co-Bootstrapping

Possible Approach

Games > Roleplaying

•Final Fantasy Fan

•Dragon Quest Home

Games > Strategy

•Shogun: Total War

Train

•EverQuest Addict

•Warcraft III Clan

Classify

ignores original Yahoo! categories

Page 5: Web Taxonomy Integration through Co-Bootstrapping

Another Approach (1/2)

Use Yahoo! categories Advantage similar categories

Potential Problem different structure categories do not match exactly

Page 6: Web Taxonomy Integration through Co-Bootstrapping

Another Approach (2/2)

Example: Crayon Shin-chan

Entertainment > Comics and Animation > Animation > Anime > Titles > Crayon Shin-chan

Arts > Animation > Anime > Titles > C > Crayon Shin-

chan

Page 7: Web Taxonomy Integration through Co-Bootstrapping

This Paper’s Approach

1. Weak Learner (as opposed to Naïve Bayes)

2. Boosting to combine Weak Hypotheses

3. New Idea: Co-Bootstrapping to exploit source categories

Page 8: Web Taxonomy Integration through Co-Bootstrapping

Assumptions

Multi-category data are reduced to binary data Totoro Fan Cartoon > My Neighbor Totoro

Toys > My Neighbor Totoro

is converted into

Totoro Fan Cartoon > My Neighbor Totoro

Totoro Fan Toys > My Neighbor Totoro Hierarchies are ignored Console > Sega and Console > Sega > Dreamcast are

not related

Page 9: Web Taxonomy Integration through Co-Bootstrapping

Weak Learner

1. Weak Learner

2. Boosting

3. Co-Bootstrapping

Page 10: Web Taxonomy Integration through Co-Bootstrapping

Weak Learner A type of classifier similar to Naïve Bayes

+ = accept - = reject Term may be a word or n-gram or …

Weak LearnerWeak Hypothesis

(term-based classifier)

After Training

Page 11: Web Taxonomy Integration through Co-Bootstrapping

Weak Hypothesis Example

contain “Crayon Shin-chan” in “Comics > Crayon Shin-chan” not in “Education > Early Childhood”

not contain “Crayon Shin-chan” not in “Comics > Crayon Shin-chan” in “Education > Early Childhood”

Page 12: Web Taxonomy Integration through Co-Bootstrapping

Weak Learner Inputs (1/2)

Training data are in the form [x1, y1], [x2, y2], …, [xm, ym] xi is a document yi is a category [xi, yi] means document xi is in category yi

D(x, y) is a distribution over all combinations of xi and yi

D(xi, yj) indicates the “importance” of (xi, yj)w is the term (automatically found)

Page 13: Web Taxonomy Integration through Co-Bootstrapping

Weak Learner Algorithm

For each possible category y, compute four values:

Note: (xi,y) with greater D (xi,y) has more influence.

ycategoryinnotisxwcontainnotdoesxyxDW

ycategoryinisxwcontainnotdoesxyxDW

ycategoryinnotisxwcontainsxyxDW

ycategoryinisxwcontainsxyxDW

ii

m

ii

y

ii

m

ii

y

ii

m

ii

y

ii

m

ii

y

1

0

1

0

1

1

1

1

),(

),(

),(

),(

Page 14: Web Taxonomy Integration through Co-Bootstrapping

Weak Hypothesis h(x, y)

Given unclassified document x and category y If x contains w, then

Else if x does not contain w, then

yinnotisdocwcontainsdocChance

yinisdocwcontainsdocChance

W

Wyxh

y

y

ln2

1ln2

1),(

1

1

yinnotisdocwcontainnotdoesdocChance

yinisdocwcontainnotdoesdocChance

W

Wyxh

y

y

ln2

1ln2

1),(

0

0

Page 15: Web Taxonomy Integration through Co-Bootstrapping

Weak Learner Comments

If sign[ h(x,y) ] = +, then x is in y | h(x,y) | is the confidence The term w is found as follows:

Repeatedly run weak learner for all possible w Choose the run with the smallest

value as the model Boosting: Minimizes probability of h(x,y) having wrong sign

y

yy

y

yy WWWW 1100

Page 16: Web Taxonomy Integration through Co-Bootstrapping

Boosting (AdaBoost.MH)

1. Weak Learner

2. Boosting

3. Co-Bootstrapping

Page 17: Web Taxonomy Integration through Co-Bootstrapping

Boosting Idea

1. Train the weak learner on different Dt(x, y) distributions

2. After each run, adjust Dt(x, y) by putting more weight on the most often misclassified training data

3. Output the final hypothesis as a linear combination of weak hypotheses

Page 18: Web Taxonomy Integration through Co-Bootstrapping

Boosting Algorithm

Given: [x1, y1], [x2, y2], …, [xm, ym], where xi X and yi Y

Initialize D1(x,y) = 1/(mk)

for t = 1,…,T do

Pass distribution Dt to weak learner

Get weak hypothesis ht(x, y)

Choose t R

Update

end for

Output the final hypothesis

t

txttt Z

yxhyYyxDyxD

)),(][exp(),(),(1

T

ttt yxhyxH

1

),(),(

Page 19: Web Taxonomy Integration through Co-Bootstrapping

Boosting Algorithm Initialization

Given: [x1, y1], [x2, y2], …, [xm, ym]

Initialize D(x, y) = 1/(mk) k = total number of categories uniform distribution

Page 20: Web Taxonomy Integration through Co-Bootstrapping

Boosting Algorithm Loop

for t = 1,…,T do Run weak learner using distribution D Get weak hypothesis ht(x, y) For each possible pair (x,y) in training data If ht(x,y) guesses incorrectly, increase

D(x,y)end for

return

T

tt yxhyxH

1

),(),(

Page 21: Web Taxonomy Integration through Co-Bootstrapping

Co-Bootstrapping

1. Weak Learner

2. Boosting

3. Co-Bootstrapping

Page 22: Web Taxonomy Integration through Co-Bootstrapping

Co-Bootstrapping Idea

We want to use Yahoo! categories to increase classification accuracy

Page 23: Web Taxonomy Integration through Co-Bootstrapping

Recall Example Problem

Games > Online

•EverQuest Addict

•Warcraft III Clan

Games > Single-Player

•Warcraft III Clan

Games > Roleplaying

•Final Fantasy Fan

•Dragon Quest Home

Games > Strategy

•Shogun: Total War

Page 24: Web Taxonomy Integration through Co-Bootstrapping

Co-Bootstrapping Algorithm (1/4)

1. Run AdaBoost on Yahoo! sites

• Get classifier Y1

2. Run AdaBoost on Google sites

• Get classifier G1

3. Run Y1 on Google sites• Get predicted Yahoo! categories for Google sites

4. Run G1 on Yahoo! sites•Get predicted Google categories for Yahoo! sites

Page 25: Web Taxonomy Integration through Co-Bootstrapping

Co-Bootstrapping Algorithm (2/4)

5. Run AdaBoost on Yahoo! sites

• Include Google category as a feature

• Get classifier Y2

6. Run AdaBoost on Google sites

• Include Yahoo! category as a feature

• Get classifier G2

7. Run Y2 on original Google sites• get more accurate Yahoo! categories for Google sites

8. Run G2 on original Yahoo! sites• get more accurate Google categories for Yahoo! sites

Page 26: Web Taxonomy Integration through Co-Bootstrapping

Co-Bootstrapping Algorithm (3/4)

9. Run AdaBoost on Yahoo! sites

• Include Google category as a feature

• Get classifier Y3

10. Run AdaBoost on Google sites

• Include Yahoo! category as a feature

• Get classifier G3

11. Run Y3 on original Google sites• get even more accurate Yahoo! categories for Google sites

12. Run G3 on original Yahoo! sites• get even more accurate Google categories for Yahoo! sites

Page 27: Web Taxonomy Integration through Co-Bootstrapping

Co-Bootstrapping Algorithm (4/4)

Repeat, repeat, and repeat…Hopefully, the classification will become more

accurate after each iteration…

Page 28: Web Taxonomy Integration through Co-Bootstrapping

Enhanced Naïve Bayes(Benchmark)

Page 29: Web Taxonomy Integration through Co-Bootstrapping

Enhanced Naïve Bayes (1/2)

Given document x source category S of x

Predict master category C In NB, Pr[C | x] Pr[C] wx(Pr[w | C])n(x,w)

w : word n(x,w) number of occurrences of w in x

Pr[C | x, S] Pr[C | S] wx(Pr[w | C])n(x,w)

Page 30: Web Taxonomy Integration through Co-Bootstrapping

Enhanced Naïve Bayes (2/2)

Pr[C] =

Estimate Pr[C | S]

|C S| : number of docs in S that is classified into C by NB classifier

iCii SCC

SCC

||||

||||

iC

iC

C

||

||

Page 31: Web Taxonomy Integration through Co-Bootstrapping

Experiment

Page 32: Web Taxonomy Integration through Co-Bootstrapping

Datasets

Google Yahoo!

Book /Top/Shopping/ Publications/Books

/Business and Economy/Shopping and Services/Books//Bookstores

Disease /Top/Health/Conditions and Diseases

/Health/Diseases and Conditions

Movie /Top/Arts/Movies/Genres /Entertainment/Movies and Film/Genres/

Music /Top/Arts/Music/Styles /Entertainment/Music/Genres

News /Top/News/By Subject /News and Media

Page 33: Web Taxonomy Integration through Co-Bootstrapping

Number of Categories*/Dataset (1/2)

Google Yahoo!

Book 49 41

Disease 30 51

Movie 34 25

Music 47 24

News 27 34

*Top level categories only

Page 34: Web Taxonomy Integration through Co-Bootstrapping

Number of Categories*/Dataset (2/2)

Book Horror Science

Fiction Non-fictionBiographyHistory

Merge into Non-fiction

Page 35: Web Taxonomy Integration through Co-Bootstrapping

Number of Websites

Google Yahoo! GY GY

Book 10,842 11,268 21,111 999

Disease 34,047 9,785 41,439 2,393

Movie 36,787 14,366 49,744 1,409

Music 76,420 24,518 95,971 4,967

News 31,504 19,419 49,303 1,620

Page 36: Web Taxonomy Integration through Co-Bootstrapping

Method (1/2)

Classify Yahoo! Book websites into Google Book categories (GY)

1. Find GY for Book

2. Hide Google categories for in GY

3. GY Yahoo! Book

4. Randomly take |GY| sites from G-Y Google Book

Page 37: Web Taxonomy Integration through Co-Bootstrapping

Method (2/2)

For each dataset, do GY five times and GY five times

macro F-score : calculate F-score for each category, then average over all categories

micro F-score : calculate F-score on the entire dataset recall = 100%?

Doesn’t say anything about multi-category ENB

Page 38: Web Taxonomy Integration through Co-Bootstrapping

Results (1/3)

00.10.20.30.40.50.60.70.80.9

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

G←Y Y←G

AB CB-AB

00.10.20.30.40.50.60.70.80.9

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

G←Y Y←G

AB CB-AB

Co-Boostrapping-AdaBoost > AdaBoost

macro-averaged F scores micro-averaged F scores

Page 39: Web Taxonomy Integration through Co-Bootstrapping

Results (2/3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5 6 7 8

Co-Bootstrapping iteration

F s

co

remaF(G←Y) miF(G←Y) maF(Y←G) miF(Y←G)

Co-Bootstrapping-AdaBoost iteratively improves AdaBoost

Book Dataset

Page 40: Web Taxonomy Integration through Co-Bootstrapping

Results (3/3)

00.10.20.30.40.50.60.70.80.9

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

G←Y Y←G

ENB CB-AB

00.10.20.30.40.50.60.70.80.9

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

Boo

k

Dis

ease

Mov

ie

Mus

ic

New

s

G←Y Y←G

ENB CB-AB

Co-Boostrapping-AdaBoost > Enhanced Naïve Bayes

macro-averaged F scores micro-averaged F scores

Page 41: Web Taxonomy Integration through Co-Bootstrapping

Contribution

Co-Bootstrapping improves Boosting performance

Does not require as in ENB