catalog integration r. agrawal, r. srikant:

22
Catalog Integration R. Agrawal, R. Srikant: WWW-10

Upload: alberta-washington

Post on 17-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Catalog Integration R. Agrawal, R. Srikant:

Catalog Integration

R. Agrawal, R. Srikant: WWW-10

Page 2: Catalog Integration R. Agrawal, R. Srikant:

Catalog Integration Problem

Integrate products from new catalog into master catalog.

a

ICs

LogicMem.DSP

fec db

ICs

Cat 2Cat 1

yx z

New CatalogMaster Catalog

Page 3: Catalog Integration R. Agrawal, R. Srikant:

The Problem (cont.)

After integration:

ICs

LogicMem.DSP

a fec db yx z

Page 4: Catalog Integration R. Agrawal, R. Srikant:

Desired Solution

Automatically integrate products: little or no effort on part of user. domain independent. Problem size: Million products Thousands of categories

Page 5: Catalog Integration R. Agrawal, R. Srikant:

Model

Product descriptions consist of words Products live in the leaf-level categories

Page 6: Catalog Integration R. Agrawal, R. Srikant:

Basic Algorithm

Build classification model using product descriptions in master catalog.

Use classification model to predict categories for products in the new catalog.

Logic

DSPx

5%

95%

Page 7: Catalog Integration R. Agrawal, R. Srikant:

National Semiconductor Files

Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver

device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator

...

...

...

Page 8: Catalog Integration R. Agrawal, R. Srikant:

National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:

Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line

Part: LM3940 1A Low Dropout RegulatorPangea Category:

Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator

...

...

Page 9: Catalog Integration R. Agrawal, R. Srikant:

Accuracy on Pangea Data

B2B Portal for electronic components: 1200 categories, 40K training

documents. 500 categories with < 5 documents. Accuracy: 72% for top choice. 99.7% for top 5 choices.

Page 10: Catalog Integration R. Agrawal, R. Srikant:

Enhanced Algorithm: Intuition

Use affinity information in the catalog to be integrated (new catalog):

Products in same category are similar. Bias the classifier to incorporate this

information. Accuracy boost depends on quality of new

catalog: Use tuning set to determine amount of bias.

Page 11: Catalog Integration R. Agrawal, R. Srikant:

Algorithm

Extension of the Naive-Bayes classification to incorporate affinity information

Page 12: Catalog Integration R. Agrawal, R. Srikant:

Naive Bayes Classifier

Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule

Pr(d): same for all categories (ignore) Pr(Ci) = #docs Ci / #total docs

Pr(d|Ci) = wd Pr(w|Ci)– Words occur independently (unigram model)

Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|)– Maximum likelihood estimate smoothed with the

Lidstone’s law of succession

Page 13: Catalog Integration R. Agrawal, R. Srikant:

Enhanced Algorithm

Pr(Ci|d,S) //d existed in category S= Pr(Ci,d,S) / Pr(d,S)

– Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S)

= Pr(Ci)Pr(S,d|Ci) / Pr(d,S)= Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d)

– Assuming d, S independent given Ci

= Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d)– Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S)

= Pr(Ci|S)Pr(d|Ci) / Pr(d|S)– Pr(S,d) = Pr(S)Pr(d|S)

Same as NB except Pr(Ci|S) instead of Pr(Ci)– Ignore Pr(d|S) as it is same for all classes

Page 14: Catalog Integration R. Agrawal, R. Srikant:

Computing Pr(Ci|S)

Pr(Ci|S) =

|Ci|(#docs in S predicted to be in Ci)w /

j[1,n] |Cj|(#docs in S predicted to be in Cj)w

|Ci| = #docs in Ci in the master catalog w determines weight of the new catalog

– Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known

– Choose one weight for the entire new catalog or different weights for different sections

Page 15: Catalog Integration R. Agrawal, R. Srikant:

Superiority of the Enhanced Algorithm Theorem: The highest possible accuracy

achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm.

Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent.

The tune set method attempts to select a good value for weight, but there is no guarantee of success.

Page 16: Catalog Integration R. Agrawal, R. Srikant:

Empirical Evaluation

Start with a real catalog M Remove n products from M to form the new

catalog N In the new catalog N

– Assign f*n products to the same category as M– Assign the rest to other categories as per some

distribution (but remember their true category) Accuracy: Fraction of products in N assigned

to their true categories

Page 17: Catalog Integration R. Agrawal, R. Srikant:

Improvement in Accuracy (Pangea)

1 2 5 10 25 50 100 200

Weight

65

70

75

80

85

90

95

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Page 18: Catalog Integration R. Agrawal, R. Srikant:

Improvement in Accuracy (Reuters)

1 2 5 10 25 50 100 200

Weight

82

84

86

88

90

92

94

96

98

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Page 19: Catalog Integration R. Agrawal, R. Srikant:

Improvement in Accuracy (Google.Outdoors)

1 5 25 100 400 1000

Weight

50

60

70

80

90

100

Ac

cu

rac

y

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Page 20: Catalog Integration R. Agrawal, R. Srikant:

Tune Set Size (Pangea)

0 5 10 20 35 50

Tune Set Size

70

75

80

85

90

95A

ccu

racy

Perfect

90-10

80-20

GaussianA

GaussianB

Base

Similar results for Reuters and Google.

Page 21: Catalog Integration R. Agrawal, R. Srikant:

Empirical Results

71-22-6 79-21 100

Purity (No. of classes & their distribution)

0

5

10

15

20

% E

rro

rs Standard

Enhanced

Page 22: Catalog Integration R. Agrawal, R. Srikant:

Summary

Classification accuracy can be improved by factoring in the affinity information implicit in the data to be categorized.

How to apply these ideas to other types of classifiers?