catalog integration r. agrawal, r. srikant:

Catalog Integration

R. Agrawal, R. Srikant: WWW-10

Catalog Integration Problem

Integrate products from new catalog into master catalog.

LogicMem.DSP

fec db

Cat 2Cat 1

New CatalogMaster Catalog

The Problem (cont.)

After integration:

LogicMem.DSP

a fec db yx z

Desired Solution

Automatically integrate products: little or no effort on part of user. domain independent. Problem size: Million products Thousands of categories

Product descriptions consist of words Products live in the leaf-level categories

Basic Algorithm

Build classification model using product descriptions in master catalog.

Use classification model to predict categories for products in the new catalog.

National Semiconductor Files

Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver

device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator

National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:

Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line

Part: LM3940 1A Low Dropout RegulatorPangea Category:

Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator

Accuracy on Pangea Data

B2B Portal for electronic components: 1200 categories, 40K training

documents. 500 categories with < 5 documents. Accuracy: 72% for top choice. 99.7% for top 5 choices.

Enhanced Algorithm: Intuition

Use affinity information in the catalog to be integrated (new catalog):

Products in same category are similar. Bias the classifier to incorporate this

information. Accuracy boost depends on quality of new

catalog: Use tuning set to determine amount of bias.

Algorithm

Extension of the Naive-Bayes classification to incorporate affinity information

Naive Bayes Classifier

Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule

Pr(d): same for all categories (ignore) Pr(Ci) = #docs Ci / #total docs

Pr(d|Ci) = wd Pr(w|Ci)– Words occur independently (unigram model)

Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|)– Maximum likelihood estimate smoothed with the

Lidstone’s law of succession

Enhanced Algorithm

Pr(Ci|d,S) //d existed in category S= Pr(Ci,d,S) / Pr(d,S)

– Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S)

= Pr(Ci)Pr(S,d|Ci) / Pr(d,S)= Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d)

– Assuming d, S independent given Ci

= Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d)– Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S)

= Pr(Ci|S)Pr(d|Ci) / Pr(d|S)– Pr(S,d) = Pr(S)Pr(d|S)

Same as NB except Pr(Ci|S) instead of Pr(Ci)– Ignore Pr(d|S) as it is same for all classes

Computing Pr(Ci|S)

Pr(Ci|S) =

|Ci|(#docs in S predicted to be in Ci)w /

j[1,n] |Cj|(#docs in S predicted to be in Cj)w

|Ci| = #docs in Ci in the master catalog w determines weight of the new catalog

– Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known

– Choose one weight for the entire new catalog or different weights for different sections

Superiority of the Enhanced Algorithm Theorem: The highest possible accuracy

achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm.

Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent.

The tune set method attempts to select a good value for weight, but there is no guarantee of success.

Empirical Evaluation

Start with a real catalog M Remove n products from M to form the new

catalog N In the new catalog N

– Assign f*n products to the same category as M– Assign the rest to other categories as per some

distribution (but remember their true category) Accuracy: Fraction of products in N assigned

to their true categories

Improvement in Accuracy (Pangea)

1 2 5 10 25 50 100 200

Weight

Perfect

GaussianA

GaussianB

Improvement in Accuracy (Reuters)

1 2 5 10 25 50 100 200

Weight

Perfect

GaussianA

GaussianB

Improvement in Accuracy (Google.Outdoors)

1 5 25 100 400 1000

Weight

Perfect

GaussianA

GaussianB

Tune Set Size (Pangea)

0 5 10 20 35 50

Tune Set Size

Perfect

GaussianA

GaussianB

catalog integration r. agrawal, r. srikant:

entire new catalog

quality of new catalog

new catalog nin

catalog integrationr

new catalog nassign

new cataloguse

highest accuracy

accuracy google

Documents

mining sequential patterns rakesh agrawal ramakrishnan...

hippocratic databases rakesh agrawal jerry kiernan...

r. agrawal. fast algorithms for mining association rules in...

mind palace mela session by adity and srikant

paper authors: rakesh agrawal, jerry kiernan, ramakrishnan...

implementing p3p using database technology rakesh agrawal...

the international monetary fund hamad, serdar and srikant

union of india v. srikant sharma highlighted

an xpath-based preference language for p3p ibm almaden...

srikant final project

k.r. srikant seminar on gsm

practical privacy: the sulq framework · in 2000 lindell...

effective behavior signature extraction method using ......a...

the role of modeling in designing communication networks r....

randomization in privacy preserving data mining agrawal, r.,...

state-space collapse via drift conditions atilla eryilmaz...

compiler design prof. y. n. srikant department of computer...

1 pattern discovery: an example with sequential patterns...

information sharing across private databases rakesh agrawal...

fast algorithms for mining association rules by rakesh...