automatic classification of text databases through query probing

27
Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.

Upload: atara

Post on 11-Jan-2016

50 views

Category:

Documents


4 download

DESCRIPTION

Automatic Classification of Text Databases Through Query Probing. Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. Search-only Text Databases. Sources of valuable information Hidden behind search interfaces Non-crawlable Example: Microsoft Support KB. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Classification of Text Databases Through Query Probing

Automatic Classification of Text Databases Through Query Probing

Panagiotis G. Ipeirotis Luis Gravano

Columbia University

Mehran SahamiE.piphany Inc.

Page 2: Automatic Classification of Text Databases Through Query Probing

Search-only Text Databases

Sources of valuable informationHidden behind search interfacesNon-crawlable

Example: Microsoft Support KB

Page 3: Automatic Classification of Text Databases Through Query Probing

Interacting With Searchable Text

Databases

1. Searching: Metasearchers

2. Browsing: Use Yahoo-like directories

3. Browse & search: “Category-enabled” metasearchers

Page 4: Automatic Classification of Text Databases Through Query Probing

Searching Text Databases: Metasearchers

Select the good databases for a queryEvaluate the query at these databasesCombine the query results from the databases

Examples: MetaCrawler, SavvySearch, Profusion

Page 5: Automatic Classification of Text Databases Through Query Probing

Browsing Through Text Databases

Yahoo-like web directories:InvisibleWeb.comSearchEngineGuide.comTheBigHub.com

Example from InvisibleWeb.comComputers > Publications > ACM DL

Category-enabled metasearchersUser-defined category (e.g. Recipes)

Page 6: Automatic Classification of Text Databases Through Query Probing

Problem With Current Classification Approach

Classification of databases is done manuallyThis requires a lot of human effort!

Page 7: Automatic Classification of Text Databases Through Query Probing

How to Classify Text Databases Automatically:

Outline

Definition of classificationStrategies for classifying searchable databases through query probingInitial experiments

Page 8: Automatic Classification of Text Databases Through Query Probing

Database Classification: Two Definitions

Coverage-based classification:The database contains many documents about the category (e.g. Basketball)Coverage: #docs about this category

Specificity-based classification:The database contains mainly documents about this categorySpecificity: #docs/|DB|

Page 9: Automatic Classification of Text Databases Through Query Probing

Database Classification: An Example

Category: Basketball

Coverage-based classificationESPN.com, NBA.com

Specificity-based classificationNBA.com, but not ESPN.com

Page 10: Automatic Classification of Text Databases Through Query Probing

Categorizing a Text Database:

Two Problems

Find the category of a given documentFind the category of all the documents inside the database

Page 11: Automatic Classification of Text Databases Through Query Probing

Categorizing Documents

Several text classifiers availableRIPPER (AT&T Research, William Cohen 1995)

Input: A set of pre-classified, labeled documentsOutput: A set of classification rules

Page 12: Automatic Classification of Text Databases Through Query Probing

Categorizing Documents: RIPPER

Training set: Preclassified documents“Linux as a web server”: Computers“Linux vs. Windows: …”: Computers“Jordan was the leader of Chicago Bulls”: Sports“Smoking causes lung cancer”: Health

Output: Rule-based classifierIF linux THEN ComputersIF jordan AND bulls THEN SportsIF lung AND cancer THEN Health

Page 13: Automatic Classification of Text Databases Through Query Probing

Precision and Recall of Document Classifier

During the training phase:100 documents about computers“Computer” rules matched 50 docsFrom these 50 docs 40 were about computers

Precision = 40/50 = 0.8Recall = 40/100 = 0.4

Page 14: Automatic Classification of Text Databases Through Query Probing

From Document to Database Classification

If we know the categories of all the documents, we are done!But databases do not export such data!

How can we extract this information?

Page 15: Automatic Classification of Text Databases Through Query Probing

Our Approach: Query Probing

Design a small set of queries to probe the databasesCategorize the database based on the probing results

Page 16: Automatic Classification of Text Databases Through Query Probing

Designing and Implementing Query Probes

The probes should extract information about the categories of the documents in the database

Start with a document classifier (RIPPER)Transform each rule into a queryIF lung AND cancer THEN health +lung +cancerIF linux THEN computers +linux

Get number of matches for each query

Page 17: Automatic Classification of Text Databases Through Query Probing

ACM DL

NBA.com

PubMED

lung AND cancer health

jordan AND bulls sports

linux computers

ACM NBA PubM

comp

sports

health

336 0 16

0 6674 0

18 103 81164

336 0 16

0 6674 0

18 103 81164

Three Categories and Three Databases

Page 18: Automatic Classification of Text Databases Through Query Probing

Using the Results for Classification

COVCOV ACM NBA PubMcomp 336 0 16

sports 0 6674 0

health 18 103 81164

00.10.20.30.40.50.60.70.80.9

1

Sp

ec

ific

ity

ACM NBA PubMed

compsportshealth

SPESPECC

ACM NBA PubM

comp 0.95 0 0

sports 0 0.985 0

health 0.05 0.015 1.0

We use the results to estimate

coverage and specificity

values

Page 19: Automatic Classification of Text Databases Through Query Probing

Adjusting Query ResultsClassifiers are not perfect!

Queries do not “retrieve” all the documents that belong to a categoryQueries for one category “match” documents that do not belong to this category

From the training phase of classifier we use precision and recall

Page 20: Automatic Classification of Text Databases Through Query Probing

Precision & Recall Adjustment

Computer-category:Rule: “linux”, Precision = 0.7 Rule: “cpu”, Precision = 0.9Recall (for all the rules) = 0.4

Probing with queries for “Computers”:Query: +linux X1 matches 0.7X1 correct matches

Query: +cpu X2 matches 0.9X2 correct matches

From X1+X2 documents found:

Expect 0.7 X1+0.9 X2 to be correctExpect (0.7 X1+0.9 X2)/0.4 total computer docs

Page 21: Automatic Classification of Text Databases Through Query Probing

Initial ExperimentsUsed a collection of 20,000 newsgroup articlesFormed 5 categories:

Computers (comp.*)Science (sci.*)Hobbies (rec.*)Society (soc.* + alt.atheism)Misc (misc.sale)

RIPPER trained with 10,000 newsgroup articles Classifier: 29 rules, 32 words used

IF windows AND pc THEN Computers (precision~0.75)IF satellite AND space THEN Science (precision~0.9)

Page 22: Automatic Classification of Text Databases Through Query Probing

Web-databases ProbedUsing the newsgroup classifier we probed four web databases:

Cora (www.cora.jprc.com) CS Papers archive (Computers)

American Scientist (www.amsci.org)Science and technology magazine (Science)

All Outdoors (www.alloutdoors.com)Articles about outdoor activities (Hobbies)

Religion Today (www.religiontoday.com)News and discussion about religions (Society)

Page 23: Automatic Classification of Text Databases Through Query Probing

Results

53

1450

113103

202

15128 50231

95

733

43215215

170

7498

45 7 67 1520

0.10.20.30.40.50.60.70.80.9

1

Cora American Scientist AllOutdoors ReligionToday

Spe

cific

ity

Computers Science Hobbies Society Misc

Only 29 queries per web siteNo need for document retrieval!

Page 24: Automatic Classification of Text Databases Through Query Probing

Conclusions

Easy classification using only a small number of queriesNo need for document retrieval

Only need a result like: “X matches found”

Not limited to search-only databases Every searchable database can be classified this way

Not limited to topical classification

Page 25: Automatic Classification of Text Databases Through Query Probing

Current Issues

Comprehensive classification schemeRepresentative training data

Page 26: Automatic Classification of Text Databases Through Query Probing

Future WorkUse a hierarchical classification schemeTest different search interfaces

Boolean modelVector-space modelDifferent capabilities

Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task)Study classification efficiency when documents are accessible

Page 27: Automatic Classification of Text Databases Through Query Probing

Related Work

Gauch (JUCS 1996)Etzioni et al. (JIIS 1997)Hawking & Thistlewaite (TOIS 1999)Callan et al. (SIGMOD 1999)Meng et al. (CoopIS 1999)