automatic discovery and classification of search interface to the hidden web dean lee and richard...

Automatic Discovery and Classification of search interface to the Hidden Web

Dean Lee and Richard Sia

Dec 2nd 2003

Goals and Motivation Hidden Webs are informative No current search engines can index

them (even Google)

Search Interface

search terms

Search Results

results

Goals and Motivation Hidden Webs are informative No current search engines can index

them (even Google)

Next-generation search engine Automatic discovery of search

interface Classification/categorization of hidden

websites Generating queries to search interfaces Crawling and indexing of these web pages

Crawling

Search Interface Detection

Domain classification

Crawling

2.2M URLs from dmoz 1.7M eventually Crawled in November 2003 20G/4G - before/after compression

Root level web pages only e.g. http://www.ucla.edu

Why root-level only? 80% of search interface contained in root-level

(from UIUC) Efficient, cost effective

3B web pages compared to 8M web sites

Search Interface Classification

Most search interfaces are inside <Form> </Form> tags

Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces

Search Interface Classification Potential attributes we’ve

considered

Action count

Select count

Password field

Training sets for C4.5

Initially only positive training set Several classification iterations using

real web data For each iteration, add correct

classifications into the positive training set and negative training sets

For misclassified web pages, do the same

Training set

3 iterations seem sufficient

Results Checked via random sampling-

select 100 random web pages and manually check the correctness of the classification

91.5% accuracy- correctly identifies search interfaces (precision)

87.5% accuracy- correctly identifies non-search interfaces

Results Random sampling estimation: 124311

search interfaces currently exist on our data set

OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web

(upper bound)

K8008.0

7.8124311

Domain Classification Manually extract domain specific

keywords Cars – odometer, mileage, airbag, acura,

… Books – ISBN, author, title, publication, …

240 keywords used 4 target categories {Books, Cars,

Entertainment, Travel} + “Others”

Domain ClassificationNavie Bayes classifier

Bad result Keywords used not

specific enough to distinguish between domains

Websites span over different topics

Probabilistic Trap of analysis

based on content only

Domain ClassificationC4.5 classification tree

“Better” result More are classified

as “Others” Deterministic Improvement

needed More keywords Link structure Analysis of search

results

Conclusion A tool for automatic search interface

detection

Rough estimate of the total number of search interfaces size of Hidden Web

Domain classification Still need improvment

Some statistics Precision

Books – 34% Cars – 41 % Entertainment – 48% Travel – 58%

Some examples http://www.barnesandnoble.com – Books http://www.amazon.com – Entertainment http://www.travelocity.com – Travel http://www.cnn.com – Others http://www.latimes.com – Cars http://www.nih.gov – Travel http://www.healthfinder.gov - Others

automatic discovery and classification of search interface to the hidden web dean lee and richard...

search interface search

web pages slide

google slide

improvment slide

sufficient slide

search results results

search interfaces precision

current search engines

Documents

sia 6 unpak

catalogo sia

sia report

sia digipak

standards on internal audit (sia) importance of standards...

trabajo sia final

how to train your dragon: the hidden world ·...

sia environment

introduction - syslat sia

sia proposal

sia van stelle

[sia] - goteborg.se

sia/sil - fanox

chipping in - sia

tayangan sia-m1a.pdf

sia broch_web_len

breed code: sia –siamese - cat...

sia alliances

marketing sia

security products -...