automatic discovery and classification of search interface to the hidden web dean lee and richard...
Post on 20-Dec-2015
214 Views
Preview:
TRANSCRIPT
Automatic Discovery and Classification of search interface to the Hidden Web
Dean Lee and Richard Sia
Dec 2nd 2003
Goals and Motivation Hidden Webs are informative No current search engines can index
them (even Google)
Goals and Motivation Hidden Webs are informative No current search engines can index
them (even Google)
Next-generation search engine Automatic discovery of search
interface Classification/categorization of hidden
websites Generating queries to search interfaces Crawling and indexing of these web pages
Crawling
2.2M URLs from dmoz 1.7M eventually Crawled in November 2003 20G/4G - before/after compression
Root level web pages only e.g. http://www.ucla.edu
Why root-level only? 80% of search interface contained in root-level
(from UIUC) Efficient, cost effective
3B web pages compared to 8M web sites
Search Interface Classification
Most search interfaces are inside <Form> </Form> tags
Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces
Training sets for C4.5
Initially only positive training set Several classification iterations using
real web data For each iteration, add correct
classifications into the positive training set and negative training sets
For misclassified web pages, do the same
Results Checked via random sampling-
select 100 random web pages and manually check the correctness of the classification
91.5% accuracy- correctly identifies search interfaces (precision)
87.5% accuracy- correctly identifies non-search interfaces
Results Random sampling estimation: 124311
search interfaces currently exist on our data set
OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web
(upper bound)
K8008.0
1
7.1
7.8124311
Domain Classification Manually extract domain specific
keywords Cars – odometer, mileage, airbag, acura,
… Books – ISBN, author, title, publication, …
240 keywords used 4 target categories {Books, Cars,
Entertainment, Travel} + “Others”
Domain ClassificationNavie Bayes classifier
Bad result Keywords used not
specific enough to distinguish between domains
Websites span over different topics
Probabilistic Trap of analysis
based on content only
Domain ClassificationC4.5 classification tree
“Better” result More are classified
as “Others” Deterministic Improvement
needed More keywords Link structure Analysis of search
results
Conclusion A tool for automatic search interface
detection
Rough estimate of the total number of search interfaces size of Hidden Web
Domain classification Still need improvment
Some statistics Precision
Books – 34% Cars – 41 % Entertainment – 48% Travel – 58%
Some examples http://www.barnesandnoble.com – Books http://www.amazon.com – Entertainment http://www.travelocity.com – Travel http://www.cnn.com – Others http://www.latimes.com – Cars http://www.nih.gov – Travel http://www.healthfinder.gov - Others
top related