query rewriting for extracting data behind html forms xueqi chen department of computer science...

18
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National Science Foundation

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Query Rewriting for Extracting Data Behind HTML Forms

Xueqi Chen

Department of Computer Science

Brigham Young University

March, 2003

Funded by National Science Foundation

Page 2: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Motivation

• Web information is stored in databases• Databases are accessed through forms• Automated agents are of great value• Process is difficult because of nature of forms

Page 3: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

System Flowchart

Input Analyzer

Retrieved Page(s)

Application Ontology

User Query

Site Form

Output Analyzer

Extracted Information

Page 4: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

User Query Acquisition

Our system provides a form created based on application-specific ontology

Page 5: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Site Form Analysis

Understand type, name, and/or values for each field

Page 6: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Form Filling

Name matching Regular Expressions – for fields with values provided Stemming Levenshtein Edit Distance Longest Common Subsequences Soundex Wordnet

Value matching

Page 7: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Value Matching: Case 1

Page 8: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Value Matching: Case 2 ??

Page 9: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Value Matching: Case 3

Color?

??

Page 10: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Value Matching: Case 4

Page 11: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Value Matching: Case 5

?

Page 12: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Value Matching: Case 6

Page 13: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Value Matching: Case 7

Page 14: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Measurements

Matching Efficiency Submission Efficiency Post-processing Efficiency

Page 15: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Measurements (cont’)

Matching Efficiency

matchedbeen have could that fields of No.

fields matchedcorrectly of No.recall

fields matched of No.

fields matchedcorrectly of No.precision

Page 16: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Measurements (cont’)

Matching Efficiency Submission Efficiency

submittedbeen have could that queries of No.

submitted queriescorrect of No.recall

queries submitted of No.

submitted queriescorrect of No.precision

Page 17: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Measurements (cont’)

Matching Efficiency Submission Efficiency Post-processing Efficiency

returnedbeen have could that records of No.

returned systemour recordscorrect of No.recall

returned systemour records of No.

returned systemour recordscorrect of No.precision

Page 18: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National

Contributions

It enhances the effectiveness of the data-extraction process

It presents another technique, in addition to [RGa01], to access data behind HTML forms.