query rewriting for extracting data behind html forms xueqi chen, 1 david w. embley 1 stephen w....
Post on 21-Dec-2015
221 views
TRANSCRIPT
![Page 1: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/1.jpg)
Query Rewriting for Extracting Data Behind HTML Forms
Xueqi Chen,1 David W. Embley1
Stephen W. Liddle2
1Department of Computer Science2Rollins Center for eBusiness
Brigham Young University
November 9, 2004
Funded by the National Science Foundation under grant IIS-0083127
![Page 2: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/2.jpg)
2
Motivation
• Web information is stored in databases• Databases are accessed through forms• Forms are designed in various ways
![Page 3: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/3.jpg)
3
Motivation
• Web information is stored in databases• Databases are accessed through forms• Forms are designed in various ways• Automated agents are of great value
![Page 4: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/4.jpg)
4
Prototype System Flowchart
Input Analyzer
Retrieved Page(s)
User Query
Site Form
Output Analyzer
Extracted Information
ApplicationExtraction Ontology
![Page 5: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/5.jpg)
5
Input Analyzer – User Query Acquisition
System creates a form based on application-specific ontology
![Page 6: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/6.jpg)
6
Input Analyzer – User Query Acquisition (cont.)
![Page 7: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/7.jpg)
7
Input Analyzer – Site Form Analysis
Understand name, type, and/or values for each field
![Page 8: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/8.jpg)
8
Input Analyzer – Form Query Generation
Form field name recognition– For all fields
Form field value recognition– For range fields only
Form field matching (Case 0 – 5)– For all fields
![Page 9: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/9.jpg)
9
Form Field Name Recognition
Match by value– Application extraction ontology
Match by name– WordNet-based C4.5 decision tree learning
algorithm– Levenshtein edit distance, SoundEx, and longest
common subsequence (LCS)
![Page 10: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/10.jpg)
10
Form Field Value Recognition
For range fields only
![Page 11: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/11.jpg)
11
Form Field Value Recognition: Type 1
Lower value list: [0, 1, 5000, 10000, 15000, 20000, 30000];
Upper value list: [2500, 5000, 10000, 15000, 20000, 30000, 50000, 999999];
Paired = false.
![Page 12: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/12.jpg)
12
Form Field Value Recognition: Type 2
Lower value list: [0, 0, 5001, 10001, 15001, 20001];
Upper value list: [999999, 5000, 10000, 15000, 20000, 999999];
Paired = true.
![Page 13: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/13.jpg)
13
Form Field Value Recognition: Type 3
Lower value list: [25, 25, 25, 25, 25, 25, 25];
Upper value list: [25, 50, 100, 300, 500, 500, 500];
Paired = true.
![Page 14: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/14.jpg)
14
Form Field Matching: Case 0
Field specified in user query (Q) is the same as in a site form (F)
![Page 15: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/15.jpg)
15
Form Field Matching: Case 1
Field in Q is not contained in F, but is in the returned information ??
![Page 16: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/16.jpg)
16
Form Field Matching: Case 2
Field in Q is not contained in F, and is not in the returned information
Color?
??
![Page 17: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/17.jpg)
17
Form Field Matching: Case 3
Field required by F is not provided in Q, but a general default value, such as “All” or “Any”, is provided by F
![Page 18: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/18.jpg)
18
Form Field Matching: Case 4
Field required by F is not provided in Q, and the default value provided by the site form is specific, not “All” or “Any”
?
![Page 19: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/19.jpg)
19
Form Field Matching: Case 5
Values specified in Q do not match values provided in F
![Page 20: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/20.jpg)
20
Output Analyzer
Form results processor– Record separator– BYU Ontos
Final results generator– Database manipulation
Single table Multiple tables
![Page 21: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/21.jpg)
21
A Car-ads Search Example
![Page 22: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/22.jpg)
22
A Car-ads Search Example (cont.)
![Page 23: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/23.jpg)
23
Measurements
Field-matching efficiency
matchedbeenhaveshouldthatfieldsofnumbertotal
fieldsmatchedcorrectlyofnumberR fm ________
____
fieldsmatchedofnumbertotal
fieldsmatchedcorrectlyofnumberPfm ____
____
![Page 24: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/24.jpg)
24
Measurements (cont.)
Field-matching efficiency Query-submission efficiency
submittedbeenhaveshouldthatqueriesofnumbertotal
submittedqueriescorrectofnumberRqs ________
____
submittedqueriesofnumbertotal
submittedqueriescorrectofnumberPqs ____
____
![Page 25: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/25.jpg)
25
Measurements (cont.)
Field-matching efficiency Query-submission efficiency Overall efficiency
qsfmoverall RRR
qsfmoverall PPP
![Page 26: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/26.jpg)
26
Experimental Results
Car-ads search
Number of Forms: 7
Number of Fields in Forms: 31
Number of Fields Applicable to Ontology: 21 (67.7%)
Field Matching Query Submission Overall
Recall 100% (21/21) 100% (249/249) 100%
Precision 100% (21/21) 82.7% (249/301)
[97.1% (249+1847)/(301+1858)]*
82.7%
[97.1%]*
* Numbers in square brackets are calculated including queries submitted for retrieving next links.
![Page 27: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/27.jpg)
27
Experimental Results (cont.)
Digital-camera search
Number of Forms: 7
Number of Fields in Forms: 41
Number of Fields Applicable to Ontology: 23 (56.1%)
Field Matching Query Submission Overall
Recall 91.3% (21/23) 100% (31/31) 91.3%
Precision 100% (21/21) 100% (31/31)
[100% (31+85)/(31+85)]*
100%
[100%]*
* Numbers in square brackets are calculated including queries submitted for retrieving next links.
![Page 28: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/28.jpg)
28
Results Discussion
Field matching– By value
Successful: 100%
– By name Successful example: price vs. myprice, pricelow, pricehigh,
_extern_price, min_price, max_price Failed: price vs. lo_p, hi_p
![Page 29: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/29.jpg)
29
Results Discussion (cont.)
Query submission
![Page 30: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/30.jpg)
30
Conclusion
Our system’s performance– Fields applicable to extraction ontologies: 61.9%– Fields system matched: 95.7%– Queries submitted that are necessary: 91.4%
To improve the performance– Field labels– The quality of the extraction ontologies
Forms our system does not handle– Multiple forms– Forms whose actions are coded inside scripts
![Page 31: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center](https://reader034.vdocument.in/reader034/viewer/2022051401/56649d555503460f94a31f0b/html5/thumbnails/31.jpg)
31
Contributions
Enables directed hidden Web crawling– Accurate field matching– Efficient form filling and submission– Post processing for precise results
Ontology based– Extensible to multiple domains– Resilient to page changes