extracting structured data from web pages
DESCRIPTION
Extracting Structured Data from Web Pages. By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan G ündem. General Underlying Terminology Modules and their operations. Presentation Outline. Motivation Example Pages. Model & Problem Formulation. Approach in Detail. - PowerPoint PPT PresentationTRANSCRIPT
Extracting Structured Data from Web Pages
By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003
Instructor: Prof. Taflan Gündem
Presentation Outline
• Motivation• Example Pages
– General– Underlying
Terminology– Modules and
their operations• Model & Problem Formulation
• Approach in Detail• Experimental Results
• Conclusion
What is next?
• Motivation• Example Pages
– General– Underlying
Terminology– Modules and
their operations• Model & Problem Formulation
• Approach in Detail• Experimental Results
• Conclusion
Motivation
• Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data.
• This paper focuses on the problem of automatically extracting structured data from a collection of pages.
• There are many web sites that contain a large collection of “structured” pages.
What is next?
• Motivation• Example Pages
– General– Underlying
Terminology– Modules and
their operations• Model & Problem Formulation
• Approach in Detail• Experimental Results
• Conclusion
Example Pages
• In the real world there are many examples for structured web pages.– amazon web site, e-bay web site etc.
• Two examples from www.amazon.com– My System– An Eternal Golden Braid
Example Pages (My System: 21st Century Edition)
Example Pages (An Eternal Golden Braid)
What is next?
• Motivation• Example Pages
– General– Underlying
Terminology– Modules and
their operations• Model & Problem Formulation
• Approach in Detail• Experimental Results
• Conclusion
Underlying Problems• Complex Schema: The “schema” of
the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on.
• Template vs. Data: Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.
How is a page created with template?
x extracted from the database
Basic Type, Tuples and Sets• Basic Type: Basic unit
of text• Tuple: Ordered List of
types, <T1,T2,…,Tn>
• Set: {T1}
< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >
Schema and Instance
< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >
Template Definition
• Own example: • Schema: S = <, {, >
• Template: TS = <A * B {*}E C * D>
• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’
• Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr
Template Encoding (T1,x1)
What is next?
• Motivation• Example Pages
– General– Underlying
Terminology– Modules and
their operations• Model & Problem Formulation
• Approach in Detail• Experimental Results
• Conclusion
General Description of EXALG
Multiple Pages
Set of Reviewers
Correct Solution for those pages
Some Terminology (1)• The occurrence-vector of a token t, is
defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page
• An equivalence class is a maximal set of tokens having the same occurrence-vector.
• A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.
Some Terminology (2)
<1,1,1,1>
<1,2,1,0>
No unique role
Some Terminology (3)
• For real pages, an equivalence class of large size and support is usually valid, where support of a token is defined as the number of pages in which the token occurs.
• Example for invalid equivalence class:– {Data, Mining, Jeff, 2, Jane, 6} has
occurrence vector <0, 1, 0, 0>
Some Terminology (4)
• The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”.
• Threshold for size and support is set by the user (SizeThres, SupThres).
Some Terminology(5)
• Valid equivalence class properties: Ordering and Nesting
• Back to own example:
• Template: TS = <A * B {*}E C * D>
• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’
• Ordered: A > B > C > D• Nesting: B > E > C
Important Observations
• In practice, two page-tokens with different occurrence-paths have different roles: html-parser
• Two page-tokens having same occurrence paths, but with different neighbours also have different roles
Explanation of observations
Modules and their operations
M o d u le E C G M
E q u iv a len c e C las s G en er a tio n M o d u le
Fin dEqF in d E q u iv a len c e C las s es
HandIn vHan d le I n v a lid
E q u iv a len c e C las s es
D if fEqD if f er en tia te R o les Us in g
E q C las s
An aly s is M o d u le
D if fFormD if f er en tia teR o les Us in gF o r m at
ExV a lE x tr ac t Valu e
C on s tTem pC o n s tr u c tT em p la te
Tem pla teS ch em aV a lu e s
in pu tpage s
Constructing Template (1)
• The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty.
• A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.
Constructing Template (2)
• The tokens connected by empty positions belong to the template.
• In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type
• This unknown type can be determined by inspecting input pages
Constructing Template(3)
What is next?
• Motivation• Example Pages
– General– Underlying
Terminology– Modules and
their operations• Model & Problem Formulation
• Approach in Detail• Experimental Results
• Conclusion
Experimental Results (1)
• Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions.
• The first 6 web pages are obtained from RoadRunner site.
• The last three web pages have more complex structure.
Experimental Results(2)
What is next?
• Motivation• Example Pages
– General– Underlying
Terminology– Modules and
their operations• Model & Problem Formulation
• Approach in Detail• Experimental Results
• Conclusion
Concluding Remarks• EXALG first discovers the unknown
template that generated the pages and uses the discovered template to extract the data from the input pages.
• Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection.
• No human intervention – automatically getting template and data
Future Work
• Automatically locate collections of pages that are structured
• Check, whether it is feasible to generate some large database from these pages
Questions & Answers