extracting structured data from web pages

Extracting Structured Data from Web Pages

By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003

Instructor: Prof. Taflan Gündem

Presentation Outline

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

What is next?






• Conclusion

Motivation

• Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data.

• This paper focuses on the problem of automatically extracting structured data from a collection of pages.

• There are many web sites that contain a large collection of “structured” pages.

What is next?






• Conclusion

Example Pages

• In the real world there are many examples for structured web pages.– amazon web site, e-bay web site etc.

• Two examples from www.amazon.com– My System– An Eternal Golden Braid

Example Pages (My System: 21st Century Edition)

Example Pages (An Eternal Golden Braid)

What is next?






• Conclusion

Underlying Problems• Complex Schema: The “schema” of

the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on.

• Template vs. Data: Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.

How is a page created with template?

x extracted from the database

Basic Type, Tuples and Sets• Basic Type: Basic unit

of text• Tuple: Ordered List of

types, <T1,T2,…,Tn>

• Set: {T1}

< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Schema and Instance

< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Template Definition

• Own example: • Schema: S = <, {, >

• Template: TS = <A * B {*}E C * D>

• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’

• Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr

Template Encoding (T1,x1)

What is next?






• Conclusion

General Description of EXALG

Multiple Pages

Set of Reviewers

Correct Solution for those pages

Some Terminology (1)• The occurrence-vector of a token t, is

defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page

• An equivalence class is a maximal set of tokens having the same occurrence-vector.

• A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.

Some Terminology (2)

<1,1,1,1>

<1,2,1,0>

No unique role


• For real pages, an equivalence class of large size and support is usually valid, where support of a token is defined as the number of pages in which the token occurs.

• Example for invalid equivalence class:– {Data, Mining, Jeff, 2, Jane, 6} has

occurrence vector <0, 1, 0, 0>


• The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”.

• Threshold for size and support is set by the user (SizeThres, SupThres).

Some Terminology(5)

• Valid equivalence class properties: Ordering and Nesting

• Back to own example:

• Template: TS = <A * B {*}E C * D>

• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’

• Ordered: A > B > C > D• Nesting: B > E > C

Important Observations

• In practice, two page-tokens with different occurrence-paths have different roles: html-parser

• Two page-tokens having same occurrence paths, but with different neighbours also have different roles

Explanation of observations

Modules and their operations

M o d u le E C G M

E q u iv a len c e C las s G en er a tio n M o d u le

Fin dEqF in d E q u iv a len c e C las s es

HandIn vHan d le I n v a lid

E q u iv a len c e C las s es

D if fEqD if f er en tia te R o les Us in g

E q C las s

An aly s is M o d u le

D if fFormD if f er en tia teR o les Us in gF o r m at

ExV a lE x tr ac t Valu e

C on s tTem pC o n s tr u c tT em p la te

Tem pla teS ch em aV a lu e s

in pu tpage s

Constructing Template (1)

• The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty.

• A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.

Constructing Template (2)

• The tokens connected by empty positions belong to the template.

• In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type

• This unknown type can be determined by inspecting input pages

Constructing Template(3)

What is next?






• Conclusion

Experimental Results (1)

• Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions.

• The first 6 web pages are obtained from RoadRunner site.

• The last three web pages have more complex structure.

Experimental Results(2)

What is next?






• Conclusion

Concluding Remarks• EXALG first discovers the unknown

template that generated the pages and uses the discovered template to extract the data from the input pages.

• Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection.

• No human intervention – automatically getting template and data

Future Work

• Automatically locate collections of pages that are structured

• Check, whether it is feasible to generate some large database from these pages

Questions & Answers

extracting structured data from web pages

Documents

structured web pages

structured data

real pages

collection of pages

number of pages

single templatetoken

invalid equivalence

web sites