functional semantic analysis of web pages on the visual ......functional semantic analysis of web...

31
Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak - 9326613 Date: 2008-01-22 Institute of Information Systems Database and Artificial Intelligence Group Supervision Prof. Georg Gottlob Dr. Wolfgang Gatterbauer

Upload: others

Post on 07-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Functional Semantic Analysis of Web Pages on the Visual Layer

Presentation of the Master‘s Thesisby Bernhard Pollak - 9326613

Date: 2008-01-22

Institute of Information SystemsDatabase and Artificial Intelligence Group

SupervisionProf. Georg GottlobDr. Wolfgang Gatterbauer

Page 2: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Web Information Extraction

Web

Institute: DBAIMembers: Prof. Gottlob

Dr. GatterbauerDr. Musliu ...

SemistructuredData

StructuredData

<html><body><h1>DBAI

</h1></body></html>

Introduction Motivation Solution Results Outlook

Page 3: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Wrapper Lixto Concept 1

Example Page(s)

Similar Structured Pages

Visual WrapperGenerator

ExtractionModule

ExtractionProgram

XMLResult

Manual

Auto

Introduction Motivation Solution Results Outlook

Page 4: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Wrapper Lixto Concept 1

Example Page(s)

Similar Structured Pages

Visual WrapperGenerator

ExtractionModule

ExtractionProgram

XMLResult

Manual

Auto

Introduction Motivation Solution Results Outlook

Page 5: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

The Problem

What about (visual) similar pages ?

Similar Structured Pages

Means similar structuredwith regards to HTML

Introduction Motivation Solution Results Outlook

Page 6: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Needs multiple manualwrapper definitions andhigher maintaining efforts

Multiple Wrappers ?

Introduction Motivation Solution Results Outlook

Page 7: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Try to use general visualrules for reducing specialwrapper dependence

Visual Approach ?

Introduction Motivation Solution Results Outlook

Page 8: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

What could be deduced ?

1. Header

1.1 Subtext

1.2 Normal Text

Newspaper 2

Semantic is present even without knowing the content

Introduction Motivation Solution Results Outlook

Page 9: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

LOGICALLINGUISTIC

RECORDSHIERARCHY

Functional Semantics 3

WWW08Thisis a

Text

LAYOUTTYPOGRAPHY

VISUAL

SEMANTIC

FUNCTIONAL

PERC

EPTI

ON

LAYE

R

Bold

Italic

1.1

Introduction Motivation Solution Results Outlook

Page 10: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

The Solution

VISUAL

SEMANTIC

FUNCTIONAL

REcord DE tection on theVIsual LAyer

The REDEVILA approach Box IdentificationSegmentationClassificationOrderingHierarchy

1

2

3

4

5

Introduction Motivation Solution Results Outlook

Page 11: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

X-Tagging

<html><body><b><x>John</x></b><x>is<x><x>text</x>

</body></html>

John is running

John is running

John is running

Without X-Tagging With X-Tagging

<html><body><b>John</b> isrunning

</body></html>

wrapping errors

1Box Identification

Introduction Motivation Solution Results Outlook

Page 12: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

VIPS Algorithm 4

Containing Crossing Covering

Basic Operations Invertion

2Segmentation

Introduction Motivation Solution Results Outlook

Page 13: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Segmentation Example 5 2Segmentation

Introduction Motivation Solution Results Outlook

Page 14: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

WEKA Toolkit 6

Important vs. Noisy segments

• 370 segments from web pages• WEKA machine learning toolkit• Feature reduction• PART algorithm

– C4.5 decision tree algorithm

3Classification

Introduction Motivation Solution Results Outlook

Page 15: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Final Feature Set

fontHeight leftPos topPos

widthRatio charRatio importance

3Classification

Introduction Motivation Solution Results Outlook

Page 16: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

A

CB

Diagonal OrderingY-Ordering Diagonal Ordering

A

CB

FE

D

A

CB

FE

D

FE

D

X-Ordering

4Ordering

1

2

3

1

2

3

1

2

3

1

2 3

1

2 3

1

2

3

Introduction Motivation Solution Results Outlook

Page 17: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Diagonal Ordering Limit 4Ordering

Limit for the arctan between the two box corners:

bmax = maximum width of the two boxeswmax = maximum width of parent structure

Introduction Motivation Solution Results Outlook

Page 18: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Hierarchy Detection

• Monohierarchical structures• Multitopological Grid

• Hierarchy model: b.x.xb = record start flag {true, false}, x = hierarchy depth

• Record start

5Hierarchy

Introduction Motivation Solution Results Outlook

Page 19: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Multitopological Grid Concept

A

1 2 3

cornerpoint

borderpoint

outerpoint

innerpoint

multipoint

B

123456789

1

23

456

Screen Coordinates Logical CoordinatesMinimal Grid:

Introduction Motivation Solution Results Outlook

Page 20: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Multitopological Grid Example 7

A

CB

A

CB

A

CB

Bottom Beam Right Beam

Introduction Motivation Solution Results Outlook

Page 21: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Experimental ResultsWeb Pages: 85Record Count: 1086

Correct: 836False Positives: 351False Negatives: 241

Recall: 77%Precision: 70%F-Measure: 73%

Four different domainsFour different domains

Introduction Motivation Solution Results Outlook

Page 22: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Semantic (Domain) Dependence 8

Webpage REDEVILA Result

Introduction Motivation Solution Results Outlook

Page 23: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

REDEVILA Example 9

Webpage REDEVILA Result

Introduction Motivation Solution Results Outlook

Page 24: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Conclusion

• Domain independence not satisfying• Definition of distance difficult• Would make current wrapper approaches more

robust• Potential for single record detection• Clear separation between "tag" and "visual"

approaches

Introduction Motivation Solution Results Outlook

Page 25: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Future Work

• Introducing domain dependence• Automatic rule generation for the MT Grid• Considering colored headers• Considering the layout (column) structure• Integration with tag information• Integration of table models with substructured

lists

Introduction Motivation Solution Results Outlook

Page 26: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

1. R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 119–128, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

2. http://www.bosai.go.jp/e/international3. D. S. Doermann, A. Rosenfeld, and E. Rivlin. The function of documents. In ICDAR ’97:

Proceedings of the 4th International Conference on Document Analysis and Recognition, pages 1077–1081, Washington, DC, USA, 1997. IEEE Computer Society.

4. D. Cai, S. Yu, J. Wen, and W. Ma. Extracting content structure for web pages based on visual representation. In Proc. 5th Asian-PacificWeb Conference (Web Technologies and Applications), pages 406–417. Springer, April 2003.

5. http://bluerobot.com/web/layouts/layout3.html6. Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and

techniques.Morgan Kaufmann, San Francisco, 2nd edition, 2005.7. http://www.google.at8. http://the1review.com9. http://www.google.com

References

Thank you for your attention

Page 27: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

BACKUP

BACKUP

Page 28: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Domain Dependent Functional Semantics

1. Header

1.1 Summary

1.2 Newstext

Address 1Address 2

Text

Salutation

Signature

NEWSPAPER LETTER

Page 29: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Segmentation Example II

Page 30: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

Problem: "Small Line Above" RuleWebpage REDEVILA Result

Page 31: Functional Semantic Analysis of Web Pages on the Visual ......Functional Semantic Analysis of Web Pages on the Visual Layer Presentation of the Master‘s Thesis by Bernhard Pollak

REDEVILA Example IWebpage REDEVILA Result