may 11, 2005www 2005 -- chiba, japan1 thresher: automating the unwrapping of semantic content from...

34
May 11, 2005 WWW 2005 -- Chiba, Japa n 1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue Google MIT CSAIL

Upload: anna-marshall

Post on 04-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 1

Thresher: Automating the Unwrapping of

Semantic Content from the World Wide Web

Andrew HogueGoogle MIT CSAIL

Page 2: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 2

Acknowledgments

• David Karger

([email protected])

• Haystack Group

(http://haystack.csail.mit.edu)

Page 3: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 3

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

Page 4: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 4

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

Page 5: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 5

Unwrapping the Web

• Majority of semantic content in “deep web”

• Transformed into human-readable HTML by scripts

• HTML is difficult for automated agents to understand

• Little incentive for content providers to provide RDF markup

• How to “unwrap” this content?

Page 6: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 6

Thresher

• Simple UI for wrapper induction on structured web content

• “Demonstrate” examples of objects

• Induce wrapper, or pattern, based on DOM

• User may also label properties with RDF

Page 7: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 7

Thresher

• Built on Haystack Semantic Web client

• Everything is RDF

• Everything has context menus

• Thresher brings RDF into the web browser

• Wrappers reify web objects for full interaction

Page 8: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 8

Thresher

• Underlying wrapper algorithm based on tree edit distance

• Align user’s examples

• Keep aligned nodes (layout elements)

• Wildcard non-aligned nodes (content)

• Pattern matching is also alignment

Page 9: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 9

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

Page 10: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 10

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

Page 11: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 11

Wrapper Induction

• Wrapper: pattern created from examples

• User provides positive examples

• Generalize examples into reusable pattern

• Existing techniques:– head-left-right-tail (HLRT) descriptors– Hidden Markov models– Support Vector Machines– Other Machine Learning

Page 12: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 12

Wrapper Induction

• Our approach: take advantage of hierarchical structure of HTML

• Each example picks out a subtree of DOM

• Calculate tree edit distance between examples

• Least-cost edit distance gives best mapping

• Remove unmapped nodes to make pattern

Google Employee
is this slide necessary, or is it too much of a repeat?
Page 13: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 13

Tree Edit Distance

• Calculate cost ( ) of sequence of operations to transform one tree into the other

• Operations: insert, delete, change a node

• Cost of an operation = size of subtree it affects

• Least-cost set of operations gives best mapping between elements

Page 14: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 14

Mapping Examples

Page 15: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 15

Mapping Examples

Page 16: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 16

Mapping Examples

Page 17: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 17

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

Page 18: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 18

Pattern Matching

• Look for document subtrees with similar structure

• Find alignments of wrapper in tree

• Require every node in wrapper be mapped to some node in document subtree

• Wildcards match zero or more times

• Each valid alignment is a match

Page 19: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 19

Matching Example

Page 20: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 20

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

Page 21: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 21

Adding Semantics

• How to tie wrappers to semantic content?

• Assert RDF statements about unwrapped objects

• Tied to wrapper structure

• Classes bound to wrappers

• Properties bound to wildcards

Page 22: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 22

Semantic Labels

Page 23: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 23

Semantic Matching

Page 24: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 24

Semantic Matching

Page 25: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 25

Semantic Matching

[

<rdf:type> <TalkAnnouncement> ;

<series> “Dertouzos Lect…” ;

<dc:title> “Distributed Hash…” ;

<time> “3:30 PM”

]

Page 26: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 26

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

Page 27: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 27

• Find additional examples automatically • Consider nodes neighboring the example• Require low normalized cost:

• Often allows us to create wrappers with a single example

Automatically Adding Examples

Page 28: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 28

Automatically Adding Examples

TR

T

Page 29: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 29

List Collapse

• Current wrappers generalize well for single elements

• Will not recognize variable length lists

• Collapse neighboring nodes with low normalized cost

• For matching, allow nodes to match more than once

Google Employee
Do we need this? If we need to cut time, cut list collapse altogether
Page 30: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 30

Wrapper Wrap-up

• Gather user example(s)

• Automatically find additional examples

• Generalize examples using best mapping

• Add semantic labels

• Match by finding alignments

• Overlay objects on the page for interaction

Page 31: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 31

Additional Tools

• Wrapper Sharing

• RSS

• Web Operations

Page 32: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 32

Our Contributions

• End-user wrapper induction

• Few examples required

• Bring object interaction into the browser

• Wrappers bridge syntactic-semantic gap

Page 33: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 33

Future Work and Applications

• Document-level classes

• Page reformatting

• Autonomous agent interaction

• Negative examples

• Automatic wrapper induction

Page 34: May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 34

[email protected]

http://haystack.csail.mit.edu