may 11, 2005www 2005 -- chiba, japan1 thresher: automating the unwrapping of semantic content from...

Post on 04-Jan-2016

218 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

May 11, 2005 WWW 2005 -- Chiba, Japan 1

Thresher: Automating the Unwrapping of

Semantic Content from the World Wide Web

Andrew HogueGoogle MIT CSAIL

May 11, 2005 WWW 2005 -- Chiba, Japan 2

Acknowledgments

• David Karger

(karger@csail.mit.edu)

• Haystack Group

(http://haystack.csail.mit.edu)

May 11, 2005 WWW 2005 -- Chiba, Japan 3

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 4

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 5

Unwrapping the Web

• Majority of semantic content in “deep web”

• Transformed into human-readable HTML by scripts

• HTML is difficult for automated agents to understand

• Little incentive for content providers to provide RDF markup

• How to “unwrap” this content?

May 11, 2005 WWW 2005 -- Chiba, Japan 6

Thresher

• Simple UI for wrapper induction on structured web content

• “Demonstrate” examples of objects

• Induce wrapper, or pattern, based on DOM

• User may also label properties with RDF

May 11, 2005 WWW 2005 -- Chiba, Japan 7

Thresher

• Built on Haystack Semantic Web client

• Everything is RDF

• Everything has context menus

• Thresher brings RDF into the web browser

• Wrappers reify web objects for full interaction

May 11, 2005 WWW 2005 -- Chiba, Japan 8

Thresher

• Underlying wrapper algorithm based on tree edit distance

• Align user’s examples

• Keep aligned nodes (layout elements)

• Wildcard non-aligned nodes (content)

• Pattern matching is also alignment

May 11, 2005 WWW 2005 -- Chiba, Japan 9

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 10

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 11

Wrapper Induction

• Wrapper: pattern created from examples

• User provides positive examples

• Generalize examples into reusable pattern

• Existing techniques:– head-left-right-tail (HLRT) descriptors– Hidden Markov models– Support Vector Machines– Other Machine Learning

May 11, 2005 WWW 2005 -- Chiba, Japan 12

Wrapper Induction

• Our approach: take advantage of hierarchical structure of HTML

• Each example picks out a subtree of DOM

• Calculate tree edit distance between examples

• Least-cost edit distance gives best mapping

• Remove unmapped nodes to make pattern

Google Employee
is this slide necessary, or is it too much of a repeat?

May 11, 2005 WWW 2005 -- Chiba, Japan 13

Tree Edit Distance

• Calculate cost ( ) of sequence of operations to transform one tree into the other

• Operations: insert, delete, change a node

• Cost of an operation = size of subtree it affects

• Least-cost set of operations gives best mapping between elements

May 11, 2005 WWW 2005 -- Chiba, Japan 14

Mapping Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 15

Mapping Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 16

Mapping Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 17

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 18

Pattern Matching

• Look for document subtrees with similar structure

• Find alignments of wrapper in tree

• Require every node in wrapper be mapped to some node in document subtree

• Wildcards match zero or more times

• Each valid alignment is a match

May 11, 2005 WWW 2005 -- Chiba, Japan 19

Matching Example

May 11, 2005 WWW 2005 -- Chiba, Japan 20

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 21

Adding Semantics

• How to tie wrappers to semantic content?

• Assert RDF statements about unwrapped objects

• Tied to wrapper structure

• Classes bound to wrappers

• Properties bound to wildcards

May 11, 2005 WWW 2005 -- Chiba, Japan 22

Semantic Labels

May 11, 2005 WWW 2005 -- Chiba, Japan 23

Semantic Matching

May 11, 2005 WWW 2005 -- Chiba, Japan 24

Semantic Matching

May 11, 2005 WWW 2005 -- Chiba, Japan 25

Semantic Matching

[

<rdf:type> <TalkAnnouncement> ;

<series> “Dertouzos Lect…” ;

<dc:title> “Distributed Hash…” ;

<time> “3:30 PM”

]

May 11, 2005 WWW 2005 -- Chiba, Japan 26

Agenda

• Overview

• Demo

• Details– Induction– Matching– Semantics– Heuristics

May 11, 2005 WWW 2005 -- Chiba, Japan 27

• Find additional examples automatically • Consider nodes neighboring the example• Require low normalized cost:

• Often allows us to create wrappers with a single example

Automatically Adding Examples

May 11, 2005 WWW 2005 -- Chiba, Japan 28

Automatically Adding Examples

TR

T

May 11, 2005 WWW 2005 -- Chiba, Japan 29

List Collapse

• Current wrappers generalize well for single elements

• Will not recognize variable length lists

• Collapse neighboring nodes with low normalized cost

• For matching, allow nodes to match more than once

Google Employee
Do we need this? If we need to cut time, cut list collapse altogether

May 11, 2005 WWW 2005 -- Chiba, Japan 30

Wrapper Wrap-up

• Gather user example(s)

• Automatically find additional examples

• Generalize examples using best mapping

• Add semantic labels

• Match by finding alignments

• Overlay objects on the page for interaction

May 11, 2005 WWW 2005 -- Chiba, Japan 31

Additional Tools

• Wrapper Sharing

• RSS

• Web Operations

May 11, 2005 WWW 2005 -- Chiba, Japan 32

Our Contributions

• End-user wrapper induction

• Few examples required

• Bring object interaction into the browser

• Wrappers bridge syntactic-semantic gap

May 11, 2005 WWW 2005 -- Chiba, Japan 33

Future Work and Applications

• Document-level classes

• Page reformatting

• Autonomous agent interaction

• Negative examples

• Automatic wrapper induction

May 11, 2005 WWW 2005 -- Chiba, Japan 34

ahogue@google.com

http://haystack.csail.mit.edu

top related