structural web search using a graph-based discovery system

17
Structural Web Search Using a Structural Web Search Using a Graph-Based Discovery Graph-Based Discovery System System Nitish Manocha, Diane J. Nitish Manocha, Diane J. Cook, and Lawrence B. Holder Cook, and Lawrence B. Holder University of Texas at University of Texas at Arlington Arlington [email protected] [email protected] http://www-cse.uta.edu/~cook http://www-cse.uta.edu/~cook

Upload: jayme-whitley

Post on 01-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Structural Web Search Using a Graph-Based Discovery System. Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington [email protected] http://www-cse.uta.edu/~cook. Structured Web Search. Existing search engines use linear feature match - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Structural Web Search Using a   Graph-Based Discovery System

Structural Web Search Using a Structural Web Search Using a Graph-Based Discovery SystemGraph-Based Discovery System

Nitish Manocha, Diane J. Cook, and Nitish Manocha, Diane J. Cook, and Lawrence B. HolderLawrence B. Holder

University of Texas at ArlingtonUniversity of Texas at Arlington

[email protected]@cse.uta.edu

http://www-cse.uta.edu/~cookhttp://www-cse.uta.edu/~cook

Page 2: Structural Web Search Using a   Graph-Based Discovery System

Structured Web SearchStructured Web Search

Existing search engines use linear feature matchExisting search engines use linear feature match Web contains structural information as wellWeb contains structural information as well

Hyperlink informationHyperlink information Web viewed as a graph [Kleinberg]Web viewed as a graph [Kleinberg]

Subdue searches based on structureSubdue searches based on structure Use as foundation of a structural search engineUse as foundation of a structural search engine

Incorporation of WordNet allows for synonym Incorporation of WordNet allows for synonym matchmatch

Page 3: Structural Web Search Using a   Graph-Based Discovery System

object

triangle

Discovers structural patterns in input graphsDiscovers structural patterns in input graphs A A substructuresubstructure is connected subgraph is connected subgraph An An instanceinstance of a substructure is a subgraph of a substructure is a subgraph

that is isomorphic to substructure definitionthat is isomorphic to substructure definition Pattern discovery, classification, clusteringPattern discovery, classification, clustering

R1

C1

T1

S1

T2

S2

T3

S3

T4

S4

Input Database Substructure S1 (graph form)

Compressed Database

R1

C1object

squareon

shape

shape S1S1 S1S1 S1S1

S1S1

Page 4: Structural Web Search Using a   Graph-Based Discovery System

Subdue AlgorithmSubdue Algorithm

• Start with individual verticesStart with individual vertices• Keep only best substructures on queueKeep only best substructures on queue• Expand substructure by adding Expand substructure by adding

edge/vertexedge/vertex• Compress graph and repeat to generate Compress graph and repeat to generate

hierarchical descriptionhierarchical description• Optional use of background knowledgeOptional use of background knowledge

Page 5: Structural Web Search Using a   Graph-Based Discovery System

Inexact Graph MatchInexact Graph Match

Some variations may occur between Some variations may occur between instancesinstances

Want to abstract over minor differencesWant to abstract over minor differences Difference = cost of transforming one Difference = cost of transforming one

graph to make it isomorphic to anothergraph to make it isomorphic to another Match if Match if cost/size < thresholdcost/size < threshold

Page 6: Structural Web Search Using a   Graph-Based Discovery System

Application DomainsApplication Domains

Protein dataProtein data Human Genome DNA dataHuman Genome DNA data Spatial-temporal domainsSpatial-temporal domains

Earthquake dataEarthquake data Aircraft Safety and Reporting SystemAircraft Safety and Reporting System

Telecommunications dataTelecommunications data Program source codeProgram source code Web dataWeb data

Page 7: Structural Web Search Using a   Graph-Based Discovery System

page

Represent Web as GraphRepresent Web as Graph Breadth-first search of domain to generate graphBreadth-first search of domain to generate graph

Nodes represent pages / documentsNodes represent pages / documents Edges represent hyperlinksEdges represent hyperlinks Additional nodes represent document keywordsAdditional nodes represent document keywords

page

university

texas

learning group

projects

subdue

robotics

parallel

hyperlink

work

word word

planning

Page 8: Structural Web Search Using a   Graph-Based Discovery System

WebSubdue’s Structural SearchWebSubdue’s Structural Search

Formulate query as graphFormulate query as graph Use Subdue’s predefined substructure Use Subdue’s predefined substructure

option to search for instances of queryoption to search for instances of query

Instructor

TeachingRobotics

ResearchRobotics

Publication

Robotics

httphttp

Postscript| PDF

Page 9: Structural Web Search Using a   Graph-Based Discovery System

Query: Query: Find all pages which link to Find all pages which link to a page containing term ‘Subdue’a page containing term ‘Subdue’

Subgraph vertices: 1 pageURL: http://cygnus.uta.edu7  pageURL: http://cygnus.uta.edu/projects.html8 Subdue[1->7] hyperlink[7->8] word

Subdue

pagehyperlink

/* Vertex ID Label */

sv 1 pagev 2 pagev 3 Subdue

/* Edge Vertex 1 Vertex 2 Label */

d 1 2 hyperlinkd 2 3 word

word

page

Page 10: Structural Web Search Using a   Graph-Based Discovery System

Search for Presentation PagesSearch for Presentation Pages

WebSubdueWebSubdue 22 instances22 instances

AltaVistaAltaVista Query Query ““host:www-cse.uta.edu AND host:www-cse.uta.edu AND

image:next_motif.gif AND image:up_motif.gif AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.”image:previous_motif.gif.”

12 instances12 instances

page

page page page

hyperlinkhyperlink

hyperlink

hyperlink hyperlink

Page 11: Structural Web Search Using a   Graph-Based Discovery System

Search for Reference PagesSearch for Reference Pages

Search for page with at least 35 in linksSearch for page with at least 35 in links WebSubdue found 5 pages in www-cseWebSubdue found 5 pages in www-cse

AltaVista cannot perform this type of searchAltaVista cannot perform this type of search

page

page page page

hyperlinkhyperlink

hyperlink

Page 12: Structural Web Search Using a   Graph-Based Discovery System

Inclusion of WordNetInclusion of WordNet

When generating graphWhen generating graph Use common stopword listUse common stopword list

When searching for subgraph instancesWhen searching for subgraph instances Morphology functionsMorphology functions

October = OctOctober = Oct teaching = teachteaching = teach

SynsetsSynsets Optional allowance of synonymsOptional allowance of synonyms

Page 13: Structural Web Search Using a   Graph-Based Discovery System

Search for pages on ‘jobs in Search for pages on ‘jobs in computer science’computer science’

Inexact match: allow one level of synonymsInexact match: allow one level of synonyms WebSubdue found 33 matchesWebSubdue found 33 matches

Words include Words include employment, work, job, problem, employment, work, job, problem, tasktask

AltaVista found 2 matchesAltaVista found 2 matches

page

jobs computer science

wordword

word

Page 14: Structural Web Search Using a   Graph-Based Discovery System

Search for ‘authority’ hub and authority pagesSearch for ‘authority’ hub and authority pages

WebSubdue found 3 hub WebSubdue found 3 hub (and 3 authority) pages(and 3 authority) pages

AltaVista cannot perform AltaVista cannot perform this type of searchthis type of search

Inexact match applied Inexact match applied with threshold = 0.2 (4.2 with threshold = 0.2 (4.2 transformations allowed)transformations allowed)

WebSubdue found 13 WebSubdue found 13 matchesmatches

page

hyperlink

page page

page page page

word word word

algorithms algorithms algorithms

HUBS

AUTHORITIES

Page 15: Structural Web Search Using a   Graph-Based Discovery System

Subdue Learning from Web DataSubdue Learning from Web Data Distinguish professors’ and students’ web pagesDistinguish professors’ and students’ web pages

Learned concept (professors have “box” in Learned concept (professors have “box” in address field)address field)

Distinguish online stores and professors’ web pagesDistinguish online stores and professors’ web pages Learned concept (stores have more levels in Learned concept (stores have more levels in

graph)graph)

page boxword

page

page

page

page

page

page page

Page 16: Structural Web Search Using a   Graph-Based Discovery System

ConclusionsConclusions

WebSubdue can be used to search for WebSubdue can be used to search for structural web datastructural web data

Could be enhanced with additional Could be enhanced with additional WordNet features such as synset path WordNet features such as synset path lengthlength

Efficient structural search necessary for Efficient structural search necessary for future of web search toolsfuture of web search tools

Page 17: Structural Web Search Using a   Graph-Based Discovery System

To Learn MoreTo Learn More

cygnus.uta.edu/subdue

[email protected]://www-cse.uta.edu/~cook