structural web search using a graph-based discovery system
DESCRIPTION
Structural Web Search Using a Graph-Based Discovery System. Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington [email protected] http://www-cse.uta.edu/~cook. Structured Web Search. Existing search engines use linear feature match - PowerPoint PPT PresentationTRANSCRIPT
Structural Web Search Using a Structural Web Search Using a Graph-Based Discovery SystemGraph-Based Discovery System
Nitish Manocha, Diane J. Cook, and Nitish Manocha, Diane J. Cook, and Lawrence B. HolderLawrence B. Holder
University of Texas at ArlingtonUniversity of Texas at Arlington
[email protected]@cse.uta.edu
http://www-cse.uta.edu/~cookhttp://www-cse.uta.edu/~cook
Structured Web SearchStructured Web Search
Existing search engines use linear feature matchExisting search engines use linear feature match Web contains structural information as wellWeb contains structural information as well
Hyperlink informationHyperlink information Web viewed as a graph [Kleinberg]Web viewed as a graph [Kleinberg]
Subdue searches based on structureSubdue searches based on structure Use as foundation of a structural search engineUse as foundation of a structural search engine
Incorporation of WordNet allows for synonym Incorporation of WordNet allows for synonym matchmatch
object
triangle
Discovers structural patterns in input graphsDiscovers structural patterns in input graphs A A substructuresubstructure is connected subgraph is connected subgraph An An instanceinstance of a substructure is a subgraph of a substructure is a subgraph
that is isomorphic to substructure definitionthat is isomorphic to substructure definition Pattern discovery, classification, clusteringPattern discovery, classification, clustering
R1
C1
T1
S1
T2
S2
T3
S3
T4
S4
Input Database Substructure S1 (graph form)
Compressed Database
R1
C1object
squareon
shape
shape S1S1 S1S1 S1S1
S1S1
Subdue AlgorithmSubdue Algorithm
• Start with individual verticesStart with individual vertices• Keep only best substructures on queueKeep only best substructures on queue• Expand substructure by adding Expand substructure by adding
edge/vertexedge/vertex• Compress graph and repeat to generate Compress graph and repeat to generate
hierarchical descriptionhierarchical description• Optional use of background knowledgeOptional use of background knowledge
Inexact Graph MatchInexact Graph Match
Some variations may occur between Some variations may occur between instancesinstances
Want to abstract over minor differencesWant to abstract over minor differences Difference = cost of transforming one Difference = cost of transforming one
graph to make it isomorphic to anothergraph to make it isomorphic to another Match if Match if cost/size < thresholdcost/size < threshold
Application DomainsApplication Domains
Protein dataProtein data Human Genome DNA dataHuman Genome DNA data Spatial-temporal domainsSpatial-temporal domains
Earthquake dataEarthquake data Aircraft Safety and Reporting SystemAircraft Safety and Reporting System
Telecommunications dataTelecommunications data Program source codeProgram source code Web dataWeb data
page
Represent Web as GraphRepresent Web as Graph Breadth-first search of domain to generate graphBreadth-first search of domain to generate graph
Nodes represent pages / documentsNodes represent pages / documents Edges represent hyperlinksEdges represent hyperlinks Additional nodes represent document keywordsAdditional nodes represent document keywords
page
university
texas
learning group
projects
subdue
robotics
parallel
hyperlink
work
word word
planning
WebSubdue’s Structural SearchWebSubdue’s Structural Search
Formulate query as graphFormulate query as graph Use Subdue’s predefined substructure Use Subdue’s predefined substructure
option to search for instances of queryoption to search for instances of query
Instructor
TeachingRobotics
ResearchRobotics
Publication
Robotics
httphttp
Postscript| PDF
Query: Query: Find all pages which link to Find all pages which link to a page containing term ‘Subdue’a page containing term ‘Subdue’
Subgraph vertices: 1 pageURL: http://cygnus.uta.edu7 pageURL: http://cygnus.uta.edu/projects.html8 Subdue[1->7] hyperlink[7->8] word
Subdue
pagehyperlink
/* Vertex ID Label */
sv 1 pagev 2 pagev 3 Subdue
/* Edge Vertex 1 Vertex 2 Label */
d 1 2 hyperlinkd 2 3 word
word
page
Search for Presentation PagesSearch for Presentation Pages
WebSubdueWebSubdue 22 instances22 instances
AltaVistaAltaVista Query Query ““host:www-cse.uta.edu AND host:www-cse.uta.edu AND
image:next_motif.gif AND image:up_motif.gif AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.”image:previous_motif.gif.”
12 instances12 instances
page
page page page
hyperlinkhyperlink
hyperlink
hyperlink hyperlink
Search for Reference PagesSearch for Reference Pages
Search for page with at least 35 in linksSearch for page with at least 35 in links WebSubdue found 5 pages in www-cseWebSubdue found 5 pages in www-cse
AltaVista cannot perform this type of searchAltaVista cannot perform this type of search
page
page page page
hyperlinkhyperlink
hyperlink
…
Inclusion of WordNetInclusion of WordNet
When generating graphWhen generating graph Use common stopword listUse common stopword list
When searching for subgraph instancesWhen searching for subgraph instances Morphology functionsMorphology functions
October = OctOctober = Oct teaching = teachteaching = teach
SynsetsSynsets Optional allowance of synonymsOptional allowance of synonyms
Search for pages on ‘jobs in Search for pages on ‘jobs in computer science’computer science’
Inexact match: allow one level of synonymsInexact match: allow one level of synonyms WebSubdue found 33 matchesWebSubdue found 33 matches
Words include Words include employment, work, job, problem, employment, work, job, problem, tasktask
AltaVista found 2 matchesAltaVista found 2 matches
page
jobs computer science
wordword
word
Search for ‘authority’ hub and authority pagesSearch for ‘authority’ hub and authority pages
WebSubdue found 3 hub WebSubdue found 3 hub (and 3 authority) pages(and 3 authority) pages
AltaVista cannot perform AltaVista cannot perform this type of searchthis type of search
Inexact match applied Inexact match applied with threshold = 0.2 (4.2 with threshold = 0.2 (4.2 transformations allowed)transformations allowed)
WebSubdue found 13 WebSubdue found 13 matchesmatches
page
hyperlink
page page
page page page
word word word
algorithms algorithms algorithms
HUBS
AUTHORITIES
Subdue Learning from Web DataSubdue Learning from Web Data Distinguish professors’ and students’ web pagesDistinguish professors’ and students’ web pages
Learned concept (professors have “box” in Learned concept (professors have “box” in address field)address field)
Distinguish online stores and professors’ web pagesDistinguish online stores and professors’ web pages Learned concept (stores have more levels in Learned concept (stores have more levels in
graph)graph)
page boxword
page
page
page
page
page
page page
ConclusionsConclusions
WebSubdue can be used to search for WebSubdue can be used to search for structural web datastructural web data
Could be enhanced with additional Could be enhanced with additional WordNet features such as synset path WordNet features such as synset path lengthlength
Efficient structural search necessary for Efficient structural search necessary for future of web search toolsfuture of web search tools