Download - Gülfem IŞIKLAR M.Mirac KOCATÜRK
Complex Queries over Web Repositories
Sriram Raghavan and Hector Garcia-MolinaComputer Science Department
Stanford University
Gülfem IŞIKLARGülfem IŞIKLARM.Mirac KOCATÜRKM.Mirac KOCATÜRK
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 22
OutlineOutline
IntroductionIntroduction
Challenges and Solution ApproachChallenges and Solution Approach
Model of a Web RepositoryModel of a Web Repository
Query OperatorsQuery Operators
Examples of Complex QueriesExamples of Complex Queries
Optimizing and Executing Complex Web QueriesOptimizing and Executing Complex Web Queries
ConclusionConclusion
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 33
IntroductionIntroduction
Web repositories manage large heterogeneous collections of Web pages and associated indexes.
For effective analysis and mining, these repositories must provide a declarative query interface that supports complex expressive Web queries.
In this paper, we model a Web repository in terms of “Web relations” and describe an algebra for expressing complex Web queries.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 44
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 55
Example 1: Let S be a weighted set consisting of all the pages
in the stanford.edu domain that contain the phrase ’Mobile
networking’. Compute R, the set of all the “.edu” domains
(except stanford.edu) that pages in S point to (we say a page p
points to domain D if it points to any page in D). List the top-10
domains in R in descending order of their weights.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 66
Example 2: With each comic strip C, he associates a website CS, and a set CW containing the name of the strip and the names of the characters featured in that strip. For example,
DilbertW = {Dilbert, Dogbert, The Boss} and
DilbertS = dilbert.com.
Extract a set of at most 10000 pages from the stanford.edu domain, preferring pages whose URLs either include the “~” character or include the path fragment “/people/”.
For each comic strip C, compute f1 (C), the number of pages in S that contain the words in CW, and f2 (C), the number of pages in CS that pages in S point to.
f1 (C) + f2 (C) is a measure of popularity for comic strip C.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 77
Challenges and Solution ApproachChallenges and Solution Approach
Query models used in relational or text retrieval systems provide some, but not all of the features required to support Web queries.
Thus, treating a Web repository as an application of a text retrieval system will support the “document collection” view. However, queries involving navigation or relational operators will be extremely hard to formulate and execute.
On the other hand, the relational model provides a rich and well-tested suite of operators for expressing complex predicates over Web page attributes. However, ranks and orders are not intrinsic to the the basic relational model.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 88
Model of a Web RepositoryModel of a Web Repository
Page: We use the term “page” to refer to any Web resource
that is referenced by a URL, crawled, and stored in the
repository.
Link: We use the term “link” to refer to any hypertext link that
is embedded in the pages in the repository. Each link is
associated with a source page (the page in which the
hypertext link occurs) and a destination page (the page that
the link refers to), and a unique identifier linkID.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 99
Ordered relation: Given a relation R and a strict partial ordering >R on the tuples of R, we refer to the pair [ R; >R ] as an ordered
relation on R.
For instance, we define an ordered relation [R; >R] = [ R; {a >R d;
a >R e; b >R d; b >R e; c >R d; c >R e } ], where each tuple whose
domain attribute is stanford.edu is >R -related to any tuple outside
the stanford.edu domain.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1010
Ranked relation: Given a relation R and a function w that assigns weights ( normalized to the range [0,1] ) to the tuples of R, we can define a new relation [ R ; w ] that is simply R with an additional implicit real-valued attribute w.
We refer to [ R ; w ] as a ranked relation on R and to R as the “base relation” of [ R ; w ].
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1111
We model a Web repository as a 6-tuple
W = ( IP; IL; WR ; P ; L ; F ):
IP (resp. IL) is an identifier space from which the pageID (resp.
linkID) for every page (resp. link) is chosen.
WR is a set of plain, ranked, or ordered relations called Web
relations. A relation R is said to be a Web relation if it contains at least one attribute whose domain is IP, IL 2Ip, or 2IL .
P WR is a universal page relation. P contains one tuple for
each page in the repository and one column for each page attribute.
P = (pageID, ....)
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1212
L WR is a universal link relation. L contains one tuple for
each hyperlink in the repository and one column for every available link attribute.
L = (linkID, srcID, destID, ...)
F is a set of predefined page and link ranking functions that have been registered in the repository.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1313
Query OperatorsQuery Operators
Unary Relational OperatorsUnary Relational Operators
Select (σ): For an ordered relation, we define ( [ R ; >R ] ) = [ S ; >S ], where S = (R) and >S is defined as a >S b iff a >R b and a; b S.For instance, referring to relation R, given the ordered relation
[ R ; >R ] = [ R; {a > d; a > e; b > d; b > e; c > d; c > e } ], representing “preference to stanford.edu pages over berkeley.edu pages”, σpInDegree≥4 ([R;>R]) will yield [S;>S] where S = σpInDegree≥4 (R) = {b; d; e}, and >S includes the two ordering conditions b >S d and b >S e.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1515
Projection (Projection (ΠΠ):): The semantics of projection also carry over unchanged from traditional relational algebra, except for the folllowing changes:
The result of the projection must include at least one attribute whose domain is IP, IL, 2Ip, or 2IL , to ensure that the result is a
Web relation.
Projection on a ranked relation [ R ; w ] will retain the ranking attribute R.w even if it is not listed in the projection list.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1616
Group-by ( γ ) on plain Web relation : The γ operator is used
to group the incoming links to WS based on the language of the
page in which the links occur.
Group-by ()
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1717
Group-by ( γ ) on ranked Web relation: The value of t.w’ for a
tuple t [S ; w’] is computed by averaging the ranks of all the
tuples of R belonging to the corresponding group.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1818
Group-by ( γ ) on ordered Web relation: The partial ordering >R is used to express the following preference: “prefer pages
with depths ≤ 3”. Thus, b >R c, g >R e, etc., as shown in the
diagram.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1919
Binary Relational OperatorsBinary Relational Operators
Union ( U ): We define [ X; >X ] [ [ Y; >Y ] = [ Z ; >Z ], where
• Z = X U Y
• If a >X b and a >Y b, then a >Z b
• If either a >X b and b ∉ Y or a >Y b and b ∉ X, then a >Z b
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2020
Set-difference (−): Note that the result has a partial order which is
simply >X restricted to the elements present in the result.
[ X ; >X ] − [ Y ; >Y ] = [ Z ; >Z ],
where Z = X − Y and a >Z b iff a >X b and a; b Z
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2121
Intersection ( ⋂ ):
[ X ; >X ] ⋂ [ Y ; >Y ] = [ Z ; >Z ], where
Z = X ⋂ Y and a >Z b iff a; b Z, a >X b, and a >Y b
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2222
Cross-product ( X ): Cross-product operations can involve any
pair of plain, ranked, or ordered relations. The challenge is to
define the ordering or ranking of the result for each possible
combination of operands.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2323
Ranking and ordering operators
Rank (Ψ): Operator Ψ simply formalizes the act of applying a
ranking function to a base relation. Thus, given a relation R and
ranking function f : R x {R} → [ 0 ; 1 ], we define Ψ( f ; R ) = [ R ; f ].
Compose ( Θh ,op ): The compose operator Θ is used to merge
two ranked relations to produce another ranked relation.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2424
Order ( Φ ): The operator Φ constructs an ordered relation,
given either a ranked relation or a plain base relation.
When applied on a ranked relation, Φ ([R; f]) returns the
corresponding ordered relation [ R ; >f ].
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2525
Prune ( Ωk ): The prune operator provides a mechanism for
retrieving a fixed-size subset of tuples from a relation. In
particular, given a relation R, Ωk (R) selects a subset of size
min(k; |R|).
For example, consider the ordered relation [ R; {a > b ; a > c ; a > e ; f > b ; f > c ; f > e} ] shown in previous figure, corresponding to the preference for “.com” domains over “.org” domains.
Ω4 on this relation can yield any set of four tuples as long as at least a and f are part of the result (thus, 6 possible results). Thus, one possible result of applying Ω4 is [{a; f; e; d}; {a > e; f > e} ].
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2626
QUERY OPERATORSQUERY OPERATORS
NAVIGATION OPERATORSNAVIGATION OPERATORS→→Λ is represented as forward navigationΛ is represented as forward navigation←←Λ is represented as backward navigationΛ is represented as backward navigation
These operators are expressed in terms of These operators are expressed in terms of cross product and group by operations.cross product and group by operations.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2727
QUERY OPERATORSQUERY OPERATORS
Navigation Operators differ in 2 ways:Navigation Operators differ in 2 ways:
1.1. Binary Navigation operatorBinary Navigation operator
2.2. Unary Navigation operatorUnary Navigation operator
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2828
QUERY OPERATORSQUERY OPERATORS
Binary Navigation OperatorBinary Navigation Operator1.1. Navigation with RankingNavigation with Ranking2.2. Navigation with OrderingNavigation with Ordering
a. a. Ordering only on pagesOrdering only on pagesb.b. Ordering both pages and links Ordering both pages and links
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2929
QUERY OPERATORSQUERY OPERATORS(navigation with ordering)(navigation with ordering)
[R,>[R,>RR]=]=ΦΦpLanguage=English>pLanguage≠English(R)pLanguage=English>pLanguage≠English(R)
[S,>[S,>SS]=]=ΦΦIntraDomain=yes>IntraDomain=No(S)IntraDomain=yes>IntraDomain=No(S)
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3030
QUERY OPERATORSQUERY OPERATORS(navigation with ranking)(navigation with ranking)
In this example we have the terms of;In this example we have the terms of;[R,f] and [S,g][R,f] and [S,g]
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3131
QUERY OPERATORSQUERY OPERATORS
UNARY NAVIGATION OPERATORSUNARY NAVIGATION OPERATORS
Instead of choosing from a set of tuples these operators Instead of choosing from a set of tuples these operators
permit navigation using all available data links in the permit navigation using all available data links in the
repository.repository.
So if R is ordered and ranked then each neighbour will also So if R is ordered and ranked then each neighbour will also
be correspondingly ordered and ranked.be correspondingly ordered and ranked.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3232
EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES
Example 1Example 1
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3333
EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES
Example 2Example 2
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3434
EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES
Example 3Example 3
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3535
EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES
Example 4Example 4
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3636
OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES
An optimizer and execution engine is developed to An optimizer and execution engine is developed to
efficiently executing the complex queries.efficiently executing the complex queries.
The challenges of the system are:The challenges of the system are:
1.1. Certain unique features of Web data setCertain unique features of Web data set
2.2. The storage structures used in Web repositoriesThe storage structures used in Web repositories
3.3. Characteristics of complex web queriesCharacteristics of complex web queries
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3737
OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES
As with join operations in relational queries, optimization of As with join operations in relational queries, optimization of
navigation operations is crucial for web queries.navigation operations is crucial for web queries.
There are two techniques to optimize navigation operation:There are two techniques to optimize navigation operation:
1.1. Exploit Query LocalityExploit Query Locality
2.2. Exploit PruneExploit Prune
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3838
OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES
PAGE CLUSTERSPAGE CLUSTERS
To identify and exploit locality during query execution, we To identify and exploit locality during query execution, we
partition the entire set in the repository into page clusters.partition the entire set in the repository into page clusters.
We attempt to group together “related” pages so that all the We attempt to group together “related” pages so that all the
pages relevant to a complex query as distributed among a pages relevant to a complex query as distributed among a
relatively small number of clusters.relatively small number of clusters.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3939
OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES
S-NODE REPRESENTATIONS-NODE REPRESENTATION
Supernode graphs resides in memory.Supernode graphs resides in memory.
Graph chunks are loaded from disk on demand.Graph chunks are loaded from disk on demand.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4040
EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS
35-million page data set 35-million page data set (approximately; 600 million links with 300 GB of HTML)(approximately; 600 million links with 300 GB of HTML)
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4141
EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS
30 Web queries over 5 different 20-million data set30 Web queries over 5 different 20-million data set
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4242
RELATED WORKSRELATED WORKS
Drawing inspiration graph and hyper-text query systems, a Drawing inspiration graph and hyper-text query systems, a
number of web query languages have been developed in the number of web query languages have been developed in the
past such as WebSQL, W3QL, StruQL etc.past such as WebSQL, W3QL, StruQL etc.
These models are not incorporate with the notions ordering These models are not incorporate with the notions ordering
and ranking.and ranking.
At implementation level, these systems are intended for At implementation level, these systems are intended for
“online” queries for Web-Site Management as opposed to our “online” queries for Web-Site Management as opposed to our
“warehouse” model.“warehouse” model.
19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4343
CONCLUSIONCONCLUSION
We addressed the problem of formulating and executing We addressed the problem of formulating and executing
complex queries over Web repositories.complex queries over Web repositories.
We showed that the key characteristics of Web queries are We showed that the key characteristics of Web queries are
the combination of navigation, text search and relational the combination of navigation, text search and relational
operators which can manipulate ordering and ranking.operators which can manipulate ordering and ranking.
Finally we discussed some of the optimization techniques to Finally we discussed some of the optimization techniques to
execute such queries more efficiently.execute such queries more efficiently.
THANK YOUTHANK YOU