niagaracq: a scalable continuous query system for internet databases
DESCRIPTION
NiagaraCQ: A Scalable Continuous Query System for Internet Databases. CS561 Presentation Xiaoning Wang. Presentation Outline. Why NiagaraCQ? What’s NiagaraCQ? Details Experiments. We need Continuous Queries. - PowerPoint PPT PresentationTRANSCRIPT
1
NiagaraCQ: A Scalable Continuous Query System for Internet Databases
CS561 Presentation
Xiaoning Wang
2
Presentation Outline
Why NiagaraCQ?
What’s NiagaraCQ?
Details
Experiments
3
We need Continuous Queries
Allows users to obtain new results from a database without having to issue the same query repeatedly.
Especially useful in an environment like the Internet comprised of large amounts of frequently changing information.
Need to be able to support millions of queries due to the scale of the Internet.
4
What’s NiagaraCQ?
A distributed database system for querying distributed XML data sets using a query language like XML-QL.
It allows a very large number of users to be able to register continuous queries in a high-level query language such as XML-QL.
5
Anything else?
Novelty:
query optimization strategy
6
What’s the Approach
Group!!! Assumption
?
7
Advantages of Grouping
Group optimization has the following benefits: Grouped queries can share computations. Common execution plans of grouped queries
can reside in memory, significantly saving on I/O costs compared to executing each query separately.
Grouping makes it possible to test the “firing” conditions of many continuous queries together, avoiding unnecessary invocations.
8
Can this traditional grouping tech be applied
into Continuous Query directly?
NO!!!
9
Why not?
Previous group optimization efforts focused on finding an optimal plan for a small number of similar queries.
Not applicable to a continuous query system for the following reasons:
Computationally too expensive to handle a large number of queries.
Not designed for an environment like the web.
10
Incremental group optimization approach
Uniqueness of NiagaraCQ
11
NiagaraCQ
Supports scalable continuous query processing over multiple, distributed XML files by deploying incremental group optimization.
12
It’s a novel approach in which queries are grouped according to their signatures.
Groups are created for existing queries according to their signatures, which represent similar structures among the queries.
Groups allow the common parts of two or more queries to be shared.
Each individual query in a query group shares the results from the execution of the group plan.
Incremental group optimization
13
Cont.
When a new query arrives, existing groups are considered as possible optimization choices instead of re-grouping all the queries in the system.
New query is merged into existing groups whose signatures match that of the query.
Existing queries are not re-grouped
14
query-split scheme
Incremental group optimization scheme employs a query-split scheme.
After the signature of a new query is matched, the sub-plan corresponding to the signature is replaced with a scan of the output file produced by the matching group.
Optimization process then continues with the remainder of the query tree is a bottom-up fashion until the entire query has been analyzed.
If no group “matches” a signature of the new query, a new query group for this signature is created in the system.
Thus, each continuous query is split into several smaller queries such that inputs of each of these queries are monitored using the same techniques that are used for the inputs of user-defined continuous queries.
15
Cont.
Advantages for query-split scheme: Main advantage is that it can be
implemented using a general query engine with only minor modifications.
Anther advantage is that the approach is very scalable.
16
Cont.
Since queries are continuous being added and removed from groups, over time the quality of the group can deteriorate, leading to a reduction in the overall performance of the system.
One or more groups may require “dynamic regrouping” to re-establish their effectiveness.
17
Types of Continuous Queries
Change-base continuous queries Fired as soon as new relevant data becomes
available. i.e. In stocks, day-traders would probably
want to know the desired price information immediately.
Provide better response time, but waste system resources when instantaneous answers are not really required.
18
Types of Continuous Queries
Timer-based continuous queries Executed only as time intervals specified by the
submitting user. i.e. In stocks, long-term investors may be satisfied
being notified every hour. Can be supported more efficiently, thus query
systems that support these continuous queries should be more scalable.
However, since users can specify various overlapping time intervals for their continuous queries, grouping timer-based queries is much more difficult than grouping purely change-based queries.
19
NiagaraCQ (cont.)
Considering only the changed portion of each updated XML file and not the entire file.
Monitor and detect data source changes using both push and pull models on heterogeneous sources.
Caching mechanism is used to obtain good performance with limited amounts of memory.
20
Using Expression Signature
Represent the same syntax structure, but possibly different constant values, in different queries.
Specific implementation of the signature concept.
21
Group Group signature
Quotes.Quote.Symbolin quotes.xml
constant
=
22
Group Group constant table
…. ….INTC Dest.i
MSFT Dest.j
…. ….
Constant_value Destination_buffer
23
Group plan Previous query plan
Quotes.xml Quotes.xml
File Scan File Scan
Select Symbol=“INT
C”
Select Symbol=“INT
C”
Trigger Action I Trigger Action J
24
Group plan New plan
Quotes.xml Constant Table
File Scan File Scan
Join
Trigger Action I Trigger Action J
Symbol=Constant_value
Split
……
25
Buffer design
The destination buffer for the split operator is needed.
Pipelined scheme
Intermediate scheme
26
Pipelined scheme
Split
Operator
…….
buffer
Operator
buffer
27
Disadvantage of pipeline
Such a scheme doesn’t work for grouping timer-based CQ’s. It’s difficult for a split operator to determine which tuple should be stored and how long they should be stored for.
The query structure is a directed graph, not a tree and hence the plan may be too complicated for a general XML-QL query engine to execute
The combine plan may be very large requires resources beyond the limits of system.
A large portion of the query plan may not need to be executed at each query invocation.
One query may block many other queries.
28
Materialized Intermediate Files
Query Split with Materialized Intermediate Files
Split
Trig.Act.J
File Scan
File Scan
Trig.Act.J
….
File_i File_j
29
Advantages for this design
Every query is scheduled independently. Only the necessary queries are executed.
Queries after the Split operator will be in standard, tree-structured query format and thus can be scheduled and executed by a general query engine.
Each query in the system is about the size of a common user query, so that it can be executed without consuming an unusual amount of system resources.
This approach handles intermediate files and original data files uniformly.
The potential bottleneck of pipelined approach is avoided.
30
General Selection Predicates
Format: “Attribute op Constant” Attribute is a path expression without wildcards in it. Op includes “-”, “<”, “>”, because these formats
dominate in the selection queries.
31
Problem for range-query groups
One potential problem for range-query groups is that the intermediate files may contain a large number of duplicated tuples because range predicated of the different queries might overlap.
32
Solution: Virtual Intermediate Files
Split
Trig.Act.J
File Scan
File Scan
Trig.Act.J
….
Virtual Intermediate File_i VI File_i
Real Intermediate File
……
Only one
33
Join Operators Join operators are usually expensive, sharing
common join operations can significantly reduce the amount of computation.
34
Join first? Or selection first? Advantage
Disadvantage
Solution: choose a better one based on a cost model
35
Timer-based Continuous Queries
Grouped in the same way as change-based queries except that the time information needs to be recorded at installation time.
Challenges Hard to monitor the timer events of those queries. Sharing the common computation becomes
difficult due to the various time intervals. Timer-based continuous queries fires at
specific times, but only if the corresponding input files have not been modified.
36
Incremental Evaluation
Incremental evaluation allows queries to be invoked only on the changed data.
For each file, on which CQ’s are defined, NiagaraCQ keeps a “delta file” that contains recent changes.
Queries are run over the delta files whenever possible instead of their original files.
A time stamp is added to each tuple in the delta file. NiagaraCQ fetches only tuples that were added to the delta file since the query’s last firing time.
37
Memory Caching
Caching is used to obtain good performance with a limited amount of memory.
Caches query plans, system data structures, and data files for better performance.
38
What should be cached? Grouped query plans. We assume
that the number of query groups is relatively small.
Recently accessed file. The event list for monitoring the
timer-based events. But it can be large, so we keep only a “time window” of this list
39
Experimental Results
Hardware Settings Conducted on a Sun Ultra 6000 with 1GB of
RAM, running JDK1.2 on Solaris 2.6. Data Sets
Database of stock information consisting of two XML files, “quotes.xml” and “companies.xml”.
“quotes.xml” contains stock information on about 5000 NASDAQ companies. The size of the file is about 2 MB.
“companies.xml” contains related company information. The size of the file is about 1 MB.
40
Comparison
41
Comparison
42
System Status and Future Work
A prototype version of NiagaraCQ has been developed, which includes a Group Optimizer, Continuous Query Manager, Event Detector, and Data Manager.
Incremental group optimizer is still at a preliminary stage. However, incremental group optimization has been shown to be a promising way to achieve good performance and scalability.
Intend to extent incremental group optimization to queries containing operators other than selection and join (i.e. aggregation).