niagaracq: a scalable continuous query system for internet databases

1

NiagaraCQ: A Scalable Continuous Query System for Internet Databases

CS561 Presentation

Xiaoning Wang

2

Presentation Outline

Why NiagaraCQ?

What’s NiagaraCQ?

Details

Experiments

3

We need Continuous Queries

Allows users to obtain new results from a database without having to issue the same query repeatedly.

Especially useful in an environment like the Internet comprised of large amounts of frequently changing information.

Need to be able to support millions of queries due to the scale of the Internet.

4

What’s NiagaraCQ?

A distributed database system for querying distributed XML data sets using a query language like XML-QL.

It allows a very large number of users to be able to register continuous queries in a high-level query language such as XML-QL.

5

Anything else?

Novelty:

query optimization strategy

6

What’s the Approach

Group!!! Assumption

?

7

Advantages of Grouping

Group optimization has the following benefits: Grouped queries can share computations. Common execution plans of grouped queries

can reside in memory, significantly saving on I/O costs compared to executing each query separately.

Grouping makes it possible to test the “firing” conditions of many continuous queries together, avoiding unnecessary invocations.

8

Can this traditional grouping tech be applied

into Continuous Query directly?

NO!!!

9

Why not?

Previous group optimization efforts focused on finding an optimal plan for a small number of similar queries.

Not applicable to a continuous query system for the following reasons:

Computationally too expensive to handle a large number of queries.

Not designed for an environment like the web.

10

Incremental group optimization approach

Uniqueness of NiagaraCQ

11

NiagaraCQ

Supports scalable continuous query processing over multiple, distributed XML files by deploying incremental group optimization.

12

It’s a novel approach in which queries are grouped according to their signatures.

Groups are created for existing queries according to their signatures, which represent similar structures among the queries.

Groups allow the common parts of two or more queries to be shared.

Each individual query in a query group shares the results from the execution of the group plan.

Incremental group optimization

13

Cont.

When a new query arrives, existing groups are considered as possible optimization choices instead of re-grouping all the queries in the system.

New query is merged into existing groups whose signatures match that of the query.

Existing queries are not re-grouped

14

query-split scheme

Incremental group optimization scheme employs a query-split scheme.

After the signature of a new query is matched, the sub-plan corresponding to the signature is replaced with a scan of the output file produced by the matching group.

Optimization process then continues with the remainder of the query tree is a bottom-up fashion until the entire query has been analyzed.

If no group “matches” a signature of the new query, a new query group for this signature is created in the system.

Thus, each continuous query is split into several smaller queries such that inputs of each of these queries are monitored using the same techniques that are used for the inputs of user-defined continuous queries.

15

Cont.

Advantages for query-split scheme: Main advantage is that it can be

implemented using a general query engine with only minor modifications.

Anther advantage is that the approach is very scalable.

16

Cont.

Since queries are continuous being added and removed from groups, over time the quality of the group can deteriorate, leading to a reduction in the overall performance of the system.

One or more groups may require “dynamic regrouping” to re-establish their effectiveness.

17

Types of Continuous Queries

Change-base continuous queries Fired as soon as new relevant data becomes

available. i.e. In stocks, day-traders would probably

want to know the desired price information immediately.

Provide better response time, but waste system resources when instantaneous answers are not really required.

18

Types of Continuous Queries

Timer-based continuous queries Executed only as time intervals specified by the

submitting user. i.e. In stocks, long-term investors may be satisfied

being notified every hour. Can be supported more efficiently, thus query

systems that support these continuous queries should be more scalable.

However, since users can specify various overlapping time intervals for their continuous queries, grouping timer-based queries is much more difficult than grouping purely change-based queries.

19

NiagaraCQ (cont.)

Considering only the changed portion of each updated XML file and not the entire file.

Monitor and detect data source changes using both push and pull models on heterogeneous sources.

Caching mechanism is used to obtain good performance with limited amounts of memory.

20

Using Expression Signature

Represent the same syntax structure, but possibly different constant values, in different queries.

Specific implementation of the signature concept.

21

Group Group signature

Quotes.Quote.Symbolin quotes.xml

constant

=

22

Group Group constant table

…. ….INTC Dest.i

MSFT Dest.j

…. ….

Constant_value Destination_buffer

23

Group plan Previous query plan

Quotes.xml Quotes.xml

File Scan File Scan

Select Symbol=“INT

C”

Select Symbol=“INT

C”

Trigger Action I Trigger Action J

24

Group plan New plan

Quotes.xml Constant Table

File Scan File Scan

Join

Trigger Action I Trigger Action J

Symbol=Constant_value

Split

……

25

Buffer design

The destination buffer for the split operator is needed.

Pipelined scheme

Intermediate scheme

26

Pipelined scheme

Split

Operator

…….

buffer

Operator

buffer

27

Disadvantage of pipeline

Such a scheme doesn’t work for grouping timer-based CQ’s. It’s difficult for a split operator to determine which tuple should be stored and how long they should be stored for.

The query structure is a directed graph, not a tree and hence the plan may be too complicated for a general XML-QL query engine to execute

The combine plan may be very large requires resources beyond the limits of system.

A large portion of the query plan may not need to be executed at each query invocation.

One query may block many other queries.

28

Materialized Intermediate Files

Query Split with Materialized Intermediate Files

Split

Trig.Act.J

File Scan

File Scan

Trig.Act.J

….

File_i File_j

29

Advantages for this design

Every query is scheduled independently. Only the necessary queries are executed.

Queries after the Split operator will be in standard, tree-structured query format and thus can be scheduled and executed by a general query engine.

Each query in the system is about the size of a common user query, so that it can be executed without consuming an unusual amount of system resources.

This approach handles intermediate files and original data files uniformly.

The potential bottleneck of pipelined approach is avoided.

30

General Selection Predicates

Format: “Attribute op Constant” Attribute is a path expression without wildcards in it. Op includes “-”, “<”, “>”, because these formats

dominate in the selection queries.

31

Problem for range-query groups

One potential problem for range-query groups is that the intermediate files may contain a large number of duplicated tuples because range predicated of the different queries might overlap.

32

Solution: Virtual Intermediate Files

Split

Trig.Act.J

File Scan

File Scan

Trig.Act.J

….

Virtual Intermediate File_i VI File_i

Real Intermediate File

……

Only one

33

Join Operators Join operators are usually expensive, sharing

common join operations can significantly reduce the amount of computation.

34

Join first? Or selection first? Advantage

Disadvantage

Solution: choose a better one based on a cost model

35

Timer-based Continuous Queries

Grouped in the same way as change-based queries except that the time information needs to be recorded at installation time.

Challenges Hard to monitor the timer events of those queries. Sharing the common computation becomes

difficult due to the various time intervals. Timer-based continuous queries fires at

specific times, but only if the corresponding input files have not been modified.

36

Incremental Evaluation

Incremental evaluation allows queries to be invoked only on the changed data.

For each file, on which CQ’s are defined, NiagaraCQ keeps a “delta file” that contains recent changes.

Queries are run over the delta files whenever possible instead of their original files.

A time stamp is added to each tuple in the delta file. NiagaraCQ fetches only tuples that were added to the delta file since the query’s last firing time.

37

Memory Caching

Caching is used to obtain good performance with a limited amount of memory.

Caches query plans, system data structures, and data files for better performance.

38

What should be cached? Grouped query plans. We assume

that the number of query groups is relatively small.

Recently accessed file. The event list for monitoring the

timer-based events. But it can be large, so we keep only a “time window” of this list

39

Experimental Results

Hardware Settings Conducted on a Sun Ultra 6000 with 1GB of

RAM, running JDK1.2 on Solaris 2.6. Data Sets

Database of stock information consisting of two XML files, “quotes.xml” and “companies.xml”.

“quotes.xml” contains stock information on about 5000 NASDAQ companies. The size of the file is about 2 MB.

“companies.xml” contains related company information. The size of the file is about 1 MB.

40

Comparison

41

Comparison

42

System Status and Future Work

A prototype version of NiagaraCQ has been developed, which includes a Group Optimizer, Continuous Query Manager, Event Detector, and Data Manager.

Incremental group optimizer is still at a preliminary stage. However, incremental group optimization has been shown to be a promising way to achieve good performance and scalability.

Intend to extent incremental group optimization to queries containing operators other than selection and join (i.e. aggregation).

niagaracq: a scalable continuous query system for internet databases

Documents

new query group

individual query

query tree

entire query

existing queries

smaller queries

millions of queries

highlevel query language