time 2002, manchester, uk index based processing of semi- restrictive temporal joins donghui zhang,...

27
TIME 2002, Manchester, UK Index Based Processing of Semi-Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

Upload: helena-logan

Post on 17-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Index Based Processing of Semi-Restrictive Temporal Joins

Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

Page 2: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Contents

Background Join problem definition Straightforward approaches Proposed join algorithms Performance study Conclusions

Page 3: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Background Temporal record: (key, time interval) and some

attributes. TE-Join: two records qualify for join if

their time intervals intersect; and their keys are equal.

Page 4: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Background Our earlier work [ICDE02] solved a general

TE-Join (GTE-Join), where portions from each relation are joined: the portion is selected via a range-interval

selection: record keys should be in range r and time intervals should intersect interval i.

interesting because (1) temporal relations are large; (2) TE-Join is a special case, when r and i are (-, +).

Page 5: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Problem Definition

Semi-restrictive joins: records join if their keys are equal (GE-Join), or their intervals intersect (GT-join), but not both.

GE-Join: select a subset from X, a subset from Y, and join records from the subsets if their keys are equal.

GT-Join: select a subset from X, a subset from Y, and join records from the subsets if their intervals intersect.

Page 6: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Problem Definition

GT-Join example: find employees whose last names start with ‘B’ and who co-worked during 1995 with the employees whose last names start with ‘S’.

GE-Join example: find the 1998 IBM employees who were UC Riverside students in 1995.

Page 7: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

GT-Join Solutions...

Page 8: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Straightforward Solutions for GT-Join

1. Unsynchronized join.

2. Synchronized join using B+-trees.

3. Synchronized join using R-trees.

Page 9: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

1. Unsynchronized join: separate the selection and join phases; not efficient because: storing the intermediate result can be

large; selection in one relation ignores data

distribution of the other relation.

Straightforward Solutions for GT-Join

Page 10: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

2. Synchronized using B+-trees.

Not efficient: y needs to be checked against every record whose start is before end of y.

tmin tmax

y

If cluster on start:

Cluster on end is similar.

Straightforward Solutions for GT-Join

Page 11: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Store each record as a two-dimensional interval in the R-tree;

Use existing R-tree join algorithms [BKS93, HJR97];

Modifications: (1) integrate the selection condition; (2) join index records as long as they intersect in time dimension and ignore key dimension.

However, not efficient since R-trees do not handle long intervals well.

3. Synchronized using R-trees.

Straightforward Solutions for GT-Join

Page 12: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Our Solutions

Synchronized join using temporal indices. Multi-version B+-tree (MVBT) [BGO+96]:

asymptotically optimal space, update, query. We propose three synchronized, MVBT-

based join algorithms.

(apply to other temporal indices as well)

Page 13: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Review of MVBT

A “forest” of trees: different trees may overlap.

Root nodes correspond to contiguous, non-intersecting time intervals.

A record may be stored in multiple pages. Efficient range-interval selection algorithms.

Page 14: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Top-down GT-Join

Idea: for each pair of trees, one from each MVBT forest, synchronized tree traversal (STT).

STT for two trees:

Note that special care is needed to avoid duplicates, since a record has multiple copies.

initially, join root nodes; to join two nodes, join their children; eventually, join elements in leaf pages.

Page 15: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Link-based GT-Join

A

B

C

In each leaf page, store a pointer to its predecessor.

D find pairs of data pages that (1) intersect with the

right border of the query rectangle; and (2) intersect with each other in time dimension;

keep such pairs in priority queue; sweep left synchronously.

For GT-Join:

Page 16: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Plane Sweep GT-Join

Similar to link-based. Maintain two priority queues, one for each

MVBT. At each step, access the leaf page with the

largest end time and add records to buffer. To add records to buffer, join with

existing records from the other MVBT. Throw away useless records.

Page 17: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

GE-Join Solutions...

Page 18: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

GE-Join Solutions...

Similarly, we have: unsynchronized synchronized using B+-trees synchronized using R-trees top-down using MVBT link-based using MVBTNote: some of them, especially the link-based algorithm, are quite different due to different join condition.

Page 19: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Implemented Algorithms

Notation: Meaning:

mvbt_df Synchronized MVBT, depth-first

mvbt_bf Synchronized MVBT, breadth-first

mvbt_link Synchronized MVBT, link-based

r*_df Synchronized R*-tree, depth-first

r*_bf Synchronized R*-tree, breadth-first

Common to both GT-Join and GE-Join:

Page 20: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Implemented Algorithms

mvbt_ps Synchronized MVBT, plane-sweep

spj spatially partitioned join [LOT94]

b+ Synchronized B+-tree, index on keymvbt_sm Unsynchronized, sort-merge after selection

Specific to GE-Join:

Specific to GT-Join:

Page 21: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Experimental Setup

• Implemented in GNU C++.• Sun Enterprise 250 Server machine with two

UltraSPARC-II processors using Solaris 2.8.• Page size = 8KB.• Buffer size = 10MB; LRU buffer.• Each data set: 10 million records.• R/I ratio: length of query key range divided

by length of query time interval. It describes the shape of query rectangle.

Page 22: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

GT-Join Performance

R/I ratio = 10.

0

1000

2000

3000

4000

5000

6000

7000

mvbt_df

mvbt_bf

mvbt_link

mvbt_ps

r*_df

r*_bf sp

j

IO

CPU

Page 23: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

GT-Join Performance

R/I ratio = 0.1.

0

250

500

750

1000

1250

1500

mvbt_df

mvbt_bf

mvbt_link

mvbt_ps

r*_df

r*_bf sp

j

IO

CPU

Page 24: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

GE-Join Performance

R/I ratio = 10.

0100200300400500600700800900

mvbt_df

mvbt_bf

mvbt_link

mvbt_sm b+ r*_

dfr*_bf

IO

CPU

Page 25: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

GE-Join Performance

R/I ratio = 0.1.

02505007501000125015001750200022502500

mvbt_df

mvbt_bf

mvbt_link

mvbt_sm b+ r*_

dfr*_bf

IO

CPU

Page 26: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK

Conclusions We addressed index-based GT-Join and GE-Join. Joins using traditional indices (B+-tree, R-tree)

are not efficient. We proposed various synchronized approaches

based on temporal indices (MVBT). Experiments:

– for GT-Join, link-based and plane-sweep are the best;– for GE-Join, link-based and sort-merge are the best;– overall, link-based is the best: multi-fold

improvement over B+-tree/R-tree joins.

Page 27: TIME 2002, Manchester, UK Index Based Processing of Semi- Restrictive Temporal Joins Donghui Zhang, Vassilis J. Tsotras University of California, Riverside

TIME 2002, Manchester, UK