constraint processing techniques for improving join computation: a proof of concept anagh lal &...

Constraint Processing Techniques for Improving Join Computation:

A Proof of Concept

Anagh Lal & Berthe Y. Choueiry

Constraint Systems Laboratory

Department of Computer Science & EngineeringUniversity of Nebraska-Lincoln

An illustrative example

R2

A B C

1 12 23

1 13 23

1 14 23

1 15 23

2 10 25

3 17 20

3 18 22

4 10 25

5 12 23

5 13 23

5 14 23

5 15 23

6 13 27

6 14 27

8 14 28

Join querySELECT R1.A,R1.B,R1.C

FROM R1,R2

WHERE R1.A=R2.A

AND R1.B=R2.B

AND R1.C=R2.C 10 tuples in 3 nested tuples

R1 J oin R2 (Compacted)

A B C

{1, 5} {12, 13, 14} {23}

{2, 4} {10} {25}

{6} {13, 14} {27}

R1

A B C

1 12 23

1 13 23

1 14 23

2 10 25

3 16 30

3 16 24

4 10 25

5 12 23

5 13 23

5 14 23

6 13 27

6 14 27

7 14 28

7 19 20

Advantages

Direct• Savings of number of tuple comparisons• Savings in I/O for next operator• Space reduction of materialized join queries

Future applications• Use for query size estimation• Assist in high-level analysis of data & in data

mining

Our contributions A new representation of a join query as a Constraint

Satisfaction Problem (CSP) A new sorting-based bundling algorithm

• Suitable for CSPs with fewer and larger constraints (i.e., join)• Improves memory usage

A new sort-merge join algorithm for producing (dynamically) bundled tuples• Yields compact representation, saves memory space

Identification of possible applications• Data analysis • Materialized views• Assisting query-size estimation Suggested, not yet demonstrated

Constraint Satisfaction Problem

Given P = (V, D, C)• V = {Vi}, a set of variables

• D = {DVi}, the set of their respective domains

• C is a set of constraints restricting the acceptable combination of values for variables.

• Solution is a consistent assignment of values to variables

Query: find 1 solution, all solutions, etc.

V3

{d}

{a, b, d} {a, b, c}

{c, d, e, f}

V4

V2V1

Solving CSPs Typically, DFS & backtracking Improvement

• Static bundling [Freuder 91]

• Dynamic bundling [our group]– Based on dynamically identifying symmetries

– Guaranteed never less efficient than non-bundling, static bundling

Without bundling Static bundling

S

c d, e, f

dV1

V2

Dynamic bundling

c e, f d

dV1

V2

S

c e f d

dV1

V2

S

V3

{d}

{a, b, d} {a, b, c}

{c, d, e, f}

V4

V2V1

Modeling Join as a CSP

Attributes of relations CSP variables Attribute values variable domains Relations relational constraints Join conditions join-condition constraintsSELECT R1.A,R1.B,R1.C

FROM R1,R2

WHERE R1.A=R2.A

AND R1.B=R2.B

AND R1.C=R2.C

Sorting-based bundling

Heuristic for variable ordering Place variables linked by join conditions as close to each other as possible

R1.A

R2.A

R1.B

R2.B

R1.C

R2.C

R1

R2

Sort relations using above ordering Next: Compute bundles of variable

ahead in variable ordering (R1.A)

Bundling an attribute

Partition of a constraintTuples of the relation having the same value of R1.A

Compare projected tuples of first partition with those of another partition

Compare with every other partition to get complete bundle

Partition

Unequalpartitions

Symmetricpartitions

Bundle {1, 5}

R1

A B C

1 12 23

1 13 23

1 14 23

2 10 25

5 12 23

5 13 23

5 14 23

Join using dynamic bundlingSelect next-

variableCompute next

valid bundle

Foundbundle?

Last variable?

Move to previous variable

Undo previous

assignment

1st in Ordering?

No

No

Yes

Output onetuple

Start

Stop

Yes

Yes

No

Assign bundle

Finding the valid bundle

R1

A B C

1 12 23

1 13 23

1 14 23

2 10 25

3 16 30

3 16 24

R2

A B C

1 12 23

1 13 23

1 14 23

1 15 23

2 10 25

3 17 20

{1, 5, x}{1, 5, y, z}

Common {1, 5}1. Compute a bundle

for the attribute 2. Check bundle

validity with future constraints

3. If no common value found GOTO 1

Assign variable with the surviving values in the bundle

Analysis of overheads

For Bundling• Additional data structures: 2 arrays, 1 pointer• Only 1 array may become cumbersome

Array size is largest • when all the values of a variable are in one

bundle • But, this case also leads to best savings!

Improved implementation • Use of Bitmaps?

Progressive Merge Join PMJ: A sort-merge algorithm by [Dittrich et al. 03] Provides early results

• Assists in query size-estimation Two main phases

• Sorting: starts producing results in this phase• Merging phase: merges sorted runs

We use the framework of the PMJ for our external join.

Implemented & evaluated with the XXL library• We use the same library for our implementation

Preliminary experiments

Data sets• Random: 2 relations R1, R2 with same schema as example

– Each relation: 10’000 tuples– Memory size: 4’000 tuples– Page size 200 tuples

• Real-world problem: 3 relations, 4 attributes

Compaction rate achieved• Random problem: 1.48

– Savings compensate for even worst case (of the current experimental implementation)

• Real-world problem: 2.26 (69 tuples in 32 nested tuples)

Related work

Join algorithms• Well established algorithms• Do not focus on exploiting symmetry

Database compression• Output results are not compressed• Compression at value level, not tuple level

Related work (contd)

[Mamoulis & Papadias 1998] • Join using FC for spatial DB • Restricted to binary constraints• No compaction of solution space

[Bayardo et al. 1996]• Reduce the number of the intermediate tuples of a sequence of

joins

[Rich et al. 1993]• Do not compact join attribute values• Does not detect redundancy present in the grouped sub-relations

Future work

Refine implementation • Use of lighter data structures

Test usefulness in the context of Constraint DBs• Values are continuous intervals, e.g. spatial database

Conduct thorough evaluations of overall performance & overhead (memory & CPU) on different data distributions

Investigate benefit of using bundling• query size estimation • materialized views

Research supported by CAREER Award #0133568 from NSF

DB vs. CSP terminology

Bundling relations: Data structures Considering the portion

of the relation in memory Current-Inst: To store the

current instantiations of past variables Vp of R1.

Current-Constraint: selection of R’:• Past variable values equal

Current-Inst• Current variable Vc > all

previous instantiations of Vc

Bundling relations:Computing bundles (Algorithm 1)

NEXT-PARTITION(p) returns the first unchecked partition in Current-Constraint following the partition p.

Sorted constraints Checking equality of tuples is efficient

Bundling relations: Data structures

Processed-Values: Cumulatively stores non-representative values of bundles

Computing bundles of Vc Values of Vc in it are ignored

Partition p is marked as checked when:• Value(p) is in an instantiation bundle

• p is selected for comparing with other partitions to check for bundles

Join computation: In memory

Two subsets of relations (some pages) in memory: • Algorithm to find result of joining the two.• Join computed as a search

– Finding all solutions

• After finding one solution, search resumes from same depth

– Algorithm shown can be entered at any “depth” in the search

• Uses Algorithm 1 to find bundles for assigning to variables

Join computation: In memory

Join as a search (Algo. 2) BACKTRACK

• Variable[depth] in Current-Inst reset

• Processed-Values for the variable emptied

• Value in Current-Solution reset

• Current-Constraint re-computed

Undoes the effects of the previous instantiation.

Expanded onnext slide

Join computation: In memory COMMON(bi, bundles)

subset of bi consistent using join-condition constraints

For equality COMMON Intersection

Empty result of COMMONinconsistency BACKTRACK

constraint processing techniques for improving join computation: a proof of concept anagh lal &...

Documents