xjoin: a reactively-scheduled pipelined join operator

CS561 - XJoin 1

XJoin: A Reactively-Scheduled Pipelined Join Operator

IEEE Bulletin, 2000

by Tolga Urhan and Michael J. Franklin

Based on a talk prepared by Asima Silva & Leena Razzaq

CS561 - XJoin 2

Goal of XJoin

Efficiently evaluate equi-join in online query processing over distributed data sources

Optimization objectives: Having small memory footprint Fast initial result delivery Hiding intermittent delays in data arrival

CS561 - XJoin 3

Outline

Hash Join History Motivation of XJoin Challenges in Developing XJoin Three Stages of XJoin Preventing Duplicates Experimental Results Conclusion

CS561 - XJoin 4

Classic Hash Join

key2 R tuples

key1 R tuples

key3 R tuples

key4 R tuples

Key5 R tuples

1. Build S tuple 1

S tuple 2

S tuple 3

S tuple 4

S tuple 5

2. Probe

2-phase: build and probe Only one table is hashed in memory

CS561 - XJoin 5

Hybrid Hash Join One table is hashed both to disk and memory (partitions) G. Graefe, “Query Evaluation Techniques for Large Databases”.

ACM 1993.

Disk

Bucket i

Bucket i+1

Bucket i+2

Bucket …

Bucket j-1

Bucket j

R tuples

R tuples

R tuples

R tuples

R tuples

R tuplesBucket n

Bucket n+1

Bucket n+2

Bucket …

Bucket m-1

Bucket m

R tuples

R tuples

R tuples

R tuples

R tuples

R tuples

Memory

S tuple 1

S tuple 2

S tuple 3

S tuple 4

S tuple …

CS561 - XJoin 6

Symmetric Hash Join (Pipelined) Both tables are hashed (both kept in main memory only) A. Wilschut, P. M.G. Apers, “Dataflow Query Execution in a

Parallel Main-Memory Environment”, DPD 1991.

Source R

OUTPUT

Source S

Key n

Key n+1

Key n+2

Key …

Key m-1

Key m

R tuples

R tuples

R tuples

R tuples

R tuples

R tuples

BUILD

PROBE

R tuple S tuple

Key i

Key i+1

Key i+2

Key …

Key j-1

Key j

S tuples

S tuples

S tuples

S tuples

S tuples

S tuples

BUILD

PROBE

R tuple S tuple

CS561 - XJoin 7

Problem of SHJ:

Memory intensive Won’t work for large input streams. Won’t allow for many joins to be processed in a

pipeline (or even in parallel).

CS561 - XJoin 8

New Problem in Online Query Processing over Distributed Data Sources Unpredictable data access due to link

congestion, load balances, etc. Three classes of delays

Initial Delay: first tuple arrives from remote source more slowly than usual

Slow Delivery: data arrives at a constant, but slower than expected rate

Bursty Arrival: data arrives in a fluctuating manner

CS561 - XJoin 9

Question:

Why are delays undesirable? Prolong the time for first output Slow the processing if wait for data to first be

there before acting If too fast, you want to avoid loosing any data Waste time if you sit idle while no data is coming Unpredictable, one single strategy won’t work

CS561 - XJoin 10

Motivation of XJoin

Produce results incrementally when available Tuples returned as soon as produced

Allow progress to be made when one or more sources experience delays by: Background processing performed on previously

received tuples so results are produced even when both inputs are stalled

CS561 - XJoin 11

XJoin Design

Tuples are stored in partitions (Hash Join): A memory-resident (m-r) portion A disk-resident (d-r) portion

CS561 - XJoin 12

Memory-resident partitions of source B

Tuple B

hash(Tuple B) = n

SOURCE-BSOURCE-A

D I S K

M E

M O

R Y 1

. . . . . . nn1

Memory-resident partitions of source A

1

. . . . . . . . . . . . n

1

Disk-residentpartitions of source A

. . . n

Disk-residentpartitions of source B

. . . . . .1 nk

k

flu

sh

Tuple A

hash(Tuple A) = 1

CS561 - XJoin 13

Challenges in Developing XJoin Manage flow of tuples between memory and

secondary storage (when and how to do it) Control background processing when inputs

are delayed (reactive scheduling idea) Ensure the full answer is produced Ensure duplicate tuples are not produced Provide both quick initial result as well as

good overall throughput

CS561 - XJoin 14

XJoin Stages

XJoin proceeds in 3 stages (separate threads)

M : M

M : D

D : D

CS561 - XJoin 15

M E

M O

R Y

Partitions of source B

. . . . . . . . .i j

SOURCE-B

hash(record B) = j

Tuple B

SOURCE-A

Tuple A

hash(record A) = i

i j

Partitions of source A

. . . . . . . . .

Output

Insert Probe InsertProbe

1st Stage: Memory-to-Memory Join

CS561 - XJoin 16

1st Stage: Memory-to-Memory Join Join processing continues as long as:

Memory permits, and One of the inputs is producing tuples

If memory is full, one partition is picked to be flushed to disk and append to the end of disk-resident portion

If no new input, then stage 1 is blocked and stage 2 starts

CS561 - XJoin 17

Why Stage 1?

In-memory operations are much faster and cheaper than on-disk operations, thus guaranteeing result are produced as soon as possible.

CS561 - XJoin 18

Question:

What does the 2nd Stage do? When does the 2nd Stage start? Hints:

What occurs when data input (tuples) are too large for memory?

Answer: The 2nd Stage joins Memory-to-Disk Occurs when both the inputs are blocking

CS561 - XJoin 19

Output

i. . . . . . .. . . . . . .i. . . . . . .. . . . . . .

M E

M O

R Y

Partitions of source BPartitions of source A

D I

S K

Partitions of source BPartitions of source A

ii . . . . .. . . . .. . . . .. . . . .

DPiA MPiB

Stage 2

CS561 - XJoin 20

2nd Stage: Memory-to-Disk Join Activated when 1st Stage is blocked Performs 3 steps:

1. Choose the partition according to throughput and size of partition from one source

2. Use tuples from d-r portion to probe m-r portion of other source and output matches, till d-r completely processed

3. Check if either input resumed producing tuples. If yes, resume 1st Stage. If no, choose another d-r portion and continue the 2nd Stage.

CS561 - XJoin 21

Controlling 2nd Stage

Cost of 2nd Stage is hidden when both inputs experience delays

Tradeoff ? What are the benefits of using the second stage?

Produce results when input sources are stalled Allows varying input rates

What is the disadvantage? The second stage must complete a d-r portion before

checking for new input (overhead) To address the tradeoff, use an activation threshold:

Pick a partition likely to produce many tuples right now

CS561 - XJoin 22

3rd Stage: Disk-to-Disk Join

Clean-up stage Assume that all data for both inputs has arrived Assume that 1st and 2nd stage have completed

Why is this step necessary? Completeness of answer: make sure that all result

tuples are being produced. Reason: some tuples in disk-resident portions

may not have chance to join each other.

CS561 - XJoin 23

Preventing Duplicates

When could duplicates be produced? Duplicates could be produced in both the 2nd and

3rd stages which may perform overlapping work. How to address it?

XJoin prevents duplicates with timestamps. When address this?

During processing when trying to join two tuples.

CS561 - XJoin 24

Time Stamping : Part 1

2 fields are added to each tuple: Arrival TimeStamp (ATS)

Indicates the time when the tuple first arrived in memory Departure TimeStamp (DTS)

Indicates the time when the tuple was flushed to disk [ATS, DTS] indicates when tuple was in memory

When did two tuples get joined in the 1st state? If Tuple A’s DTS is within Tuple B’s [ATS, DTS]

Tuples that meet this overlap condition are not considered for joining at the 2nd or 3rd stage

CS561 - XJoin 25

Tuple B1 178 198

Tuples joined in first stage

B1 arrived after A and before A was flushed to disk

Tuple A 102 234

DTSATS

Tuple B2 348 601

Tuples not joined in first stage

B2 arrived after A and after A was flushed to disk

Tuple A 102 234

DTSATS

Non-Overlapping

Detecting Tuples Joined in 1st Stage

Overlapping

CS561 - XJoin 26

Time Stamping : Part 2

For each partition, keep track of ProbeTS: time when a 2nd stage probe was done DTSlast: the DTS of the last tuple of the disk-resident

portion

Several such probes may occur Keep an ordered history of such probe descriptors

Usage All tuples before and including at time DTSlast were

joined in stage 2 with all tuples in main memory at time ProbeTS

CS561 - XJoin 27

Detecting Tuples Joined in 2nd stage

All A tuples in Partition 2 up to DTSlast 350,Were joined with m-r tuples that arrived before Partition 2’s ProbeTS.

100 300 800 900

20 340 350 550 700 900Tuple A 100 200

Tuple B 500 600

ATS DTS

ATS DTS

overlap

DTSlast ProbeTS

History list for the corresponding partition.

Partition 2

Partition 2

CS561 - XJoin 28

Experiments

HHJ (Hybrid Hash Join) XJoin (with 2nd stage and with caching) XJoin (without 2nd stage) XJoin (with aggressive usage of 2nd stage)

CS561 - XJoin 29

Case 1: Slow NetworkBoth Sources Are Slow

CS561 - XJoin 30

Case 1: Slow NetworkBoth Sources Are Slow (Bursty) XJoin improves delivery time of initial

answers -> interactive performance The reactive background processing is an

effective solution to exploit intermittent delays to keep continued output rates

Shows that 2nd stage is very useful if there is time for it

CS561 - XJoin 31

Case 2: Fast NetworkBoth Sources Are Fast

CS561 - XJoin 32

Case 2: Fast NetworkBoth Sources Are Fast All XJoin variants deliver initial results earlier. XJoin also can deliver the overall result in

equal time to HHJ HHJ delivers the 2nd half of the result faster

than XJoin. 2nd stage cannot be used too aggressively if

new data is coming in continuously

CS561 - XJoin 33

Conclusion

Can be conservative on space (small footprint)

Can produce initial result as early as possible Can hide intermittent data delays Can be used in conjunction with online query

processing to manage data streams (limited)

CS561 - XJoin 34

How to Further Optimize XJoin? Resuming Stage 1 as soon as data arrives Removing no-longer-joining tuples timely More …

CS561 - XJoin 35

References

Urhan, Tolga and Franklin, Michael J. “XJoin: Getting Fast Answers From Slow and Bursty Networks.”

Urhan, Tolga and Franklin, Michael J. “XJoin: A Reactively-Scheduled Pipelined Join Operator.”

Hellerstein, Franklin, Chandrasekaran, Deshpande, Hildrum, Madden, Raman, and Shah. “Adaptive Query Processing: Technology in Evolution”. IEEE Data Engineering Bulletin, 2000.

Hellerstein and Avnur, Ron. “Eddies: Continuously Adaptive Query Processing.”

Babu and Wisdom, Jennifer. “Continuous Queries Over Data Streams”.

CS561 - XJoin 36

Stream: New Query Context

Challenges faced by XJoin potentially unbounded growing join state Indefinite delay of some join results

Solutions Exploit semantic constraints to remove no-longer-

joining data timely Constraints: sliding window, punctuations

CS561 - XJoin 37

Punctuation

Punctuation is predicate on stream elements that evaluates to false for every element following the punctuation.

9961234 Edward 17

9961235 Justin 19

9961238 Janet 18

* * (0, 18]

no more tuples for students whose age

are less than or equal to 18!

ID Name Age

9961256 Anna 20

…

CS561 - XJoin 39

PJoin Execution Logic

Hash TableHash Table

Join State (Disk-Resident Portion)

Join State (Memory-Resident Portion)

… 35399

…

Hash Table

5935

…

State of Stream A (Sa) State of Stream B (Sb)

Stream A Stream B

3

Hash(ta) = 1

Tuple ta

3

3

Purge Cand. Pool

3

Purge Cand. Pool

Hash Table

…

1

2

4

3 <10

Punct. Set (PSb)Punct. Set (PSa)

CS561 - XJoin 40

PJoin Execution Logic

Hash TableHash Table

Join State (Disk-Resident Portion)

Join State (Memory-Resident Portion)

… 35399

…

Hash Table

5935

…

State of Stream A (Sa) State of Stream B (Sb)

Stream A Stream B

3

Hash(pa) = 1

Punctuation pa

Purge Cand. Pool

3

Purge Cand. Pool

Hash Table

…

<10

Punct. Set (PSb)Punct. Set (PSa)

CS561 - XJoin 41

PJoin vs. XJoin: Memory Overhead

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 10000 20000 30000 40000 50000

Ti me (mi l l i seconds)

# of

Tup

les

in J

oin

Stat

es XJ oi nPJ oi n

Tuple inter-arrival: 2 millisecondsPunctuation inter-arrival: 40 tuples/punctuation

CS561 - XJoin 42

PJoin vs. XJoin: Tuple Output Rate

0

100000

200000

300000

400000

500000

600000

700000

800000

0 10000 20000 30000 40000 50000 60000

Ti me (mi l l i seconds)

# of

Out

put

Tupl

es

PJ oi nXJ oi n

Tuple inter-arrival: 2 millisecondsPunctuation inter-arrival: 30 tuples/punctuation

CS561 - XJoin 43

Conclusion

Memory requirement for PJoin state almost insignificant compare to XJoin’s.

Increase in join state of XJoin leading to increasing probe cost, thus affecting tuple output rate.

Eager purge is best strategy for minimizing join state.

Lazy purge with appropriate purge threshold provides significant advantage in increasing tuple output rate.

xjoin: a reactively-scheduled pipelined join operator

Documents

memory operations

tuplesif memory

memory permits

memory partitionsg

memory joinjoin processing

main memory onlya

memory intensive wont

online query processing