1 continuous queries over data streams vitaly kroivets, lyan marina presentation for the seminar on...

34
1 Continuous Queries Continuous Queries over over Data Streams Data Streams Vitaly Kroivets, Lyan Marina Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem, Fall 2002 The Hebrew University of Jerusalem, Fall 2002

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

1

Continuous Queries Continuous Queries over over

Data StreamsData Streams

Vitaly Kroivets, Lyan MarinaVitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and InternetPresentation for The Seminar on Database and InternetThe Hebrew University of Jerusalem, Fall 2002The Hebrew University of Jerusalem, Fall 2002

2

Contents of the lectureContents of the lecture

IntroductionIntroduction

Proposed Architecture of Data Proposed Architecture of Data Stream Management SystemStream Management System

Research problemsResearch problems

Query OptimizationQuery Optimization

BibliographyBibliography

3

Data Streams vs. Data Data Streams vs. Data SetsSets

Data Sets:Data Sets: Data Streams:Data Streams:

Updates Updates infrequentinfrequent

Data changed Data changed constantly constantly (sometimes (sometimes additions only)additions only)Old data Old data

required many required many timestimes

Mostly only freshest Mostly only freshest data useddata used

Example: Example: employees employees personal data personal data tabletable

Examples: financial Examples: financial tickers, data feeds tickers, data feeds from sensors, from sensors, network monitoring, network monitoring, etcetc

4

Using Traditional Using Traditional DatabaseDatabase

User/ApplicationUser/ApplicationUser/ApplicationUser/Application

LoaderLoaderLoaderLoader

QueryQuery ResultResult

ResultResult……

QueryQuery……

5

Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

6

Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)

DataStream

ManagementSystem

(DSMS)

7

What Is A Continuous What Is A Continuous Query ?Query ?

Query which is Query which is issued once issued once and logically and logically run run continuously.continuously.

8

What is Continuous What is Continuous Query ?Query ?

Query which is issued once and run continuously.Query which is issued once and run continuously.

Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.

9

What is Continuous What is Continuous Query ?Query ?

Query which is issued once and run continuously.Query which is issued once and run continuously.

More examples:

Continues queries used to support load balancing, online automatic trading at Stock Exchange

10

Special ChallengesSpecial Challenges

Timely online answers Timely online answers even for rapid data even for rapid data streamsstreams

Ability of fast access to Ability of fast access to large portions of data large portions of data

Processing of multiple Processing of multiple streams simultaneously streams simultaneously

11

Making Things ConcreteMaking Things Concrete

Outgoing (call_ID, caller, time, event)Incoming (call_ID, callee, time, event)

event = start or end

CentralOffice

CentralOffice

DSMS

BOB ALICE

12

Making Things ConcreteMaking Things Concrete

Database = two streams of mobile call Database = two streams of mobile call recordsrecords Outgoing(connectionID, caller, start, end)Outgoing(connectionID, caller, start, end) Incoming(connectionID, callee, start, end)Incoming(connectionID, callee, start, end)

Query language = SQLQuery language = SQL

FROM clauses can refer to streams and/or FROM clauses can refer to streams and/or relationsrelations

13

Query 1 (self-join)Query 1 (self-join)

Find allFind all outgoing callsoutgoing calls longer thanlonger than 2 minutes2 minutes

SELECT O1.call_ID, O1.callerSELECT O1.call_ID, O1.callerFROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O2.time – O1.time > 2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.call_ID = O2.call_ID AND O1.event = startAND O1.event = start AND O2.event = end)AND O2.event = end)

Result requiresResult requires unbounded storageunbounded storage Can provideCan provide result as data streamresult as data stream Can output after 2 min,Can output after 2 min, without seeingwithout seeing end end

14

Query 2 (join)Query 2 (join)

Pair upPair up callerscallers and and calleescallees

SELECT O.caller, I.calleeSELECT O.caller, I.calleeFROM Outgoing O, Incoming IFROM Outgoing O, Incoming IWHERE O.call_ID = I.call_IDWHERE O.call_ID = I.call_ID

Can still provideCan still provide result as data streamresult as data stream RequiresRequires unbounded temporary storage …unbounded temporary storage … … … unless streams areunless streams are near-synchronizednear-synchronized

15

Query 3 (group-by Query 3 (group-by aggregation)aggregation)

Total connection timeTotal connection time for each callerfor each caller

SELECT O1.caller, sum(O2.time – O1.time)SELECT O1.caller, sum(O2.time – O1.time)FROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O1.call_ID = O2.call_IDWHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O1.event = start AND O2.event = end)AND O2.event = end)GROUP BY O1.callerGROUP BY O1.caller

Cannot provide result in (append-only) Cannot provide result in (append-only) stream. stream.

Alternatives:Alternatives:• Output stream with updatesOutput stream with updates• Provide current value on demandProvide current value on demand• Keep answer in memoryKeep answer in memory

16

ConclusionsConclusions

Conventional DBMS technology is Conventional DBMS technology is inadequateinadequate

We need reconsider all aspects of data We need reconsider all aspects of data management and processing in presence management and processing in presence of data streamsof data streams

17

DBMS versus DSMSDBMS versus DSMS

• Persistent relationsPersistent relations • Transient streams (and Transient streams (and persistent relations)persistent relations)

18

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

19

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

• Random accessRandom access • Sequential accessSequential access

20

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

• Random accessRandom access • Sequential accessSequential access

• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design

• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics

21

DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations

• Transient streams (and Transient streams (and persistent relations)persistent relations)

• One-time queriesOne-time queries • Continuous queriesContinuous queries

• Random accessRandom access • Sequential accessSequential access

• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design

• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics

• ““Unbounded” disk storeUnbounded” disk store • Bounded main memoryBounded main memory

22

RelatedRelated workworkTapestryTapestry system system

CContent-based filtering oontent-based filtering off email messages. email messages. RRestricted subset of SQLestricted subset of SQL append-only query append-only query resultsresults

Cronicle data modelCronicle data model AAppend-only ordered sequences of tuplesppend-only ordered sequences of tuples

restricted view-definition languagerestricted view-definition language doesnt store doesnt store any croniclesany croniclesAlert systemAlert system

EEvent-condition Action triggers in conventional vent-condition Action triggers in conventional SQL DBSQL DB Continuous Queries over append-only Continuous Queries over append-only "active tables"."active tables".

23

RelatedRelated workworkMaterialized ViewsMaterialized Views

Materialized Views are queries which need to be Materialized Views are queries which need to be reevaluated whenever database changesreevaluated whenever database changes..

Materialized Views vsMaterialized Views vs. . Continuous QueriesContinuous Queries::

Continuous QueriesContinuous Queries May stream rather then store resultMay stream rather then store result May deal with append only relations May deal with append only relations May provide approximate answersMay provide approximate answers Processing strategy may adapt characteristics Processing strategy may adapt characteristics

of data streamof data stream

24

Architecture for Architecture for continuous queriescontinuous queries

Single stream of tuples D, single continuous Query QSingle stream of tuples D, single continuous Query Qand Answer to the query Aand Answer to the query AQ is issued once and operates continuouslyQ is issued once and operates continuously

<A,B><A,B><A,B> Q

Data Stream

Continuous Query

A?Answer

25

Architecture for Architecture for continuous queriescontinuous queries

We consider data streams that adhere to the relation We consider data streams that adhere to the relation model (i. e. streams of tuples), although many of model (i. e. streams of tuples), although many of the ideas and techniques are independent of the the ideas and techniques are independent of the data model being considereddata model being considered

<A,B><A,B><A,B> Q

Data Stream

Continuous Query

A?Answer

26

Architecture for continuous Architecture for continuous queriesqueries

Scenario 1Scenario 1 ( (simplestsimplest):):

Data stream D is append only Data stream D is append only - - no updates or no updates or deletions. How to handle Q?deletions. How to handle Q?

11) ) Always store current answer A to Q Always store current answer A to Q ..

D is of unbounded size D is of unbounded size ==> A may be too> A may be too..

22) ) Not to store A, but make new tuples in A Not to store A, but make new tuples in A available as another continuous streamavailable as another continuous stream..

No need for unbounded storage for A, but No need for unbounded storage for A, but may may need unbounded storage to determine new need unbounded storage to determine new

tuples in Atuples in A..

27

Architecture for continuous Architecture for continuous queriesqueries

Scenario 2Scenario 2 Input stream is appendInput stream is append--only, but may cause only, but may cause

updates and deletions in answer Aupdates and deletions in answer A.. => May need to update/delete tuples in output => May need to update/delete tuples in output

data streamdata stream Scenario3Scenario3 ( (most generalmost general)) Input stream D includes updates and deletionsInput stream D includes updates and deletions.. => Much data of stream should be stored to => Much data of stream should be stored to

determine answer.determine answer.

28

Architecture for continuous Architecture for continuous queriesqueries

How to solve?How to solve?

1) Restrict expressiveness of Q.1) Restrict expressiveness of Q.

2) Impose constrains on data stream to2) Impose constrains on data stream to

guarantee that answer to Q is boundedguarantee that answer to Q is bounded

and amount of data needed to compute Q .and amount of data needed to compute Q .

3) Provide approximate answer.3) Provide approximate answer.

29

Arcitecture for processing Arcitecture for processing continuous queriescontinuous queries

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Stream 1

Stream 2

Stream N

.

.

.

Throw

Scratch

Store

Stream

30

Architecture for Architecture for continuous queriescontinuous queries

STREAMSTREAM is data stream containing tuples is data stream containing tuples appended to A. It is appendappended to A. It is append--only stream only stream ((shouldnt include updatesshouldnt include updates//deletionsdeletions))

STREAMSTREAM and and STORESTORE define current answer A define current answer A..

31

Architecture for continuous Architecture for continuous queriesqueries

When query Q is notified of new When query Q is notified of new

tuple t in a relevant data stream, tuple t in a relevant data stream,

it can perform number of actions,it can perform number of actions,

which are not mutually exclusivewhich are not mutually exclusive

1) t causes new tuples in A1) t causes new tuples in A

if tuple a will remain in A foreverif tuple a will remain in A forever: :

send a to send a to STREAMSTREAM

2) if a should be in A, but may be2) if a should be in A, but may be removed at some removed at some moment: add a to moment: add a to STORESTORE

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Throw Scratch Store Stream

Stream

32

Architecture for continuous Architecture for continuous queriesqueries

When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant

data stream, it can perform number of actions,data stream, it can perform number of actions,

which are not mutually exclusivewhich are not mutually exclusive

3) t may cause update or deletion3) t may cause update or deletion

of answer tuples in Store. Answer of answer tuples in Store. Answer

tuples may be moved from tuples may be moved from

STORE STORE to to STREAMSTREAM

4) May need to save t or derived 4) May need to save t or derived

data to ensure in future can compute data to ensure in future can compute

query result send t to query result send t to SCRATCHSCRATCH

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Throw Scratch Store Stream

Stream

33

Architecture for continuous Architecture for continuous queriesqueries

When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant

data stream, it can perform number of actions,data stream, it can perform number of actions,

which are not mutually exclusivewhich are not mutually exclusive

5) t not needed and will not be5) t not needed and will not be

needed. Send it to needed. Send it to THROWTHROW

((unless we like to archive itunless we like to archive it))

6) As a result of t we may move 6) As a result of t we may move

data from data from STORESTORE or or SCRATCHSCRATCH

to to THROWTHROW

Stream QueryStream QueryProcessorProcessor

Stream QueryStream QueryProcessorProcessor

Throw Scratch Store Stream

Stream

34

Architecture for Architecture for continuous queriescontinuous queries

Scenario1 Scenario1

Data stream D is append only Data stream D is append only - - no updates orno updates or

deletions. Always store current answer A to Q deletions. Always store current answer A to Q ..

STREAMSTREAM empty emptySTORESTORE always contain A always contain ASCRATCHSCRATCH contains whatever needed to to contains whatever needed to to keep answer in keep answer in STORESTORE up to date up to date