1 continuous queries over data streams vitaly kroivets, lyan marina presentation for the seminar on...
Post on 19-Dec-2015
215 views
TRANSCRIPT
1
Continuous Queries Continuous Queries over over
Data StreamsData Streams
Vitaly Kroivets, Lyan MarinaVitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and InternetPresentation for The Seminar on Database and InternetThe Hebrew University of Jerusalem, Fall 2002The Hebrew University of Jerusalem, Fall 2002
2
Contents of the lectureContents of the lecture
IntroductionIntroduction
Proposed Architecture of Data Proposed Architecture of Data Stream Management SystemStream Management System
Research problemsResearch problems
Query OptimizationQuery Optimization
BibliographyBibliography
3
Data Streams vs. Data Data Streams vs. Data SetsSets
Data Sets:Data Sets: Data Streams:Data Streams:
Updates Updates infrequentinfrequent
Data changed Data changed constantly constantly (sometimes (sometimes additions only)additions only)Old data Old data
required many required many timestimes
Mostly only freshest Mostly only freshest data useddata used
Example: Example: employees employees personal data personal data tabletable
Examples: financial Examples: financial tickers, data feeds tickers, data feeds from sensors, from sensors, network monitoring, network monitoring, etcetc
4
Using Traditional Using Traditional DatabaseDatabase
User/ApplicationUser/ApplicationUser/ApplicationUser/Application
LoaderLoaderLoaderLoader
QueryQuery ResultResult
ResultResult……
QueryQuery……
5
Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
6
Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)
DataStream
ManagementSystem
(DSMS)
7
What Is A Continuous What Is A Continuous Query ?Query ?
Query which is Query which is issued once issued once and logically and logically run run continuously.continuously.
8
What is Continuous What is Continuous Query ?Query ?
Query which is issued once and run continuously.Query which is issued once and run continuously.
Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.
9
What is Continuous What is Continuous Query ?Query ?
Query which is issued once and run continuously.Query which is issued once and run continuously.
More examples:
Continues queries used to support load balancing, online automatic trading at Stock Exchange
10
Special ChallengesSpecial Challenges
Timely online answers Timely online answers even for rapid data even for rapid data streamsstreams
Ability of fast access to Ability of fast access to large portions of data large portions of data
Processing of multiple Processing of multiple streams simultaneously streams simultaneously
11
Making Things ConcreteMaking Things Concrete
Outgoing (call_ID, caller, time, event)Incoming (call_ID, callee, time, event)
event = start or end
CentralOffice
CentralOffice
DSMS
BOB ALICE
12
Making Things ConcreteMaking Things Concrete
Database = two streams of mobile call Database = two streams of mobile call recordsrecords Outgoing(connectionID, caller, start, end)Outgoing(connectionID, caller, start, end) Incoming(connectionID, callee, start, end)Incoming(connectionID, callee, start, end)
Query language = SQLQuery language = SQL
FROM clauses can refer to streams and/or FROM clauses can refer to streams and/or relationsrelations
13
Query 1 (self-join)Query 1 (self-join)
Find allFind all outgoing callsoutgoing calls longer thanlonger than 2 minutes2 minutes
SELECT O1.call_ID, O1.callerSELECT O1.call_ID, O1.callerFROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O2.time – O1.time > 2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.call_ID = O2.call_ID AND O1.event = startAND O1.event = start AND O2.event = end)AND O2.event = end)
Result requiresResult requires unbounded storageunbounded storage Can provideCan provide result as data streamresult as data stream Can output after 2 min,Can output after 2 min, without seeingwithout seeing end end
14
Query 2 (join)Query 2 (join)
Pair upPair up callerscallers and and calleescallees
SELECT O.caller, I.calleeSELECT O.caller, I.calleeFROM Outgoing O, Incoming IFROM Outgoing O, Incoming IWHERE O.call_ID = I.call_IDWHERE O.call_ID = I.call_ID
Can still provideCan still provide result as data streamresult as data stream RequiresRequires unbounded temporary storage …unbounded temporary storage … … … unless streams areunless streams are near-synchronizednear-synchronized
15
Query 3 (group-by Query 3 (group-by aggregation)aggregation)
Total connection timeTotal connection time for each callerfor each caller
SELECT O1.caller, sum(O2.time – O1.time)SELECT O1.caller, sum(O2.time – O1.time)FROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O1.call_ID = O2.call_IDWHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O1.event = start AND O2.event = end)AND O2.event = end)GROUP BY O1.callerGROUP BY O1.caller
Cannot provide result in (append-only) Cannot provide result in (append-only) stream. stream.
Alternatives:Alternatives:• Output stream with updatesOutput stream with updates• Provide current value on demandProvide current value on demand• Keep answer in memoryKeep answer in memory
16
ConclusionsConclusions
Conventional DBMS technology is Conventional DBMS technology is inadequateinadequate
We need reconsider all aspects of data We need reconsider all aspects of data management and processing in presence management and processing in presence of data streamsof data streams
17
DBMS versus DSMSDBMS versus DSMS
• Persistent relationsPersistent relations • Transient streams (and Transient streams (and persistent relations)persistent relations)
18
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
19
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
20
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design
• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics
21
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design
• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics
• ““Unbounded” disk storeUnbounded” disk store • Bounded main memoryBounded main memory
22
RelatedRelated workworkTapestryTapestry system system
CContent-based filtering oontent-based filtering off email messages. email messages. RRestricted subset of SQLestricted subset of SQL append-only query append-only query resultsresults
Cronicle data modelCronicle data model AAppend-only ordered sequences of tuplesppend-only ordered sequences of tuples
restricted view-definition languagerestricted view-definition language doesnt store doesnt store any croniclesany croniclesAlert systemAlert system
EEvent-condition Action triggers in conventional vent-condition Action triggers in conventional SQL DBSQL DB Continuous Queries over append-only Continuous Queries over append-only "active tables"."active tables".
23
RelatedRelated workworkMaterialized ViewsMaterialized Views
Materialized Views are queries which need to be Materialized Views are queries which need to be reevaluated whenever database changesreevaluated whenever database changes..
Materialized Views vsMaterialized Views vs. . Continuous QueriesContinuous Queries::
Continuous QueriesContinuous Queries May stream rather then store resultMay stream rather then store result May deal with append only relations May deal with append only relations May provide approximate answersMay provide approximate answers Processing strategy may adapt characteristics Processing strategy may adapt characteristics
of data streamof data stream
24
Architecture for Architecture for continuous queriescontinuous queries
Single stream of tuples D, single continuous Query QSingle stream of tuples D, single continuous Query Qand Answer to the query Aand Answer to the query AQ is issued once and operates continuouslyQ is issued once and operates continuously
<A,B><A,B><A,B> Q
Data Stream
Continuous Query
A?Answer
25
Architecture for Architecture for continuous queriescontinuous queries
We consider data streams that adhere to the relation We consider data streams that adhere to the relation model (i. e. streams of tuples), although many of model (i. e. streams of tuples), although many of the ideas and techniques are independent of the the ideas and techniques are independent of the data model being considereddata model being considered
<A,B><A,B><A,B> Q
Data Stream
Continuous Query
A?Answer
26
Architecture for continuous Architecture for continuous queriesqueries
Scenario 1Scenario 1 ( (simplestsimplest):):
Data stream D is append only Data stream D is append only - - no updates or no updates or deletions. How to handle Q?deletions. How to handle Q?
11) ) Always store current answer A to Q Always store current answer A to Q ..
D is of unbounded size D is of unbounded size ==> A may be too> A may be too..
22) ) Not to store A, but make new tuples in A Not to store A, but make new tuples in A available as another continuous streamavailable as another continuous stream..
No need for unbounded storage for A, but No need for unbounded storage for A, but may may need unbounded storage to determine new need unbounded storage to determine new
tuples in Atuples in A..
27
Architecture for continuous Architecture for continuous queriesqueries
Scenario 2Scenario 2 Input stream is appendInput stream is append--only, but may cause only, but may cause
updates and deletions in answer Aupdates and deletions in answer A.. => May need to update/delete tuples in output => May need to update/delete tuples in output
data streamdata stream Scenario3Scenario3 ( (most generalmost general)) Input stream D includes updates and deletionsInput stream D includes updates and deletions.. => Much data of stream should be stored to => Much data of stream should be stored to
determine answer.determine answer.
28
Architecture for continuous Architecture for continuous queriesqueries
How to solve?How to solve?
1) Restrict expressiveness of Q.1) Restrict expressiveness of Q.
2) Impose constrains on data stream to2) Impose constrains on data stream to
guarantee that answer to Q is boundedguarantee that answer to Q is bounded
and amount of data needed to compute Q .and amount of data needed to compute Q .
3) Provide approximate answer.3) Provide approximate answer.
29
Arcitecture for processing Arcitecture for processing continuous queriescontinuous queries
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Stream 1
Stream 2
Stream N
.
.
.
Throw
Scratch
Store
Stream
30
Architecture for Architecture for continuous queriescontinuous queries
STREAMSTREAM is data stream containing tuples is data stream containing tuples appended to A. It is appendappended to A. It is append--only stream only stream ((shouldnt include updatesshouldnt include updates//deletionsdeletions))
STREAMSTREAM and and STORESTORE define current answer A define current answer A..
31
Architecture for continuous Architecture for continuous queriesqueries
When query Q is notified of new When query Q is notified of new
tuple t in a relevant data stream, tuple t in a relevant data stream,
it can perform number of actions,it can perform number of actions,
which are not mutually exclusivewhich are not mutually exclusive
1) t causes new tuples in A1) t causes new tuples in A
if tuple a will remain in A foreverif tuple a will remain in A forever: :
send a to send a to STREAMSTREAM
2) if a should be in A, but may be2) if a should be in A, but may be removed at some removed at some moment: add a to moment: add a to STORESTORE
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Throw Scratch Store Stream
Stream
32
Architecture for continuous Architecture for continuous queriesqueries
When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant
data stream, it can perform number of actions,data stream, it can perform number of actions,
which are not mutually exclusivewhich are not mutually exclusive
3) t may cause update or deletion3) t may cause update or deletion
of answer tuples in Store. Answer of answer tuples in Store. Answer
tuples may be moved from tuples may be moved from
STORE STORE to to STREAMSTREAM
4) May need to save t or derived 4) May need to save t or derived
data to ensure in future can compute data to ensure in future can compute
query result send t to query result send t to SCRATCHSCRATCH
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Throw Scratch Store Stream
Stream
33
Architecture for continuous Architecture for continuous queriesqueries
When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant
data stream, it can perform number of actions,data stream, it can perform number of actions,
which are not mutually exclusivewhich are not mutually exclusive
5) t not needed and will not be5) t not needed and will not be
needed. Send it to needed. Send it to THROWTHROW
((unless we like to archive itunless we like to archive it))
6) As a result of t we may move 6) As a result of t we may move
data from data from STORESTORE or or SCRATCHSCRATCH
to to THROWTHROW
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Throw Scratch Store Stream
Stream
34
Architecture for Architecture for continuous queriescontinuous queries
Scenario1 Scenario1
Data stream D is append only Data stream D is append only - - no updates orno updates or
deletions. Always store current answer A to Q deletions. Always store current answer A to Q ..
STREAMSTREAM empty emptySTORESTORE always contain A always contain ASCRATCHSCRATCH contains whatever needed to to contains whatever needed to to keep answer in keep answer in STORESTORE up to date up to date