![Page 1: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/1.jpg)
Streaming Queries over Streaming Data
Sirish Chandrasekaran (UC Berkeley)
Michael J. Franklin (UC Berkeley)
Presented by Andy Williamson
![Page 2: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/2.jpg)
About Me
3rd Year ISYE major Minor in Computer Science From Austin, TX Have visited every state but Alaska Intern at Deloitte Consulting focusing
on SAP implementation
![Page 3: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/3.jpg)
Agenda
Background/Motivation PSoup
Introduction System Overview Query Processing Techniques Implementation Performance Aggregation Queries Conclusions
Critique
![Page 4: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/4.jpg)
Background/Motivation
Continuous Query (CQ) Systems Treat queries as fixed entities and
stream data over themPrevious systems only allowed
streaming of either data or queriesContinuously deliver results as they
are computed (infeasible/inefficient)• Data Recharging• Monitoring
![Page 5: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/5.jpg)
PSoup: Introduction
Query processor based on Telegraph query processing framework
Allows both data and queries to be streamed
Partially stores results to support disconnected operation and improve data throughput and response time
![Page 6: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/6.jpg)
PSoup: System Overview
User initially registers query specification with system System returns handle which can be used to invoke results
of query later Example Query:SELECT *FROM Data_Stream D_sWHERE (D_s.a < x ^ D_s.b > y)BEGIN(NOW – 10)END(NOW); Begin-End Clause allows:
Snapshot (constant beginning and ending time) Landmark (constant beginning and variable ending time) Sliding window (variable beginning and ending time)
Limited by size of memory
![Page 7: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/7.jpg)
PSoup: System Overview
PSoup treats execution of query streams as a join of query and data streams
Maintains State
Modules (SteMs)
for queries and data One query SteM for
all queries in the system, and one data SteM for each data stream
![Page 8: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/8.jpg)
PSoup: Query Processing Techniques Overview
PSoup assigns unique queryID that it returns to the user
Client can disconnect, reconnect and execute query to obtain updated results
PSoup continuously matches data to query predicates in background and stores the results in its Results Structure
When a query is invoked, PSoup applies the appropriate input window to the Results Structure to return the current results
![Page 9: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/9.jpg)
PSoup: Query Processing Techniques Entry of new Query specs
New queries split into two parts:• Standing Query Clause (SQC): consists of the
SELECT-FROM-WHERE clauses• BEGIN-END clause, stored in separate
WindowsTable structure
SQC inserted into Query SteM Used to probe Data SteMs corresponding to
tables in FROM clause Resulting tuples stored in Results Structure
![Page 10: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/10.jpg)
PSoup: Query Processing Techniques Entry of new data
New tuples assigned globally unique tupleID and physical timestamp (physicalID) based on system clock
Inserted into appropriate Data SteMThen used to probe Query SteM to
determine which SQCs it satisfiesTupleIDs and physicalIDs stored in
Results Structure
![Page 11: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/11.jpg)
PSoup: Query Processing Techniques
Selection Queries over a single stream
![Page 12: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/12.jpg)
PSoup: Query Processing Techniques
Join Queries Over Multiple Streams
![Page 13: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/13.jpg)
PSoup: Query Processing Techniques Query Invocation and Result Construction
Results Structure maintains info about which tuples in Data SteM(s) satisfy which SQCs in Query SteM
For each result tuple of each query, it stores tupleID and physicalID of all constituent base tuples of result tuple
Results of a query can be accessed by its queryID
Ordered by timestamp (physicalID)
![Page 14: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/14.jpg)
PSoup: Implementation
Eddy Each tuple has a predicate attribute and an
Interest List dictating where it is to be routed Provides Stream Prefix Consistency by
storing new and temporary tuples separately in New Tuple Pool and Temporary Tuple Pool
Begins by selecting a tuple from the NTP and then processing everything in the TTP before pickign another tuple from the NTP
![Page 15: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/15.jpg)
PSoup: Implementation
Data SteMUse tree-based index for data to
provide efficient access to probing queries
One red-black tree for every attributeMaintains hash-based index over
tupleIDs for fast access
![Page 16: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/16.jpg)
PSoup: Implementation
Query SteM Allows sharing of work between queries that have
overlapping FROM clauses Use red-black trees to index single-attribute single-
relation boolean factors of a query
![Page 17: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/17.jpg)
PSoup: Implementation
Query SteM For queries involving joins of multiple attributes, tree
structure doesn’t work Instead, a linked list called the predicateList is used Query SteM contains an array in which each cell
represents a query At beginning of probe by a data tuple, each cell is set
to the number of boolean factors in corresponding query
Every time tuple satisfies a boolean factor, cell value is decremented
At end of probe, if cell = 0, that means the data tuple satisfies the given query
![Page 18: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/18.jpg)
PSoup: Implementation
Results Structure Stores metadata indicating which tuples
satisfy which SQCs Can either be accomplished by previously-
mentioned bitmap or by associating a linked list of satisfactory data tuples for each query
Ordering by timestamp is simple for single-table queries
For Join queries, typically use oldest timestamp
![Page 19: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/19.jpg)
PSoup: Performance
Implemented in Java with customized versions of Eddy and SteMs
Examined performance of two versions: PSoup-Partial (PSoup-P): Maintain results
corresponding to SQCs in Results Structure, and apply BEGIN-END clauses to retrieve current results on query invocation
PSoup-Complete (PSoup-C): Continuously maintains results corresponding to current input window for each query in linked lists
NoMat: Measurements of a system that doesn’t materialize results
![Page 20: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/20.jpg)
PSoup: Performance
Storage Requirements NoMat: Storage cost = space taken to store
base data streams within maximum window over which queries are supported, plus size of structures
PSoup-P: Storage cost = storage cost of NoMat + size of Results Structure (either bitarray or linked-list)
PSoup-C: Storage cost >> storage cost of PSoup-P since C always stores current results of standing queries at a given time
![Page 21: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/21.jpg)
PSoup: Performance
Experimental Setup Varied window sizes (27-216) and number(1-
8)/type of boolean factors Measured response time and maximum
supportable data arrival rate Examined both P and C with and without
predicate indexes Tested scheme to remove redundancies
arising from joins Used synthetic generated query(27-212) /data
streams
![Page 22: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/22.jpg)
PSoup: Performance
Response Time vs. Window Size
![Page 23: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/23.jpg)
PSoup: Performance
Response Time vs. # Interval Predicates
![Page 24: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/24.jpg)
PSoup: Performance
Data Arrival Rate vs. # SQCs
![Page 25: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/25.jpg)
PSoup: Performance
Summary of Results Materializing results of queries supports
higher query invocation rates Indexing queries and lazily applying windows
improves maximum data throughput PSoup-C requires more memory PSoup-C optimizes query invocation rate PSoup-P optimizes data arrival rate
![Page 26: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/26.jpg)
PSoup: Performance
Removing Redundancy in Join processingEntry of a query
specification or
new dataComposite tuples
in joins
![Page 27: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/27.jpg)
PSoup: Aggregation Queries
PSoup can support aggregate functions
Only possible to share data structures across queries with identical SELECT-PROJECT-JOIN clause
![Page 28: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/28.jpg)
PSoup: Conclusions
Treats data and query streams analogously Can support queries that require access to data that
arrived before and after the query Materializes results to cut down on response time and
to support disconnected operation Enables data recharging and monitoring
Future work: Write data streams to disk and execute queries over
them Transfer queries between disk and memory, allowing
query execution to be scheduled Confront resource constraints when dealing with
infinite streams Query browser for temporal data
![Page 29: Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson](https://reader035.vdocument.in/reader035/viewer/2022062722/56649f295503460f94c4300b/html5/thumbnails/29.jpg)
Critique
Strengths Very well written, easy to follow Clear examples, excellent explanation of performance
results Strong method that reduces processing time with
increase in interval predicates Weaknesses
Lacking sufficient data on storage costs Experimentation only tested one multiple-relation
boolean factor for joins; unrealistic Didn’t address whether same (or similar) query could
be entered twice and accidentally given two ID’s