optimizing queries in modern infrastructure
TRANSCRIPT
Optimizing queries in modern
infrastructure: Challenges, Solutions
and Open Problems
Efthymia (Efi) Tsamoura
What motivated modern infrastructure?
Introduction
“Everyday, we create 2.5 1018 bytes of data - 90% of the data in the world
today has been created during the last two years”, IBM 2012 report
This data comes from everywhere …
Large Hadron Collider project in CERN delivers data 40M times per second
Sloan Digital Sky Survey project collects 200 GB per night resulting in 140
terabytes of data, so far
Facebook processes more than 500 TB of data daily
We need new infrastructure to store/proccess this data
E.g., clouds, grids, clusters
Characteristics of modern infrastructure
Wide-scale
Autonomy
Dynamicity
Resource leasing
Adoption of services
Characteristics of modern infrastructure
Wide-scale
Thousands of computational and storage resources spread all over the world
For example,
PlanetLab consists of more than 1000 resources spread over ~500 sites
Facebook has 9 datacenters each one of them consisting of thousands of servers
Autonomy
Dynamicity
Resource leasing
Adoption of services
Characteristics of modern infrastructure
Wide-scale
Autonomy
Concerns grid and cloud infrastructure
“…cloud is a kind of parallel or distributed environment where autonomous
resources can be shared …”, -Professor Rajkumar Buyya
Each resource belongs to a different owner who has full control over it, e.g.,
he/she can remove the resource anytime, can define/alter the user privileges, can
define which resource portion is available for usage at any time, etc.
Dynamicity
Resource leasing
Adoption of services
Characteristics of modern infrastructure
Wide-scale
Autonomy
Dynamicity
Consequence of autonomy and multiple user utilization
The infrastructure characteristics (e.g., CPU and network load) may vary
significantly along time
Observed latency measurements of a link for transferring 5MB of data between two PlanetLab hosts
during a three weeks period
Characteristics of modern infrastructure
Wide-scale
Autonomy
Dynamicity
Resource leasing
Concerns cloud infrastructure
“…This pool of resources is typically exploited by a pay-per-use model …”
Users have the ability to lease resources only for as long as needed, based on a per
quantum pricing scheme, e.g. one hour
Examples: Amazon EC2, Windows Azure
Adoption of services
Characteristics of modern infrastructure
Wide-scale
Autonomy
Dynamicity
Resource leasing
Adoption of services
Consequence of heterogeneity
Remote resources could not easily communicate due to heterogeneous software,
libraries or protocols
“…services are platform independent entities that can be described, published,
discovered and interconnected through novel paradigms and may have
arbitrarily complex functionally …”, -Professor Mike Papazoglou
Even data are put behind services
E.g., web services, XML and Internet protocols
Users can define complex tasks through multiple service flows
Characteristics of modern infrastructure
Wide-scale
Autonomy
Dynamicity
Resource leasing
Adoption of services
Rise of Service Management Platforms, e.g., Yahoo! Query Language, Apache
ServiceMix, Taverna, Pegasus
Types of queries that we consider
SQL queries, e.g.,
SELECT Packet s P
FROM Network
WHERE P .PROTOCOL=TCP AND
b y t e t e s t (P,4 , >32784 ,0 , l i t t l e )=TRUE AND
flow(P , to server , established )=TRUE AND
P .DESTINATION PORT=20031
Data workflows, e.g.,
Optimizing service plans Wide-scale
Autonomy
Dynamicity
Resource leasing
Adoption of services Optimizing services plans
Optimizing pipelined service plans
The problem of optimizing queries/tasks over services did not receive the
appropriate attention
Service Management Platforms focus on usability neglecting performance,
in major, and so users manually specify the order under which services will
be called
However, modifying the service execution order, may lead to significant
performance improvements
Optimizing pipelined service plans: Example
A company wants to obtain a list of email addresses of potential customers
selecting only those who have a good payment history for at least one card
and a credit rating above some threshold.
Services S1 returns a customer’s credit card number given his/her identifier
Services S2 returns a customer’s email address given his/her identifier
Services S3 returns a customer’s credit rating (parameterized by a credit card
rating threshold) given his/her identifier
Services S4 returns a customer’s payment history (parameterized by a
payment history threshold) given his/her identifier
Input data are customer identifiers and are returned by S5
1. Each service has
• its own per-tuple processing cost and
• its own selectivity
2. Services communicate through heterogeneous links
3. Pipelined parallelism is employed
4. Constraints regarding the service call order
5. Services are allocated on host machines
Optimizing pipelined service plans: Example
Optimizing pipelined service plans: Example
Data already processed by a service is processed by subsequent services in the
plan at the same time as the former processes new input data
Time 0: t5 t4 t3 t2 t1
Time 1: t5 t4 t3 t2 t1
Time 2: t5 t4 t3 t2 t1
Time 3: t5 t4 t3 t2 t1
Time 4: t5 t4 t3 t2 t1
Optimizing pipelined service plans: Example
There exist multiple service call plans, but these plans have different
completion time
In the first plan
1. We find the customer identifiers
2. Then, we find the customer’s credit card numbers given these ids
3. Then, we find the customer’s email address
4. Then, we find the customer’s having good payment history
5. Then, we find the customer’s having credit rating above a threshold
In the second plan
1. We find the customer identifiers
2. Then, we find the customer’s credit card numbers given these ids
3. Then, we find the customer’s having credit rating above a threshold
given these ids
4. Then, we find the customer’s having good payment history
5. Then, we find the customer’s email address
Optimizing pipelined service plans: Other
examples
Bio-informatics workflows
This workflow takes in Entrez gene
ids then adds the string "ncbi-geneid:"
to the start of each gene id. These gene
ids are then cross-referenced to KEGG
gene ids. Each KEGG gene id is then
sent to the KEGG pathway database
and its relevant pathways returned.
-Paul Fisher, myexperiment
Optimizing pipelined service plans: State-of-
the-art
State-of-the-art work either
does not consider data processing cost along with network heterogeneity (e.g.,
Srivastava et al. VLDB 2007) or
does not consider pipelined parallelism
Both assumptions have negative impact on plan completion time
Current SoA algorithms cannot be applied as the criterion to optimize
differs: when pipelining is employed we do not want to minimize the total
cost spent by all services to process data and transfer the intermediate
results, but the maximum cost that a service can incur
In pipelined execution, the completion time of a plan equals the data
processing rate, which is determined, in turn, by the slowest service
The latter service is called bottleneck service
Optimizing pipelined service plans: Sum
cost metric vs. bottleneck cost metric
Cost(Si) is the cost a services incurs to process data and send the results to
another service
Sum cost metric:
Bottleneck cost metric:
S1 S2 S3 S4
i
i
min cost(S )
i
iS
min max cost(S )
Optimizing pipelined service plans: Solution Provably optimal algorithm
Experiments performed in PlanetLab EU shows that is outperforms Greedy
(VLDB 2007)
It is based on branch & bound optimization paradigm
It can explore the whole space of solutions without building complete
services plans
It is guided by two measures called ε and
ε
S2 S6 S1 S3
S5
S7
S8
S4
Current partial plan
ε is maximum cost that a services incurs to
process data and send the results to another
service
Services not in current partial plan
is maximum cost that a service will
incur if appended to current plan
ε
ε
Optimizing multiway join pipelined service
plans
Generalization of the previous problem
One multiway join service merges the data of different sources
Application
Sensor networks (e.g., aggregate data from different sensor deployments)
Mashup platforms (e.g., IBM Mashup Center, WSO2 Mashup Server, YQL)
“A mashup, in web development, is a web page, or web application, that uses and
combines data, presentation or functionality from two or more sources to create new
services.”
Two algorithms are proposed which in many cases find the optimal plan
Improving the performance of adaptive query
optimization techniques
Wide-scale
Autonomy
Dynamicity
Resource leasing
Adoption of services
Improving the performance of
adaptive query optimization
techniques
Adaptive query optimization
Streaming data and infrastructure dynamicity render query processing a
challenging task
Modern adaptive query processing techniques employ a three phase loop,
called adaptivity loop, to adapt a plan based on runtime infrastructure and
data conditions
Monitoring
Reoptimizati
on
Analysis
Collect measurements from the
environment and keep them in a fixed-
size sliding window
Derive statistic using the collected
measurements and analyze the
efficiency of the produced plan
Build a new plan if the statistics
indicate that the current one is
inefficient
Adaptive query optimization
Sliding windows:
First In Last Out
structure
Append
measurements
Remove
measurements
Fixed size sliding windows face problems
In a small-sized window, the values of the statistics may vary
significantly among different placements of the sliding window, even
when the actual characteristics of the runtime environment do not change
When a window is large enough to keep out-of-date measurements, then
the AQP technique may “overlook” changes in the characteristics that
render the current plan inefficient
Adaptive query optimization
SELECT Packet s P
FROM Network
WHERE P .PROTOCOL=TCP AND
b y t e t e s t (P,4 , >32784 ,0 , l i t t l e )=TRUE AND
f low(P , to server , established )=TRUE AND
P .DESTINATION PORT=20031
Adaptive query optimization: Example
Query is taken from Snort network intrusion detection application
The data packets of the specific example belong to the 1998 DARPA network intrusion
detection dataset
Figures show how does the unconditional drop probability of each one of the above four
predicates change in a time period of two weeks
SELECT Packet s P
FROM Network
WHERE P .PROTOCOL=TCP AND
b y t e t e s t (P,4 , >32784 ,0 , l i t t l e )=TRUE AND
f low(P , to server , established )=TRUE AND
P .DESTINATION PORT=20031
Adaptive query optimization: Example
The problem of finding the query plan that minimizes the total cost spent reduces to the
minimum cost predicate ordering problem
The optimal query plan is the one that orders the predicates in descending drop
probability order
For example, during Monday the best predicate ordering is
as predicate P.PROTOCOL= TCP filters out more data and so less data processing is left for
subsequent predicates
P
.PROTOCOL=TCP
f low(P , to server ,
established)=TRUE
b y t e t e s t (P,4,
>32784,0, l i t t l e)
=TRUE
P .DESTINATION
PORT=20031
Selected
Packets
SELECT Packet s P
FROM Network
WHERE P .PROTOCOL=TCP AND
b y t e t e s t (P,4 , >32784 ,0 , l i t t l e )=TRUE AND
f low(P , to server , established )=TRUE AND
P .DESTINATION PORT=20031
Adaptive query optimization: Example
Historic data relying on packets of the last 2 weeks, 1 week, or 1 day to collect statistics
regarding the predicate drop probabilities, lead to orderings with 30% approximately
higher per-packet processing cost as reoptimization is not performed due to out-of-date
statistics
A smaller window increases significantly the runtime overhead counterbalancing the
potential improvements in the cost of the produced plans
Adaptive query optimization: Example
The above imply that feedback must be qualified
Challenges
We do not know when changes occur
We can not always arbitrarily enlarge a window due to space limitations (e.g., high-speed networking
applications)
Feedback qualification must incur low overhead
“The nature of data streams is totally
unpredictable”, -Professor Jenifer Widom
Monitoring
Reoptimizat
ion
Analysis
If change
Adaptive query optimization: Solution Novel monitoring phase
Feedback qualification is performed through adaptive window resizing so as to discard out-of-date
measurements when a change occur
Characteristics are realized as random variables
Employs change detection in the distribution followed by a random variable
If a change is detected, then the out-of-date measurements that have been collected prior to change are
discarded and analysis is triggered, else, nothing happens
When no change occurs the measurements are “memorized” by the change detection algorithm so they
do not have to be explicitly kept in a window
Sliding windows:
First In Last Out
structure
Append
measurements
Remove
measurements
If no change
Adaptive query optimization: Solution Novel monitoring phase: Characteristics
Non-intrusive to the state-of-the-art adaptivity loop
Can adopt any state-of-the-art online change detection algorithm through its plug-and-play abstraction
Finite State Machine for online change detection algorithms
Adaptive query optimization: Solution Novel monitoring phase: Characteristics
It provides the means to effectively control the tradeoff between the reoptimization frequency and the
quality of the runtime plan
For example, a plan can be reoptimized only when the detected changes lead to performance
deterioration more than a predefined threshold
Similar ideas were applied in systems processing data stored in dbs so far , but not in systems processing
streaming data as the latter is unpredictable
• Adaptive correlated filter ordering algorithm A-greedy on Snort network intrusion
detection application using 1998 DARPA intrusion detection dataset
Application of novel monitoring phase: A-
greedy
STREAM
The Stanford Data Stream Management System
SELECT Packet s P
FROM Network
WHERE P .PROTOCOL=TCP AND
b y t e t e s t (P,4 , >32784 ,0 , l i t t l e)=TRUE
A-greedy algorithm to order query predicates
Application of novel monitoring
phase: Vivaldi
Rationale of
NCs: embed the
network
latencies into an
Euclidian space
We know the
orange find the
red one
Experiments are
performed using
PlanetLab EU
latency
measurements
Online algorithm for selectivity change
detection Motivation
Network intrusion detection
Packet classification
β-CUSUM algorithm with O(1) runtime and space complexity
Based on CUmulative SUM algorithm
Assumes that data follow β distribution
At runtime it incrementally estimates the mean value and standard deviation of the data presented
so far
When a change occurs it employs the Nelder-Mead simplex algorithm to solve a system of
equations
β-CUSUM outperforms
ADWIN2, SIAM ICDM 2007
Martingale Test, IEEE PAMI 2010
ChangeFinder IEEE TKDE 2006
Meta-algorithm, VLDB 2004
Normal CUSUM, IEEE TNN 2008
Given packets of 40 bytes each at 40 Gbit/s speeds, a router has
less than 10 nanoseconds to process each packet
Dealing with multiple criteria
Wide-scale
Autonomy
Dynamicity
Resource leasing
Adoption of services
Dealing with multiple criteria
Multi-objective optimization of data flows
in a multi-cloud environment
Resources in cloud infrastructure are leased
Users have constraints regarding the monetary cost and completion time of
a query
Multi-objective optimization of data flows
in a multi-cloud environment: State-of-
the-art Mariposa
Wide-area distributed database system
User tasks can be broken into stages
Each remote db site executes query stages within a delay under some cost. This pair is
called bid
Users submit for each input query a cost-delay tradeoff function u(delay) = cost
Finds which db sites will execute which query stages such that the user satisfaction is
maximized
delay
$
Function u I will pay u(delay)
to have my task
executed within
time delay
Professor Mike Stonebraker
Central query optimizer
Remote db system 1
Remote db system 2
Multi-objective optimization of data flows
in a multi-cloud environment: State-of-
the-art
Remote db system 3
$
delay
Stage 1
Stage2 $
delay
Site bids for stage 1 Site bids for stage 2
Multi-objective optimization of data flows
in a multi-cloud environment: State-of-
the-art
$
delay
$
delay
Site bids for stage 1 Site bids for stage 2
delay
$
Function u
Plan which selects
to execute Stage 1
on blue site and
Stage 2 on green
remote site. Its
total delay and cost
equals the sum of
delay and costs of
the bids,
respectively
Plan which selects
to execute Stage 1
on red site and
Stage 2 on blue
remote site. Its
total delay and cost
equals the sum of
delay and costs of
the bids,
respectively
Multi-objective optimization of data flows
in a multi-cloud environment: State-of-
the-art
delay
$
Function u
Plan 1
Plan 2
Plan 3 Plan 4
User satisfaction is defined as that
vertical distance
The delay (cost) that
a user perceives (pays)
Multi-objective optimization of data flows
in a multi-cloud environment Mariposa faces a limitation that renders it inapplicable in modern cloud
infrastructure
The capabilities of each remote db site are characterized by a single point
However, in cloud infrastructure these capabilities are better characterized by a
function
$
delay
Stage 1
Stage2
$
delay
Site bids for stage 1 Site bids for stage 2
$
delay
$
delay
Site bids for stage 1 Site bids for stage 2
Multi-objective optimization of data flows
in a multi-cloud environment A function better represents elasticity, i.e., allocate as many resources as you
wish and pay according to the number of allocated resources
delay
$
Site bids for a stage
If we allocate e.g., 10, resources we
have to pay 20$ and the query stage
will be completed within 10 min
(10 min, 20$) If we allocate e.g., 2, resources we
have to pay 10$ and the query stage
will be completed within 20 min (20 min, 10$)
Multi-objective optimization of data flows
in a multi-cloud environment: Solution Optimal algorithm for convex user budget functions
Bounded error pseudopolynomial algorithms for non-convex user budget functions
Experimental evaluation shows that the proposed solutions outperform state-of-the-art
algorithms for similar multi-criteria query allocation problems (Klappi et al., SIGMOD
2011)
Future work
Directions for future work
Robust optimization on streaming environments
Current work on robust optimization deals with data stored in dbs
The proposed adativity loop must be adopted
Data correlation
Real-world data is correlated
For example, hardware failures in a cluster
State-of-the-art optimization techniques ignore possible correlations
New stream sampling estimation algorithms must be developed
Optimization accounting for multiple criteria, for example
Energy consumption and data sensing frequency in a sensor network
Energy consumption and data availability in a data warehouse
Monetary cost and data quality (e.g., freshness and completeness) in a Mashup
application
Monetary cost and task completion time in a cloud infrastructure
For more details http://delab.csd.auth.gr/~tsamoura/
E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Optimization of Decentralized Multi-way Join Queries over Pipelined Filtering Services", Computing, Springer, vol. 94, no. 12, pp. 939-972, 2012
E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Decentralized execution of linear workflows over Web Services", Future Generation Computer Systems (FGCS), Elsevier, vol. 27, no. 3, pp. 341-347, 2011
E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Optimal Service Ordering in Decentralized Queries over Web Services", International Journal of Knowledge-based Organizations (IJKBO), IGI GLOBAL, vol. 1, no. 2, pp. 1-16, 2011
E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Lifting the Burden of History in Adaptive Ordering of Pipelined Stream Filters", Proceedings of the 7th IEEE International Conference on Data Engineering, Workshop on Self Managing Database Systems (ICDE-SMDB), 2012
E. Tsamoura, A. Gounaris and Y. Manolopoulos "On the Quest of Optimal Service Ordering in Decentralized Queries", Proc. of the 29th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC 2010), pp. 277-278, Zurich, Switzerland, 2010
A. Gounaris, E. Tsamoura and Y. Manolopoulos, "Adaptive query processing in distributed settings", Intelligent Query Processing, Barbara Catania & Lakhmi Jain (Eds), Springer-Verlag, pp. 211–236, 2012
E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Queries over Web Services", Web Data Management Trails, Lakhmi Jain and Athena Vakali (Eds), Springer-Verlag, pp. 139-169, 2011
The research team I am working with Data engineering laboratory, http://delab.csd.auth.gr/
Lab director, Professor Yannis Manolopoulos
Academic staff
A. Papadopoulos
A. Gounaris
K. Tsichlas Professor Yannis Manolopoulos
Thank you !
Questions???