optimizing queries in modern infrastructure

Optimizing queries in modern

infrastructure: Challenges, Solutions

and Open Problems

Efthymia (Efi) Tsamoura

What motivated modern infrastructure?

Introduction

“Everyday, we create 2.5 1018 bytes of data - 90% of the data in the world

today has been created during the last two years”, IBM 2012 report

This data comes from everywhere …

Large Hadron Collider project in CERN delivers data 40M times per second

Sloan Digital Sky Survey project collects 200 GB per night resulting in 140

terabytes of data, so far

Facebook processes more than 500 TB of data daily

We need new infrastructure to store/proccess this data

E.g., clouds, grids, clusters

Characteristics of modern infrastructure

Wide-scale

Autonomy

Dynamicity

Resource leasing

Adoption of services


Wide-scale

Thousands of computational and storage resources spread all over the world

For example,

PlanetLab consists of more than 1000 resources spread over ~500 sites

Facebook has 9 datacenters each one of them consisting of thousands of servers

Autonomy

Dynamicity

Resource leasing



Wide-scale

Autonomy

Concerns grid and cloud infrastructure

“…cloud is a kind of parallel or distributed environment where autonomous

resources can be shared …”, -Professor Rajkumar Buyya

Each resource belongs to a different owner who has full control over it, e.g.,

he/she can remove the resource anytime, can define/alter the user privileges, can

define which resource portion is available for usage at any time, etc.

Dynamicity

Resource leasing



Wide-scale

Autonomy

Dynamicity

Consequence of autonomy and multiple user utilization

The infrastructure characteristics (e.g., CPU and network load) may vary

significantly along time

Observed latency measurements of a link for transferring 5MB of data between two PlanetLab hosts

during a three weeks period


Wide-scale

Autonomy

Dynamicity

Resource leasing

Concerns cloud infrastructure

“…This pool of resources is typically exploited by a pay-per-use model …”

Users have the ability to lease resources only for as long as needed, based on a per

quantum pricing scheme, e.g. one hour

Examples: Amazon EC2, Windows Azure



Wide-scale

Autonomy

Dynamicity

Resource leasing


Consequence of heterogeneity

Remote resources could not easily communicate due to heterogeneous software,

libraries or protocols

“…services are platform independent entities that can be described, published,

discovered and interconnected through novel paradigms and may have

arbitrarily complex functionally …”, -Professor Mike Papazoglou

Even data are put behind services

E.g., web services, XML and Internet protocols

Users can define complex tasks through multiple service flows


Wide-scale

Autonomy

Dynamicity

Resource leasing


Rise of Service Management Platforms, e.g., Yahoo! Query Language, Apache

ServiceMix, Taverna, Pegasus

Types of queries that we consider

SQL queries, e.g.,

SELECT Packet s P

FROM Network

WHERE P .PROTOCOL=TCP AND

b y t e t e s t (P,4 , >32784 ,0 , l i t t l e )=TRUE AND

flow(P , to server , established )=TRUE AND

P .DESTINATION PORT=20031

Data workflows, e.g.,

Optimizing service plans Wide-scale

Autonomy

Dynamicity

Resource leasing

Adoption of services Optimizing services plans

Optimizing pipelined service plans

The problem of optimizing queries/tasks over services did not receive the

appropriate attention

Service Management Platforms focus on usability neglecting performance,

in major, and so users manually specify the order under which services will

be called

However, modifying the service execution order, may lead to significant

performance improvements

Optimizing pipelined service plans: Example

A company wants to obtain a list of email addresses of potential customers

selecting only those who have a good payment history for at least one card

and a credit rating above some threshold.

Services S1 returns a customer’s credit card number given his/her identifier

Services S2 returns a customer’s email address given his/her identifier

Services S3 returns a customer’s credit rating (parameterized by a credit card

rating threshold) given his/her identifier

Services S4 returns a customer’s payment history (parameterized by a

payment history threshold) given his/her identifier

Input data are customer identifiers and are returned by S5

1. Each service has

• its own per-tuple processing cost and

• its own selectivity

2. Services communicate through heterogeneous links

3. Pipelined parallelism is employed

4. Constraints regarding the service call order

5. Services are allocated on host machines



Data already processed by a service is processed by subsequent services in the

plan at the same time as the former processes new input data

Time 0: t5 t4 t3 t2 t1






There exist multiple service call plans, but these plans have different

completion time

In the first plan

1. We find the customer identifiers

2. Then, we find the customer’s credit card numbers given these ids

3. Then, we find the customer’s email address

4. Then, we find the customer’s having good payment history

5. Then, we find the customer’s having credit rating above a threshold

In the second plan

1. We find the customer identifiers

2. Then, we find the customer’s credit card numbers given these ids

3. Then, we find the customer’s having credit rating above a threshold

given these ids

4. Then, we find the customer’s having good payment history

5. Then, we find the customer’s email address

Optimizing pipelined service plans: Other

examples

Bio-informatics workflows

This workflow takes in Entrez gene

ids then adds the string "ncbi-geneid:"

to the start of each gene id. These gene

ids are then cross-referenced to KEGG

gene ids. Each KEGG gene id is then

sent to the KEGG pathway database

and its relevant pathways returned.

-Paul Fisher, myexperiment

Optimizing pipelined service plans: State-of-

the-art

State-of-the-art work either

does not consider data processing cost along with network heterogeneity (e.g.,

Srivastava et al. VLDB 2007) or

does not consider pipelined parallelism

Both assumptions have negative impact on plan completion time

Current SoA algorithms cannot be applied as the criterion to optimize

differs: when pipelining is employed we do not want to minimize the total

cost spent by all services to process data and transfer the intermediate

results, but the maximum cost that a service can incur

In pipelined execution, the completion time of a plan equals the data

processing rate, which is determined, in turn, by the slowest service

The latter service is called bottleneck service

Optimizing pipelined service plans: Sum

cost metric vs. bottleneck cost metric

Cost(Si) is the cost a services incurs to process data and send the results to

another service

Sum cost metric:

Bottleneck cost metric:

S1 S2 S3 S4

i

i

min cost(S )

i

iS

min max cost(S )

Optimizing pipelined service plans: Solution Provably optimal algorithm

Experiments performed in PlanetLab EU shows that is outperforms Greedy

(VLDB 2007)

It is based on branch & bound optimization paradigm

It can explore the whole space of solutions without building complete

services plans

It is guided by two measures called ε and

ε

S2 S6 S1 S3

S5

S7

S8

S4

Current partial plan

ε is maximum cost that a services incurs to

process data and send the results to another

service

Services not in current partial plan

is maximum cost that a service will

incur if appended to current plan

ε

ε

Optimizing multiway join pipelined service

plans

Generalization of the previous problem

One multiway join service merges the data of different sources

Application

Sensor networks (e.g., aggregate data from different sensor deployments)

Mashup platforms (e.g., IBM Mashup Center, WSO2 Mashup Server, YQL)

“A mashup, in web development, is a web page, or web application, that uses and

combines data, presentation or functionality from two or more sources to create new

services.”

Two algorithms are proposed which in many cases find the optimal plan

Improving the performance of adaptive query

optimization techniques

Wide-scale

Autonomy

Dynamicity

Resource leasing


Improving the performance of

adaptive query optimization

techniques

Adaptive query optimization

Streaming data and infrastructure dynamicity render query processing a

challenging task

Modern adaptive query processing techniques employ a three phase loop,

called adaptivity loop, to adapt a plan based on runtime infrastructure and

data conditions

Monitoring

Reoptimizati

on

Analysis

Collect measurements from the

environment and keep them in a fixed-

size sliding window

Derive statistic using the collected

measurements and analyze the

efficiency of the produced plan

Build a new plan if the statistics

indicate that the current one is

inefficient


Sliding windows:

First In Last Out

structure

Append

measurements

Remove

measurements

Fixed size sliding windows face problems

In a small-sized window, the values of the statistics may vary

significantly among different placements of the sliding window, even

when the actual characteristics of the runtime environment do not change

When a window is large enough to keep out-of-date measurements, then

the AQP technique may “overlook” changes in the characteristics that

render the current plan inefficient


SELECT Packet s P

FROM Network



f low(P , to server , established )=TRUE AND


Adaptive query optimization: Example

Query is taken from Snort network intrusion detection application

The data packets of the specific example belong to the 1998 DARPA network intrusion

detection dataset

Figures show how does the unconditional drop probability of each one of the above four

predicates change in a time period of two weeks

SELECT Packet s P

FROM Network






The problem of finding the query plan that minimizes the total cost spent reduces to the

minimum cost predicate ordering problem

The optimal query plan is the one that orders the predicates in descending drop

probability order

For example, during Monday the best predicate ordering is

as predicate P.PROTOCOL= TCP filters out more data and so less data processing is left for

subsequent predicates

P

.PROTOCOL=TCP

f low(P , to server ,

established)=TRUE

b y t e t e s t (P,4,

>32784,0, l i t t l e)

=TRUE

P .DESTINATION

PORT=20031

Selected

Packets

SELECT Packet s P

FROM Network






Historic data relying on packets of the last 2 weeks, 1 week, or 1 day to collect statistics

regarding the predicate drop probabilities, lead to orderings with 30% approximately

higher per-packet processing cost as reoptimization is not performed due to out-of-date

statistics

A smaller window increases significantly the runtime overhead counterbalancing the

potential improvements in the cost of the produced plans


The above imply that feedback must be qualified

Challenges

We do not know when changes occur

We can not always arbitrarily enlarge a window due to space limitations (e.g., high-speed networking

applications)

Feedback qualification must incur low overhead

“The nature of data streams is totally

unpredictable”, -Professor Jenifer Widom

Monitoring

Reoptimizat

ion

Analysis

If change

Adaptive query optimization: Solution Novel monitoring phase

Feedback qualification is performed through adaptive window resizing so as to discard out-of-date

measurements when a change occur

Characteristics are realized as random variables

Employs change detection in the distribution followed by a random variable

If a change is detected, then the out-of-date measurements that have been collected prior to change are

discarded and analysis is triggered, else, nothing happens

When no change occurs the measurements are “memorized” by the change detection algorithm so they

do not have to be explicitly kept in a window

Sliding windows:

First In Last Out

structure

Append

measurements

Remove

measurements

If no change

Adaptive query optimization: Solution Novel monitoring phase: Characteristics

Non-intrusive to the state-of-the-art adaptivity loop

Can adopt any state-of-the-art online change detection algorithm through its plug-and-play abstraction

Finite State Machine for online change detection algorithms

Adaptive query optimization: Solution Novel monitoring phase: Characteristics

It provides the means to effectively control the tradeoff between the reoptimization frequency and the

quality of the runtime plan

For example, a plan can be reoptimized only when the detected changes lead to performance

deterioration more than a predefined threshold

Similar ideas were applied in systems processing data stored in dbs so far , but not in systems processing

streaming data as the latter is unpredictable

• Adaptive correlated filter ordering algorithm A-greedy on Snort network intrusion

detection application using 1998 DARPA intrusion detection dataset

Application of novel monitoring phase: A-

greedy

STREAM

The Stanford Data Stream Management System

SELECT Packet s P

FROM Network


b y t e t e s t (P,4 , >32784 ,0 , l i t t l e)=TRUE

A-greedy algorithm to order query predicates

Application of novel monitoring

phase: Vivaldi

Rationale of

NCs: embed the

network

latencies into an

Euclidian space

We know the

orange find the

red one

Experiments are

performed using

PlanetLab EU

latency

measurements

Online algorithm for selectivity change

detection Motivation

Network intrusion detection

Packet classification

β-CUSUM algorithm with O(1) runtime and space complexity

Based on CUmulative SUM algorithm

Assumes that data follow β distribution

At runtime it incrementally estimates the mean value and standard deviation of the data presented

so far

When a change occurs it employs the Nelder-Mead simplex algorithm to solve a system of

equations

β-CUSUM outperforms

ADWIN2, SIAM ICDM 2007

Martingale Test, IEEE PAMI 2010

ChangeFinder IEEE TKDE 2006

Meta-algorithm, VLDB 2004

Normal CUSUM, IEEE TNN 2008

Given packets of 40 bytes each at 40 Gbit/s speeds, a router has

less than 10 nanoseconds to process each packet

Dealing with multiple criteria

Wide-scale

Autonomy

Dynamicity

Resource leasing


Dealing with multiple criteria

Multi-objective optimization of data flows

in a multi-cloud environment

Resources in cloud infrastructure are leased

Users have constraints regarding the monetary cost and completion time of

a query


in a multi-cloud environment: State-of-

the-art Mariposa

Wide-area distributed database system

User tasks can be broken into stages

Each remote db site executes query stages within a delay under some cost. This pair is

called bid

Users submit for each input query a cost-delay tradeoff function u(delay) = cost

Finds which db sites will execute which query stages such that the user satisfaction is

maximized

delay

$

Function u I will pay u(delay)

to have my task

executed within

time delay

Professor Mike Stonebraker

Central query optimizer

Remote db system 1

Remote db system 2



the-art

Remote db system 3

$

delay

Stage 1

Stage2 $

delay

Site bids for stage 1 Site bids for stage 2



the-art

$

delay

$

delay


delay

$

Function u

Plan which selects

to execute Stage 1

on blue site and

Stage 2 on green

remote site. Its

total delay and cost

equals the sum of

delay and costs of

the bids,

respectively

Plan which selects

to execute Stage 1

on red site and

Stage 2 on blue

remote site. Its

total delay and cost

equals the sum of

delay and costs of

the bids,

respectively



the-art

delay

$

Function u

Plan 1

Plan 2

Plan 3 Plan 4

User satisfaction is defined as that

vertical distance

The delay (cost) that

a user perceives (pays)


in a multi-cloud environment Mariposa faces a limitation that renders it inapplicable in modern cloud

infrastructure

The capabilities of each remote db site are characterized by a single point

However, in cloud infrastructure these capabilities are better characterized by a

function

$

delay

Stage 1

Stage2

$

delay


$

delay

$

delay



in a multi-cloud environment A function better represents elasticity, i.e., allocate as many resources as you

wish and pay according to the number of allocated resources

delay

$

Site bids for a stage

If we allocate e.g., 10, resources we

have to pay 20$ and the query stage

will be completed within 10 min

(10 min, 20$) If we allocate e.g., 2, resources we

have to pay 10$ and the query stage

will be completed within 20 min (20 min, 10$)


in a multi-cloud environment: Solution Optimal algorithm for convex user budget functions

Bounded error pseudopolynomial algorithms for non-convex user budget functions

Experimental evaluation shows that the proposed solutions outperform state-of-the-art

algorithms for similar multi-criteria query allocation problems (Klappi et al., SIGMOD

2011)

Future work

Directions for future work

Robust optimization on streaming environments

Current work on robust optimization deals with data stored in dbs

The proposed adativity loop must be adopted

Data correlation

Real-world data is correlated

For example, hardware failures in a cluster

State-of-the-art optimization techniques ignore possible correlations

New stream sampling estimation algorithms must be developed

Optimization accounting for multiple criteria, for example

Energy consumption and data sensing frequency in a sensor network

Energy consumption and data availability in a data warehouse

Monetary cost and data quality (e.g., freshness and completeness) in a Mashup

application

Monetary cost and task completion time in a cloud infrastructure

For more details http://delab.csd.auth.gr/~tsamoura/

E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Optimization of Decentralized Multi-way Join Queries over Pipelined Filtering Services", Computing, Springer, vol. 94, no. 12, pp. 939-972, 2012

E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Decentralized execution of linear workflows over Web Services", Future Generation Computer Systems (FGCS), Elsevier, vol. 27, no. 3, pp. 341-347, 2011

E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Optimal Service Ordering in Decentralized Queries over Web Services", International Journal of Knowledge-based Organizations (IJKBO), IGI GLOBAL, vol. 1, no. 2, pp. 1-16, 2011

E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Lifting the Burden of History in Adaptive Ordering of Pipelined Stream Filters", Proceedings of the 7th IEEE International Conference on Data Engineering, Workshop on Self Managing Database Systems (ICDE-SMDB), 2012

E. Tsamoura, A. Gounaris and Y. Manolopoulos "On the Quest of Optimal Service Ordering in Decentralized Queries", Proc. of the 29th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC 2010), pp. 277-278, Zurich, Switzerland, 2010

A. Gounaris, E. Tsamoura and Y. Manolopoulos, "Adaptive query processing in distributed settings", Intelligent Query Processing, Barbara Catania & Lakhmi Jain (Eds), Springer-Verlag, pp. 211–236, 2012

E. Tsamoura, A. Gounaris and Y. Manolopoulos, "Queries over Web Services", Web Data Management Trails, Lakhmi Jain and Athena Vakali (Eds), Springer-Verlag, pp. 139-169, 2011

The research team I am working with Data engineering laboratory, http://delab.csd.auth.gr/

Lab director, Professor Yannis Manolopoulos

Academic staff

A. Papadopoulos

A. Gounaris

K. Tsichlas Professor Yannis Manolopoulos

http://delab.csd.auth.gr/

Thank you !

Questions???

optimizing queries in modern infrastructure

Documents