querying big data by accessing small data wenfei fanuniversity of edinburgh & beihang university...

Querying Big Data

by Accessing Small Data

Wenfei Fan University of Edinburgh & Beihang University

Floris Geerts University of Antwerp

Yang Cao University of Edinburgh & Beihang University

Ting Deng Beihang University

Ping Lu Beihang University

1

Challenges introduced by big data

Traditional computational complexity theory of 50 years:

• The ugly: PSPACE-hard, EXPTIME-hard, … , undecidable

• The bad: NP-hard (intractable)

• The good: polynomial time computable (PTIME)

Can we still answer queries on big data with limited resource?

What happens when it comes to big data?

Using SSD of 6G/s, a linear scan of a data set DD would take

• 1.9 days when DD is of 1PB (1015B)

• 5.28 years when DD is of 1EB (1018B)

O(n) time is already beyond reach on big data in practice!

2

Bounded evaluability

Input: A class L of queries

Question: Can we find, for any query Q L and any (possibly big) dataset D, a fraction DQ of D such that

Q(D) = Q(DQ), and

DQ can be identified in time determined by Q?

Making the cost of computing Q(D) independent of |D|!

Scales with D no matter how big D grows

Q( )DDQ( ) DQDQDQDQ

Graph Search (Facebook)

Find me restaurants in New York my friends have been to in 2014

select rid

from friend(pid1, pid2), person(pid, name, city),

dine(pid, rid, dd, mm, yy)

where pid1 = p0 and pid2 = person.pid and

pid2 = dine.pid and city = NYC and yy = 2014

Boundedly evaluable with indices under constraints?

Facebook: 5000 friends per person

Each year has at most 366 days

Each person dines at most once per day

pid is a key for relation person

Data semantics in constraints

3

1.38 billion person tuples, and over 140 billion friend tuples

Build an index from pid1 to pid2 for friend(pid1, pid2)

4

Bounded query evaluation

Accessing 5000 + 5000 + 5000 * 366 tuples in total

Fetch 5000 pid’s for friends of p0 -- 5000 friends per person

For each pid, check whether she lives in NYC – 5000 person tuples

For pid’s living in NYC, find restaurants where they dined in 2014 –

5000 * 366 tuples at most

A query plan under the constraints + indices


Q(rid) = p, p1, n, c, dd, mm, yy (friend(p, p1) person(p, n, c)

dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014)

In contrast to 1.38 billion person tuples, and over 140 billion friend tuples

Overview

Formalization of bounded query plans and queries

The complexity of deciding the bounded evaluability for

– CQ (SPJ), UCQ, FO+ (SPJU), FO

Effective syntax for boundedly evaluable queries

Approximate query answering with bounded evaluability

– Bounded envelopes

– Bounded query specialization

Previous work: bounded query plans are not properly defined5

We only know that bounded evaluability is undecidable for FO [PODS 2014] in PTME for CQ with very restricted query plans [VLDB 2014]

Boundedly evaluable queries: formulation

6

Access constraints to capture data semantics

On a relation schema R: X (Y, N) X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values

Access schema: A set of access constraints

Combining cardinality constraints and index

friend(pid1, pid2): pid1 (pid2, 5000) 5000 friends per person

dine(pid, rid, dd, mm, yy): pid, yy (rid, 366) each year has

at most 366 days and each person dines at most once per day

person(pid, name, city): pid (city, 1) pid is a key for person

Examples

Discovery: functional dependencies, simple aggregate queries

7

Bounded plans for query Q

In the presence of access schema A

{ a }: a constant in query Q Fetch(X Tj, R, Y): via access constraint R: X (Y’, N), j < i

Y(Tj)， C(Tj), (Tj): projection, selection, renaming

Tj Tk, Tj Tk, Tj - Tk: Cartesian product, union, set difference, for j < I, k < i

The length of (Q, R): bounded by an exponential in |R|, |Q| and |A|

Independent of the size of instances D of R

(Q, R): T1 = 1, …, Tn = n, where i is

not very practical for plans beyond exponential

Y X Y’

Fetch data by making use of indices in A

8

Boundedly evaluable queries Q

Q has a bounded query plan (Q, R) under an access schema A

CQ: only { a }, Fetch(X Tj, R, Y), Y(Tj)， C(Tj), (Tj), Tj Tk :

UCQ: at the end only

FO+ : { a }, Fetch, , , , , , FO: { a }, Fetch, , , , , ,

Coping with big data

Deciding bounded evaluability

9

The bounded evaluability problem (BEP(L))

Input: A relational schema R, an access schema A, and a query Q in a query language L

Question: Is Q boundedly evaluable under A?

When Q has a bounded query plan under A.

Undecidable for FO [PODS 2014]

Is BEP decidable for CQ? UCQ? FO+? If so, what is the complexity?

The bounded evaluability analysis is nontrivial

Example of bounded evaluable queries

Schema: R(A, B, C) Access schema A: R( C, 1), R(AB C, N) A CQ query:

Q(x, y) = x1, x2, z1, z2, z3 (R(x1, x2, x) R(z1, z2, y )

R(x, y, z3) x1 = 1 x2 = 1)

We need to reason about A-equivalence and “nontrivial” variables

Is Q boundedly evaluable?

Yes, Q is A-equivalent to Q’(x, x) = R (1, 1, x), which is boundedly evaluable:– x = y = z3

z1, z2 (R(1, 1, x) R(z1, z2, y)) is entailed by R(1, 1, x) With indices in A, “nontrivial” variables are fetchable;combinations are indexed

10

11

The complexity of BEP

BEP is EXPSPACE-complete for CQ, UCQ and FO+ good news: decidablebad news: to expensive to be practical

Can we make practical use of bounded evaluability?

lower bound: by reduction from the non-emptiness problem for parameterized regular expressions

Upper bound: a characterization based on A-equivalence and “nontrivial” variables for boundedly evaluable queries

Effective syntax for boundedly evaluable queries

12

An effective syntax for bounded CQ

A form of queries covered by an access schema AA CQ is boundedly evaluable under A iff it is A-equivalent to a CQ covered by AAll CQ queries covered by A are boundedly evaluable under AIt is in PTIME to syntactically check whether a CQ is covered by A in |Q|, |A| and |R|

A syntactic characterization of boundedly evaluable CQ

A CQ Q is covered by A if

all free variables and variables that participate in “selection / join” of Q are accessible via indices in A

combination of such variables in each atom R(xx) is indexed by a single access constraint

More on covered queries

Schema: R(A, B, C) Access schema A: R( C, 1), R(AB C, N)

Q(x, y) = x1, x2, z1, z2, z3 (R(x1, x2, x) R(z1, z2, y )

R(x, y, z3) x1 = 1 x2 = 1)

2p-complete to decide whether a query in FO+ is covered

A query in FO+ is covered by A if for each CQ-subquery Qi

either Qi is covered by A,

or for each A-instance (Ti) of Qi, there exists a CQ-subquery Qj of Q such that Qi

((Ti)) Qj ((Ti)) and Qj is covered

covered

13

Bounded envelopes

14

Bounded envelopes

What can we do if query Q in L is not boundedly evaluable under A?

Approximate query answering

QL and QU: upper and lower envelopes of Q

QL(D) and QU(D) are not too far from Q(D)

We find QL and QU in the same language L such that

QL and QU are boundedly evaluable under A for all instances D that satisfy A– QL(D) Q(D) QU(D), and

– NL | Q(D) QL(D) |, and NU |QU(D) Q(D) |,

where NL and NU are constants

S. Chaudhuri and P. G. Kolatis. Can datalog be approximated? JCSS 55(2), 1997

Example bounded envelopes

Schema: R(A, B) Access schema A: R(A B, N)

Q(x) = y, z, w (R(w, x) R(y, w) R(x, z) w = 1)

Bounded envelopes may not exist

not boundedly evaluable

Q(x, y) = w (R(w, x) R(y, w) w = 1)

Bounded envelopes

Upper: QU(x) = y, z (R(1, x) R(x, z))Lower: QL(x) = y, z (R(1, x) R(y, 1) R(x, y) R(x, z))

relaxation

expansion

15

16

The bounded envelope problems

UPE(L): Input: A relational schema R, an access schema A, and a query Q in a query language L Question: Does Q have a bounded upper envelope under A?

Similarly LPE(L) for lower envelopes.

We consider covered envelopes when Q is in CQ, UCQ or FO+

Complexity bounds

For CQ, UEP and LEP are NP-complete For UCQ, UPE is 2

p-complete and LEP is NP-complete

For FO+, UPE is 2p-complete and LEP is DP-complete

For FO, UEP and LEP are undecidable

Bounded specialized queries

Bounded query specialization

Access schema A, and query Q with a set X of parameters (variables)Q(x = c): Q x = c: x X, valuation c is a constant tuple– bounded evaluable under A for all valuations c

Consider covered queries when Q is in CQ, UCQ or FO+

Instantiate a minimum set of parameters and make Q bounded


All valuations p0

Q(p, rid) = p, p1, n, c, dd, mm, yy (friend(p, p1) person(p, n,

c) dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014)

17

18

The bounded specialization problem (QSP(L))

Input: A relational schema R, an access schema A, a query Q in a query language L, a set X of parameters of Q, and a positive integer k

Question: Does Q have a bounded specialization Q(x = c) with k | x | ?

Complexity bounds

NP-complete for CQ 2

p-complete for UCQ and FO+

undecidable for FO

Summing up

26

Bounded evaluability of queries

Challenges: querying big data is cost-prohibitive Bounded evaluability allows us to make big data small However, the bounded evaluability analysis is expensive

Nonetheless, we can make practical use of bounded evaluabilityEffective syntax: covered queries for CQ, UCQ and FO+ Approximate query answering:

• Bounded envelopes with a constant bound• Bounded specialization for parameterized queries

An approach to effectively querying big data 19

Decidability and complexity

querying big data by accessing small data wenfei fanuniversity of edinburgh & beihang university...

Documents