querying big data by accessing small data wenfei fanuniversity of edinburgh & beihang university...
TRANSCRIPT
Querying Big Data
by Accessing Small Data
Wenfei Fan University of Edinburgh & Beihang University
Floris Geerts University of Antwerp
Yang Cao University of Edinburgh & Beihang University
Ting Deng Beihang University
Ping Lu Beihang University
1
Challenges introduced by big data
Traditional computational complexity theory of 50 years:
• The ugly: PSPACE-hard, EXPTIME-hard, … , undecidable
• The bad: NP-hard (intractable)
• The good: polynomial time computable (PTIME)
Can we still answer queries on big data with limited resource?
What happens when it comes to big data?
Using SSD of 6G/s, a linear scan of a data set DD would take
• 1.9 days when DD is of 1PB (1015B)
• 5.28 years when DD is of 1EB (1018B)
O(n) time is already beyond reach on big data in practice!
2
Bounded evaluability
Input: A class L of queries
Question: Can we find, for any query Q L and any (possibly big) dataset D, a fraction DQ of D such that
Q(D) = Q(DQ), and
DQ can be identified in time determined by Q?
Making the cost of computing Q(D) independent of |D|!
Scales with D no matter how big D grows
Q( )DDQ( ) DQDQDQDQ
Graph Search (Facebook)
Find me restaurants in New York my friends have been to in 2014
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
pid2 = dine.pid and city = NYC and yy = 2014
Boundedly evaluable with indices under constraints?
Facebook: 5000 friends per person
Each year has at most 366 days
Each person dines at most once per day
pid is a key for relation person
Data semantics in constraints
3
1.38 billion person tuples, and over 140 billion friend tuples
Build an index from pid1 to pid2 for friend(pid1, pid2)
4
Bounded query evaluation
Accessing 5000 + 5000 + 5000 * 366 tuples in total
Fetch 5000 pid’s for friends of p0 -- 5000 friends per person
For each pid, check whether she lives in NYC – 5000 person tuples
For pid’s living in NYC, find restaurants where they dined in 2014 –
5000 * 366 tuples at most
A query plan under the constraints + indices
Find me restaurants in New York my friends have been to in 2014
Q(rid) = p, p1, n, c, dd, mm, yy (friend(p, p1) person(p, n, c)
dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014)
In contrast to 1.38 billion person tuples, and over 140 billion friend tuples
Overview
Formalization of bounded query plans and queries
The complexity of deciding the bounded evaluability for
– CQ (SPJ), UCQ, FO+ (SPJU), FO
Effective syntax for boundedly evaluable queries
Approximate query answering with bounded evaluability
– Bounded envelopes
– Bounded query specialization
Previous work: bounded query plans are not properly defined5
We only know that bounded evaluability is undecidable for FO [PODS 2014] in PTME for CQ with very restricted query plans [VLDB 2014]
Boundedly evaluable queries: formulation
6
Access constraints to capture data semantics
On a relation schema R: X (Y, N) X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values
Access schema: A set of access constraints
Combining cardinality constraints and index
friend(pid1, pid2): pid1 (pid2, 5000) 5000 friends per person
dine(pid, rid, dd, mm, yy): pid, yy (rid, 366) each year has
at most 366 days and each person dines at most once per day
person(pid, name, city): pid (city, 1) pid is a key for person
Examples
Discovery: functional dependencies, simple aggregate queries
7
Bounded plans for query Q
In the presence of access schema A
{ a }: a constant in query Q Fetch(X Tj, R, Y): via access constraint R: X (Y’, N), j < i
Y(Tj), C(Tj), (Tj): projection, selection, renaming
Tj Tk, Tj Tk, Tj - Tk: Cartesian product, union, set difference, for j < I, k < i
The length of (Q, R): bounded by an exponential in |R|, |Q| and |A|
Independent of the size of instances D of R
(Q, R): T1 = 1, …, Tn = n, where i is
not very practical for plans beyond exponential
Y X Y’
Fetch data by making use of indices in A
8
Boundedly evaluable queries Q
Q has a bounded query plan (Q, R) under an access schema A
CQ: only { a }, Fetch(X Tj, R, Y), Y(Tj), C(Tj), (Tj), Tj Tk :
UCQ: at the end only
FO+ : { a }, Fetch, , , , , , FO: { a }, Fetch, , , , , ,
Coping with big data
Deciding bounded evaluability
9
The bounded evaluability problem (BEP(L))
Input: A relational schema R, an access schema A, and a query Q in a query language L
Question: Is Q boundedly evaluable under A?
When Q has a bounded query plan under A.
Undecidable for FO [PODS 2014]
Is BEP decidable for CQ? UCQ? FO+? If so, what is the complexity?
The bounded evaluability analysis is nontrivial
Example of bounded evaluable queries
Schema: R(A, B, C) Access schema A: R( C, 1), R(AB C, N) A CQ query:
Q(x, y) = x1, x2, z1, z2, z3 (R(x1, x2, x) R(z1, z2, y )
R(x, y, z3) x1 = 1 x2 = 1)
We need to reason about A-equivalence and “nontrivial” variables
Is Q boundedly evaluable?
Yes, Q is A-equivalent to Q’(x, x) = R (1, 1, x), which is boundedly evaluable:– x = y = z3
z1, z2 (R(1, 1, x) R(z1, z2, y)) is entailed by R(1, 1, x) With indices in A, “nontrivial” variables are fetchable;combinations are indexed
10
11
The complexity of BEP
BEP is EXPSPACE-complete for CQ, UCQ and FO+ good news: decidablebad news: to expensive to be practical
Can we make practical use of bounded evaluability?
lower bound: by reduction from the non-emptiness problem for parameterized regular expressions
Upper bound: a characterization based on A-equivalence and “nontrivial” variables for boundedly evaluable queries
Effective syntax for boundedly evaluable queries
12
An effective syntax for bounded CQ
A form of queries covered by an access schema AA CQ is boundedly evaluable under A iff it is A-equivalent to a CQ covered by AAll CQ queries covered by A are boundedly evaluable under AIt is in PTIME to syntactically check whether a CQ is covered by A in |Q|, |A| and |R|
A syntactic characterization of boundedly evaluable CQ
A CQ Q is covered by A if
all free variables and variables that participate in “selection / join” of Q are accessible via indices in A
combination of such variables in each atom R(xx) is indexed by a single access constraint
More on covered queries
Schema: R(A, B, C) Access schema A: R( C, 1), R(AB C, N)
Q(x, y) = x1, x2, z1, z2, z3 (R(x1, x2, x) R(z1, z2, y )
R(x, y, z3) x1 = 1 x2 = 1)
2p-complete to decide whether a query in FO+ is covered
A query in FO+ is covered by A if for each CQ-subquery Qi
either Qi is covered by A,
or for each A-instance (Ti) of Qi, there exists a CQ-subquery Qj of Q such that Qi
((Ti)) Qj ((Ti)) and Qj is covered
covered
13
Bounded envelopes
14
Bounded envelopes
What can we do if query Q in L is not boundedly evaluable under A?
Approximate query answering
QL and QU: upper and lower envelopes of Q
QL(D) and QU(D) are not too far from Q(D)
We find QL and QU in the same language L such that
QL and QU are boundedly evaluable under A for all instances D that satisfy A– QL(D) Q(D) QU(D), and
– NL | Q(D) QL(D) |, and NU |QU(D) Q(D) |,
where NL and NU are constants
S. Chaudhuri and P. G. Kolatis. Can datalog be approximated? JCSS 55(2), 1997
Example bounded envelopes
Schema: R(A, B) Access schema A: R(A B, N)
Q(x) = y, z, w (R(w, x) R(y, w) R(x, z) w = 1)
Bounded envelopes may not exist
not boundedly evaluable
Q(x, y) = w (R(w, x) R(y, w) w = 1)
Bounded envelopes
Upper: QU(x) = y, z (R(1, x) R(x, z))Lower: QL(x) = y, z (R(1, x) R(y, 1) R(x, y) R(x, z))
relaxation
expansion
15
16
The bounded envelope problems
UPE(L): Input: A relational schema R, an access schema A, and a query Q in a query language L Question: Does Q have a bounded upper envelope under A?
Similarly LPE(L) for lower envelopes.
We consider covered envelopes when Q is in CQ, UCQ or FO+
Complexity bounds
For CQ, UEP and LEP are NP-complete For UCQ, UPE is 2
p-complete and LEP is NP-complete
For FO+, UPE is 2p-complete and LEP is DP-complete
For FO, UEP and LEP are undecidable
Bounded specialized queries
Bounded query specialization
Access schema A, and query Q with a set X of parameters (variables)Q(x = c): Q x = c: x X, valuation c is a constant tuple– bounded evaluable under A for all valuations c
Consider covered queries when Q is in CQ, UCQ or FO+
Instantiate a minimum set of parameters and make Q bounded
Find me restaurants in New York my friends have been to in 2014
All valuations p0
Q(p, rid) = p, p1, n, c, dd, mm, yy (friend(p, p1) person(p, n,
c) dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014)
17
18
The bounded specialization problem (QSP(L))
Input: A relational schema R, an access schema A, a query Q in a query language L, a set X of parameters of Q, and a positive integer k
Question: Does Q have a bounded specialization Q(x = c) with k | x | ?
Complexity bounds
NP-complete for CQ 2
p-complete for UCQ and FO+
undecidable for FO
Summing up
26
Bounded evaluability of queries
Challenges: querying big data is cost-prohibitive Bounded evaluability allows us to make big data small However, the bounded evaluability analysis is expensive
Nonetheless, we can make practical use of bounded evaluabilityEffective syntax: covered queries for CQ, UCQ and FO+ Approximate query answering:
• Bounded envelopes with a constant bound• Bounded specialization for parameterized queries
An approach to effectively querying big data 19
Decidability and complexity