a distributed self-adaption cube building model based on...
TRANSCRIPT
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/322638429
A Distributed Self-adaption Cube Building Model Based on Query Log
Chapter · January 2018
DOI: 10.1007/978-3-319-74521-3_41
CITATIONS
0READS
72
4 authors, including:
Some of the authors of this publication are also working on these related projects:
Wechaty View project
Conversational AI CLUB View project
Zhuohuan Li
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Zhuohuan Li on 03 August 2018.
The user has requested enhancement of the downloaded file.
A Distributed Self-adaption Cube Building
Model Based on Query Log
Meina Song, Mingkun Li, Zhuohuan Li,Haihong E, and Zhonghong Ou
Beijing University of Posts and Telecommunications, China {mnsong,dangshazi,
lizhuohuan, ehaihong,zhonghong.ou}@bupt.edu.cn
Abstract Among the diverse distributed query and analysis engine, Kylin have
gained wide adoption since its various strengths. By using Kylin, users can in-
teract with Hadoop data at sub-second latency. However, it still has some dis-
advantages. One representative disadvantage is the the exponential growth of
cuboids along with the growth of dimensions. In this paper, we optimize the cu-
boid materialization strategy of Kylin by reducing the number of cuboids based
on the traditional OLAP optimization method. We optimize the strategy mainly
from two aspects. Firstly, we propose Lazy-Building strategy to delay the con-
struction of nonessential cuboid and shorten the time of cuboid initialization.
Secondly, we adopt Materialized View Self-adjusting Algorithm to eliminate
the cuboids which are not in use for a long period. Experimental results demon-
strate the efficacy of the proposed Distributed Self-Adaption Cube Building
Model. Specifically, by using our model, cube initialization speed has increased
by 28.5 percent points and 65.8 percent points space are saved, comparing with
the cube building model of Kylin.
Keywords: Distributed OLAP, Distributed Query Processing System, Kylin,
Query Log, Materialization Strategy
1 Introduction
In the era of big data, many modern companies produce huge amounts of data in
their service lines. These data are used to conduct report analysis based on OLAP
analysis. In order to conduct report analysis, companies need a system which can
response to the query of thousands of data analysts at the same time. That requires
high scalability, stability, accuracy and speed of the system. In fact, there doesn’t
exist a widely-accepted method in distributed OLAP field. Many query engines
can also conduct report analysis, such as Presto [4], Impala [2], Spark SQL [14]
or Elasticsearch [10], but they are more emphasis on data query and analysis. As a
matter of fact, Kylin [7] is the specialized tool in Distributed OLAP field which is
used often.
Kylin is originally developed by eBay, and is now a project of the Apache
Software Foundation. It is designed to accelerate analysis on Hadoop and allow
the use of SQL-compatible tools. It also provides a SQL interface and supports
2
multidimensional analysis on Hadoop for extremely large datasets. Kylin can
reach the scale of one million or even millisecond OLAP analysis. So it is very
frequently-used in the domestic IT industry.
The idea of Kylin is not original. Many technologies in Kylin have been used
to accelerate analysis over the past 30 years. These technologies involve storing
pre-calculated results, generating each level's cuboids with all possible combina-
tions of dimensions, and calculating all metrics at different levels. Essentially,
Kylin extends the methods of traditional OLAP field to the distributed field, gen-
erating Cube on Hadoop ecology.
When data becomes bigger, the pre-calculation processing becomes impossible
even with powerful hardware. However, with the benefit of Hadoop's distributed
computing power, calculation jobs can leverage hundreds of thousands of nodes
[9]. This allows Kylin to perform these calculations in parallel and merge the final
result, thereby significantly reducing the processing time.
Data cube [5] construction is the core of Kylin, it has two characteristics: one is
the exponential growth of cuboids [5] along with the growth of dimensions; the
other is the large amount of IO due to increased number of cuboids. The cube is
usually very sparse, the increase of sparse data will waste a lot of computing time
and memory space.
A full n-dimensional data cube could contains cuboids [5]. However, most
of cuboids are not used, because most of query requested by data analyst follow
the normal distribution. That's a waste of IO and memory.
In this paper, we propose a self-adaption cube building model which adopts a
method called lazy-building cuboids and abandons useless cuboids based on que-
ry log. It can reduce the cube construction time and cube size a lot to save IO and
memory. The paper is structured as follows. In Section II, we present the back-
ground. In Section III, we introduce the design and implementation details of the
self-adaption cube building model. In Section IV, we focus on experimental eval-
uation. Finally, in Section V, we discuss the Self-Adaption Cube Building Model
and give a summary of the paper.
2 Background
2.1 Cube Calculation Algorithm
There are several strategies of data cube materialization [11] to reduce the cost of
aggregation calculation and increase the query processing efficiency including
iceberg cube calculation Algorithm [3], condensed cube calculation Algorithm
[15], shell fragment cube calculation Algorithm [13], approximate cube calcula-
tion Algorithm [17], and time-series data stream cube calculation Algorithm [6].
They are all based on Partial Materialization [16], which means that a data sub-
cube is selected and pre-calculated according to specific methods. Partial Materi-
alization is a compromise between storage space, cost of maintenance and query
processing efficiency.
3
In the process of iceberg cube calculation, sub-cubes which are higher than the
minimum threshold are aggregated and materialized. Beyer proposed BUC algo-
rithm [12] for iceberg cube calculation, which is widely-accepted.
According to the order of cuboid calculation, the methodologies of aggregation
calculation can be divided into two categories: top-down and bottom-up.
1. Top-Down: Firstly, calculate the metric of the whole data cube, and then the recur-
sive search is performed along each dimension. Secondly, check the conditions of
the iceberg, prune branches that do not meet the condition. The most typical algo-
rithm is BUC algorithm, it perform best on sparse data cube.
2. Bottom-Up: Starting from the base cuboids, compute high level cuboid from the
low level cuboid in the search grid according the parents-children relationship.
Typical algorithms are Pipesort algorithm, pipehash algorithm, overlap algorithm
and Multiway aggregation algorithm [18].
However, Kylin doesn't follow the principle of partial materialization. In order
to reduce unnecessary redundant calculation and shorten the cube construction
time, Kylin adopts a Method called By Layer Cubing, which is a distributed ver-
sion of the Pipesort algorithm, a kind of bottom-up algorithm [1].
2.2 By Layer Cubing
As its name indicates, a full cube is calculated by layer: N dimension, N-1 dimen-
sion, N-2 dimension, until 0 dimension; Each layer's calculation is based on it's
parent layer (except the first, which base on source data); So this algorithm need
N rounds of MapReduce running in sequence [8]; In the MapReduces, the key is
the composite of the dimensions, the value is the composite of the measures;
When the mapper reads a key-value pair, it calculates its possible child cuboids;
For each child cuboid, remove 1 dimension from the key, and then output the new
key and value to the reducer; The reducer gets the values grouped by key; It ag-
gregates the measures, and then output to HDFS; One layer's MR is finished;
When all layers are finished, the cube is calculated. The following Fig.1 describes
the workflow:
It has some disadvantage:
1. This algorithm causes too much shuffling to Hadoop; The mapper doesn't do ag-
gregation, all the records that having same dimension values in next layer will be
omitted to Hadoop, and then aggregated by combiner and reducer;
2. Many reads/writes on HDFS: each layer's cubing need write its output to HDFS for
next layer MR to consume; In the end, Kylin need another round MR to convert
these output files to HBase HFile for bulk load; These jobs generates many inter-
mediate files in HDFS;
All in all: the performance is not good, especially when the cube has many di-
mensions.
4
Fig. 1. By Layer Cubing. Each level of computation is a MapReduce task, and serial execution.
A N dimensional Cube needs N times MapReduce Job at least.
2.3 By Segment Cubing
In order to solve these shortcomings above, Kylin develops a new cube building
algorism called by segment cubing. The core idea is, each mapper calculates the
feed data block into a small cube segment (with all cuboids), and then output all
key/values to reducer; the reducer aggregates them into one big cube segment,
finishing the cubing; Fig. 2 illustrates the flow;
Fig. 2. By Segment Cubing.
5
Compared with By Layer Cubing, the By Segment Cubing has two main dif-
ferences:
1. The mapper will do pre-aggregation, this will reduce the number of records that the
mapper output to Hadoop, and also reduce the number that reducer need to aggre-
gate;
2. One round MR can calculate all cuboids;
Based on the work mentioned above, we take advantage of both two Algorithm,
and optimize cuboid materialization strategy.
3 Design and Implementation
In this section, we first introduce the architecture of Self-adaption Cube Building
Model (SCBM) and the overall workflow. Then we explain cuboids Lazy-
Buliding and the cuboid spanning tree. Finally, we describe the implementation
details of the Materialized View Self-Adjusting Algorithm.
3.1 Architecture of Self-Adaption Cube Building Model
The overall architecture of self-adaption cube building model is illustrated in fol-
lowing Fig. 3.
Fig. 3. Architecture of Self-Adaption Cube Building Model.
Self-adaption cube building model takes fact table [5] as input of the overall
system, usually the fact table is managed by distributed Data Warehouse Hive.
We first set the parameters of the cube model, such as the filed for analysis, the
base cuboid level and so on, then we build base cuboids in a mapper-reduce. After
the construction of the base cuboids, the system can support query request, Query
6
execution engine [7] resolves the query to find the required cuboids. If the cuboid
has been generated, the query will be executed; if the cuboid is missing, the Lazy
building module will be triggered to build the cuboid using the method in Section
3.2. When the query result returns, the system records the query log and waits for
the adjustment of cube launched by Self-Adaption module according to the Mate-
rialized View Self-Adjusting Algorithm explained in Section 3.3. At the same
time, the system maintains a dynamic cube spanning tree to store the metadata of
cuboids.
3.2 Cuboid Spanning Tree and Lazy-Buliding
Cuboid Spanning Tree In original By Layer Cubing, Kylin calculates the cu-
boids with Broad First Search (BFS) order, which causes a waste of memory. On
the contrary, Cuboid Spanning Tree generates cuboid with Depth First Search
(DFS) order to reduce the cuboids that need be cached in memory. This avoids
unnecessary disk and network I/O, and the resource Kylin occupied is highly re-
duced;
With the DFS order, the output of a mapper is fully sorted (except some special
cases), as the row key of cuboid is composed of cuboid ID and dimension values
like [ Cuboid ID + dimension values ], and inside a cuboid the rows are already
sorted. Since the outputs of mapper are already sorted, Shuffles sort would be
more efficient.
In addition, DFS order is vary suitable for cuboid's lazy building. Cuboid span-
ning tree also record the metadata of cuboids in every node in the tree, it provides
the basis for the selection of ancestor cuboids.
Lazy-Building Lazy-Building is a basic concept of the model. In order to reduce
the number of cuboids, we adopt the strategy of generating on demand. At the
same time, we persist all cuboids on the low layer in the By-Layer cubing algo-
rithm for higher speed and lower computational complexity of Lazy-Buliding. For
example: a cube has 4 dimensions: A, B, C, D; Each mapper has 1 million source
records to process; The column cardinality in the mapper is Card(A), Card(B), Card(C) and Card(D). The Lazy-Buliding is demonstrated in Fig. 4.
1. User set a base-layer parameter in cube model info to control the scale of base cu-
boids layer. If this parameter is not set, it will use default value log (dimensions) +
1.
2. Base cuboids building module import data from fact table and build the base cu-
boids by Cube Bulid Engine in Kylin.
3. Update the Cuboid Spanning Tree and save the metadata.
4. Client launch a query select avg (measure i) from table group by C which hit the
missing cuboid [C]. Then, lazy building module receive request to build cuboid
[C].
5. Lazy building module find a cuboid generation path according to Ancestor Cu-
boids Selection and build the missing cuboid to response the query as soon as pos-
sible.
7
6. Record the path and determine whether to build all the cuboid on the path at low
load according to Materialized View Self-Adjustment Algorithm.
Ancestor Cuboids Selection When the needed cuboids is missing, we should
select an ancestor cuboid and a cuboid generation path. The basic principle is to
choose the ancestor cuboid whose measures are the least to aggregate, which
means we can get the minimum amount of computation and time to generate the
missing cuboid. After that, we need to find a path P from ancestor cuboid to the
missing cuboid in compliance with the Minimum cardinality principle.
For example, in the Fig. 4. In order to generate the missing cuboid [C], we
firstly find all the candidate cuboids [A B C] [A C D] [B C D]. Then, we compare
the size of the three candidate cuboids. Assuming [B C D] is selected, we gener-
ate [C] by aggregate [B C D] on dimension B, D. The cube is enough to response
the query. However, for the sake of maintenance of cube according to By Layer
Cubing, we need to find a path from [B C D] to [C].
Fig. 4. Lazy-Building and Ancestor Cuboids Selection.
When aggregating from parent to a child cuboid, assuming that from base cu-
boid [B C D] to 1-dimension cuboid [C], There are two paths: [B C D] [B C] [C]
and [B C D] [C D] [C]. We assume Card(D) > Card(B) and the dimension A is
independent with other dimensions, after aggregation, the cuboid [BCD]'s size
will be about 1/Card(D) or 1/Card(B) of the size of base cuboid; So the output
will be reduced to 1/Card(D) or 1/Card(B) of the original one in this step. So we
choose the first path, the records that written from mapper to reducer can be re-
8
duced to 1/Card(D) of original size; The less output to Hadoop, which means less
I/O and computing and the model can attain better performance.
3.3 Materialized View Self-Adjustment Algorithm
Self-adpation module adjusts the cube according to the Materialized View Self-
Adjusting Algorithm. This chapter proposes a query statistics method which takes
fixed times of queries as a statistical period, and this method updates the corre-
sponding query statistics. This method adjusts materialized views set according to
the threshold of elimination and generation, stabilizes the query efficiency, and
minimizes the shake of materialized view.
Query Statistics Method A kind of Statistics Method for query.
Definition 1 : Materialized view adjustment cycle
The materialized view adjustment cycle can be customized to a fixed number
of queries, for example, every 100 queries for a materialized view adjustment
cycle.
Definition 2 : Average query statistics
Since the actual query may change over time, the query set should also be ad-
justed accordingly. For example, a query that has not been executed in a couple of
cycles should be removed from the query collection and the corresponding mate-
rialized view is deleted.
After many queries, the query log will accumulate a certain amount of query
records, this paper presents a query statistical method based on the query log
which described in the following. If there is a query set Q = { , , . . . , },
and the query log set L, scan forward the log file from the ending of the log file
and determine whether there is a query qi in the cycle , and update
according to Equation 1:
(1)
In the formula, α is a weighted coefficient, a constant; is the query set in
the cycle. By this method, we can monitor the change of the query set Q,
which can greatly reduce the shake of materialized view.
Materialized View Self-Adjustment Algorithm The main steps of the material-
ized view set adjustment with the query changes are listed as follows:
1. Prior to the adjustment, initialize materialized view set M = {base cuboids}, the
corresponding query task set to Q.
2. During the query, the query is written into the query log L, and the query counter is
accumulated
3. Set the threshold of elimination T and the threshold of generation S, update the Av-
erage query statistics each life cycle , and determine whether eliminate or mate-
rialize corresponding views.
9
The pseudo code of Materialized View Self-Adjustment Algorithm is showed
in Algorithm 1. Input: Query Log L, Materialized View M, Query Task set
Q, Materialized view adjustment cycle , Threshold Of
Elimination T, Threshold of generation S, Path set from
ancestor cuboid to the missing cuboid P Output: Materialized View after Adjustment M
1 Gets the current query count value count;
2 if count % then
3 for j = 1; j ≤ ; j++ do
4 scan forward the log file from the ending of the
log file ;
5 update query task set Q according to and ;
6 end
7 end
8 update according to formula 2;
9 for each in Q do
10 if ≥ S then
11 materialize views m corresponding to ;
12 M.add(m);
13 end
14 else if ≤ T then
15 eliminate views m corresponding to ;
16 M.delete(m);
17 end
18 end
19 Return M;
Algorithm 1: Materialized View Self-Adjustment Algorithm
In the above algorithm, from line 1 to line 8, it scans the query log in a statisti-
cal period , and update the query task set Q during the scanning. From line 9 to
line 17, it iterate around query in Q, and determine whether eliminate or material-
ize corresponding views according to the comparison of threshold and the
calculated by formula 2. Suppose query task set Q contains k different
query, then the time complexity of the algorithm is O( + k).
4 Experimental Evaluation
4.1 Dataset
To test performance, we use the standard weather dataset from the China Meteor-
ological Data network. The dataset contains 4726499 weather records from Chi-
na's 2170 distinct counties started from January 1, 2011 to January 1, 2017. The
original dataset is too complicated. In order to better conduct the experiment, we
10
select eight dimensions: Province, city, county, date, weather, wind direction,
wind speed, air quality level and two measures: Maximum temperature, Minimum
temperature.
4.2 Evaluation Metrics
We use cube first construction time, average query time and cube size as the
evaluation metrics of our proposed method.
Cube First Construction Time refers to the base cuboids building time for self-
adaption cube building model.
Average Query Time is defined as the average query time during materialized
view adjustment cycle ~ . The detailed calculation is listed in Equation 2.
(2)
Cube Size refers to the disk allocation that the whole cube takes up.
4.3 Experimental Results
We first compare the metric of cube first construction time. Because the parame-
ter of base-layer has a great impact on this metric, in order to reflect the average
condition, we use the default value log (dimensions) + 1. We test the model 5
times and the results were aggregated to calculate averages which can avoid the
impact of MapReduce failure. Results can be seen from table 1, the time con-
sumption of the new model is reduced by 28.5%.
Table 1. Cube first Construction Time
Test Result Test 1 Test 2 Test 3 Test 4 Test 5 Average origin Kylin cube building model 92min 83min 86min 104min 91min 91.2min self-adaption cube building model 64min 61min 83min 57min 61min 65.2min
Fig. 5. Average query time trends in T1 ∼ T30.
11
In query time, we set Materialized view adjustment cycle 50 and test 30 cy-
cles ~ . We can observe that Cuboid hit rate and query response time signif-
icantly increased and improved along with the increase of query requests. Finally,
the query efficiency of the two models are almost on a par.
Fig. 6. Average cube size trends in T1 ∼ T30.
In cube size, we see that the curve that represents this metric tends to be stable
after vibration in prophase from Fig. 6. Finally, the spaces consumption of the
proposed model was reduced by 65.83%.
5 Conclusion
We have presented a Distributed Self-Adaption Cube Construction Model Based
on Query Log and applied it to a weather dataset to test its performance. Our
model adopts a special partial materialization strategy and it can automatically
adjust the cuboid set which is used in query request according to query log. Based
on experimental results, the proposed model can reduce the cube construction
time and cube size to a great extent at the expense of tiny query efficiency reduc-
tion. However, this model has good performance only when the query distribution
is relatively concentrated. So users can choose either of the two models according
to their practical business query scenario. Overall, the proposed model is of great
practical significance in the application of BI tools. In the next stage, we will
optimize the base cuboids generation strategy to reduce the query latency in pro-
phase.
6 Acknowledgement
This work is supported by the National Key project of Scientific and Technical
Supporting Programs of China (Grant No. 2015BAH07F01); Engineering Re-
search Center of Information Networks, Ministry of Education.
12
7 References
1. Ying Chen, Frank Dehne, Todd Eavis, and Andrew Rau-Chaplin. Parallel rolap data cube
construction on shared-nothing multiprocessors. Distributed and parallel Databases,
15(3):219–236, 2004.
2. Impala. http://impala.apache.org/, 2017. [Online; accessed 13-April-2017].
3. Prasad M Deshpande, Rajeev Gupta, and Ashu Gupta. Distributed iceberg cubing over
ordered dimensions, March 16 2015. US Patent App. 14/658,542.
4. Presto. https://prestodb.io/, 2017. [Online; accessed 13-April-2017].
5. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali
Venkatrao, Frank Pellow, and Hamid Pirahesh. Data cube: A relational aggregation operator
generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery,
1(1):29–53, 1997.
6. Mateusz Kalisch, Marcin Michalak, Piotr Przysta lka, Marek Sikora, and Lukasz Wr obel.
Outlier detection and elimination in stream data–an experimental approach. In International
Joint Conference on Rough Sets, pages 416–426. Springer, 2016.
7. Kylin. http://kylin.apache.org/, 2017. [Online; accessed 13-April-2017].
8. Suan Lee, Jinho Kim, Yang-Sae Moon, and Wookey Lee. Efficient distributed parallel
top-down computation of rolap data cube using mapreduce. In International Conference on
Data Warehousing and Knowledge Discovery, pages 168–179. Springer, 2012.
9. Feng Li, M Tamer Ozsu, Gang Chen, and Beng Chin Ooi. R-store: a scalable distributed
system for supporting real-time analytics. In Data Engineering (ICDE), 2014 IEEE 30th Inter-
national Conference on, pages 40–51. IEEE, 2014.
10. Elasticsearch. https://www.elastic.co/products/elasticsearch, 2017. [Online; accessed 13-
April-2017].
11. Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan. Distributed cube
materialization on holistic measures. In Data Engineering (ICDE), 2011 IEEE 27th Interna-
tional Conference on, pages 183–194. IEEE, 2011.
12. Yongge Shi and Yiqun Zhou. An improved apriori algorithm. In Granular Computing
(GrC), 2010 IEEE International Conference on, pages 759–762. IEEE, 2010.
13. Rodrigo Rocha Silva, Celso Massaki Hirata, and Joubert de Castro Lima. Computing big
data cubes with hybrid memory. Journal of Convergence Information Technology, 11(1):13,
2016.
14. Sqark sql. http://spark.apache.org/sql/, 2017. [Online; accessed 13-April-2017].
15. Wei Wang, Jianlin Feng, Hongjun Lu, and Jeffrey Xu Yu. Condensed cube: An effective
approach to reducing data cube size. In Data Engineering, 2002. Proceedings. 18th Interna-
tional Conference on, pages 155–165. IEEE, 2002.
16. Ying Xia, Ting Ting Luo, Xu Zhang, and Hae Young Bae. A parallel adaptive partial
materialization method of data cube based on genetic algorithm. 2016.
17. Dan Yin, Hong Gao, Zhaonian Zou, Jianzhong Li, and Zhipeng Cai. Approximate ice-
berg cube on heterogeneous dimensions. In International Conference on Database Systems for
Advanced Applications, pages 82–97. Springer, 2016.
18. Yihong Zhao, Prasad M Deshpande, and Jeffrey F Naughton. An array-based algorithm
for simultaneous multidimensional aggregates. In ACM SIGMOD Record, volume 26, pages
159–170. ACM, 1997.
View publication statsView publication stats