enabling(self,service(bi(on( document(storesmarcel/dolap2017/dolap23.pdf · sql queries on...

Enabling(Self,Service(BI(on(Document(Stores

Mohamed(L.(Chouder,(Stefano(Rizzi,(Rachid(Chalal(

(

LMCS,(ESI,(Algiers,(Algeria(

DISI,(University(of(Bologna,(Italy(

March(21,(2017( DOLAP’17,(VENICE( 1(

Summary 1.  MoKvaKon(

2.  Goal(

3.  Our(approach:(

1.  Overview(

2.  MulKdimensional(enrichment(

3.  Querying(

4.  OLAP(enabling(

4.  Experiments(

5.  Conclusions(


Mo;va;on ( Over(the(past(decade,(NoSQL(systems(have(emerged(as(an(alternaKve(

to(RDBMSs(in(several(implementaKons(

◦  keyXvalue(◦  columnXbased(

◦  graphXbased(◦  documentXbased(

( Document)stores)collect(nested,(denormalized,(and(hierarchical(

documents(in(the(form(of(collec-ons)◦  The(documents(within(a(collecKon(may(present(a(structural(variety(due(to(

schema(flexibility(and(data(evoluKon(

◦  Documents(are(selfXdescribing(and(mainly(encoded(using(the(JSON(data(

format(


schemaless(

Mo;va;on ( The(vast(amounts(of(semiXstructured(data(lying(in(document(stores(

could(be(profitably(integrated(into(enterprise(BI(systems,(but...(

◦  ...due(to(their(schemaless(nature,(document(stores(are(hardly(accessible(via(

direct(OLAP(querying(

◦  ...the(SQLXlike(query(interfaces(available((e.g.,(Apache(Drill,(Spark(SQL,(Presto)(provide(access(to(schemaless(and(nested(data(but(do(not(support(

OLAP(over(document(stores(

( Two(possible(approaches(for(OLAP(querying(of(document(stores(

◦  schemaXonXwrite(

◦  schemaXonXread(


Mo;va;on


Data warehouse

data(lake(

OLAP(

operaKonal(

data(

A(schema0on0write)approach)forces(a(fixed(structure(in(data(

and(loads(them(into(a(

DW(

Data warehouse

data(lake(

OLAP(

operaKonal(

data(

A(schema0on0read)approach)leaves(data(unchanged(in(their(

structure(unKl(they(

are(accessed(by(the(

user(

Goal ( Devising(an(interacKve,(schemaXonXread(approach(for(finding(

mulKdimensional(structures(in(document(stores(aimed(at(enabling(

OLAP(querying(in(the(context(of(selfXservice(BI,(where(business(users(

can(proacKvely(create(custom(reports(without(involving(the(IT(

department(

( To(discover(the(informaKon(necessary(for(discovering(mulKdimensional(

concepts(despite(the(lack(of(structure,(we(look(for(approximate)func-onal)dependencies)(AFDs)(

( The(approach(has(been(implemented(on(top(of(MongoDB,(one(of(the(

most(popular(document(stores(


Overview


MulKdimensional(

Enrichment(Querying( OLAP(Enabling(

Document(

Store(

MdXSchema(CXSchema(

Overview


MulKdimensional(


Document(

Store(

MdXSchema(CXSchema(

A(schema(is(extracted(from(the(

collecKon(and(enriched(with(

basic(md(knowledge(to(create(a(

drad(mulKdimensional(schema(

Overview


MulKdimensional(


Document(

Store(

MdXSchema(CXSchema(

The(user(formulates(an(md(

query,(which(is(checked(against(

data:(if(it(is(wellXformed,(it(is(

executed;(otherwise,(the(mdX

schema(is(refined(

Overview


MulKdimensional(


Document(

Store(

MdXSchema(CXSchema(

Local(porKons(of(md(

hierarchies(are(built(by(mining(

AFDs(from(data(

Overview


MulKdimensional(


Document(

Store(

MdXSchema(CXSchema(

The(user(applies(an(

OLAP(operator(to(

iteraKvely(create(a(new(

query(on(the(collecKon(

Example


( NBA(games(

along with relational data in one system [5, 17] (e.g., Or-acle, IBM DB2, and PostgreSQL). Current systems do notimpose a fixed relational schema to store and query data,but derive and maintain a dynamic schema to be used forschema-on-read SQL/JSON querying instead [17]. Other so-lutions that fall into this category are virtual adapters thatexpose a relational view of the data in a document store, tobe used in common BI tools; an example is the MongoDBBI connector (www.mongodb.com/products/bi-connector).

The second line of solutions are SQL-like interfaces thato↵er extensions to query schemaless data persisted in doc-ument stores or as JSON files (e.g., in Hadoop). SparkSQL [3] has been designed for relational data processingof native Spark distributed datasets in addition of diversedata sources and formats (including JSON). In order to runSQL queries on schemaless data in Spark SQL, a schemais automatically inferred beforehand. Apache Drill (drill.apache.org) is a distributed SQL engine that can join datafrom multiple data sources, including Hadoop and NoSQLdatabases. It has a built-in support for JSON with colum-nar in-memory execution. Drill is also able to dynamicallydiscover a schema during query processing, which allows tohandle schemaless data without defining a schema upfront.Finally, Presto (prestodb.io) is a distributed SQL engine,developed at Facebook for interactive queries, that can alsocombine data from multiple kinds of data sources includ-ing relational and non-relational databases. When queryingschemaless data, Presto tries to discover data types, but itmay need a schema to be defined manually in order to runqueries. All above-mentioned engines di↵er in their archi-tecture, support to ANSI SQL, and extensions to SQL tohandle schemaless and nested data. These engines can con-nect to MongoDB, which gave us various architectural op-tions for implementation. However, none of them supportsmultidimensional or OLAP querying over document stores.

3. DOCUMENT STORESA document store holds a set of databases, each organizing

the storage of documents in the form of collections in a waysimilar to rows and tables in relational databases. A docu-ment, also called object, consists of a set of key-value pairs(or fields) mainly encoded in JSON format (www.json.org).JSON has many variants that are used for storage and op-timization purposes such as BSON (www.bsonspec.org) inMongoDB and, very recently, OSON in the Oracle DBMS[17]. Keys are always strings, while values have the follow-ing types: primitive (number, string, Boolean), object (orsub-document), array of atomic values or objects.

From a conceptual point of view, a typical collection con-sists of a set of business entities connected through two kindsof relationships: nesting and references. Nesting consists ofembedding objects arbitrarily within other objects, whichcorresponds to two types of relationships: to-one in the caseof a nested object and to-many in the case of an array ofnested objects. On the other hand, references are similarto foreign keys in relational databases and can be expressedmanually or using a specific mechanism such as $ref in Mon-goDB. However, when references are used, getting data re-quires joins which are not supported in document stores; so,nesting is most often used instead. Since a document storehas a flexible schema there are a variety of ways for datarepresentation; nevertheless, the way data is modeled maya↵ect performance and data retrieval patterns.

games :{“ id” : “52f2a70eddbd75540aba7c06”,“date” : ISODate(“1989-03-27”),“teams” :[ {“name” : “Detroit Pistons”,

“abbreviation” : “DET”,“score” : 90,“home” : true,“won” : 1,“results” : {“points” : 90, . . . },“players” :[ {“player” : “Isiah Thomas”,

“points” : 30, . . . },. . .

]},{“name”: “Dallas Mavericks”,“abbreviation” : “DAL”,“score” : 77,“home” : false,“won” : 0,“results” : {“points” :77, . . . },“players” :[ {“player” : “Rolando Blackman”,

“points” : 23, . . . },. . .

]}

]}

(a)

games :{“ id” : “52f2a70eddbd75540aba7c06”,“date” : ISODate(“1989-03-27”),“hometeam” :{“name” : “Detroit Pistons”,“abbreviation” : “DET”,“score” : 90,“home” : true,“won” : 1,“results” : {“points” : 90, . . . },“players” :[ {“player” : “Isiah Thomas”,

“points” : 30, . . . },. . .

]},“guestteam” :{“name”: “Dallas Mavericks”,“abbreviation” : “DAL”,“score” : 77,“home” : false,“won” : 0,“results” : {“points” :77, . . . },“players” :[ {“player” : “Rolando Blackman”,

“points” : 23, . . . },. . .

]}

}

(b)

Figure 1: Sample JSON document of a game

Example 1. In the scope of this paper, we focus on col-lections of JSON documents that logically represent the samebusiness entities, while expecting that their structure mayvary. In particular, we will use as a working example a col-lection that models NBA games (adapted from [26]), where adocument represents one game between two teams along withplayers and team results. In the sample document shown inFigure 1a the main object is games, which holds the arrayof objects teams (to-many), which in turn holds the objectresults (to-one) and the array of objects players (to-many).Figure 1b shows an alternative JSON representation of thesame domain; the relationship between game and teams ismodeled using two separate sub-documents (home and guestteam) instead of an array. Note that a simple query thatcomputes the average score by team would be much easier toimplement using the first representation and it would alsoperform better, since in the second representation it wouldrequire two separate queries.

4. APPROACHAs already discussed, our approach for enabling self-service

BI on document stores includes three phases: multidimen-sional enrichment, querying, and OLAP enabling; these phasesare discussed in the detail in the following subsections.

4.1 Multidimensional EnrichmentThe aim of this phase is to define a draft multidimensional

schema on which the user will be able to formulate queries.To this end, we extract the schema of the input collectionand enrich it with basic multidimensional knowledge. Sev-eral works have addressed schema extraction and discoveryfrom document stores [25, 29, 14]. These works focus on thestructural variety of documents within the same collectioncaused by their schemaless nature and by the evolution ofdata; for instance, [14] proposes to extract a JSON schema(www.json-schema.org) similar to a JSON document and

Mul;dimensional(enrichment ( What:(defining(a(drad(mdXschema(on(which(the(user(will(be(able(to(

formulate(queries(

( How:(extract(the(cXschema(of(the(input(collecKon(and((

enrich(it(with(basic(md(knowledge(based(on(the(type(of((

ahributes(and(respecKng(the((

mul-dimensional)space)arrangement))constraint))


infer attribute and type occurrence frequencies to empha-size the structural variety. In [29], all the schema versionsin a collection are extracted and stored in a repository to beused within a schema management framework. In order tohave a single view of the distinct schemata in a collection,the authors propose a relaxed form of schema called skele-ton that better captures the core and prominent fields andfilters out unfrequent ones. This would be useful for us sincewe rely on a single view of the collection; however, the com-putational cost for building skeletons is high, making themunsuitable in real-time contexts.

Since schema discovery is not the focus of this paper, weadopt a simple algorithm that builds a tree-like schema in-cluding all the fields that appear in the documents [3]. Thisalgorithm works in a single-pass over data to infer a treestructure of fields and their corresponding types (fields withvarying types are generalized to String).

The tree-like collection schema that will be the input forthe subsequent steps of our approach is defined as follows.

Definition 1 (Collection Schema). Given a collec-tion D, its schema (briefly, c-schema) is a tree S “ pK,Eqwhere:

‚ K is the union of the sets of keys appearing in thedocuments of D;

‚ r P K is the distinguished root node and is labelled withthe collection name;

‚ E is a set of arcs that includes (i) an arc r Ñ k foreach key/value pair in the root r, and (ii) an arc k1 Ñk2 i↵ k2 appears as a key in an array of nested objectshaving key k1.

Figure 2 shows the c-schema to which the document ofFigure 1a conforms. A path from the root key to a leafkey is called an attribute. In denoting attributes we usethe dot notation omitting the root key (since a single col-lection is involved) but including the keys corresponding tonested objects; so, for instance, the c-schema in Figure 2contains attributes id, date, teams.name, teams.results.points,teams.players.player, teams.players.points, etc. The depth ofan attribute a is the number of arcs included in its path,and is denoted by depthpaq (e.g., depthpidq “ 1); the set ofattributes at depth � is denoted by Attrsp�q.

games

id date teams

name abbreviation home score results.points players

player points. . .

. . .

Figure 2: NBA games c-schema

After the c-schema S of a collection has been defined, acorresponding multidimensional schema has to be derivedfrom it. Since measures at di↵erent granularities can pos-sibly be included in the documents (e.g., in our example,points at the player’s level and score at the team level), thedefinition we provide relates each single measure to a specificset of dimension.

Definition 2 (Md-Schema). A multidimensionalschema (briefly, md-schema) is a triple M “ pH,M, gqwhere:

‚ H is a finite set of hierarchies; each hierarchy hi P His associated to a set Li of categorical levels and a roll-up partial order ©i of Li. Each element G P 2L, whereL “ î

i Li and at most one level in G is taken fromeach hierarchy hi, is called a group-by set of H.

‚ M is a finite set of numerical measures.

‚ g is a function relating each measure in M to its di-mensions, i.e., to the set of levels that determine itsgranularity: g : M Ñ G where G is the set of all group-by sets of H.

At first, a draft md-schema Mdraft is built from S bytentatively labelling all attributes of types date, string, andBoolean as dimensions (i.e., as levels of di↵erent hierarchies),and all attributes of types integer and decimal as measures.Date dimensions are then decomposed into di↵erent categor-ical levels giving rise to standard temporal hierarchies (e.g.,date © month © year). Note that no further attempt to dis-cover hierarchies is made at this stage, since discovering theFDs involving all attributes would be computationally tooexpensive. As a consequence, in Mdraft two levels l and l1

might be erroneously modeled as two separate dimensions,while they should be actually part of the same hierarchybecause l © l1.

The user can contribute to this step by manually changingthe label of some attributes, since in some cases a numericattribute can be used as a dimension, and a non-numericattribute can be used as a measure. The first situation hasno impact on the following steps, since a group-by set canalso include numeric attributes. Conversely, the second sit-uation requires that the values of the non-numeric attributeare transformed into numerical values to be supported byaggregation functions.

To complete the definition ofMdraft, the granularity map-ping g must be built. We recall from [22] that an md-schemashould comply with themultidimensional space arrangementconstraint, stating that each instance of a measure is relatedto one instance of each dimension. In the c-schema S, at-tributes in Attrsp�q are related by to-many multiplicity tothose in Attrsp� ` 1q, so a measure cannot be related tolevels having a higher depth. Therefore, gpmq is set by con-necting each measure m to the group-by set G that containsall levels l P L such that depthpmq • depthplq.

Example 2. In this example and in the following oneswe will refer to the representation in Figure 1a. Attributeteams.score is numerical, so it is assumed to be a mea-sure. It is depthpteams.scoreq “ 2, therefore teams.score

is tentatively associated in Mdraft with group-by set G “tteams.name, teams.abbreviation, teams.home, id, dateu. Allthe measures that have the same depth of teams.score

cXschema( drad(mdXschemas(

Game(

teams.score(

teams.results.points(

teams.name(

teams.abbrev.(

teams.home(

id(

date((

Game(

teams.players.points(

(

teams.players.player(

teams.name(

teams.abbrev.(

teams.home(

id(

date((

Querying ( What:(supporKng(a(nonXexpert(user(in(formulaKng(a(wellXformed(md(

query(including(a(groupXby(set,(a(set(of(measures,(and(an(aggregaKon(

funcKon((

◦  q = (Gq, Mq, Σq),(for(instance(Gq)=({teams.name,(date,(id}(

Mq(=({teams.score}(

Σq =(Sum((

( How:(check(the(user(query(for(base(integrity(and((summarizaKon(integrity...(

◦  check(measures(to(ensure(disjointness(

◦  check(groupXby(set(to(ensure(completeness(and(base(integrity(

◦  check(aggregaKon(funcKon(to(ensure(summarizaKon((

( ...and(possibly(recommend(alternaKve(measures/groupXby(


Game(

teams.score(

teams.results.points(

teams.name(

teams.abbrev.(

teams.home(

id(

date((

Querying ( Check(groupXby(set(to(ensure(base(integrity,(i.e.,(that(the(levels(in(Gq(are(mutually(orthogonal(

◦  query(q(is(valid(only(if,(for(each(pair(of(levels(in(Gq,(there(is(no(XtoXone(relaKonship(between(them(

◦  data(are(accessed(to(retrieve(approximate)FDs)between(the(groupXby(levels(


Game(

teams.score(

teams.results.won(

teams.name(

teams.abbrev.(

teams.home(

id(

date((

Game(

teams.score(

teams.results.won(

teams.name(

teams.abbrev.(

teams.home(

(((((((((((((date(

(id(

OLAP(enabling ( What:(refine(the(mdXschema(by(discovering(local(porKons(of(md(

hierarchies((

( How:(mine((simple)(approximate(FDs(to(discover(rollXups(and(drillX

downs(for(the(ahributes(in(the(groupXby(set(


Game(

teams.score(

teams.results.won(

teams.name(

teams.abbrev.(

teams.home(

(((((((((((((date(

(id(

Game(

teams.score(

teams.results.won(

teams.name(

(((((((((teams.abbrev.(

teams.home(

(((((((((((((date(

(id(

OLAP(enabling ( Discover(rollXups(and(drillXdowns(for(the(ahributes(in(the(groupXby(set(◦  Given(two(levels(l(and(l’,(let(strength(l, l’)(denote(the(raKo(between(the(number(of(unique(values(of(l(and(the(number(of(unique(values(of l l’

◦  AFD(l → l’(holds(if(strength(l, l’) ≥ ε,(where(ε(is(a(userXdefined(threshold(◦  Discovering(possible(rollXups(for l ∈ Gq(requires(to(mine(the(AFDs(of(type(

l → l’(,(where(l’(is(not(in(Gq(and(is(not(deeper(than(l(in(the(cXschema(

◦  To(avoid(useless(checks,(the(AFDs(whose(rightXhand(side l’(has(higher(cardinality(than(l(are(not(checked,(since(they(clearly(cannot(hold(

◦  The(AFDs(are(checked(unKl(one(is(found(to(hold;(then,(the(mdXschema(is(

updated(by(placing(these(levels(in(the(same(hierarchy(


Experiments ( Two(realXworld(datasets:((◦  Games(

◦  32K(nested(documents(represenKng(NBA(games(in(the(period(of(1985X2013(

◦  each(document(contains(46(ahributes;(40(of(them(are(numeric(

◦  DBLP(◦  5.4M(flat(documents(scraped(from(DBLP(in(XML(format(and(converted(into(JSON(

◦  documents(have(common(ahributes((Ktle,(author,(year,(etc.)(and(uncommon(ones((journal,(

bookKtle,(etc.)((

( Query(execuKon(delegated(to(MongoDB(


Experiments(–(Querying


Experiments(–(OLAP(enabling


Conclusions ( The(experiments(we(conducted(show(that(the(performance(of(our(

approach(is(in(line(with(the(requirements(of(realXKme(user(interacKon(

( Future(work(◦  opKmize(the(algorithms(proposed(to(improve(performance(

◦  take(evoluKon(of(data(and(schemas(into(account(to(increase(effecKveness(

(temporal(FDs?)((