integrating mapreduce and rdbmss · mapreduce for machine learning algorithms. their approach...

Integrating MapReduce and RDBMSs

Natalie Gruska, Patrick Martin

School of Computing

Queen’s University

Kingston, ON, Canada

Abstract

Data processing needs are changing with theever increasing amounts of both structured andunstructured data. While the processing ofstructured data typically relies on the well-developed field of relational database manage-ment systems (RDBMSs), MapReduce is a pro-gramming model developed to cope with pro-cessing immense amounts of unstructured data.MapReduce, however, offers features and ad-vantages that can be exploited to process struc-tured data. Several database vendors and re-searchers have already turned to MapReduceto aid in processing relational data, thus re-quiring integration of MapReduce and RDBMStechnologies. In this paper, we provide a tax-onomy to characterize several existing integra-tion methods. Further, we take a detailed lookat DBInputFormat which is an interface be-tween Hadoop’s MapReduce and a relationaldatabase. The challenges posed by such an in-terface are identified and we provide sugges-tions for improvement.

1 Introduction

Businesses are collecting more informationthan ever, and the amount of data beingkept is increasing dramatically. With thesechanges, new processing models are beingsought. MapReduce [9] is one such frameworkdesigned to process large amounts of data inparallel. A MapReduce system consists of acluster of nodes with which processing is donein parallel. What makes such a system so

Copyright c� 2010 Natalie Gruska and Patrick Mar-tin. Permission to copy is hereby granted provided theoriginal copyright notice is reproduced in copies made.

highly parallelizable is the fashion in whichdata analysis tasks are defined; namely in termsof map and reduce functions. A map functioncreates key/record pairs out of the input andthe reduce function aggregates all the recordsfor a specific key. If the input data is spreadover many machines, this allows each node toperform its part of the processing independentof the other nodes. A key characteristic ofMapReduce systems is their scalability, whichstems in part from the processing model itself,but also from the fine-grained fault toleranceintegrated in these systems. If a node failswhile processing a MapReduce job another cantake over its part of the job without having torestart the job entirely. Hadoop [2] is an open-source implementation of such a MapReducesystem.

Traditionally, MapReduce systems have beenused to analyze semi- or unstructured data.For structured data, relational database man-agement systems (RDBMS) are readily avail-able and well developed. However, processingneeds for structured data are also changing,both scale-wise and application-wise. Mod-ern databases are already efficient at process-ing data but MapReduce has properties thatcan be exploited to enhance or aid processing.Several researchers and database vendors havetaken the approach of integrating MapReduceand RDBMS technology. We examine these ap-proaches and determine a set of different in-tegration types, each designed for a differentpurpose.

MapReduce systems are sometimes viewedas primitive replacements for RDBMSs sincethey also process data, but in a very brute-forcemanner. We argue, however, that the two sys-tems are complimentary and not competitors.

212

The main contribution of our work is aclassification and characterization of currentMapReduce and RDBMS integration technolo-gies. Furthermore, we investigate an imple-mentation of one particular type of integration,Hadoop’s DBInputFormat [1], and point outthe challenges this interface poses and providesuggestions for improvement. The remainderof this paper is structured as follows. Section2 explains the MapReduce processing modelin more detail. In Section 3 we present sev-eral applications in which MapReduce is use-ful when processing relational data. Section 4discusses the MapReduce-RDBMS controversyand points out the advantages and disadvan-tages of both systems. Our taxonomy of dif-ferent integration types is presented in Section5. Sections 6 and 7 present our evaluation ofHadoop’s DBInputFormat. In Section 8 we dis-cuss the challenges of integrating MapReduceand RDBMSs and provide suggestions for fu-ture research.

2 The MapReduce Process-

ing Model

Figure 1: Data flow of the MapReduce process-ing model.

MapReduce was introduced by Dean andGhemawat [9]. A MapReduce job consists oftwo phases: a map phase and a reduce phase.These phases and the overall data flow of aMapReduce job are illustrated in Figure 1. Inthe map phase, several map tasks each processpart of the input. Processing consists of cre-ating a set of key/value pairs out of the input.A sample map task performing a word countis shown in Figure 2. In this case, the keys of

the key/value pairs are words and the value isthe number of times the word appeared in theinput for this map task. The reduce phase con-sists of combining sets of tuples with the samekey. Each reduce task receives all the key/valuepairs for a certain key and combines these insome way. For instance, in Figure 2, the valuesof all the tuples with the key dog are added to-gether to produce the final count of how manytimes the word appeared in the input.

Figure 2: Sample map and reduce tasks exe-cuting a word count.

3 MapReduce and Rela-

tional Data

There are several scenarios where it is advan-tageous to use MapReduce to process rela-tional data and thus interoperation between aRDBMS and MapReduce system is necessary.

MapReduce is used to process large amountsof unstructured or ad-hoc data. Occasionallythe processing of this data requires, or could beaided by, structured data stored in a RDBMS.It is often the case that the ad-hoc data isonly used a small number of times before be-ing discarded and hence it is not worth the ef-fort of loading the data into a database. Inthis situation it is more efficient to extractsmall amounts of relevant information out ofthe database and use it in the MapReduce jobprocessing the ad-hoc data. For instance, con-sider an internet company that maintains adatabase containing customer profiles with in-formation such as customer name, address, andgender. The company would like to analyzeor predict user behaviour by examining largeamounts of ad-hoc data such as click-streaminformation. This analysis might benefit fromthe incorporation of some of the profile infor-mation, such as address or gender, to determinemore accurate predictions.

Another application of relational data in a

213

MapReduce environment is advanced process-ing. Advanced processing refers to data pro-cessing that is outside a RDBMS’s scope. Thisincludes tasks such as machine learning andgraph analysis, for which the data would betaken out of the database anyways. It is impor-tant to emphasize that advanced processing isnot meant to compete with the database’s pro-cessing abilities, but is outside of the databasecapabilities. In this category we also includecomplex user defined functions that it may bepossible to execute within the database but ina very limited and inefficient fashion. MapRe-duce can be used to perform tasks such asmachine learning and graph analysis. Chu etal. [7] present a general framework for usingMapReduce for machine learning algorithms.Their approach describes how algorithms suchas support vector machines, k-means and neu-ral networks can be processed using MapRe-duce. Rather than being based on using multi-ple machines, their approach is based on theidea of using multiple cores as the process-ing nodes, but the underlying concept is thesame. Another advanced processing task isgraph analysis. Cohen [8] explores the ideaof using MapReduce to perform graph algo-rithmic tasks such as determining vertex de-grees and identifying trusses. Trusses are sub-graphs of high connectivity which can be veryuseful when analyzing social networking data.If MapReduce and RDBMS technology is in-tegrated, RDBMSs will benefit from this addi-tional processing power.

Further, as suggested by Stonebreaker etal. [16], a MapReduce system can act asan extract-transform-load (ETL) system, com-plementing traditional database technology.Thus, MapReduce can be used to extract datafrom relational databases, process/transform itand load it into a data warehouse. Again, theseare tasks that traditional databases are not de-signed to do.

4 MapReduce vs. RDBMS

As a processing model, MapReduce has causedmuch debate among academics and industryprofessionals [10, 14, 16]. MapReduce hasmany avid followers but also many critics,

mainly from the database community. Themain argument against MapReduce is that itseems to be a step backwards from moderndatabase systems. MapReduce is a very bruteforce approach without the optimizing and in-dexing capabilities of modern database sys-tems.

There have been several publications com-paring MapReduce and parallel DBMSs. Pavloet al. published a study comparing Hadoop’sMapReduce with two parallel DBMSs [14].Their experiments showed that both databaseswere significantly slower at loading data thanHadoop. The data loading time turned out tobe one of the databases’ main disadvantages asthey were much faster at data processing taskssuch as grep tasks, aggregation tasks and jointasks. The only task for which the databasesdid not significantly outperform Hadoop was auser-defined function (UDF) task.

Dean et al. [10] point out some advantagesthat a MapReduce system has over a paral-lel database system. Firstly, MapReduce hasa very fine-grained fault tolerance meaning ifa node fails only a small part of the whole jobneeds to be restarted. Furthermore, MapRe-duce is storage system independent and compli-cated transformations can be easier to expressin MapReduce than SQL.

In contrast to the before mentioned work,we examine the integration between the twosystems, not the rivalry. There are cases inwhich MapReduce is more suitable, and casesin which a parallel database excels, each systemhas its strengths and its weaknesses. Thereforean integration of the two systems is needed. Weexamine several possible integration methods.

5 Integration Taxonomy

The key to understanding the use of relationaldata in a MapReduce environment is to viewthe two systems as symbiotic rather than ascompetitors. Relational database and MapRe-duce systems each have their own strengthsthat can be combined to produce powerful sys-tems. Several database vendors, researchersand a MapReduce provider have already takensteps towards combining the strengths of thetwo systems. While combining the two sys-

214

tems has advantages, the way in which theyare integrated directly influences which advan-tages dominate. Different types of integrationare geared towards different purposes. Identi-fying and classifying the different types of in-tegration helps to clarify why this integrationis necessary and what kind of integration bestsuits a certain situation. Also, it helps to iden-tify areas in which current systems are lacking,further research is required and which typesof systems are most promising. In examiningthe current technologies we identify three typesof integration; MapReduce Dominant, RDBMS

Dominant, and Loosely-Coupled. An overviewof the key characteristics of each type is shownin Figure 3.

5.1 MapReduce Dominant

Systems that are classified as MapReduceDominant (MRD) are MapReduce systemswith relational database technology added.Since MapReduce is the primary technology,most of MapReduce’s properties such as auto-matic parallelization and fine-grained fault tol-erance are retained. As such, MRD systemsare aimed at processing very large amounts ofboth structured and unstructured data. It isnot uncommon for modern businesses to havedata warehouses as large as petabytes. Theprocessing of data at this scale needs to be donein parallel to be efficient. However, there areproblems with scaling traditional parallel rela-tional databases to this level. Failures becomevery common and a cluster of homogeneousmachines is usually required. A MapReducesystem is capable of dealing with both of theseissues. Nevertheless, MapReduce by itself lacksmany of the features a database system pro-vides, such as query optimization. Integratingrelational database technology into a MapRe-duce system allows it to leverage such features.

To the best of our knowledge, HadoopDB [5]is the only current implementation of an MRDsystem. HadoopDB is a hybrid of Hadoop’sMapReduce and the PostgreSQL database sys-tem. Hadoop acts as the coordination and com-munication layer, and individual database sys-tems are part of the storage layer. Each nodein the Hadoop cluster runs an instance of adatabase system in addition to the Hadoop dis-

tributed file system. When HadoopDB receivesan SQL query, it is translated into a MapRe-duce job which is then broken into queriesfor the individual databases. At the databaselevel, the optimizer and indexing capabilitiescan then be exploited to process the individ-ual queries. However, only the queries thatget pushed to the database layer are optimizedby database query optimizers. Optimizationof the MapReduce job as a whole relies onmore primitive methods. Also, because of theintegration of databases, MRD systems sufferfrom longer data loading times than regularMapReduce systems. Because of MapReduce’slarge required scale, these systems are not suit-able for everyday relational database tasks atsmaller scales and are not meant to replace re-lational databases, but rather analyze massivequantities of structured data.

HadoopDB is easily scalable since the set ofmachines in the Hadoop cluster does not needto be homogeneous and node failures are dealtwith in a fine-grained fashion. Further, thistype of system is suitable for semi-structureddata and unstructured data as well as struc-tured since it retains all of Hadoop’s capabili-ties.

5.2 RDBMS Dominant

While MRD provides scale and fault toleranceto the processing of massive amounts of data,RDBMS Dominant (DBD) focuses on extend-ing a database system’s capabilities in termsof its processing abilities. Extended processingabilities can be helpful when doing tasks suchas click-stream sessionization [11]. A paralleldatabase is already like a Hadoop cluster inthat there are many nodes which can be usedin parallel to process data. DBD systems arenot necessarily MapReduce systems in the con-ventional sense, but have properties that en-able MapReduce-like computations. The DBDcategory consists of relational database man-agement systems that have MapReduce func-tionality. RDBMSs are very efficient at reg-ular database tasks, however, when it comesto user defined functions (UDFs), their abil-ities are lacking. Many applications are dif-ficult to express in SQL, and thus UDFs areneeded. However, traditional UDF frame-

215

Figure 3: Overview of the different types of MapReduce-RDBMS integrations.

works are very restrictive in that they are notrelation-in relation-out operators and do notparallelize well. In general DBD approachesattempt to enhance, redefine or expand theUDF framework in parallel databases to sup-port MapReduce like computations. This is acommon approach among data warehouse ven-dors.

Aster Data has developed an approach touser-defined functions called SQL/MapReduce(SQL/MR) [11]. SQL/MR is designed to over-come the limitations of UDFs by enablingparallel computation of procedural functionsacross hundreds of servers. Functions definedwith this framework are polymorphic, inher-ently parallelizable and composable, meaningtheir input and output behaviour is equivalentto an SQL subquery. Hence, they can easilybe integrated into the processing pipeline. Thefunctions themselves are defined in a fashionsimilar to MapReduce’s map and reduce func-tions.

Greenplum has also developed a RDBMSthat incorporates MapReduce processing [12].At its core, the system consists of a singleparallel data-flow engine that processes bothSQL and MapReduce jobs. Unifying MapRe-duce and RDBMS functionality in this way al-lows for the integration of MapReduce and SQLcode. SQL queries can use the output fromMapReduce jobs as virtual tables and MapRe-duce jobs can use SQL queries as input.

Chen et al. [6] take a more general approachto enhancing processing abilities. Since regu-

lar UDFs are not relation-in relation-out op-erators they cannot be composed with otherrelational operators. Another significant draw-back is that in many systems, for UDFs tobe efficient they must be coded with complexinteractions with the internal data structuresof the DBMS. This leads to either very ineffi-cient UDFs or requires developers with deepknowledge of the DBMS. Chen et al. solvethis problem through a conceptual extensionto the SQL language; relation valued functions(RVF). RVFs wrap complex applications whichcan then be integrated into query processing.They also propose RVF-shells to be responsi-ble for the interactions with the database sothat RVF developers are shielded from DBMSinternal details. Although this method can beused to do MapReduce like computations it ismore powerful in that it is not restricted tothese computations.

5.3 Loosely-Coupled

Loosely-coupled integration is a category of ap-proaches in which a MapReduce system andRDBMS interact with each other but remainseparate. This approach is motivated by itssimplicity, flexibility and ability to provide thesystem’s capabilities to each other while keep-ing them isolated. Vertica [4] uses this kind ofinterfacing in its communication with Hadoop.Hadoop also provides an interface for interact-ing with any database that supports a JDBC [3]connection. There are several advantages tothis kind of approach. It can be used to trans-

216

fer data between databases since the input andoutput databases of a MapReduce job can dif-fer, this makes the MapReduce system usefulas an ETL system. Also, no new technologyor hardware is required, as long as a databaseand a MapReduce cluster are already present.Because the systems are independent of eachother they can be configured individually andthe size of the MapReduce cluster is flexible.This makes running the MapReduce clusterin a cloud environment with flexible resourcespossible. As for scale, fault-tolerance and pro-cessing abilities of the individual systems, theseattributes are not changed through interfacingthe two systems. The interface simply providesan opportunity for using a MapReduce systemto process data that is present in a databaseor writing the result of a MapReduce job to adatabase.

On the negative side, there is no true inte-gration of SQL and MapReduce so a program-mer using a Loosely-coupled system needs tobe able to write and understand SQL queriesand MapReduce programs. Domain-specificlanguages such as Pig [13], Hive [17] andSawzall [15] that are built on top of MapReducedo exist. Although several of these languagesare based on SQL, they do not incorporate thetraditional relational data model used in rela-tional databases. Rather, each language comeswith its own, usually more simplistic, definitionof a data model.

Another drawback of a Loosely-coupled ap-proach is that the systems have no knowledgeof each other which makes optimizing certainMapReduce jobs and queries difficult. The wayin which data is accessed and transferred be-tween the two systems has a significant effecton the efficiency of a Loosely-coupled approach.Being the most readily available approach, anevaluation and discussion of a specific interfaceis provided in Section 7.

6 Hadoop’s

DBInputFormat

With version 0.19, Hadoop added a DBInput-Format [1] component to its distribution. Thisfeature allows users to easily use relationaldata as input for their MapReduce jobs.

The following sections outline where thisDBInputFormat fits with respect to speci-fying a MapReduce job and how it can be used.

6.1 A Hadoop MapReduce Job

To define a MapReduce job, the user is re-quired to specify several components. First,the semantics of the job are defined througha map function and a reduce function, whichthe user usually programs. Furthermore it isnecessary to specify an input format and anoutput format for the job. The input formatdefines where the data for the job comes from,how it is broken into records (key-value pairsas input to the map jobs) and how the datais split among the different map tasks whichalso defines how many map tasks are required.DBInputFormat is an example of such an inputformat. DBInputFormat is a class provided byHadoop to do all the above mentioned tasks inorder to properly read data from a relationaldatabase. The details of DBInputFormat arepresented in Section 6.2.

6.2 DBInputFormat Specifics

Hadoop’s DBInputFormat works with anydatabase that supports JDBC. This means thedatabase can be located anywhere and can haveany configuration as long as Hadoop is able toaccess it using JDBC.

There are two ways to specify exactly whichdata is required from the database:

Selection based specification. Thismethod requires the user to specify a singletable, the attributes to select, filter conditionsand an attribute by which the results areordered. This type of specification doesnot require the user to write any SQL sinceHadoop generates the SQL queries from theuser’s specifications. Hence it is more difficultfor the user to make errors using this kind ofinput, but the types of queries that can beproduced are constrained.

Query based specification. This methodprovides the user with more flexibility than theselection based specification since the user mayspecify an arbitrary SQL query as long as it

217

contains an order-by clause.1 In addition tothe actual query, the user is required to specifya second query which returns the number ofresults of the first query.

7 Experimenting with

DBInputFormat

The goals of our experiments with Hadoop’sDBInputFormat were the following:

• To determine how Hadoop reads data fromthe database; what queries it sends to thedatabase and how it splits the data amongdifferent map tasks.

• To compare the efficiency of reading datafrom a database to reading data from filescontained in the Hadoop Distributed FileSystem (HDFS).

• To determine the effect of using a complexquery to retrieve data from the database.

7.1 Setup

The setup for our experiments consistedof a single machine running Hadoop 0.20.1and MySQL 5.1.45. Although this simplesetup cannot express Hadoop’s true process-ing power, we believe it suitable for our pur-poses since we are examining the behaviour ofHadoop and its interaction with the database,and not evaluating computing power or perfor-mance. The experiments were performed on asystem with a 2.26 GHz Intel Core Duo pro-cessor and 4GB RAM. Hadoop 0.20.1 was in-stalled in pseudo-distributed mode. Since thesetup consisted of a single machine, each of theHadoop components, such as the coordinatingnode and the worker node, ran as a process.

7.2 Data Setup

The MySQL database was set up to containtwo sets of tables; Employees with 500,000records, and EmployeesLarge with 5,000,000

1There is nothing that will prevent a query withoutan order-by clause from being executed but the resultsare not guaranteed to be accurate.

records. Each of the records contained theattributes name, salary and hobby, with thehobby attribute being uniformly distributedamong eight different values.

To compare Hadoop’s performance using in-put from a database to its performance usinginput from text files, files that were equiva-lent to the contents of the database tables wereneeded. We created text files for each databasetable in two fashions: single file and multiplefiles. In the single file method all the recordsare in a single file. In the multiple file methodthe records are evenly distributed among 10files. Within the text files a record is repre-sented by a line on which the attribute valuesare separated by commas. These files are thenloaded into Hadoop’s file system.

7.3 Produced Queries

To determine the kind of queries that Hadoopsends to the database when using DBIn-putFormat, we ran a MapReduce job usingjust the hobby attribute from the Employees

table as input and logged the queries sentto the database system. The specificationof the input was done using the selectionbased method (see Section 6.2). The followingqueries were logged:

SELECT COUNT(*) FROM Employees

SELECT hobby FROM Employees

ORDER BY hobby LIMIT 250000 OFFSET 0

SELECT hobby FROM Employees

ORDER BY hobby LIMIT 250000 OFFSET

250000

Hadoop sent three very similar queries tothe database. The first query returns the totalnumber of results. Hadoop requires this num-ber to know how many input records there willbe so that they can be split across appropri-ately many map-tasks. In this case, two maptasks were assigned, which are responsible forthe two latter queries. Each map task executesthe query and selects only the subset of resultsit was assigned. Other MapReduce jobs fol-lowed this same pattern of queries: an initialquery to retrieve the number of results followedby one query per map task. The problem with

218

using LIMIT and OFFSET to split the results isthat the whole result set is calculated for everymap task.

The significance of the order by clause canbe seen in these queries since this clause ensuresthat the results are ordered the same way everytime the query is executed. Such a consistentorder is required so that the map tasks receivedisjoint and covering subsets of the query re-sult.

There are several significant disadvantagesto this method of retrieving records from thedatabase:

• Same query is executed many times. Exe-cuting the same query multiple times hasa negative effect in that every map taskhas to wait for the whole query to com-plete. Also these queries are sent to thedatabase system in parallel, meaning it hasmore work to do and will likely take longerto process the queries. For a small num-ber of map tasks this effect may not alwaysbe significant but if there are hundreds ofmap tasks—the scale at which MapReducesystems are meant to be used—the effectcould be dramatic.

• Database cannot change during job execu-

tion. The same query is executed multi-ple times during the same MapReduce job.Since the partitioning of the results amongdifferent map tasks depends on the orderof the results, if the set of results changesduring the execution of the MapReducejob, its output will be inaccurate. There-fore, the results of the query need to bekept consistent in some way. All changesto the database that could affect the queryresult could be blocked during the execu-tion of the MapReduce job or a times-tamp based filtering could be applied tothe query results.

7.4 Reading Data

To compare the efficiency of reading data usingthe DBInputFormat to reading data from filescontained in the HDFS we executed equivalentMapReduce jobs several times, varying the in-put source (see Section 7.2 for specifics on the

Figure 4: Execution times of a MapReduce jobusing different input sources.

input data). We then compared the total exe-cution time of the jobs.

Figure 4 shows the results of these experi-ments. We ran the Hadoop job on two differentsizes of input—500,000 records and 5,000,000records. For the smaller input, the job thatread data from a single file and the job thatread data from the database took an almostequal amount of time. However, the job thatread data from 10 files was significantly slower.This is likely due to the fact that Hadoop ini-tiates at least one map task for each input file.In this case, this resulted in 10 map tasks andthe overhead of running the map tasks greatlyoutweighed the small amount of data processedby each task. The numbers on top of the barsin Figure 4 indicate the number of map tasksassigned to that job.

For the larger input size—5,000,000records—the overhead of running 10 maptasks is no longer as significant since eachtask has more data to process. Hence, theexecution time of the job reading from 10 filescomes very close to that of the job readingfrom a single file. Noteworthy is that the 10file job completes faster than the databasejob. We ran an additional experiment forthe larger input size—Database* in Figure 4.Since Hadoop only assigned 2 map tasks to thejob reading data from the database, and thefile reading jobs had at least 4, we thought itmight increase the database job’s performanceif the number of map tasks was increased.However, as seen in Figure 4 increasing thenumber of map tasks had a drastically negative

219

effect. The decrease in performance can beexplained by the fact that for each additionalmap task the query to the database is executedan additional time, thus increasing the numberof map tasks and the number of queries thatneed to be run, which puts a strain on thesystem.

7.5 Complex Queries

To test the effect of using the result of a com-plex query as input for a Hadoop job, ad-ditional tables were added to the database.In addition to the tables described in Section7.2, two more tables—Pets and PetsLarge—containing 200,000 and 2,000,000 records re-spectively were created. Each record in the ta-bles represented a pet owned by an employeeand consisted of the attributes id, owner—which referred to a record in either Employ-ees or EmployeesLarge—, and a type. Therewere five different types which were uniformlydistributed.

One argument for using input from a DBMSin MapReduce jobs is that this allows thedatabase to do some preprocessing. However,for preprocessing to be effective, the queriessent to the database must be more complexthan a simple projection query as experimentedwith in Section 7.4. To measure the effect of us-ing a complex query as input we performed fur-ther experiments. The query for these exper-iments involved a dependent subquery whichmakes it more complex to process. Figure 5shows the execution time for Hadoop jobs us-ing the results from this query as input un-der variation of input size and number of maptasks. For both input sizes, the job using 4map tasks took longer to complete than the jobusing 2 map tasks. The difference was muchmore prominent for the larger input size. Thiseffect can again be explained by the fact thatthe query is executed once for each map task.The more significant difference between 2 and4 map tasks for the larger input is due to thelonger processing time of the query. Runningjust the query in MySQL without Hadoop tookaround 4 seconds for 500,000 records and 44seconds for 5,000,000. Hence the advantage ofthe database’s preprocessing becomes less valu-able as the number of map tasks increases.

Figure 5: The execution times of a MapReducejob using results from a complex SQL query asinput.

8 Challenges

8.1 DBInputFormat

Hadoop’s implementation of DBInputFormathighlights some of the challenges of using aloosely-coupled approach to link RDBMSs andMapReduce. Firstly the input data, which isdefined through a query, needs to be parti-tioned so that multiple map tasks can workon the data concurrently. DBInputFormat per-forms this split in a very primitive way: by ex-ecuting the query once for each map task andthen selecting a specific subset of the results.This is a waste of resources since theoreticallythe result of the query needs to only be calcu-lated once. One option would be to split thequery in such a way that each map task onlyexecutes a subset of the query and only getsthe results it is to process instead of executingthe whole query and selecting a subset of theresult. However, splitting arbitrary queries insuch a way is a very complex problem. Verticasolves this problem in its interface with Hadoopby allowing the user to specify a list of differentselection conditions. Each map task then runsa query with one of these conditions. The listof conditions can also be the result of a query.This implementation pushes the splitting of theinput to the user. While potentially resultingin a more efficient execution, this method ishighly susceptible to data skew and to be ef-ficient the user must have knowledge of whatthe data looks like. For instance, a user couldchoose a set of query conditions for which al-

220

most all the results fall into the same categoryand are thus in the same data split and pro-cessed by the same map task. To avoid this,and take advantage of the database’s advancedoptimizing capabilities and system tables, thedatabase itself would have to split the query.

Furthermore there is the challenge of data inthe database being modified while the MapRe-duce job is executing so that when the queryis executed at different times there is a differ-ent result set. This problem could be solvedby simply executing the query once externally,saving the results into a text file and manu-ally loading this file into the Hadoop file sys-tem. This file can then be used as input forthe MapReduce job instead of directly retriev-ing the data from a database. While likelybeing more efficient, this method of dealingwith the problem is more difficult and hands onand defeats the intention of integrating the twosystems. Another option would be to createa view—temporary database table—containingthe query result from which subsequent queriescan select parts. This will still result in thedatabase being queried multiple times, how-ever, the query result is only calculated once.The view can be removed once the MapRe-duce job is finished. Perhaps an even moreefficient option would be to have an operatorin the query processing pipeline that alreadyperforms the partitioning of the query resultand stores it in multiple views. The subse-quent queries can then each read one of theviews. This approach, however, would involvemodifying the database implementation.

8.2 General

As each type of integration has its own pur-pose, it also has its own set of challenges. ForMapReduce Dominant (MRD) systems, whichare mainly used to process extremely largeamounts of structured and unstructured data,one challenge is optimizing MapReduce jobs totake full advantage of the databases at eachnode. HadoopDB implements some of this op-timization but there is much room for improve-ment. Also, very important is the partitioningand distribution of data to the different nodes.In parallel database systems, relations are of-ten moved from one node to another while pro-

cessing a query, for MRD systems this is notpossible. Thus, much care must go into howthe data is partitioned.

RDBMS Dominant (DBD) systems en-hance the processing capabilities of paralleldatabases. These systems also pose severalchallenges. In particular, advanced optimiza-tion techniques for the MapReduce jobs areneeded: optimization of the job as part of aquery plan, and of the job itself. Further, themain disadvantage of DBD systems comparedto MRD systems is that they do not scale aswell. If a finer-grained fault tolerance can beintroduced to these systems so that when anode fails the whole query does not have to berestarted, then this disadvantage can be over-come.

Loosely-Coupled systems are useful as ETLsystems and in cases when MapReduce andRDBMSs already exist and need to be inter-faced with each other. As discussed in Section8.1, the way in which data is transferred fromone system to another is critical to performanceand requires future research. Improvements inthis interface are necessary for it to be appli-cable at a large scale. Further, languages tointegrate relational data and SQL into MapRe-duce jobs are lacking. For instance a language,or extension to an existing language, which canbe compiled into Hadoop jobs using DBInput-Format as input.

For all types of systems, data loading andplacement are key to efficiency, as they are intraditional database systems. Data loading isone of the main drawbacks to using relationaldatabase systems since it takes a significantamount of time. Work on parallelizing thisloading can make it more efficient. How thedata is partitioned and what indexes are cre-ated is also very important and relevant to alltypes of systems.

9 Summary

Although MapReduce was originally created tocope with large amounts of unstructured data,there are advantages in exploiting it to pro-cess relational data, such as scale and addi-tional processing capabilities. Thus, RDBMSand MapReduce should not be regarded as

221

rival systems, but rather as complimentary.Exactly which advantages an integration ofthe two systems has to offer depends stronglyon the type of integration. We have exam-ined existing integrations and classified theminto three categories: MapReduce Dominant,RDBMS Dominant, and Loosely-Coupled. Wehave found that one specific implementationof the Loosely-Coupled approach, Hadoop’sDBInputFormat, has much room for improve-ment. Currently, the manner in which datais extracted from a relational database is un-necessarily repetitive and hence wasteful of re-sources. Combining MapReduce and RDBMStechnologies has the potential to create verypowerful systems. However, each type ofMapReduce-RDBMS integration also has itsown set of challenges that need to be addressedby future research.

About the Authors

Natalie Gruska is a MSc candidate in theSchool of Computing at Queens University.She holds a BSc from Saarland Universityin Germany. Her research interests includedatabase system performance, autonomic com-puting systems and cloud computing. She canbe reached at [email protected].

Patrick Martin is a Professor in the School ofComputing at Queens University. He holds aBSc from the University of Toronto, MSc fromQueens University and a PhD from the Uni-versity of Toronto. He is also a Faculty Fel-low with the IBM Centre for Advanced Stud-ies in Toronto, ON. His research interests in-clude database system performance, Web ser-vices and autonomic computing systems. Hecan be reached at [email protected].

References

[1] DBInputFormat.http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html.

[2] Hadoop. http://hadoop.apache.org/.

[3] JDBC. http://java.sun.com/javase/ tech-nologies/database/.

[4] Vertica. http://www.vertica.com/.

[5] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, AlexanderRasin, and Avi Silberschatz. HadoopDB:An Architectural Hybrid of MapReduceand DBMS Technologies for AnalyticalWorkloads. In VLDB’09: Proceedings

of the 2009 International Conference on

VLDB, August 2009.

[6] Qiming Chen, Andy Therber, MeichunHsu, Hans Zeller, Bin Zhang, and Ren Wu.Efficiently support MapReduce-like com-putation models inside parallel DBMS. InIDEAS ’09: Proceedings of the 2009 In-

ternational Database Engineering and Ap-

plications Symposium, pages 43–53, NewYork, NY, USA, 2009. ACM.

[7] Cheng T. Chu, Sang K. Kim, Yi A.Lin, Yuanyuan Yu, Gary R. Bradski, An-drew Y. Ng, and Kunle Olukotun. Map-Reduce for machine learning on multi-core. In In Proceedings of Neural Informa-

tion Processing Systems Conference, pages281–288. MIT Press, 2006.

[8] Jonathan Cohen. Graph Twiddling in aMapReduce World. Computing in Science

and Eng., 11(4):29–41, 2009.

[9] Jeffrey Dean and Sanjay Ghemawat.MapReduce: simplified data processing onlarge clusters. In OSDI’04: Proceedings of

the 6th conference on Symposium on Oper-

ating Systems Design & Implementation,pages 137–150, Berkeley, CA, USA, 2004.USENIX Association.

[10] Jeffrey Dean and Sanjay Ghemawat.MapReduce: a flexible data processingtool. Commun. ACM, 53(1):72–77, 2010.

[11] Eric Friedman, Peter Pawlowski, and JohnCieslewicz. SQL/MapReduce: a practicalapproach to self-describing, polymorphic,and parallelizable user-defined functions.VLDB’09: Proceedings of the 2009 Inter-

national Conference on VLDB, 2(2):1402–1413, 2009.

222

[12] Greenplum. A Unified Engine for RDBMSand MapReduce. www.greenplum.com/technology/mapreduce/.

[13] Christopher Olston, Benjamin Reed,Utkarsh Srivastava, Ravi Kumar, and An-drew Tomkins. Pig latin: a not-so-foreignlanguage for data processing. In SIGMOD

’08: Proceedings of the 2008 ACM SIG-

MOD international conference on Man-

agement of data, pages 1099–1110, NewYork, NY, USA, 2008. ACM.

[14] Andrew Pavlo, Erik Paulson, AlexanderRasin, Daniel J. Abadi, David J. De-Witt, Samuel Madden, and Michael Stone-braker. A comparison of approaches tolarge-scale data analysis. In SIGMOD ’09:

Proceedings of the 35th SIGMOD interna-

tional conference on Management of data,pages 165–178, New York, NY, USA, 2009.ACM.

[15] Rob Pike, Sean Dorward, Robert Griese-mer, and Sean Quinlan. Interpreting thedata: Parallel analysis with Sawzall. Sci.

Program., 13(4):277–298, 2005.

[16] Michael Stonebraker, Daniel Abadi,David J. DeWitt, Sam Madden, ErikPaulson, Andrew Pavlo, and Alexan-der Rasin. MapReduce and parallelDBMSs: friends or foes? Commun. ACM,53(1):64–71, 2010.

[17] Ashish Thusoo, Joydeep Sen Sarma, Na-mit Jain, Zheng Shao, Prasad Chakka,Suresh Anthony, Hao Liu, Pete Wyck-off, and Raghotham Murthy. Hive: awarehousing solution over a map-reduceframework. VLDB’09: Proceedings of the

2009 International Conference on VLDB,2(2):1626–1629, 2009.

223

integrating mapreduce and rdbmss · mapreduce for machine learning algorithms. their approach...

Documents