map reduce and parallel dbms: friends or foes?

23
#Technology Map Reduce and Parallel DBMS: Friends or Foes? Reference: Communications of ACM: Jan 2010 Vol 53, No. 1

Upload: saurav-basu

Post on 29-Jun-2015

494 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Map reduce and parallel dbms: Friends or Foes?

#Technology

Map Reduce and Parallel DBMS: Friends or Foes?

Reference: Communications of ACM: Jan 2010 Vol 53, No. 1

Page 2: Map reduce and parallel dbms: Friends or Foes?

Organization1. Question: Are Map Reduce System making Parallel

DBMS seem like legacy systems?2. Map Reduce Paradigm3. Parallel Database Systems (Query Execution)4. Mapping Parallel DBMS onto Map Reduce5. Possible Applications6. DBMs Sweet Spot7. Architectural Differences8. Learning from each other9. Conclusion

Page 3: Map reduce and parallel dbms: Friends or Foes?

Map Reduce Paradigm

• large numbers of processors working in parallel to solve computing problems.

• low end commodity servers.• proliferation of programming tools• simple model to express sophisticated

distributed programs.

Page 4: Map reduce and parallel dbms: Friends or Foes?

• possible to write almost any parallel-processing task as either a set of database queries or a set of MR jobs

• Questions: – Which is better? – How to benchmark?

Page 5: Map reduce and parallel dbms: Friends or Foes?

Parallel Database System

Salient Points:• horizontal partitioning of relational tables• partitioned execution of SQL queries• cluster of commodity servers with separate

cpu, memory & disks.

Page 6: Map reduce and parallel dbms: Friends or Foes?

Parallel DBMS Paradigm

• provide high level programming environment• inherently parallelizable• commercial systems from dozen vendors.• complexity of parallelizing queries abstracted

from programmer

Page 7: Map reduce and parallel dbms: Friends or Foes?

Partitioning Strategies

• Hash Range: hash function is applied to one or more attributes of each row to determine the target node and disk where the row should be stored• Round Robin: Target Node = row# modulo no. of m/cs

Page 8: Map reduce and parallel dbms: Friends or Foes?

Partitioned Query Execution

• SQL operators: selection, aggregation, join, projection, and update

Example :1 [Selection]:

Page 9: Map reduce and parallel dbms: Friends or Foes?

• Example 2 [Aggregation]

Sales table: round-robin partitioned

Page 10: Map reduce and parallel dbms: Friends or Foes?

• Example 3 [Join]:

Sales table : round-robin partitioned Customers table: hash partitioned on Customer.custId attribute

Page 11: Map reduce and parallel dbms: Friends or Foes?
Page 12: Map reduce and parallel dbms: Friends or Foes?

• Automatically manage partitioning strategies: Example:• Sales and Customers are each hash-partitioned on

their custId attribute, the query optimizer will recognize that the two tables are both hash-partitioned on the joining attributes and omit the shuffle operator from the compiled query plan.

• if both tables are round-robin partitioned, then the optimizer will insert shuffle operators for both tables so tuples that join with one another end up on the same node

Page 13: Map reduce and parallel dbms: Friends or Foes?

Mapping Parallel DBMSsonto MapReduce

• MR program consists of only two functions —

Map and Reduce—written by a user to process key/value data pairs

• Input data set is stored in a collection of partitions in a distributed file system deployed on each node in the cluster.

Page 14: Map reduce and parallel dbms: Friends or Foes?

• Map & Reduce operations: SQL aggregates augmented with UDFs and user-defined aggregates provide DBMS users the same MR-style reduce functionality.

• reshuffle in Map and Reduce tasks in equivalent to a GROUP BY.

• parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL).

Page 15: Map reduce and parallel dbms: Friends or Foes?

MR: Applications

• ETL & readonce data sets– parse & clean data feed it to downstream dbs.

• Complex analytics:– Multiple passes required, complex dataflows.

• Semi structured data:– key-value pairs, where the number of attributes present in

any given record varies(schema not reqd)• Quick & dirty analysis:

- quick startup time one off analysis• Limited budget: open source

Page 16: Map reduce and parallel dbms: Friends or Foes?

DBMS “Sweet Spot” [Benchmarks]

Page 17: Map reduce and parallel dbms: Friends or Foes?

• Grep: scan through a data set of 100B records looking for a three-character pattern. Each record consists of a unique key in the first 10B, followed by a 90B random value. The search pattern is found only in the last 90B once in every 10,000

records.• Weblog: calculate the total ad revenue generated for each

visited IP address from the logs

• Join: consists of two subtasks that perform a complex calculation on the two data sets. In the first part of the task, each system must find the IP address that generated the most revenue within a particular date range in the user visits. Once these intermediate records are generated, the system must then calculate the average PageRank of all pages visited during this interval

Page 18: Map reduce and parallel dbms: Friends or Foes?

Architectural DifferencesMap Reduce DBMS

Record Parsing burden of parsing the fields of each record on user code. This parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type.

records are parsed by DBMSs when the data is initially loaded. DBMSs storage manager carefully lay out records such that attributes can be directly addressed at runtime in their most efficient storage representation.

Compression Hadoop often executed slower when compression used on its input files

carefully tuned compression algorithms ensure thatthe cost of decompressing tuples does not offset the performance gainsfrom the reduced I/O cost of reading compressed data.

Page 19: Map reduce and parallel dbms: Friends or Foes?

Map Reduce DBMSPipelining Producer writes intermediate

results to local data structures, and consumer subsequently “pulls” the data. Check-pointing results gives greater Fault Tolerance at the expense of runtime performance.

query plan distributed to nodesat execution time. operatorin plan must send data to thenext operator, regardless of whetherthat operator is running on sameor different node. qualifyingdata is “pushed” by the first operatorto the second operator. data is streamed from producer to consumer; intermediate data is never written to disk; resulting “back-pressure” in the runtime system will stall the producer before it has a chance to overrunthe consumer.

Page 20: Map reduce and parallel dbms: Friends or Foes?

Map Reduce DBMS

Scheduling task in an MR system is scheduled on processing nodes one storage block at a time. Such runtime work schedulingat a granularity of storage blocks is much more expensive than the DBMS compile time scheduling.

Argument: allowing the MR scheduler to adapt to workload skew and performancedifferences between nodes.

each node knows exactly what it mustdo and when it must do it according tothe distributed query plan. Because the operations are known in advance, the system is able to optimize the execution plan to minimize data transmission between nodes.

Column Stores Hadoop/HDFS are row stores. Slower for aggregate queries on same attribute.

Vertica is a columnsystem reads only the attributesnecessary for solving the user query.limited need for reading data represents a considerable performance advantage over traditional, row-storedDatabases for aggregation queries.

Page 21: Map reduce and parallel dbms: Friends or Foes?

Learning from Each OtherMap Reduce Parallel DBMS

MR advocates should learn from parallelDBMS the technologies and techniquesfor efficient query parallel execution

commercial DBMS productsmust move toward one-buttoninstalls, automatic tuning that workscorrectly, better Web sites with examplecode, better query generators, andbetter documentation.

higher-level languagesare invariably a good idea forany data-processing system. higher-level interfaces on top of MR/Hadoop should be accelerated; (Hive, Pig, Scope,Dryad/Linq)

DBMSs cannot deal with in situ data.Though some database systems (suchas PostgreSQL, DB2, and SQL Server)have capabilities in this area, furtherflexibility is needed.

Page 22: Map reduce and parallel dbms: Friends or Foes?

Conclusions:

• Parallel DBMSs excel at efficient querying of large data sets.

• MR style systems excel at complex analytics & ETL tasks

• complementary • complex analytical problems require the

capabilities provided by both systems.

Page 23: Map reduce and parallel dbms: Friends or Foes?

ThanksSaurav Basu