distributed data processing on the cloud –lecture 6 joins

43
Distributed data processing on the Cloud – Lecture 6 Joins with MapReduce and MapReduce limitations Satish Srirama Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation Commons Attribution 3.0 License)

Upload: others

Post on 16-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed data processing on the Cloud –Lecture 6 Joins

Distributed data processing on the Cloud – Lecture 6

Joins with MapReduce and

MapReduce limitations

Satish Srirama

Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation

Commons Attribution 3.0 License)

Page 2: Distributed data processing on the Cloud –Lecture 6 Joins

Outline

• Structured data

– Processing relational data with MapReduce

• MapReduce limitations

19.10.2018 Satish Srirama 2/41

Page 3: Distributed data processing on the Cloud –Lecture 6 Joins

Relational Databases

• A relational database is comprised of tables

• Each table represents a relation = collection of

tuples (rows)

• Each tuple consists of multiple fields

(columns)

19.10.2018 Satish Srirama 3/41

Page 4: Distributed data processing on the Cloud –Lecture 6 Joins

Basic SQL Commands - Queries

https://mariadb.com/kb/en/library/basic-sql-statements/

19.10.2018 Satish Srirama 4/41

Page 5: Distributed data processing on the Cloud –Lecture 6 Joins

Design Pattern: Secondary Sorting

• MapReduce sorts input to reducers by key

– Values are arbitrarily ordered

• What if want to sort value also?

– E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…

• Use case:

– We are monitoring different sensors

– Data: (timestamp, sensor id, sensor value)

– Map outputs

sensorid -> (timestamp, sValue)

– How to see the activity of each sensor over time?

19.10.2018 Satish Srirama 5/41

Page 6: Distributed data processing on the Cloud –Lecture 6 Joins

Secondary Sorting: Solutions

• Solution 1:

– Buffer values in memory, then sort

– Why is this a bad idea?

• Solution 2:

– “Value-to-key conversion” design pattern:form composite intermediate key, (k, v1)

– Let execution framework do the sorting

– Preserve state across multiple key-value pairs to handle processing

– Anything else we need to do?

19.10.2018 Satish Srirama 6/41

Page 7: Distributed data processing on the Cloud –Lecture 6 Joins

Value-to-Key Conversion

k → (v1, r), (v4, r), (v8, r), (v3, r)…

(k, v1) → (v1, r)

Before

After

(k, v3) → (v3, r)(k, v4) → (v4, r)(k, v8) → (v8, r)

Values arrive in arbitrary order…

Values arrive in sorted order…Process by preserving state across multiple keysRemember to partition correctly!

19.10.2018 Satish Srirama 7/41

Page 8: Distributed data processing on the Cloud –Lecture 6 Joins

Working Scenario

• Two tables:

– User demographics (gender, age, income, etc.)

– User page visits (URL, time spent, etc.)

• Analyses we might want to perform:

– Statistics on demographic characteristics

– Statistics on page visits

– Statistics on page visits by URL

– Statistics on page visits by demographic characteristic

– …

19.10.2018 Satish Srirama 8/41

Page 9: Distributed data processing on the Cloud –Lecture 6 Joins

Relational Algebra

• Primitives– Projection (π)

– Selection (σ)

– Cartesian product (×)

– Set union (∪)

– Set difference (−)

– Rename (ρ)

• Other operations– Join (⋈)

– Group by… aggregation

– …

19.10.2018 Satish Srirama 9/41

Page 10: Distributed data processing on the Cloud –Lecture 6 Joins

π

Projection

R1

R2

R3

R4

R5

R1

R2

R3

R4

R5

19.10.2018 Satish Srirama 10/41

Page 11: Distributed data processing on the Cloud –Lecture 6 Joins

Projection in MapReduce

• Easy!– Map over tuples, emit new tuples with appropriate

attributes

– No reducers, unless for regrouping or resorting tuples

– Alternatively: perform in reducer, after some other processing

• Basically limited by HDFS streaming speeds– Speed of encoding/decoding tuples becomes

important

– Take advantage of compression when available

– Semistructured data? No problem!

19.10.2018 Satish Srirama 11/41

Page 12: Distributed data processing on the Cloud –Lecture 6 Joins

Selection

R1

σR2

R3

R4

R5

R1

R3

19.10.2018 Satish Srirama 12/41

Page 13: Distributed data processing on the Cloud –Lecture 6 Joins

Selection in MapReduce

• Easy!

– Map over tuples, emit only tuples that meet criteria

– No reducers, unless for regrouping or resorting tuples

– Alternatively: perform in reducer, after some other processing

• Basically limited by HDFS streaming speeds

– Speed of encoding/decoding tuples becomes important

– Take advantage of compression when available

– Semistructured data? No problem!

19.10.2018 Satish Srirama 13/41

Page 14: Distributed data processing on the Cloud –Lecture 6 Joins

Group by… Aggregation

• Example: What is the average time spent per URL?

• In SQL:

– SELECT url, AVG(time) FROM visits GROUP BY url

• In MapReduce:

– Map over tuples, emit time, keyed by url

– Framework automatically groups values by keys

– Compute average in reducer

– Optimize with combiners

19.10.2018 Satish Srirama 14/41

Page 15: Distributed data processing on the Cloud –Lecture 6 Joins

Relational JoinsR1

R2

R3

R4

S1

S2

S3

S4

R1 S2

R2 S4

R3 S1

R4 S3

19.10.2018 Satish Srirama 15/41

Page 16: Distributed data processing on the Cloud –Lecture 6 Joins

Working Scenario – visited again

• Two tables:

– User demographics (userid, gender, age, income, etc.)

– User page visits (userid, URL, time spent, etc.)

• Analyses we might want to perform:

– Statistics on demographic characteristics

– Statistics on page visits

– Statistics on page visits by URL

– Statistics on page visits by demographic characteristic

– …

19.10.2018 Satish Srirama 16/41

Page 17: Distributed data processing on the Cloud –Lecture 6 Joins

Types of Relationships

One-to-One One-to-Many Many-to-Many

19.10.2018 Satish Srirama 17/41

Page 18: Distributed data processing on the Cloud –Lecture 6 Joins

Join Algorithms in MapReduce

• Reduce-side join

• Map-side join

• In-memory join

– Striped variant

– Memcached variant

19.10.2018 Satish Srirama 18/41

Page 19: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join

• Basic idea: group by join key– Map over both sets of tuples

– Emit tuple as value with join key as the intermediate key

– Execution framework brings together tuples sharing the same key

– Perform actual join in reducer

– Similar to a “sort-merge join” in database terminology

• Two variants– 1-to-1 joins

– 1-to-many and many-to-many joins

19.10.2018 Satish Srirama 19/41

Page 20: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: 1-to-1

R1

R4

S2

S3

R1

R4

S2

S3

keys valuesMap

R1

R4

S2

S3

keys values

Reduce

Note: no guarantee if R is going to come first or S

19.10.2018 Satish Srirama 20

Page 21: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: 1-to-many

R1

S2

S3

R1

S2

S3

S9

keys valuesMap

R1 S2

keys values

Reduce

S9

S3 …

19.10.2018 Satish Srirama 21

Page 22: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: V-to-K Conversion

R1

keys values

In reducer…

S2

S3

S9

R4

S3

S7

New key encountered: hold in memory

Cross with records from other set

New key encountered: hold in memory

Cross with records from other set

19.10.2018 Satish Srirama 22

Page 23: Distributed data processing on the Cloud –Lecture 6 Joins

Reduce-side Join: many-to-many

R1

keys values

In reducer…

S2

S3

S9

Hold in memory

Cross with records from other set

R5

R8

19.10.2018 Satish Srirama 23

Page 24: Distributed data processing on the Cloud –Lecture 6 Joins

Problems with Reduce-side join

• In summary many-to-many is still left with the

scalability issue

• Another big problem:

– The basic idea behind the reduce-side join is to

repartition the two datasets by the join key

– This requires shuffling both the datasets across

the network

19.10.2018 Satish Srirama 24/41

Page 25: Distributed data processing on the Cloud –Lecture 6 Joins

Map-side Join: Basic Idea

Assume two datasets are sorted by the join key:

R1

R2

R3

R4

S1

S2

S3

S4

A sequential scan through both datasets to join(called a “merge join” in database terminology)

19.10.2018 Satish Srirama 25/41

Page 26: Distributed data processing on the Cloud –Lecture 6 Joins

Map-side Join: Parallel Scans

• If datasets are sorted by join key, join can be accomplished by a scan over both datasets

• How can we accomplish this in parallel?– Partition and sort both datasets in the same manner

– Example: • Suppose R and S were both divided into ten files, partitioned in the same

manner by the join key

• We simply need to merge join the first file of R with the first file of S, the second file of R with second of S, etc.

• In MapReduce:– Map over one dataset, read from other corresponding partition

– No reducers necessary (unless to repartition or resort)

• Consistently partitioned datasets: realistic to expect?– Yes. Mostly MapReduce jobs are part of a workflow...

19.10.2018 Satish Srirama 26/41

Page 27: Distributed data processing on the Cloud –Lecture 6 Joins

In-Memory Join

• Basic idea: load one dataset into memory, stream over other dataset

– Works if R << S and R fits into memory

– Called a “hash join” in database terminology

• MapReduce implementation

– Distribute R to all nodes

– Map over S, each mapper loads R in memory, hashed by join key

– For every tuple in S, look up join key in R

– No reducers, unless for regrouping or resorting tuples

19.10.2018 Satish Srirama 27/41

Page 28: Distributed data processing on the Cloud –Lecture 6 Joins

In-Memory Join: Variants

• Striped variant:– What if R is too big to fit into memory?

– Divide R into R1, R2, R3, … s.t. each Rn fits into memory

– Perform in-memory join: ∀n, Rn ⋈ S

– Take the union of all join results

• Memcached join:– Load R into memcached

– Replace in-memory hash lookup with memcached lookup

– Memcached capacity >> RAM of individual node

– Memcached scales out with cluster

– Memcached is fast (basically, speed of network)

– Batch requests to balance the latency costs

19.10.2018 Satish Srirama 28/41

Page 29: Distributed data processing on the Cloud –Lecture 6 Joins

Which join to use?

• In-memory join > map-side join > reduce-side

join

– Why?

• Limitations of each?

– In-memory join: memory

– Map-side join: sort order and partitioning

– Reduce-side join: general purpose

19.10.2018 Satish Srirama 29/41

Page 30: Distributed data processing on the Cloud –Lecture 6 Joins

Processing Relational Data: Summary

• MapReduce algorithms for processing relational data:

– Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce

– Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer

– Multiple strategies for relational joins

• Complex operations require multiple MapReduce jobs

– Example: top ten URLs in terms of average time spent

– Opportunities for automatic optimization

19.10.2018 Satish Srirama 30/41

Page 31: Distributed data processing on the Cloud –Lecture 6 Joins

Issues with MapReduce

• Java verbosity– Writing low level MapReduce code is slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping takes significant time and requires manual compilation

– A lot of custom code is required • Even simple functions such as sum and avg, you need to write reduce methods

– Hard to manage more complex MapReduce job chains

• Pig is a high level language on top of Hadoop MapReduce (Next Lecture)– Similar to declarative SQL

• Easier to get started

– In comparison to Hadoop MapReduce: • 5% of the code

• 5% of the time

19.10.2018 Satish Srirama 31/41

Page 32: Distributed data processing on the Cloud –Lecture 6 Joins

Adapting Algorithms to MapReduce

• Designed a classification on how the algorithms can be adapted to MapReduce [Srirama et al, FGCS 2012]

– Algorithm � single MapReduce job

• Monte Carlo, RSA breaking

– Algorithm � n MapReduce jobs

• CLustering LARge Applications - CLARA

– Each iteration in algorithm � single MapReduce job

• K-medoids (Clustering)

– Each iteration in algorithm � n MapReduce jobs

• Conjugate Gradient

• Applicable especially for Hadoop MapReduce

S. N. Srirama, P. Jakovits, E. Vainikko: Adapting Scientific Computing Problems to Clouds using MapReduce,

Future Generation Computer Systems Journal, ISSN: 0167-739X, 28(1):184-192, 2012. Elsevier press. DOI

10.1016/j.future.2011.05.025.

Page 33: Distributed data processing on the Cloud –Lecture 6 Joins

Issues with Hadoop MapReduce

• It is designed and suitable for:

– Data processing tasks

– Embarrassingly parallel tasks

• Has serious issues with iterative algorithms

– Long „start up“ and „clean up“ times ~17 seconds

– No way to keep important data in memory between MapReduce job executions

– At each iteration, all data is read again from HDFS and written back to HDFS, at the end

– Results in a significant overhead in every iteration

19.10.2018 Satish Srirama 33/41

Page 34: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches - 1

• Restructuring algorithms into non-iterative versions – Can we change class 3 & 4 algorithms into class 1 or 2?

• Iterative k-medoid clustering algorithm

in MapReduce

• Map– Find the closest medoid and assign the

object to the cluster of the medoid

– Input: (cluster id, object)

– Output: (new cluster id, object)

• Reduce– Find which object is the most central and assign it as the new medoid

of the cluster

– Input: (cluster id, (list of all objects in the cluster))

– Output: (cluster id, new medoid)

19.10.2018 Satish Srirama 34/41

Page 35: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches – 1 - continued

• CLARA was designed to make PAM algorithm more scalable [Kaufman and Rousseeuw (1990)]

– CLARA applies PAM on random samples of the original dataset

• CLARA in MapReduce

– The algorithm can be reduced to 2 MapReduce Jobs

• MapReduce job I (Finding the Candidate sets)– Map: (Assign random key to each point)

• Input: < key, point >

• Output < random key, point >

– Reduce: (Pick first S points and use k-medoids on them)• Output < key, k-medoids(S points) >

– result of k-medoids() is a set of k medoids19.10.2018 Satish Srirama 35/41

Page 36: Distributed data processing on the Cloud –Lecture 6 Joins

• MapReduce job II (Measuring the quality over whole data set)– Map: (Find each points distance from it's closest medoid, for each

candidate set)• Input: < key, point >

• Output: < candidate set key, distance >– C different sums, one for each candidate set

– Reduce: (Sum distances)

• Candidate set and respective medoids with minimum sum will decide the clusters

CLARA in MapReduce - continued

K-medoids CLARA

19.10.2018 36/41

Page 37: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches - 2

• Using alternative MapReduce frameworks that

are designed to handle iterative algorithms

– In memory processing

– Spark (Lecture 8)

19.10.2018 Satish Srirama 37/41

Page 38: Distributed data processing on the Cloud –Lecture 6 Joins

Alternative Approaches - 3

• Alternative distributed computing models

such as Bulk Synchronous Parallel model [Valiant,

1990] [Jakovits et al, HPCS 2013]

• Available BSP frameworks:

– Graph processing: jPregel & Giraph (Lecture 11)

19.10.2018 Satish Srirama 38/41

Page 39: Distributed data processing on the Cloud –Lecture 6 Joins

This week in lab…

• You’ll try Joins in MapReduce

19.10.2018 Satish Srirama 39/41

Page 40: Distributed data processing on the Cloud –Lecture 6 Joins

Next Lecture

• Higher level scripting languages for distributed

data processing - Pig

19.10.2018 Satish Srirama 40/41

Page 41: Distributed data processing on the Cloud –Lecture 6 Joins

References

• Data-Intensive Text Processing with MapReduce

Authors: Jimmy Lin and Chris Dyer

http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

• S. N. Srirama, P. Jakovits, E. Vainikko: Adapting Scientific Computing Problems to Clouds using MapReduce, Future Generation Computer Systems - Journal, ISSN: 0167-739X, 28(1):184-192, 2012. Elsevier press.

• P. Jakovits, S. N. Srirama: Evaluating MapReduce Frameworks for Iterative Scientific Computing Applications, The 2014 (12th) International Conference on High Performance Computing & Simulation (HPCS 2014), Bologna, Italy, July 21-25, 2014, pp. 226-233. IEEE.

19.10.2018 Satish Srirama 41/41

Page 42: Distributed data processing on the Cloud –Lecture 6 Joins

THANK YOU

[email protected]

19.10.2018 Satish Srirama 42

Page 43: Distributed data processing on the Cloud –Lecture 6 Joins

Issues with MapReduce

• Java verbosity

• Hadoop task startup time

• Not good for iterative algorithms

– Writing low level MapReduce code is slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping requires compilation

– A lot of custom code is required

– Hard to manage more complex MapReduce job chains

19.10.2018 Satish Srirama 43/41