lecture 3 cs492 special topics in computer science distributed algorithms and systems

6
Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems

Upload: rosamund-nelson

Post on 12-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems

Lecture 3

CS492 Special Topics in Computer ScienceDistributed Algorithms and Systems

Page 2: Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems

2Fall 2008 CS492

MapReduce

Simplified data processing on large clusters A programming model and an associated implementation

for processing and generating large data sets Inspired by “map” and “reduce” primitives present in

Lisp and many other functional languages Simple and powerful interface

that enables automatic paralization and distribution of large-scale computations

Page 3: Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems

3Fall 2008 CS492

Map and Reduce

Map map <k1, v1> list (k2, v2) example

k1 = document, v1 = contents k2 = words, v2 = 1

Reduce reduce (k2, list(v2)) list (v2)

k2 = words, v2 = 1 list (v2) = total count of v2 (or k2)

Page 4: Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems

4Fall 2008 CS492

MapReduce Execution Overview

Page 5: Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems

5Fall 2008 CS492

Examples

Distributed grep Count of URL access frequency Reverse web-link graph Term-vector per host

Page 6: Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems

6Fall 2008 CS492

Pig platform for analyzing large data sets that consists of a

high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets

Dryad Distributed data-parallel programs from sequential build-

ing blocks (EuroSys 2007)