running mapreduce in dockerpapaggel/courses/eecs... · running mapreduce in docker eecs 4415 big...

17
Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou [email protected]

Upload: others

Post on 17-Mar-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

Running MapReduce in Docker

EECS 4415

Big Data Systems

Tilemachos Pechlivanoglou

[email protected]

Page 2: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

Last week’s WordCount example

2

mapper.pyreducer.py

Page 3: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

3

System representation

PC

Docker

HDFS

Page 4: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

4

docker -it run eecsyorku/eecs4415

PC

Docker

🗀 .../Documents/A2

mapper.py reducer.py input.txt

🗀 /app

Page 5: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

5

docker -it run -v $PWD:/app eecsyorku/eecs4415

PC

Docker

🗀 .../Documents/A2

mapper.py reducer.py input.txt

🗀 /app

mapper.py reducer.py input.txt

Page 6: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

6

docker -it run -p 9870:9870 -p 8088:8088

-v $PWD:/app eecsyorku/eecs4415

PC

Docker

🗀 .../Documents/A2

mapper.py reducer.py input.txt

🗀 /app

mapper.py reducer.py input.txt

HDFS

Browser

localhost:<port>

9870 8088

Page 7: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

7

docker -it run -p 9870:9870 -p 8088:8088

-v $PWD:/app eecsyorku/eecs4415

Page 8: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

Web UI

8

Page 9: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

9

hdfs dfs -ls /

PC

Docker

🗀 /app

mapper.py reducer.py input.txt

HDFS

🗀 /

<empty>

Page 10: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

10

hdfs dfs -put ./input.txt /

PC

Docker

🗀 /app

mapper.py reducer.py input.txt

HDFS

🗀 /

input.txt

Page 11: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

11

hdfs dfs -mkdir /test

PC

Docker

🗀 /app

mapper.py reducer.py input.txt

HDFS

🗀 /

input.txt 🗀 test

Page 12: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

12

hdfs dfs -rm -r /test

PC

Docker

🗀 /app

mapper.py reducer.py input.txt

HDFS

🗀 /

input.txt

Page 13: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

13

Running MapReduce

hadoop jar /usr/hadoop-3.0.0/share/hadoop/tools/lib/hadoop-streaming-3.0.0.jar \

-file ./mapper.py \

-mapper ./mapper.py \

-file ./reducer.py \

-reducer ./reducer.py \

-input /input.txt \

-output /output

Page 14: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

14

hdfs dfs -get /output/part*

PC

Docker

🗀 /app

mapper.py reducer.py input.txt

HDFS🗀 /

input.txt

🗀 output

_SUCCESS

part-00000

part-00000

Page 15: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

15

Success!

Contents of part-00000:

Page 16: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

Dealing with imports

16

● For Python’s externally imported packages (nltk, sklearn):○ program will run properly outside Hadoop, but will fail without reason in it○ they need to be loaded into HDFS somehow

● To load them, compress as zip:○ zip -r nltkandyaml.zip nltk sklearn

○ mv ntlk_sklearn.zip /path/to/where/your/mapper/will/be/nltk_sklearn.mod

○ hadoop … -file ./nltk_sklearn.mod

● And manually import:○ import zipimport

○ importer = zipimport.zipimporter('nltk_sklearn.mod')

○ sklearn = importer.load_module('sklearn')

○ nltk = importer.load_module('nltk')

Page 17: Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

Thank you!

17

Links to check:https://zettadatanet.wordpress.com/2015/04/04/a-hands-on-introduction-to-mapreduce-in-python/https://afourtech.com/guide-docker-commands-examples/https://medium.com/@rrfd/your-first-map-reduce-using-hadoop-with-python-and-osx-ca3b6f3dfe78https://hadoop.apache.org/docs/r1.2.1/streaming.html