hadoop with r mapreduce jobs in how to program - jordi · pdf filehow to program mapreduce...

27
How to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Upload: trannhu

Post on 27-Feb-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

How to program MapReduce jobs in

Hadoop with RGroup 8

João Rosa, Mario Almeida, Alex Pérez

Page 2: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Index

● Introduction

● Hadoop

● MapReduce

● R

● Why and how?

● Possible uses? Business opportunities?

● Conclusion

● Questions

● References

Page 3: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Nowadays, we have lots of data. BIG DATA!

Page 4: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

If we need to analyse this we have a problem...

Page 5: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

...but, if we need to analyse this we have a BIG DATA problem!

Page 6: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

A possible solution!

+

How can we analyse this BIG DATA?

Page 7: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

The project includes these subprojects:

Page 8: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.

Page 9: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

Page 10: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

● Highly fault-tolerant with hardware Failure● Designed to be deployed on low-cost hardware● Streaming Data Access● Large Data Sets● Portability Across Heterogeneous Hardware and

Software Platforms

Page 11: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Supports distributed computing on large data sets on clusters of computers

Process large amounts of raw data

Page 12: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Map + Reduce

Page 13: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez
Page 14: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

R is the language of Pirates!!!

Rrrrr

Page 15: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

What is R?

It's a language and environment for statistical computing and graphics!

Page 16: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

What is R?

2 million analysts!Quantitative finance!Google, Facebook and LinkedIn!

Page 17: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Why R?

● Current analytic solutions are costly!● New methods for analyzing complex datasets!

Page 18: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Why Hadoop with R?

"Easiest, most productive, most elegant way to write map reduce jobs."

Page 19: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Why Hadoop with R?

● One-two orders of magnitude less code than Java

Page 20: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Why Hadoop with R?

Readable, reusable and extensible language.

Page 21: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Why Hadoop with R?

To give R analysts a way to access the map-reduce programming paradigm using big data sets.

Page 22: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

How to use Hadoop with R?RHadoop = rmr + rHDFS + rHBase ● rhdfs - functions providing file management

of the HDFS from within R (RJava). ● rhbase - functions providing database

management for the HBase distributed database from within R (Thrift).

● rmr - functions providing Hadoop

MapReduce functionality in R.

Page 23: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Business opportunities?

xkcd.com

Page 24: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Conclusions

Productivity vs EfficiencyWide variety of statistical and graphical techniquesBusiness orientation

Page 25: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Questions?

Page 26: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

Referenceshttp://hadoop.apache.org/ - Apache Hadoop's projecthttp://www.r-bloggers.com/how-to-program-mapreduce-jobs-in-hadoop-with-r/ - teachers pagehttp://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf - MapReducehttps://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial - MapReduce in R tutorialhttp://www.inside-r.org/r-doc/base/lapply - R lapplyhttp://www.inside-r.org/r-doc/base/tapply - R tapplyhttp://www.revolutionanalytics.com/what-is-open-source-r/ - What is R?http://www.r-project.org/ - What is R? Official pagehttp://en.wikipedia.org/wiki/R_(programming_language) - R wikihttp://www.johndcook.com/R_language_for_programmers.html - R programming for those coming from other languageshttp://www.revolutionanalytics.com/why-revolution-r/whitepapers/r-is-hot.php- why are R is hot

Page 27: Hadoop with R MapReduce jobs in How to program - Jordi  · PDF fileHow to program MapReduce jobs in Hadoop with R Group 8 João Rosa, Mario Almeida, Alex Pérez

PicturesWe tried to use CC pictures, bellow are their respective links: http://www.flickr.com/photos/nanagyei/4880468290 - pig pirateshttp://www.flickr.com/photos/timypenburg/5328226108 - maths and penhttp://www.flickr.com/photos/48481327 - graduatehttp://s0.geograph.org.uk/geophotos/01/53/43/1534341_7dc47500.jpg - storehttp://www.flickr.com/photos/dizfunk/3066153143/ - nerdhttp://geekithawaii.com/wp-content/uploads/2011/01/7562581_l.jpg - skyhttp://www.robweir.com/blog/wp-content/uploads/2011/01/numbers.jpg - numbershttp://delightfulchildrensbooks.files.wordpress.com/2011/02/read-around-the-world.jpg - children Others:http://www.xkcd.com