hadoop with r mapreduce jobs in how to program - jordi · pdf filehow to program mapreduce...
TRANSCRIPT
How to program MapReduce jobs in
Hadoop with RGroup 8
João Rosa, Mario Almeida, Alex Pérez
Index
● Introduction
● Hadoop
● MapReduce
● R
● Why and how?
● Possible uses? Business opportunities?
● Conclusion
● Questions
● References
Nowadays, we have lots of data. BIG DATA!
If we need to analyse this we have a problem...
...but, if we need to analyse this we have a BIG DATA problem!
A possible solution!
+
How can we analyse this BIG DATA?
The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
The project includes these subprojects:
Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
● Highly fault-tolerant with hardware Failure● Designed to be deployed on low-cost hardware● Streaming Data Access● Large Data Sets● Portability Across Heterogeneous Hardware and
Software Platforms
Supports distributed computing on large data sets on clusters of computers
Process large amounts of raw data
Map + Reduce
R is the language of Pirates!!!
Rrrrr
What is R?
It's a language and environment for statistical computing and graphics!
What is R?
2 million analysts!Quantitative finance!Google, Facebook and LinkedIn!
Why R?
● Current analytic solutions are costly!● New methods for analyzing complex datasets!
Why Hadoop with R?
"Easiest, most productive, most elegant way to write map reduce jobs."
Why Hadoop with R?
● One-two orders of magnitude less code than Java
Why Hadoop with R?
Readable, reusable and extensible language.
Why Hadoop with R?
To give R analysts a way to access the map-reduce programming paradigm using big data sets.
How to use Hadoop with R?RHadoop = rmr + rHDFS + rHBase ● rhdfs - functions providing file management
of the HDFS from within R (RJava). ● rhbase - functions providing database
management for the HBase distributed database from within R (Thrift).
● rmr - functions providing Hadoop
MapReduce functionality in R.
Business opportunities?
xkcd.com
Conclusions
Productivity vs EfficiencyWide variety of statistical and graphical techniquesBusiness orientation
Questions?
Referenceshttp://hadoop.apache.org/ - Apache Hadoop's projecthttp://www.r-bloggers.com/how-to-program-mapreduce-jobs-in-hadoop-with-r/ - teachers pagehttp://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf - MapReducehttps://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial - MapReduce in R tutorialhttp://www.inside-r.org/r-doc/base/lapply - R lapplyhttp://www.inside-r.org/r-doc/base/tapply - R tapplyhttp://www.revolutionanalytics.com/what-is-open-source-r/ - What is R?http://www.r-project.org/ - What is R? Official pagehttp://en.wikipedia.org/wiki/R_(programming_language) - R wikihttp://www.johndcook.com/R_language_for_programmers.html - R programming for those coming from other languageshttp://www.revolutionanalytics.com/why-revolution-r/whitepapers/r-is-hot.php- why are R is hot
PicturesWe tried to use CC pictures, bellow are their respective links: http://www.flickr.com/photos/nanagyei/4880468290 - pig pirateshttp://www.flickr.com/photos/timypenburg/5328226108 - maths and penhttp://www.flickr.com/photos/48481327 - graduatehttp://s0.geograph.org.uk/geophotos/01/53/43/1534341_7dc47500.jpg - storehttp://www.flickr.com/photos/dizfunk/3066153143/ - nerdhttp://geekithawaii.com/wp-content/uploads/2011/01/7562581_l.jpg - skyhttp://www.robweir.com/blog/wp-content/uploads/2011/01/numbers.jpg - numbershttp://delightfulchildrensbooks.files.wordpress.com/2011/02/read-around-the-world.jpg - children Others:http://www.xkcd.com