map reduce, hadoop & pig
DESCRIPTION
A Hadoop, MapReduce and Pig summaryTRANSCRIPT
MAP/REDUCE, HADOOP & PIGData mining applied on the enterprise
DEFINITIONS
Data mining is the process of extracting patterns from data. Commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.
A Framework is a re-usable design for a software system (or subsystem). A software framework may include support programs, code libraries, a scripting language, or other software to help develop and glue together the different components of a software project. Various parts of the framework may be exposed through an API.
MAP/REDUCE Framework for processing huge datasets on certain
kinds of distributable problems using a large number of computers.
MapReduce provides Automatic parallelization & distribution Fault tolerance I/O scheduling Status monitoring
Use cases Document clustering Machine learning Inverted index construction
Was used to completely regenerate Google's index of the World Wide Web
MAP/REDUCE
"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.
Defined with respect to data structured in (key, value) pairs
MAP/REDUCE
MAP/REDUCE – DATA FLOW SECTIONS
Input reader
Map function
Partition function
Compare function
Reduce function
Output writer
MAP/REDUCE -> HADOOP
Google calls it… Hadoop equivalent…
MapReduce Hadoop MapReduce
GFS HDFS
Sawzall Hive, Pig
BigTable Hbase
Chubby ZooKeeper
HADOOP
Java Map/Reduce implementation
Framework that schedules tasks, provides monitoring, and re-executing the failed ones.
Single master JobTracker
Several slave TaskTracker, one per node
Hadoop DFS (not explicitly required)
Add-ons: Hive (Facebook dev), Pig (Yahoo! dev)
HADOOP EXAMPLE
A program that takes web server access log files and counts the number of hits in each minute slot over a week
Differentiate input & output phases: Map & Reduce Map phase: Access log files Reduce phase: Key set + iterator over each key
subset
HADOOP EXAMPLE
Map
Reduce
Input
Output
HADOOP EXAMPLE – MAP PHASE
HADOOP EXAMPLE – REDUCE PHASE
HADOOP EXAMPLE – MAIN CODE
PROBLEMS
Hadoop Map/Reduce is very powerful, but…
Requires a Java Programmer
User has to reinvent the wheel everytime a functionality is needed (join, filter, etc)
Harder to write, harder to maintain
User optimized
PIG
Platform for analyzing large data sets
High-level language + infrastructure (compiler)
Pig Latin Data flow language rather than procedural or
declarative
Ease of programming
Optimization opportunities
Extensibility
PIG - ADVANTAGES
Increases productivity. In one test 10 lines of Pig Latin ≈ 200 lines of Java. What took 4 hours to write in Java took 15
minutes in Pig Latin.
Opens the system to non-Java programmers.
Provides common operations like join, group, filter, sort.
PIG – HOW IT WORKS
PIG EXAMPLE
Start a terminal and run $ cd /usr/share/cloudera/pig/ $ bin/pig –x local
Should see a prompt like: grunt>
PIG – EXAMPLE - AGGREGATION
Let’s count the number of times each user appears in a given data set. log = LOAD ‘excite-small.log’ AS (user,
timestamp, query); grpd = GROUP log BY user; cntd = FOREACH grpd GENERATE group,
COUNT(log); STORE cntd INTO ‘output’;
Results: 002BB5A52580A8ED 18 005BD9CD3AC6BB38 18
PIG
Supports several functions Aggregation Grouping Filtering Ordering Joins & Anti-Joins Cogrouping (grouping generalization) Several data types:
Scalar: int, long, double, chararray, bytearray Complex: Maps, Tuples, Bags
PIG - COMMANDS
Pig Command What it does
load Read data from file system.
store Write data to file system.
foreach Apply expression to each record and output one or morerecords.
filter Apply predicate and remove records that do not return true.
group/cogroup Collect records with the same key from one or more inputs.
join Join two or more inputs based on a key.
order Sort records based on a key.
distinct Remove duplicate records.
union Merge two data sets.
POSSIBLE APPLICATIONS AT VLEX
Faster and improved, parallelized document indexing
Targeted advertisement
Recommendation system
Trending topics
Better search tools (Search assist)
QUESTIONS?
REFERENCIAS
Cloudera: Introduction to Pig
Hadoop, a Free Software Program, Finds Uses Beyond Search
Digging Deeper Into Data With Hadoop
Apache Hadoop
Pig Tutorial