brief introduction on hadoop,dremel, pig, flumejava and cassandra
DESCRIPTION
TRANSCRIPT
A Brief Discussion on: Hadoop MapReduce, Pig,
JavaFlume,Cascading & Dremel
Presented By: Somnath Mazumdar
29th Nov 2011
MapReduce è Based on Google's MapReduce Programming Framework è FileSystem: GFS for MapReduce ... HDFS for Hadoop è Language: MapReduce is written in C++ but Hadoop is in Java è Basic Functions : Map and Reduce inspired by similar primitives in
LISP and other languages... Why we should use ???
l Automatic parallelization and distribution
l Fault-tolerance
l I/O scheduling
l Status and monitoring
MapReduce Map Function: (1) Processes input key/value
pair (2) Produces set of
intermediate pairs Syntax: map (key,value)-
>list(key,inter_value)
Reduce Function: (1) Combines all intermediate values
for a particular key
(2) Produces a set of merged output values
Syntax:
reduce (out_key, list(inter_value)) -> list(out_value)
Programming Model
Hello World, Bye World!
Hello MapReduce, Goodbye to MapReduce.
Welcome to UCD, Goodbye to UCD.
Reduce Phase
HDFS Map Phase
Intermediate Result
HDFS
M1
M2
M3
(Hello, 1) (Bye, 1) (World, 1) (World, 1)
(Welcome, 1) (to, 1) (to, 1)
(Goodbye, 1) (UCD, 1) (UCD, 1)
(Hello, 1) (to, 1) (Goodbye, 1) (MapReduce, 1) (MapReduce, 1)
R1
R2
(Hello, 2) (Bye, 1) (Welcome, 1) (to, 3)
(World, 2) (UCD, 2) (Goodbye, 2) (MapReduce, 2)
MapReduce Applications: (1) Distributed grep & Distributed sort (2) Web link-graph reversal, (3) Web access log stats, (4) Document clustering, (5) Machine Learning and so on... To know more:
è MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.
è Hadoop: The Definitive Guide - O'Reilly Media
PIG è First Pig developed at Yahoo Research around 2006 later moved to
Apache Software Foundation
è Pig is a data flow programming environment for processing large files based on MapReduce / Hadoop.
è High-level platform for creating MapReduce programs used with Hadoop and HDFS
è Apache library that interprets scripts written in Pig Latin and runs them on a Hadoop cluster.
At Yahoo! 40% of all Hadoop jobs are run with Pig
PIG WorkFlow:
First step: Load input data. Second step: Manipulate data with functions like filtering, using foreach, distinct or any user defined functions. Third step: Group the data. Final stage: Writing data into the DFS or repeating the step if another dataset arrives.
Scripts written in PigLatin------------------->Hadoop ready jobs Pig Library/Engine
Take Away Point:: Do more with data not with functions..
Cascading Query API and Query Planner for defining, sharing, and executing data
processing workflows.
Supports to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.).
Originally authored by Chris Wensel (founder of Concurrent, Inc.)
What it offers?? Data Processing API (core)
Process Planner
Process Scheduler
How to use?? 1. Install Hadoop
2. Put Hadoop job .jar which must contain cascading .jars.
Cascading:‘Source-Pipe-Sink’ How it works?? Source: Data is captured from sources. Pipes: are created independent from the data they will process. Supports
reusable ‘pipes’ concept. Sinks: Results are stored in output files or ‘sinks’. Data Processing API provides Source-Pipe-Sink mechanism.
Once tied to data sources and sinks, it is called a ‘flow’(Topological Scheduler). These flows can be grouped into a ‘cascade’(CascadeConnector class), and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied.
Cascading Pipe Assembly------MR Job Planner---->graph of dependent MapReduce
jobs.
Also provides External Data Interfaces for data...
It efficiently supports splits, joins, grouping, and sorting.
Usages: log file analysis, bioinformatics, machine learning, predictive analytics, web content mining etc.
Cascading is cited as one of the top five most powerful Hadoop projects by SD Times in 2011.
FlumeJava Java Library API that makes easy to develop,test and run
efficient data parallel pipelines.
Born on May 2009 @ Google Lab
Library is a collection of immutable parallel classes.
Flumejava:
1. abstracts how data is presented as in memory data structure or as file
2. abstracts away the implementation details like local loop or remote MR job.
3. Implements parallel job using deferred evaluation
FlumeJava
How it works???
Step1: invoke the parallel operation.
Step2: Do not run. Do the following ..
2.1. Records the operation and the arguments.
2.2. save them into an internal execution plan graph structure.
2.3. Construct the execution plan for whole computation.
Step3: Optimizes the execution plan.
Step4: Execute them.
Faster than typical MR pipeline with same logical struct. & easier.
FlumeJava Data Model:
Pcollection<T>: central class, an immutable bag of elements of type T
Can be unordered (collection(efficient)) or ordered (sequence).
PTable<K, V>:Second central class
Immutable multi-map with keys of class K and values of class V
Operators:
parallelDo(PCollection<T>): Core parallel primitives
groupByKey(PTable<Pair<K,V>>)
combineValues(PTable<Pair<K, Collection<V>>):
flatten(): logical view of multiple PCollections as one Pcollection
Join()
Dremel A distributed system for interactive analysis of large datasets since
2006 in Google.
Provides custom, scalable data management solution built over shared clusters of commodity machines.
Three Features/Key aspects:
1. Storage Format: column-striped storage representation for non relational nested data (lossless representation).
Why nested?
It backs a platform-neutral, extensible mechanism for serializing structured data at Google.
What is main aim??
Store all values of a given field consecutively to improve retrieval efficiency.
Dremel 2. Query Language: Provides a high-level, SQL-like language to express
ad hoc queries.
It efficiently implementable on columnar nested storage.
Fields are referenced using path expressions.
Supports nested subqueries, inter and intra-record aggregation, joins etc.
3. Execution:Multi-level serving tree concept (distributed search engine)
Several queries can execute simultaneously.
Query dispatcher schedules queries based on priorities and balances load
I am lost..Are MR and Dremel same??
Take away point:: Dremel it complements MapReduce-based computing.
Features MapReduce aka MR Dremel
Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab
Type Distributed & parallel programming framework
Distributed interactive ad hoc query system
Scalable & Fault Tolerant
Yes Yes
Data processing Record oriented Column oriented
Batch processing Yes No
In situ processing No Yes