brief introduction on hadoop,dremel, pig, flumejava and cassandra

A Brief Discussion on: Hadoop MapReduce, Pig,

JavaFlume,Cascading & Dremel

Presented By: Somnath Mazumdar

29th Nov 2011

MapReduce è  Based on Google's MapReduce Programming Framework è  FileSystem: GFS for MapReduce ... HDFS for Hadoop è  Language: MapReduce is written in C++ but Hadoop is in Java è  Basic Functions : Map and Reduce inspired by similar primitives in

LISP and other languages... Why we should use ???

l  Automatic parallelization and distribution

l  Fault-tolerance

l  I/O scheduling

l  Status and monitoring

MapReduce Map Function: (1)  Processes input key/value

pair (2)  Produces set of

intermediate pairs Syntax: map (key,value)-

>list(key,inter_value)

Reduce Function: (1)  Combines all intermediate values

for a particular key

(2)  Produces a set of merged output values

Syntax:

reduce (out_key, list(inter_value)) -> list(out_value)

Programming Model

Hello World, Bye World!

Hello MapReduce, Goodbye to MapReduce.

Welcome to UCD, Goodbye to UCD.

Reduce Phase

HDFS Map Phase

Intermediate Result

HDFS

M1

M2

M3

(Hello, 1) (Bye, 1) (World, 1) (World, 1)

(Welcome, 1) (to, 1) (to, 1)

(Goodbye, 1) (UCD, 1) (UCD, 1)

(Hello, 1) (to, 1) (Goodbye, 1) (MapReduce, 1) (MapReduce, 1)

R1

R2

(Hello, 2) (Bye, 1) (Welcome, 1) (to, 3)

(World, 2) (UCD, 2) (Goodbye, 2) (MapReduce, 2)

MapReduce Applications: (1)  Distributed grep & Distributed sort (2)  Web link-graph reversal, (3)  Web access log stats, (4)  Document clustering, (5)  Machine Learning and so on... To know more:

è  MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.

è  Hadoop: The Definitive Guide - O'Reilly Media

PIG è  First Pig developed at Yahoo Research around 2006 later moved to

Apache Software Foundation

è  Pig is a data flow programming environment for processing large files based on MapReduce / Hadoop.

è  High-level platform for creating MapReduce programs used with Hadoop and HDFS

è  Apache library that interprets scripts written in Pig Latin and runs them on a Hadoop cluster.

At Yahoo! 40% of all Hadoop jobs are run with Pig

PIG WorkFlow:

First step: Load input data. Second step: Manipulate data with functions like filtering, using foreach, distinct or any user defined functions. Third step: Group the data. Final stage: Writing data into the DFS or repeating the step if another dataset arrives.

Scripts written in PigLatin------------------->Hadoop ready jobs Pig Library/Engine

Take Away Point:: Do more with data not with functions..

Cascading Query API and Query Planner for defining, sharing, and executing data

processing workflows.

Supports to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.).

Originally authored by Chris Wensel (founder of Concurrent, Inc.)

What it offers?? Data Processing API (core)

Process Planner

Process Scheduler

How to use?? 1. Install Hadoop

2. Put Hadoop job .jar which must contain cascading .jars.

Cascading:‘Source-Pipe-Sink’ How it works?? Source: Data is captured from sources. Pipes: are created independent from the data they will process. Supports

reusable ‘pipes’ concept. Sinks: Results are stored in output files or ‘sinks’. Data Processing API provides Source-Pipe-Sink mechanism.

Once tied to data sources and sinks, it is called a ‘flow’(Topological Scheduler). These flows can be grouped into a ‘cascade’(CascadeConnector class), and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied.

Cascading Pipe Assembly------MR Job Planner---->graph of dependent MapReduce

jobs.

Also provides External Data Interfaces for data...

It efficiently supports splits, joins, grouping, and sorting.

Usages: log file analysis, bioinformatics, machine learning, predictive analytics, web content mining etc.

Cascading is cited as one of the top five most powerful Hadoop projects by SD Times in 2011.

FlumeJava Java Library API that makes easy to develop,test and run

efficient data parallel pipelines.

Born on May 2009 @ Google Lab

Library is a collection of immutable parallel classes.

Flumejava:

1. abstracts how data is presented as in memory data structure or as file

2. abstracts away the implementation details like local loop or remote MR job.

3. Implements parallel job using deferred evaluation

FlumeJava

How it works???

Step1: invoke the parallel operation.

Step2: Do not run. Do the following ..

2.1. Records the operation and the arguments.

2.2. save them into an internal execution plan graph structure.

2.3. Construct the execution plan for whole computation.

Step3: Optimizes the execution plan.

Step4: Execute them.

Faster than typical MR pipeline with same logical struct. & easier.

FlumeJava Data Model:

Pcollection<T>: central class, an immutable bag of elements of type T

Can be unordered (collection(efficient)) or ordered (sequence).

PTable<K, V>:Second central class

Immutable multi-map with keys of class K and values of class V

Operators:

parallelDo(PCollection<T>): Core parallel primitives

groupByKey(PTable<Pair<K,V>>)

combineValues(PTable<Pair<K, Collection<V>>):

flatten(): logical view of multiple PCollections as one Pcollection

Join()

Dremel A distributed system for interactive analysis of large datasets since

2006 in Google.

Provides custom, scalable data management solution built over shared clusters of commodity machines.

Three Features/Key aspects:

1. Storage Format: column-striped storage representation for non relational nested data (lossless representation).

Why nested?

It backs a platform-neutral, extensible mechanism for serializing structured data at Google.

What is main aim??

Store all values of a given field consecutively to improve retrieval efficiency.

Dremel 2. Query Language: Provides a high-level, SQL-like language to express

ad hoc queries.

It efficiently implementable on columnar nested storage.

Fields are referenced using path expressions.

Supports nested subqueries, inter and intra-record aggregation, joins etc.

3. Execution:Multi-level serving tree concept (distributed search engine)

Several queries can execute simultaneously.

Query dispatcher schedules queries based on priorities and balances load

I am lost..Are MR and Dremel same??

Take away point:: Dremel it complements MapReduce-based computing.

Features MapReduce aka MR Dremel

Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab

Type Distributed & parallel programming framework

Distributed interactive ad hoc query system

Scalable & Fault Tolerant

Yes Yes

Data processing Record oriented Column oriented

Batch processing Yes No

In situ processing No Yes

brief introduction on hadoop,dremel, pig, flumejava and cassandra

Technology

data processing api

hadoop cluster

machine learning

execution plan

ptable

data

hadoop

functions