presentation

Presented By :: Harsha JainCSE – IV Year Student

A new way to store and analyze data

www.powerpointpresentationon.blogspot.com

Topics Covered

• What is Hadoop?• Why, Where, When?• Benefits of Hadoop• How Hadoop Works?• Hdoop Architecture • Hadoop Common

• HDFS• Hadoop MapReduce• Installation &

Execution• Demo of installation • Hadoop Community

By Harsha Jain

What is Hadoop?

• Hadoop was created by Douglas Reed Cutting, who named haddop after his child’s stuffed elephant to support Lucene and Nutch search engine projects.

• Open-source project administered by Apache Software Foundation. • Hadoop consists of two key services:

a. Reliable data storage using the Hadoop Distributed File System (HDFS).b. High-performance parallel data processing using a technique called MapReduce.• Hadoop is large-scale, high-performance processing jobs — in spite

of system changes or failures.

By Harsha Jain

Hadoop, Why?

• Need to process 100TB datasets• On 1 node:

– scanning @ 50MB/s = 23 days• On 1000 node cluster:

– scanning @ 50MB/s = 33 min• Need Efficient, Reliable and Usable framework

By Harsha Jain

Where and When Hadoop

Where• Batch data processing, not

real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling)

• Highly parallel data intensive distributed applications

• Very large production deployments (GRID)

When• Process lots of unstructured

data• When your processing can

easily be made parallel• Running batch jobs is

acceptable• When you have access to lots

of cheap hardware

By Harsha Jain

Benefits of Hadoop

• Hadoop is designed to run on cheap commodity hardware

• It automatically handles data replication and node failure

• It does the hard work – you can focus on processing data

• Cost Saving and efficient and reliable data processing

By Harsha Jain

How Hadoop Works

• Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

• In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

• Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

By Harsha Jain

Hdoop ArchitectureThe Apache Hadoop project develops open-source software for reliable, scalable, distributed computing

Hadoop Consists::• Hadoop Common*: The common utilities that support the other

Hadoop subprojects. • HDFS*: A distributed file system that provides high throughput

access to application data. • MapReduce*: A software framework for distributed processing of

large data sets on compute clusters. Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common, At the bottom is the Hadoop Distributed File System (HDFS), which stores files across storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which consists of JobTrackers and TaskTrackers.

* This presentation is primarily focus on Hadoop architecture and related sub project

By Harsha Jain

Data Flow

Web Servers

Scribe Servers

Network Storage

Hadoop ClusterOracle RAC

MySQL

By Harsha Jain

Hadoop Common

• Hadoop Common is a set of utilities that support the other Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.

By Harsha Jain

HDFS

• Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.

• HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

• Replication and locality

By Harsha Jain

HDFS Architecture

By Harsha Jain

MapReduce Implementation

1. Input files split (M splits)2. Assign Master & Workers3. Map tasks4. Writing intermediate data to

disk (R regions)5. Intermediate data read &

sort6. Reduce tasks7. Return

By Harsha Jain

MapReduce Cluster Implementation

split 0split 1split 2split 3split 4

Output 0

Output 1

Input files Output filesM map tasks

R reduce tasks

Intermediate files

Several map or reduce tasks can run on a single computer

Each intermediate file is divided into R partitions, by partitioning function

Each reduce task corresponds to one partition

By Harsha Jain

Examples of MapReduceWord Count

• Read text files and count how often words occur. o The input is text fileso The output is a text file

each line: word, tab, count

• Map: Produce pairs of (word, count)• Reduce: For each word, sum up the

counts.

By Harsha Jain

Lets Go…

Installation ::• Requirements: Linux, Java

1.6, sshd, rsync• Configure SSH for

password-free authentication• Unpack Hadoop distribution• Edit a few configuration files• Format the DFS on the

name node• Start all the daemon

processes

Execution::• Compile your job into a JAR

file• Copy input data into HDFS• Execute bin/hadoop jar with

relevant args• Monitor tasks via Web

interface (optional)• Examine output when job is

complete

By Harsha Jain

Demo Video for installation

By Harsha Jain

Hadoop Community

Hadoop Users

• Adobe• Alibaba• Amazon• AOL• Facebook• Google• IBM

Major Contributor

• Apache• Cloudera• Yahoo

By Harsha Jain

References

• Apache Hadoop! (http://hadoop.apache.org )• Hadoop on Wikipedia (

http://en.wikipedia.org/wiki/Hadoop)• Free Search by Doug Cutting (

http://cutting.wordpress.com )• Hadoop and Distributed Computing at Yahoo! (

http://developer.yahoo.com/hadoop )• Cloudera - Apache Hadoop for the Enterprise (

http://www.cloudera.com )

By Harsha Jain

presentation

Technology

hadoop works hadoop

benefits of hadoop hadoop

hadoop common hadoop

hadoop architecture

hadoop subprojects

apache hadoop project

queriesby harsha jain

harsha jain cse