intro to apache hadoop

21
Intro to Apache™ Hadoop® A Brown Bag Session at EAI Technologies by Sufi Nawaz

Upload: sufi-nawaz

Post on 26-Jan-2015

131 views

Category:

Technology


11 download

DESCRIPTION

A presentation I compiled for a weekly brown bag session held at EAI Technologies.

TRANSCRIPT

Page 1: Intro to Apache Hadoop

Intro to Apache™ Hadoop®A Brown Bag Session at EAI Technologies

by Sufi Nawaz

Page 2: Intro to Apache Hadoop

What is this Hadoop you speak of?

"Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware."

- Wikipedia

Doug Cutting (Creator)

Page 3: Intro to Apache Hadoop

More about Hadoop

● It is a highly scalable, fault tolerant and distributed compute and storage platform.

● Based on Google GFS and MapReduce.● Brings computation to data and not the other

way around.● Created by Doug Cutting and Mike Cafarella

in 2005.● Originally developed to support distribution

for the Nutch search engine project.

Page 4: Intro to Apache Hadoop

Why use Hadoop?

● Process lots of data - in petabytes even● Distributed processing● Uses simple programming models● Scalable - add new nodes simply● Cost effective - uses commodity hardware● Flexible - Hadoop is schema-less and can

absorb any kind of data● Fault tolerant - redistribution of failed jobs

and data recovery by data replication

Page 5: Intro to Apache Hadoop

When to use Hadoop and not?

Good for:● Indexing Data● Log Analysis● Image Manipulation● Sorting Large Scale Data● Data Mining

Bad for:● For real time processing● For processing intensive tasks with little data

Page 6: Intro to Apache Hadoop

Hadoop Modules

- Hadoop Common- Hadoop Distributed File System (HDFS)- Hadoop YARN- Hadoop MapReduce

Page 7: Intro to Apache Hadoop

Hadoop Distributed File System (HDFS)

Page 8: Intro to Apache Hadoop

Hadoop Distributed File System

The Apache HDFS is the primary distributed storage component used by applications under Apache Hadoop project.

Apache HDFS can serve as a stand-alone distributed file system as well.

Page 9: Intro to Apache Hadoop

Hadoop Distributed File System

A single Namenode maintains the directory tree and manages the namespace and access to files by clients. It holds Metadata for list of files, blocks, datanodes all in memory.

Datanodes store and manage the data blocks as local files on servers throughout the rest of the cluster. Reports to Namenode with heartbeat.

Page 10: Intro to Apache Hadoop

Hadoop Distributed File System

Page 11: Intro to Apache Hadoop

Hadoop Distributed File System

What is HDFS bad for?● Low latency data access. It trades low

latency to increase the throughput of the data.

● Lots of small files, since default block size is 64MB. Will increase memory requirements of namenode.

● Multiple writers and arbitrary modification.

Page 12: Intro to Apache Hadoop

Hadoop Distributed File System

Anatomy of write● DFSOutputStream splits data into packets. ● Writes into an internal queue. ● DataStreamer asks namenode to get list of

datanodes and uses the internal data queue.● Namenode gives a list of datanodes for the

pipeline.● Maintains internal queue of packets waiting

to be acknowledged.

Page 13: Intro to Apache Hadoop

Hadoop Distributed File System

Anatomy of read:● Namenode returns locations of blocks.● Datanode list is sorted according to their proximity to the

client.● FSDataInputStream wraps DFSInputStream, which

manages datanode and namenode I/O.● Read is called repeatedly on the datanode till end of the

block is reached.● Finds the next DataNode for next data block.● All happens transparently to the client.● Calls close after finishing reading the data.

Page 14: Intro to Apache Hadoop

Hadoop Distributed File System

Accessibility ● DFS Shell● DFS Admin● Browser Interface● Mountable HDFS

Page 15: Intro to Apache Hadoop

MapReduce

Page 16: Intro to Apache Hadoop

MapReduce

Page 17: Intro to Apache Hadoop

MapReduce

Main Components● JobClient ● JobTracker ● TaskTracker

Page 18: Intro to Apache Hadoop

MapReduce

JobTracker (Master)● Single Job Tracker per cluster ● Schedule Map and Reduce Tasks for TaskTrackers● Monitors Tasks and keeps track of TaskTrackers status● Re-execute tasks on failure

TaskTracker (Slave)● Single TaskTrackers per node (multiple in a cluster)● Run Map and Reduce Tasks

Page 19: Intro to Apache Hadoop

Who uses Hadoop?

● Yahoo!○ Support research for Ad Systems and Web Search

● Facebook○ 2 major clusters (1100 + 300 machines w/ 8 cores)○ Heavy users of both streaming and Java APIs. ○ Have developed a FUSE implementation on HDFS.

● EBay ○ 532 nodes cluster (8 * 532 cores, 5.3PB

● Hulu ○ 13 machine cluster (8 cores/machine, 4TB/machine)○ Log storage and analysis

● Many more○ http://wiki.apache.org/hadoop/PoweredBy

Page 20: Intro to Apache Hadoop

Where can I find resources?

● Hadoop Docs○ http://hadoop.apache.org/docs/current/

● Mailing List:○ http://hadoop.apache.org/mailing_lists.html

● White papers from Cloudera, Intel, Dell, etc.● Hadoop in 20 Pages (http://blog.imaginea.

com/hadoop-a-short-guide/)● Yahoo! CDN Hadoop Tutorial● Google Search Engine (!)

Page 21: Intro to Apache Hadoop

Some Additional Info

● Hadoop Streaming ○ Run MapReduce with any language supporting

standard I/O e.g. ruby, python.● Hadoop Distributed Cache

○ Puts contents of specified input path to memory in all datanodes across cluster.

● Hadoop Security○ Secure Hadoop with Kerberos

● Hadoop Federation○ Solution for NameNode High Availability (HA) and no

Single Point of Failure of NameNode