hadoop

Nguyen Thanh HaiPortal team

August 2012

www.exoplatform.com - Copyright 2012 eXo Platform 2

1

2

3

4

Agenda

− Meet Hadoop−History−Data!−Data Storage and Analysis−What Hadoop is Not

− The Hadoop Distributed File System−HDFS concept−Architecture−Goals−Command User Interface

− MapReduce−Overview−How MapReduce works

− Practice−Demo−Discussion


Meet Hadoop

-- HistoryHistory

- Data!- Data!

- Data Storage and Analysis- Data Storage and Analysis

- What Hadoop is Not - What Hadoop is Not


History


History

- Hadoop got its start in Nutch. A few of them were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers

- Once Google published its GFS and MapReduce papers, the route became clear. It'd devised systems to solve precisely the problems they were having with Nutch. So they started, two of them, half-time, to try to re-create these systems as a part of Nutch - Around that time. Yahoo! got interested, and quickly put together a team. They split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.


Data! We live in the data age


Data Storage and Analysis

- While the storage capacities of hard drives have increased massively over the years, access speeds the rate at which data can be read from drivers have not kept up. Once typical drive from 1990 cloud store 1,370 MB of data and had a transfer speed of 4.4 MB/s. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100MB/s

- This is a long time to read all data on a single drive and writing is even slower.



The obvious way:The obvious way:

- Imagine if we have 100 drivers, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.

- Only using one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them.



The problems to solve:The problems to solve:

- The firstThe first: As soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available.

- The secondThe second: That most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combine with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging

With Hadoop:With Hadoop:

Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce.


What Hadoop is Not

- It is not a substitute for a databaseIt is not a substitute for a database. Hadoop stores data in files, and dose not index them. If you want to find something, you have to run a MapReduce job going through all the data. This take time, and mean that you cannot directly use Hadoop as a substitute for a database. Where Hadoop works is where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.

- MapReduce is not always the best algorithmMapReduce is not always the best algorithm. MapReduce is profound idea: talking a simple functional programming operation and applying it, in parallel, to gigabytes or terabytes of data. But there is a price. For that parallelism, you need to have each MR operation independent from all the others. If you need to know If you need to know everything that has gone before, you have a problem.everything that has gone before, you have a problem.

- Hadoop and MapReduce is not a place to learn Java programming- Hadoop and MapReduce is not a place to learn Java programming

- Hadoop is not an ideal place to learn networking error messages- Hadoop is not an ideal place to learn networking error messages

- Hadoop clusters are not a place to learn Unix/Linux system administration- Hadoop clusters are not a place to learn Unix/Linux system administration


The Hadoop Distributed File System

-- HDFS ConceptHDFS Concept

- Architecture- Architecture

- Goals- Goals

- Command Line User Interface- Command Line User Interface


HDFS concept

Block:Block:

- A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem for a single disk build on this by dealing with data in blocks. The disk blocks are normally 512 bytes.

- HDFS, too, has concept of the block, but it is a much larger unit – 64MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block's worth of underlying storage.


HDFS Concept

NameNode and DataNodes:NameNode and DataNodes:

- An Hadoop cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers)

- The NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. It executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

- DataNodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (client or NameNode), and they report back to the NameNode periodically with list of blocks that they are storing.


Architecture


HDFS Goals

- Hardware Failure: Hardware Failure: An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system's data. The fact that these are a huge number of components and that each component has a non-trivial probability of failure means that some components of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is core architectural goal of HDFS.

- Large Data Sets:- Large Data Sets: Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large file. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support ten of millions of files on single instance.

- “Moving Computation is Cheaper than Moving Data”: - “Moving Computation is Cheaper than Moving Data”: A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of data is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving data to where the application running. HDFS provides interfaces for applications to move themselves closer to where data is located.


Command Line User Interface


MapReduce

-- OverviewOverview

- How MapReduce Works- How MapReduce Works


Overview

- Hadoop MapReduce is a software framework for easily writing application which process vast amounts of data (multi-terabyte data-sets) in parallel on large cluster (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

- A MapReduce jobjob usually splits the input data-sets into independent chunks which are processed by the map taskmap task in a completely parallel manner. The framework sorts the output of the maps, which are then input to the reduce taskreduce task. Typically both the input and the output of job are sorted by filesystem. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

- The MapReduce framework consist of a single master JobTrackerJobTracker and one worker TaskTrackserTaskTrackser per cluster-node. The master is responsible for scheduling the jobs component tasks on the worker, monitoring them and re-executing the failed tasks. The workers execute the tasks as directly by the manner.


How MapReduce Works


Practice

-- DemoDemo

- Discussion- Discussion

hadoop

Technology