hadoop 2.0 architecture | hdfs federation | namenode high availability |

43
Slide 1 ww w.edureka.in/hadoop

Upload: edureka

Post on 27-Jan-2015

141 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 1 www.edureka.in/hadoop

Page 2: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 2

Hello There!!My name is Annie.

Let me test your Hadoop 1.x knowledge?

Annie’s Introduction

Page 3: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 3

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you store 1 billion files in a Hadoop 1.x cluster?- Yes- No

Annie’s Question

Page 4: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 4

No. Even though you have hundreds of DataNodes in the cluster, the NameNode keeps all its metadata in memory, so you are limited to a maximum of only 50-100M files in the entire cluster because of a Single NameNode in Hadoop 1.x.

Annie’s Answer

Page 5: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 5

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

A Hadoop 1.x cluster can have multiple HDFS Namespaces.- True- False

Annie’s Question

Page 6: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 6

False. Not possible with Hadoop 1.x.

Annie’s Answer

Page 7: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 7

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Which of the following is (are) a significant disadvantage in Hadoop 1.0?- ‘Single Point Of Failure’ of NameNode- Too much burden on Job Tracker

Annie’s Question

Page 8: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 8

Single Point of Failure of NameNode and too much burden on Job Tracker.

Annie’s Answer

Page 9: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 9

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x?- Yes- No

Annie’s Question

Page 10: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 10

No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload.

Annie’s Answer

Page 11: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 11

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use Hadoop for Real-time processing?- Yes- No

Annie’s Question

Page 12: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 12

No. Hadoop is designed and developer for massively parallel batch processing.

Annie’s Answer

Page 13: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Limitations of Hadoop 1.x

No horizontal scalability of NameNode

Does not support NameNode High Availability

Overburdened JobTracker

Not possible to run Non-MapReduce Big Data Applications on HDFS

Does not support Multi-tenancy

Page 14: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 14 www.edureka.in/hadoop

Hadoop 1.x – In Summary

Client

HDFS Map Reduce

Secondary NameNode

Data BlocksDataNode

NameNode Job Tracker

Task Tracker

Map Reduce

DataNode Task Tracker

Map Reduce….

DataNode DataNodeTask Tracker

Map Reduce

Task Tracker

Map Reduce

Page 15: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 15 www.edureka.in/hadoop

Problem Description

NameNode – No Horizontal Scalability

Single NameNode and Single Namespace, limited by NameNode RAM

NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery usingSecondary NameNode in case of failure

Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle ofApplications

MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be usedfor other workloads such as Graph processing etc.

Hadoop 1.x - Challenges

Page 16: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

NameNode - No High Availability

NameNode - No Horizontal Scale

DataNode

DataNode

DataNode

….

Client Get Block Locations

Block Management

Read Data

NameNodeNS

Slide 16 www.edureka.in/hadoop

NameNode – Scale and HA

Page 17: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 17 www.edureka.in/hadoop

Name Node –Single Point of Failure

Secondary NameNode:

“Not a hot standby” for the NameNode Connects to NameNode every hour* Housekeeping, backup of NameNode metadata Saved metadata can build a failed NameNode

You give me metadata every hour, I will make

it secure

Single Point Failure

Secondary NameNode

NameNode

metadata

metadata

Page 18: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 18 www.edureka.in/hadoop

Job Tracker – Overburdened

CPU

Spends a very significant portion of time and effort managing the life cycle of applications

Network

Single Listener Thread to communicate with thousands of

Map and Reduce Jobs

Task Tracker Task Tracker Task Tracker….

Job Tracker

Page 19: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 19 www.edureka.in/hadoop

MRv1 – Unpredictability in Large Clusters

As the cluster size grow and reaches to 4000 Nodes

Cascading Failures

The DataNode failures results in a seriousdeterioration of the overall clusterperformance because of attempts to replicatedata and overload live nodes, through networkflooding.

Multi-tenancy

As clusters increase in size, you may want toemploy these clusters for a variety of models.MRv1 dedicates its nodes to Hadoop andcannot be re-purposed for other applicationsand workloads in an Organization. With thegrowing popularity and adoption of cloudcomputing among enterprises, this becomesmore important.

Page 20: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Unutilized Data in HDFS

Terabytes and Petabytes of data in HDFS can only be used for MapReduce processing

Slide 11 www.edureka.in/hadoop

Page 21: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Introducing Hadoop 2.0

Features Hadoop 1.x Hadoop 2.0

HDFS Federation One NameNode and a Namespace Multiple NameNode and Namespaces

NameNode High Availability Not present Highly Available

YARN - Processing Control and Multi-tenancy

JobTracker, TaskTracker Resource Manager, Node Manager, App Master, Capacity Scheduler

Other important Hadoop 2.0 features HDFS Snapshots NFSv3 access to data in HDFS Support for running Hadoop on MS Windows Binary Compatibility for MapReduce applications built on Hadoop 1.0 Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in Hadoop ecosystem

Slide 12 www.edureka.in/hadoop

Page 22: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Namenode

Block Management

NS

Storage

Datanode Datanode…

Nam

esp

ace

Blo

ckSto

rage

Nam

esp

ace

NS1 NSk NSn

NN-1 NN-k NN-n

Common Storage

Datanode 1

…Datanode 2

…Datanode m

…Blo

ckSto

rage

Pool 1 Pool k Pool n

Block Pools

… …

Hadoop 1.0 Hadoop 2.0

Slide 22 www.edureka.in/hadoop

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

Hadoop 2.0 Cluster Architecture - Federation

Page 23: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 23 www.edureka.in/hadoop

cluster.

Annie’s Question

How does HDFS Federation help HDFS Scale horizontally?A) Reduces the load on any single NameNode by using the multiple, independent NameNodes to manage individual parts of the file system namespace.B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.

Page 24: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 24 www.edureka.in/hadoop

Annie’s Answer

(A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.

Page 25: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 25

Annie’s Question

You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory?

www.edureka.in/hadoop

Page 26: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 26

Annie’s Answer

The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”.

www.edureka.in/hadoop

Page 27: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 27

Node Manager

HDFS

YARN

Resource Manager

Shared edit logs

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Data Node

Standby NameNode

Active NameNode

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Data Node

Client

Data Node

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Node Manager

Hadoop 2.0 Cluster Architecture - HA

NameNode High Availability

Next Generation MapReduce

HDFS HIGH AVAILABILITY

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

Page 28: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 28

Hadoop 2.0 Cluster Architecture - HA

www.edureka.in/hadoop

High Availability in Hadoop 2.0

NameNode recovery in Hadoop 1.0

Secondary NameNode

Standby NameNode

Active NameNode

Secondary NameNode

NameNode

Edit logs

Meta-Data

Automatic failover to Standby NameNode

Manually Recover using Secondary

NameNodeFSImage

Page 29: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 29

Annie’s Question

NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0?a) Single Point Of Failure of NameNodeb) Too much burden on Job Tracker

www.edureka.in/hadoop

Page 30: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 30

Annie’s Answer

Single Point of Failure of NameNode.

www.edureka.in/hadoop

Page 31: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

HBase

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework HBase

OtherYARN

Frameworks(MPI, GIRAPH)

Slide 23 www.edureka.in/hadoop

YARNCluster Resource Management

YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework

YARN and Hadoop Ecosystem

Page 32: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

Slide 32 www.edureka.in/hadoop

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

Page 33: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 33 www.edureka.in/hadoop

Organizes jobs into queues

Queue shares as %’s of cluster

FIFO scheduling within eachqueue

Data locality-aware Scheduling

Hierarchical QueuesTo manage the resource within an organization.

Capacity GuaranteesA fraction to the total available capacity allocated to each Queue.

SecurityTo safeguard applications from other users.

ElasticityResources are available in a predictable and elastic manner to queues.

Multi-tenancySet of limit to prevent over-utilization of resources by a singleapplication.

OperabilityRuntime configuration of Queues.

Resource-based schedulingIf needed, Applications can request more resources than the default.

Multi-tenancy - Capacity Scheduler

Page 34: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 34

Annie’s Question

YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework?a) Single Point Of Failure Of NameNodeb) Too much burden on Job Tracker

www.edureka.in/hadoop

Page 35: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 35

Annie’s Answer

Too much burden on Job Tracker.

www.edureka.in/hadoop

Page 36: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 36

NameNode HighAvailability

Next Generation MapReduce

Hadoop 2.0 – In Summary

Client

HDFS YARN

Resource ManagerStandby NameNode

Active NameNode

Distributed Data Storage Distributed Data Processing

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

ORJournal Node

Scheduler

Applications Manager

(AsM)

www.edureka.in/hadoop

Page 37: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 37

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use Hadoop 2.0 for Real-time processing?- Yes- No

Annie’s Question

Page 38: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 38

No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing.

Annie’s Answer

Page 39: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 39 www.edureka.in/hadoop

What about Real-time Processing?

Hadoop is good for Batch but

How do I process Big Data in Real-time?

Page 40: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Slide 40 www.edureka.in/hadoop

Storm is coming….

APACHE STORM

The Real-time Hadoop

• Continuous commutation system

Distributed, Reliable, Fault-tolerant,

Scalable and Robust

• Suitable for Big Data processing• Guarantees no data loss

Programming Language agnostic

• JSON-based for Ruby, Python etc.

Use case

• Stream processing• Distributed RPC• Continuous Computation

Page 41: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Hadoop Vs. Storm

Hadoop Storm

Differences

Fundamentally as Batch processing system

Real-time processing, process unterminated streams (e.g. twitter feeds) of data, process data as it arrives

MapReduce Jobs run to completion

Topologies (Computation Graph) run forever

Stateful NodesStateless Nodes

Hadoop Storm

Similarities

Scalable Scalable

Guarantees no data loss Guarantees no data loss

Open Source Open Source

Page 42: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Storm Use Cases

Data Normalization• Groupon uses Storm to build real-time data integration

systems.

Analytics• Storm powers Twitter’s publisher analytics product,

processing every tweet and click that happens on Twitter toprovide analytics for Twitter's publisher partners.

• Flipboard use Storm across a wide range of services rangingfrom Content Search to real-time analytics, to generatingcustom magazine fields.

Log processing• Alibaba uses Storm to process the application log and data

change in databases to supply real-time data stats for dataapps.

• NaviSite uses Storm in its server log monitoring and auditingsystem.

Page 43: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Thank YouSee You in Next Class