an introduction to apache hadoop

Post on 10-May-2015

250 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware.

TRANSCRIPT

Introduction of Apache Hadoop

Presenter: Prem Chand Mali, Mindfire SolutionsDate: 30/01/2014

Presenter: Prem Chand Mali, Mindfire Solutions

About MeSCJP/OCJP - Oracle Certified Java ProgrammerMCP:70-480 - Specialist certification in HTML5

with JavaScript and CSS3 Exam

Skills : Java, Swings, Springs, Hibernate, JavaFX, Jquery, prototypeJS, ExtJS.

Connect Me : https://www.facebook.com/prem.c.mali http://www.linkedin.com/in/premmali https://twitter.com/prem_mali https://plus.google.com/106150245941317924019/about/p/pub

Contact Me : premchandm@mindfiresolutions.com / prem.c.mali@gmail.com mfsi_premchandm

Agenda

Presenter: Prem Chand Mali, Mindfire Solutions

History

What is Apache Hadoop

Why Apache Hadoop

HDFS

MapReduce

Q & A

History• Nutch Crawler based search

• GFS and Map Reduce paper published. • Yahoo! hired Doug Cutting and given dedicated team.

Presenter: Prem Chand Mali, Mindfire Solutions

What is Apache Hadoop ?• Apache Hadoop is an open-source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It supports running applications on large clusters of commodity hardware.

• Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.

• Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.

Presenter: Prem Chand Mali, Mindfire Solutions

What is Apache Hadoop ?• The Apache Hadoop framework is composed of the following modules :

– Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.

– Hadoop MapReduce - a programming model for large scale data processing.– Hadoop Common - contains libraries and utilities needed by other Hadoop

modules– Hadoop YARN - a resource-management platform responsible for managing

compute resources in clusters and using them for scheduling of users' applications.

Presenter: Prem Chand Mali, Mindfire Solutions

Why Apache Hadoop ?• State of Data

– 90% of data in past three years.– Type of data

• Unstructured• Semi-structured• Relational

– Relation world can handle GB of data.• Distributed • Scalable• Flexible• Fault tolerant• Intelligent

Presenter: Prem Chand Mali, Mindfire Solutions

HDFS• HDFS is the primary distributed storage used by Hadoop applications. It consist of following two type of components.

– NameNode

– DataNode • HDFS, is well suited for distributed storage and distributed processing using commodity hardware.

• Hadoop supports shell-like commands to interact with HDFS directly.

Presenter: Prem Chand Mali, Mindfire Solutions

HDFS

Presenter: Prem Chand Mali, Mindfire Solutions

MapReduce

Presenter: Prem Chand Mali, Mindfire Solutions

• MapReduce if combination of following three things.

– Map

– Shuffle

– Reduce • It done it's job through Job Tracker and Task Tracker

MapReduce

Presenter: Prem Chand Mali, Mindfire Solutions

MapReduce

Presenter: Prem Chand Mali, Mindfire Solutions

MapReduce

Presenter: Prem Chand Mali, Mindfire Solutions

Presenter: Prem Chand Mali, Mindfire Solutions

Question and Answer

Thank you

Presenter: Prem Chand Mali, Mindfire Solutions

www.mindfiresolutions.com

https://www.facebook.com/MindfireSolutions

http://www.linkedin.com/company/mindfire-solutions

http://twitter.com/mindfires

Presenter: Prem Chand Mali, Mindfire Solutions

top related