an introduction to apache hadoop

Introduction of Apache Hadoop

Presenter: Prem Chand Mali, Mindfire SolutionsDate: 30/01/2014

Presenter: Prem Chand Mali, Mindfire Solutions

About MeSCJP/OCJP - Oracle Certified Java ProgrammerMCP:70-480 - Specialist certification in HTML5

with JavaScript and CSS3 Exam

Skills : Java, Swings, Springs, Hibernate, JavaFX, Jquery, prototypeJS, ExtJS.

Connect Me : https://www.facebook.com/prem.c.mali http://www.linkedin.com/in/premmali https://twitter.com/prem_mali https://plus.google.com/106150245941317924019/about/p/pub

Contact Me : premchandm@mindfiresolutions.com / prem.c.mali@gmail.com mfsi_premchandm

Agenda

History

What is Apache Hadoop

Why Apache Hadoop

MapReduce

History• Nutch Crawler based search

• GFS and Map Reduce paper published. • Yahoo! hired Doug Cutting and given dedicated team.

What is Apache Hadoop ?• Apache Hadoop is an open-source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It supports running applications on large clusters of commodity hardware.

• Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.

• Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.

What is Apache Hadoop ?• The Apache Hadoop framework is composed of the following modules :

– Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.

– Hadoop MapReduce - a programming model for large scale data processing.– Hadoop Common - contains libraries and utilities needed by other Hadoop

modules– Hadoop YARN - a resource-management platform responsible for managing

compute resources in clusters and using them for scheduling of users' applications.

Why Apache Hadoop ?• State of Data

– 90% of data in past three years.– Type of data

• Unstructured• Semi-structured• Relational

– Relation world can handle GB of data.• Distributed • Scalable• Flexible• Fault tolerant• Intelligent

HDFS• HDFS is the primary distributed storage used by Hadoop applications. It consist of following two type of components.

– NameNode

– DataNode • HDFS, is well suited for distributed storage and distributed processing using commodity hardware.

• Hadoop supports shell-like commands to interact with HDFS directly.

MapReduce

• MapReduce if combination of following three things.

– Map

– Shuffle

– Reduce • It done it's job through Job Tracker and Task Tracker

MapReduce

Question and Answer

Thank you

www.mindfiresolutions.com

https://www.facebook.com/MindfireSolutions

http://www.linkedin.com/company/mindfire-solutions

http://twitter.com/mindfires

an introduction to apache hadoop

prem chand mali

mindfire solutions date

hadoop mapreduce

mali https

hadoop applications

apache hadoop framework

hadoop common

hadoop modules hadoop

Technology

introduction to spring for apache hadoop

springpeople introduction to apache hadoop

apache hadoop pig overview and introduction

apache hadoop india summit 2011 talk "making apache hadoop...

introduction to apache hadoop presentation

introduction to apache hadoop &...

refcardz - apache hadoop

an introduction to apache hadoop, mahout and hbase

20100130 hadoop apache

introduccion apache hadoop

apache hadoop today & tomorrow - snia · 2020-05-05 ·...

introduction to apache hadoop & pig - indiana university

apache hadoop hue overview and introduction

apache hadoop tutorial - it...

introduction to hadoop - duratechsolutions.in ·...

apache hadoop yarn

apache hadoop filesystem internals - snia · apache hadoop...

apache hadoop 1.1

apache hadoop tutorial

introduction apache oozie (hadoop workflow engine)€¦ ·...