hadoop and big data

41
Big Data and Hadoop Essentials

Upload: yukti-kaura

Post on 27-Dec-2015

23 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Big Data

TRANSCRIPT

Page 1: Hadoop and Big Data

Big Data and Hadoop Essentials

Page 2: Hadoop and Big Data

2

Hadoop Ecosystem

Agenda

Map Reduce Algorithm Exemplified

Hadoop Architecture

Brief History in time

Why Hadoop?

How Big is Big Data?

Demo

Page 3: Hadoop and Big Data

3

Brief History in time In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but more systems of computers.

—Grace Hopper, American Computer Scientist

Page 4: Hadoop and Big Data

4

How Big is Big Data?

Page 5: Hadoop and Big Data

5

How Big is Big Data?

Page 6: Hadoop and Big Data

6

How Big is Big Data?

Page 7: Hadoop and Big Data

7

Why Hadoop?

Page 8: Hadoop and Big Data

8

The Problem

Page 9: Hadoop and Big Data

9

BIG

DATA

Volume

Big Data comes in on large scale. Its on TB and even PB

Records, Transaction, Tables , Files

Veracity

Quality, consistency, reliability and provenance of

data

Good, bad, undefined, inconsistency, incomplete.

Variety

Big Data extends structured, including semi- structured and unstructured data of all variety

text, log, xml, audio, video, stream, flat files

Velocity

Data flown continues, time sensitive, streaming flow

Batch, Real time, Streams, Historic

Challenges in managing Big Data

Page 10: Hadoop and Big Data

10

To overcome Big Data challenges Hadoop evolves

• Cost Effective – Commodity HW

• Big Cluster – (1000 Nodes) --- Provides Storage & Processing

• Parallel Processing – Map reduce

• Big Storage – Memory per node * no of Nodes / RF

• Fail over mechanism – Automatic Failover

• Data Distribution

• Moving Code to data

• Heterogeneous Hardware System (IBM,HP,AIX,Oracle Machine of any memory and CPU configuration)

• Scalable

Page 11: Hadoop and Big Data

11

What Exactly is Hadoop?

Page 12: Hadoop and Big Data

12

What’s in a name?

Page 13: Hadoop and Big Data

13

Hadoop Vendors

Page 14: Hadoop and Big Data

14

Who uses Hadoop?

Page 15: Hadoop and Big Data

15

Why Hadoop is used for?

Page 16: Hadoop and Big Data

16

Stop and Ponder • Is Hadoop an alternative for RDBMS?

• Hadoop is not replacing the traditional data systems used for building

analytic applications – the RDBMS, EDW and MPP systems – but rather is a

complement. & Works fine together with RDBMs.

• Hadoop is being used to distill large quantities of data into something more

manageable

Page 17: Hadoop and Big Data

17

Stop and Ponder • But Don’t we know Coherence to be distributed too? Why Hadoop?

Coherence is the market leading In-Memory Data Grid. While Hadoop works fine

for large processing operations, i.e. requiring many TB of data, that can be

processed in a batch like way, there are use cases where the processing

requirements are more real-time and the data volumes are smaller, where

Coherence is a better choice than HDFS for storing the data

Page 18: Hadoop and Big Data

18

Hadoop vs. RDBMS

RDBMS MapReduce

Data size Gigabytes Petabytes

Access Interactive and batch Batch

Structure Fixed schema Unstructured schema

Language SQL Procedural (Java, C++, Ruby, etc)

Integrity High Low

Scaling Nonlinear Linear

Updates Read and write Write once, read many times

Latency Low High

Page 19: Hadoop and Big Data

19

Using Hadoop in Enterprise

Page 20: Hadoop and Big Data

20

Hadoop Architecture

• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

• Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

HDFS

Map Reduce

Hadoop

Page 21: Hadoop and Big Data

21

Hadoop Distributed File System(HDFS)

Page 22: Hadoop and Big Data

22

HDFS Architecture(Master-Slave)

Secondary

Name Node

Master Book Keeper

Slave(s)

Periodic checkpoint

Data Block

Page 23: Hadoop and Big Data

23

The CORE

CLIENT Data Analytics Jobs

Map Reduce

Data Storage Jobs

HDFS

MASTER

SLAVE

= HDFS

Page 24: Hadoop and Big Data

24

Hadoop Ecosystem

Page 25: Hadoop and Big Data

25

MAP REDUCE Algorithm exemplified!

Calculate the yearly average per state.

Page 26: Hadoop and Big Data

26

Group the city average temperatures by state

1

Page 27: Hadoop and Big Data

27

We don’t really care about the city names, so we will

discard those and keep only the state names and

cities Temperatures.

2

Page 28: Hadoop and Big Data

28

3

We’re going to get a list of temperatures averages for

each state.

Page 29: Hadoop and Big Data

29

That was Map/Reduce!

4

All we have to do is to calculate the average

temperature for each state.

Page 30: Hadoop and Big Data

30

Let’s do it again… • Map/Reduce has 3 stages : Map/Shuffle/Reduce

• The Shuffle part is done automatically by Hadoop, you just need to implement the Map and Reduce parts.

• You get input data as <Key,Value> for the Map part.

• In this example, the Key is the City name, and the Value is the set of attributes : State and City yearly average temperature.

Page 31: Hadoop and Big Data

31

• Since you want to regroup your temperatures by state, you’re going to get rid of the city name, and the State will become the Key, while the Temperature will become the Value.

Page 32: Hadoop and Big Data

32

Shuffle • Now, the shuffle task will run on the output of the Map task. It is going to

group all the values by Key, and you’ll get a List<Value>

Page 33: Hadoop and Big Data

33

Reduce • The Reduce task is the one that does the logic on the data, in our case this

is the calculation of the State yearly average temperature.

• And that’s what we will get as final output

Page 34: Hadoop and Big Data

34

Hadoop AppStore

Page 35: Hadoop and Big Data

35

Ecosystem Matrix

Page 36: Hadoop and Big Data

36

Pig and HIVE in the Hadoop Ecosystem

Page 37: Hadoop and Big Data

37

Hadoop Ecosystem Development

Page 38: Hadoop and Big Data

38

Demo

Page 39: Hadoop and Big Data

39

References

• http://hadoop.apache.org/

• http://hadoop.apache.org/hive/

• Hadoop in Action

(http://www.manning.com/lam/)

• Definitive Guide to Hadoop, 2nd ed.

(http://oreilly.com/catalog/0636920010388)

• Yahoo! Hadoop blog

(http://developer.yahoo.net/blogs/hadoop/)

• Cloudera

(http://www.cloudera.com/)

Page 40: Hadoop and Big Data

40

Q & A

Page 41: Hadoop and Big Data

41

Thank You