Download - TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

1

Let’s Hadoop

2

1. WHAT’S THE

BIG DEAL WITH BIG DATA?

3

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Gartner Predicts 800% data growth over next 5

years

Big Data opens the door to a new approach to engaging customers and making decisions

4

2. BIG DATA:

WHAT ARE THE CHALLENGES?

5

How we can capture and deliver data to right people in real-time?

How we can understand and use big data when it is in Variety of forms?

How we can store/analyze the data given its size and computational capacity?

While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept up. Example: Need to process 100TB datasets

• On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Hardware Problems

Hardware Problems / Process and combine data from Multiple disks Traditional Systems: They can’t scale, not reliable and expensive.

6

3. WHAT

TECHNOLOGIES SUPPORT BIG

DATA?

7

Scale-out everything: •Storage •Compute

8

4. WHAT MAKES

HADOOP DIFFERENT?

9

■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud (EC2 ). ■ Robust—Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. ■ Simple—Hadoop allows users to quickly write efficient parallel code. ■ Data Locality—Move Computation to the Data. ■ Replication - Use replication across servers to deal with unreliable storage/servers

10

5. IS HADOOP ONE-STOP SOLUTION?

11

Not good for... Real time Small datasets Algorithms requires large temp space Problems that are CPU bound and have lots of cross talk

Good for....

12

Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.

Framework written in Java Designed to solve problem that involve analyzing large data

(Petabytes) Programing model based on Google’s Map Reduce Infrastructure based on Google’s Big Data and Distributed File System

Hadoop consists of two core components. The Hadoop Distributed File System (HDFS) - A distributed file system MapReduce - distributed processing on compute clusters.

13

NameNode This manages the file system

namespace (metadata) and regulates access to files by clients.

The NameNode executes file system namespace operations like opening, closing, and renaming files and directories

DataNode This manages storage attached to the

node in which they run. DataNode serves read, write, perform

block operation, delete, and replication upon request from NameNode

Many Data Nodes, typically one DataNode for a physical node

14

Large-Scale Data Processing o Want to use 1000s of CPUs

o But don’t want hassle of managing things MapReduce Architecture provides

o Automatic parallelization & distribution o Fault tolerance o I/O scheduling o Monitoring & status updates

MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two phases:

o Map o Reduce

16

In map phase , the mapper reads data in the form of key/value pairs

The Reducer process all output from mapper and arrives at final output as final key/value pairs writes to HDFS.

There are two types of nodes that control the job execution process: o Jobtracker o Tasktrackers

The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers

Tasktrackers run tasks and send progress reports to the jobtracker Jobtracker runs in NameNode. Tasktracker runs in DataNode.

Download - TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

Top Related