title 44 point verdana all caps - …...3 big data is a term applied to data sets whose size is...

18
1 Let’s Hadoop

Upload: others

Post on 06-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

1

Let’s Hadoop

Page 2: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

2

1. WHAT’S THE

BIG DEAL WITH BIG DATA?

Page 3: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

3

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Gartner Predicts 800% data growth over next 5

years

Big Data opens the door to a new approach to engaging customers and making decisions

Page 4: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

4

2. BIG DATA:

WHAT ARE THE CHALLENGES?

Page 5: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

5

How we can capture and deliver data to right people in real-time?

How we can understand and use big data when it is in Variety of forms?

How we can store/analyze the data given its size and computational capacity?

While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept up. Example: Need to process 100TB datasets

• On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Hardware Problems

Hardware Problems / Process and combine data from Multiple disks Traditional Systems: They can’t scale, not reliable and expensive.

Page 6: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

6

3. WHAT

TECHNOLOGIES SUPPORT BIG

DATA?

Page 7: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

7

Scale-out everything: •Storage •Compute

Page 8: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

8

4. WHAT MAKES

HADOOP DIFFERENT?

Page 9: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

9

■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud (EC2 ). ■ Robust—Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. ■ Simple—Hadoop allows users to quickly write efficient parallel code. ■ Data Locality—Move Computation to the Data. ■ Replication - Use replication across servers to deal with unreliable storage/servers

Page 10: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

10

5. IS HADOOP ONE-STOP SOLUTION?

Page 11: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

11

Not good for... Real time Small datasets Algorithms requires large temp space Problems that are CPU bound and have lots of cross talk

Good for....

Page 12: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

12

Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.

Framework written in Java Designed to solve problem that involve analyzing large data

(Petabytes) Programing model based on Google’s Map Reduce Infrastructure based on Google’s Big Data and Distributed File System

Hadoop consists of two core components. The Hadoop Distributed File System (HDFS) - A distributed file system MapReduce - distributed processing on compute clusters.

Page 13: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

13

NameNode This manages the file system

namespace (metadata) and regulates access to files by clients.

The NameNode executes file system namespace operations like opening, closing, and renaming files and directories

DataNode This manages storage attached to the

node in which they run. DataNode serves read, write, perform

block operation, delete, and replication upon request from NameNode

Many Data Nodes, typically one DataNode for a physical node

Page 14: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

14

Large-Scale Data Processing o Want to use 1000s of CPUs

o But don’t want hassle of managing things MapReduce Architecture provides

o Automatic parallelization & distribution o Fault tolerance o I/O scheduling o Monitoring & status updates

MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two phases:

o Map o Reduce

Page 15: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

15

Page 16: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

16

In map phase , the mapper reads data in the form of key/value pairs

The Reducer process all output from mapper and arrives at final output as final key/value pairs writes to HDFS.

There are two types of nodes that control the job execution process: o Jobtracker o Tasktrackers

The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers

Tasktrackers run tasks and send progress reports to the jobtracker Jobtracker runs in NameNode. Tasktracker runs in DataNode.

Page 17: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the

17

Page 18: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the