cs525 : big data analytics

15
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

Upload: jabir

Post on 24-Feb-2016

72 views

Category:

Documents


0 download

DESCRIPTION

CS525 : Big Data Analytics. MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner. Large-Scale Data Analytics. Many enterprises turn to Hadoop computing paradigm for big data applications : . vs. Database. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS525 : Big  Data  Analytics

1

CS525: Big Data Analytics

MapReduce Computing Paradigm &Apache Hadoop Open Source

Fall 2013

Elke A. Rundensteiner

Page 2: CS525 : Big  Data  Analytics

Large-Scale Data Analytics

Scalability (petabytes of data, thousands of machines)

Database

vs.

Flexibility in accepting all data formats (no schema)

Commodity inexpensive hardware

Performance (indexing, tuning, data organization tech.)

Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency- ….

Many enterprises turn to Hadoop computing paradigm for big data applications :

Focus on read + write, concurrency, correctness, convenience, high-level access

Efficient fault tolerance support

Page 3: CS525 : Big  Data  Analytics

3

What is Hadoop

• Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers :– Large datasets Terabytes or petabytes of data– Large clusters Hundreds or thousands of nodes

• Open-source implementation for Google MapReduce• Simple programming model : MapReduce• Simple data model: flexible for any data

Page 4: CS525 : Big  Data  Analytics

4

Hadoop Framework

• Two main layers:– Distributed file system (HDFS)– Execution engine (MapReduce)

Hadoop is designed as a master-slave shared-nothing architecture

Page 5: CS525 : Big  Data  Analytics

6

Key Ideas of Hadoop

• Automatic parallelization & distribution– Hidden from end-user

• Fault tolerance and automatic recovery– Failed nodes/tasks recover automatically

• Simple programming abstraction– Users provide two functions “map” and “reduce”

Page 6: CS525 : Big  Data  Analytics

7

Who Uses Hadoop ?

• Google: Invent MapReduce computing paradigm• Yahoo: Develop Hadoop open-source of MapReduce• Integrators: IBM, Microsoft, Oracle, Greenplum• Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn• Many others …

Page 7: CS525 : Big  Data  Analytics

9

Hadoop Distributed File System (HDFS)

Centralized namenode - Maintains metadata info about files

Many datanodes (1000s) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3)

File F 1 2 3 4 5

Blocks (64 MB)

Page 8: CS525 : Big  Data  Analytics

10

HDFS File System Properties

• Large Space: An HDFS instance may consist of thousands of server machines for storage

• Replication: Each data block is replicated• • Failure: Failure is norm rather than exception

• Fault Tolerance: Automated detection of faults and recovery

Page 9: CS525 : Big  Data  Analytics

11

Map-Reduce Execution Engine(Example: Color Count)

Shuffle & Sorting based on k

Input blocks on HDFS

Produces (k, v) ( , 1)

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Users only provide the “Map” and “Reduce” functions

Page 10: CS525 : Big  Data  Analytics

12

MapReduce Engine

• Job Tracker is the master node (runs with the namenode)– Receives the user’s job– Decides on how many tasks will run (number of mappers)– Decides on where to run each mapper (locality)

• This file has 5 Blocks run 5 map tasks

• Run task reading block “1” on Node 1 or 3.

Node 1 Node 2 Node 3

Page 11: CS525 : Big  Data  Analytics

13

MapReduce Engine

• Task Tracker is the slave node (runs on each datanode)– Receives the task from Job Tracker– Runs task to completion (either map or reduce task)– Communicates with Job Tracker to report its progress

1 map-reduce job consists of 4 map tasks and 3 reduce tasks

Page 12: CS525 : Big  Data  Analytics

14

About Key-Value Pairs • Developer provides Mapper and Reducer functions • Developer decides what is key and what is value• Developer must follow the key-value pair interface

• Mappers:– Consume <key, value> pairs– Produce <key, value> pairs

• Shuffling and Sorting:– Groups all similar keys from all mappers, – sorts and passes them to a certain reducer – in the form of <key, <list of values>>

• Reducers:– Consume <key, <list of values>>– Produce <key, value>

Page 13: CS525 : Big  Data  Analytics

15

MapReduce Phases

Page 14: CS525 : Big  Data  Analytics

16

Another Example : Word Count• Job: Count occurrences of each word in a data set

Map Tasks

ReduceTasks

Page 15: CS525 : Big  Data  Analytics

17

Summary : Hadoop vs. Typical DB

Distributed DBs HadoopComputing Model - Notion of transactions

- Transaction is the unit of work- ACID properties, Concurrency control

- Notion of jobs- Job is the unit of work- No concurrency control

Data Model - Structured data with known schema- Read/Write mode

- Any data format- ReadOnly mode

Cost Model - Expensive servers - Cheap commodity machines

Fault Tolerance - Failures are rare- Recovery mechanisms

- Failures are common over thousands of machines

- Simple fault tolerance

Key Characteristics - Efficiency, Powerful, optimizations - Scalability, flexibility, fault tolerance