hadoop map reduce

Bui Quang Duy @ Septeni Technology

Hanoi 2014/01

� � Introduction � Hadoop

� Hadoop Architecture � HDFS

� PYXIS & Hadoop

Outline

� � Starting in Vietnam since March 2013 � Totally 45 employees � Heading to No.1 Ad Technology center in Asia

What’s Septeni technology

� � A programming model to distribute a task on multiple

nodes � Used to develop solutions that will process large amounts

of data in a parallelized fashion in clusters of computing nodes

�  Features of MapReduce: �  Fault-tolerance �  Status and monitoring tools �  A clean abstraction for programmers

What’s Mapreduce

. . .

User Program

Master

Split 1

Split 2

Split 3

Split 4

Split 5 . . .

Worker

Worker

Worker

Input Files Map Phase

Key/Value Pairs

Worker

Worker

Intermediate Operations

Output file 1

Reduce Phase

Remote read

Output Files

Fork Fork Fork

Write Local Write

Assign Map

Assign Reduce

MapReduce Execution Overview

Output file 2

� Hadoop

� Open Source Implementation of MapReduce by Apache Software Foundation.

� Created by Doug Cutting. � Derived from Google's MapReduce and Google File

System (GFS) papers. � Apache Hadoop is a software framework that supports

data-intensive distributed applications under a free license

�  It enables applications to work with thousands of computational independent computers and petabytes of data.

� Hadoop Components

HDFS

Storage

Self-healing high-bandwidth clustered storage

MapReduce

Processing

Fault-tolerant distributed processing

�

Hadoop Architecture

Secondary Namenode

Namenode JobTracker

Data node

TaskTracker

Map Map

Map

Reduce

Data node

TaskTracker

Map

Data node

TaskTracker

Map

Reduce Reduce

Map Map

Reduce Reduce Reduce

Reduce

Map Map

Reduce Reduce

� Dataflow in Hadoop

� Map tasks write their output to local disk � Output available after map task has completed

� Reduce tasks write their output to HDFS � Once job is finished, next job’s map tasks can be

scheduled, and will read input from HDFS

� Therefore, fault tolerance is simple: simply re-run tasks on failure � No consumers see partial operator output

� HDFS Basics

� HDFS is a filesystem written in Java � Sits on top of a native filesystem � Provides redundant storage for massive amounts

of data � Use Commodity devices

� HDFS Data

� Data is split into blocks and stored on multiple nodes in the cluster

� Each block is usually 64 MB or 128 MB � Each block is replicated multiple times � Replicas stored on different data nodes

What’s PYXIS

PYXIS is one-stop service of Ad Management, Measurement, Optimization system of online ads specialized in Facebook. Only 1 system with the approval from Facebook as Both of PMD(Ads manage) and MMP(Measurement) in the world.

Specialized in Mobile & LTV maximization.

Main Features

Massive ad creation

Graphical Reporting

Auto optimization

Mobile measurement

LTV Maximization

Auto bidding Automated optimization – auto bidding and reallocating

Data source

Summarize & Analyze

Tuning Campaign & Ad

-  Ad information (Targeting segment & Ad creative) -  Delivery data (Impression, Click, Cost,…) -  Action data (Like, Install, Billing, LTV)

PYXIS get massive data every hour Summarize and Integrate all data into optimized unit

Change bid price and budget based on unit data

� The end!

hadoop map reduce

Data & Analytics

hdfs data data

massive data

unit data

hadoop map tasks

petabytes of data

action data

different data nodes

jobs map tasks