hadoop tech share

Dylan Valerio

Hadoop Tech Share

Agenda

• Overview

• Demo

• Applications

• Configuration

Data!

• NYSE = 1 TB/day (10^12)

• FB = 10B photos = 2.5 PB (10^15)

• Ancestry.com = 2.5 PB

• Large Hadron Collider = 15 PB/year

Hadoop: The Definitive Guide

Storage and Analysis

• Storage capacity has increased, but IO has not increased proportionally.

• Disk failure.

• Analysis needs a large chunk of data.

• Bandwidth is largely limited esp. for BIG DATA.


Hadoop

• Distributed computing framework to process large amounts of data.

– Accessible – data is replicated on large clusters of commodity machines

– Robust – assume disk failures

– Scalable – add nodes

– Simple – simple MapReduce code

Hadoop in Action

HDFS

• Name Node – master of the HDFS. Directs I/O.“The Index”

– Secondary Name Node – Backup

• Data Node – where actual data is stored and replicated

Hadoop In Action

MapReduce

• MR is a programming framework that processes data by keys and values.

• The mapper code processes, while the reducer compiles.

• Mappers and reducers do not directly communicate with each other.


Huy Vo, NYU

Jobs

• Map tasks and reduce tasks are assignedthroughout the cluster.

• Job Tracker managesthe status of the job(s).

• Task Tracker manages each task assigned to them.


Architecture

Hadoop in Action

Demo

• Word Count – the “Hello World” of Map Reduce.

• Distributed GREP – Sampling Pattern

• Top Child Star – Summarization Pattern

Bit of History

• Doug Cutting and the Apache Lucene Team

• Google File Systm (2003) and MapReduce (2004)

• Cutting joined Yahoo! (2006).

• Yahoo announced its search index was being processed by a 10,000-core Hadoop cluster. (2008)


Hadoop Stack

• Core – I/O, serialization, Java RPC

• Avro – data serialization

• MapReduce

• HDFS

• Pig – higher language to explore HDFS & MR clusters

• HBase – distributed column-oriented DB

• Zoo Keeper – distributed coordination service

• Hive – distributed data warehouse + SQL-like query

• Chukwa – data collection and reports

• Mahout – collection of ML algorithms for HDFS clusters


Configuration Checklist

• Rack management

• Java Installation

• Hadoop download and shell environment tweaking

• SSH + VIM

• Default configuration files:

– Core-site.xml, hdfs-site.xml, mapred-site.xml

• Formatting the HDFS

• Start-all.sh

Hadoop Shell Commands

• Hadoop fs –ls

• Hadoop jar <jar file> <main method> <input params>

Web-Based Cluster UI

• Localhost:50070 – Job administration

• Localhost:50030 – DFS administration

Hadoop for other languages

• Hadoop streaming uses Unix standard streams.

– So you can use bash scripts, Ruby, python, etc.

• Hadoop pipes is a C++ interface to MR.

Benefits to AC Technologies Discussion

Report Generation

• Suppose we have HBase for:

– High Availability: Distributed DB

– Partition Tolerance: Auto-Sharding

– Scalability: Horizontal Scaling

• Then, common scenarios will be:

– Service Management & Monitoring:

• Partitioning by month

• Binning by functional category

• Sampling by file status

– Harvest DB

• Top harvested files per day, per site

Log Mining

• Suppose we have a common repository for all log files (Zenoss)

– Exception counting (WARN – FATAL level)

– Info-level reporting

Analytics-Driven Decision Making

• Application Influence Mining through Akamai Logs

– Prevalent and isolated applications over the whole client base

– Prevalent and isolated applications over a single organization

– Projections of Application Influence over time

hadoop tech share

Documents

hadoop pipes

core hadoop cluster

largeamounts of data

big data

accessible data

definitive guide

languages hadoop streaming

hadoop stack core io