hadoop tech share

22
Dylan Valerio Hadoop Tech Share

Upload: joshua-zabala

Post on 06-Jul-2015

58 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop tech share

Dylan Valerio

Hadoop Tech Share

Page 2: Hadoop tech share

Agenda

• Overview

• Demo

• Applications

• Configuration

Page 3: Hadoop tech share

Data!

• NYSE = 1 TB/day (10^12)

• FB = 10B photos = 2.5 PB (10^15)

• Ancestry.com = 2.5 PB

• Large Hadron Collider = 15 PB/year

Hadoop: The Definitive Guide

Page 4: Hadoop tech share

Storage and Analysis

• Storage capacity has increased, but IO has not increased proportionally.

• Disk failure.

• Analysis needs a large chunk of data.

• Bandwidth is largely limited esp. for BIG DATA.

Hadoop: The Definitive Guide

Page 5: Hadoop tech share

Hadoop

• Distributed computing framework to process large amounts of data.

– Accessible – data is replicated on large clusters of commodity machines

– Robust – assume disk failures

– Scalable – add nodes

– Simple – simple MapReduce code

Hadoop in Action

Page 6: Hadoop tech share

HDFS

• Name Node – master of the HDFS. Directs I/O.“The Index”

– Secondary Name Node – Backup

• Data Node – where actual data is stored and replicated

Hadoop In Action

Page 7: Hadoop tech share

MapReduce

• MR is a programming framework that processes data by keys and values.

• The mapper code processes, while the reducer compiles.

• Mappers and reducers do not directly communicate with each other.

Hadoop: The Definitive Guide

Page 8: Hadoop tech share

Huy Vo, NYU

Page 9: Hadoop tech share

Jobs

• Map tasks and reduce tasks are assignedthroughout the cluster.

• Job Tracker managesthe status of the job(s).

• Task Tracker manages each task assigned to them.

Hadoop: The Definitive Guide

Page 10: Hadoop tech share

Architecture

Hadoop in Action

Page 11: Hadoop tech share

Demo

• Word Count – the “Hello World” of Map Reduce.

• Distributed GREP – Sampling Pattern

• Top Child Star – Summarization Pattern

Page 12: Hadoop tech share

Bit of History

• Doug Cutting and the Apache Lucene Team

• Google File Systm (2003) and MapReduce (2004)

• Cutting joined Yahoo! (2006).

• Yahoo announced its search index was being processed by a 10,000-core Hadoop cluster. (2008)

Hadoop: The Definitive Guide

Page 13: Hadoop tech share

Hadoop Stack

• Core – I/O, serialization, Java RPC

• Avro – data serialization

• MapReduce

• HDFS

• Pig – higher language to explore HDFS & MR clusters

• HBase – distributed column-oriented DB

• Zoo Keeper – distributed coordination service

• Hive – distributed data warehouse + SQL-like query

• Chukwa – data collection and reports

• Mahout – collection of ML algorithms for HDFS clusters

Hadoop: The Definitive Guide

Page 14: Hadoop tech share
Page 15: Hadoop tech share

Configuration Checklist

• Rack management

• Java Installation

• Hadoop download and shell environment tweaking

• SSH + VIM

• Default configuration files:

– Core-site.xml, hdfs-site.xml, mapred-site.xml

• Formatting the HDFS

• Start-all.sh

Page 16: Hadoop tech share

Hadoop Shell Commands

• Hadoop fs –ls

• Hadoop jar <jar file> <main method> <input params>

Page 17: Hadoop tech share

Web-Based Cluster UI

• Localhost:50070 – Job administration

• Localhost:50030 – DFS administration

Page 18: Hadoop tech share

Hadoop for other languages

• Hadoop streaming uses Unix standard streams.

– So you can use bash scripts, Ruby, python, etc.

• Hadoop pipes is a C++ interface to MR.

Page 19: Hadoop tech share

Benefits to AC Technologies Discussion

Page 20: Hadoop tech share

Report Generation

• Suppose we have HBase for:

– High Availability: Distributed DB

– Partition Tolerance: Auto-Sharding

– Scalability: Horizontal Scaling

• Then, common scenarios will be:

– Service Management & Monitoring:

• Partitioning by month

• Binning by functional category

• Sampling by file status

– Harvest DB

• Top harvested files per day, per site

Page 21: Hadoop tech share

Log Mining

• Suppose we have a common repository for all log files (Zenoss)

– Exception counting (WARN – FATAL level)

– Info-level reporting

Page 22: Hadoop tech share

Analytics-Driven Decision Making

• Application Influence Mining through Akamai Logs

– Prevalent and isolated applications over the whole client base

– Prevalent and isolated applications over a single organization

– Projections of Application Influence over time