hadoop tech share
TRANSCRIPT
Dylan Valerio
Hadoop Tech Share
Agenda
• Overview
• Demo
• Applications
• Configuration
Data!
• NYSE = 1 TB/day (10^12)
• FB = 10B photos = 2.5 PB (10^15)
• Ancestry.com = 2.5 PB
• Large Hadron Collider = 15 PB/year
Hadoop: The Definitive Guide
Storage and Analysis
• Storage capacity has increased, but IO has not increased proportionally.
• Disk failure.
• Analysis needs a large chunk of data.
• Bandwidth is largely limited esp. for BIG DATA.
Hadoop: The Definitive Guide
Hadoop
• Distributed computing framework to process large amounts of data.
– Accessible – data is replicated on large clusters of commodity machines
– Robust – assume disk failures
– Scalable – add nodes
– Simple – simple MapReduce code
Hadoop in Action
HDFS
• Name Node – master of the HDFS. Directs I/O.“The Index”
– Secondary Name Node – Backup
• Data Node – where actual data is stored and replicated
Hadoop In Action
MapReduce
• MR is a programming framework that processes data by keys and values.
• The mapper code processes, while the reducer compiles.
• Mappers and reducers do not directly communicate with each other.
Hadoop: The Definitive Guide
Huy Vo, NYU
Jobs
• Map tasks and reduce tasks are assignedthroughout the cluster.
• Job Tracker managesthe status of the job(s).
• Task Tracker manages each task assigned to them.
Hadoop: The Definitive Guide
Architecture
Hadoop in Action
Demo
• Word Count – the “Hello World” of Map Reduce.
• Distributed GREP – Sampling Pattern
• Top Child Star – Summarization Pattern
Bit of History
• Doug Cutting and the Apache Lucene Team
• Google File Systm (2003) and MapReduce (2004)
• Cutting joined Yahoo! (2006).
• Yahoo announced its search index was being processed by a 10,000-core Hadoop cluster. (2008)
Hadoop: The Definitive Guide
Hadoop Stack
• Core – I/O, serialization, Java RPC
• Avro – data serialization
• MapReduce
• HDFS
• Pig – higher language to explore HDFS & MR clusters
• HBase – distributed column-oriented DB
• Zoo Keeper – distributed coordination service
• Hive – distributed data warehouse + SQL-like query
• Chukwa – data collection and reports
• Mahout – collection of ML algorithms for HDFS clusters
Hadoop: The Definitive Guide
Configuration Checklist
• Rack management
• Java Installation
• Hadoop download and shell environment tweaking
• SSH + VIM
• Default configuration files:
– Core-site.xml, hdfs-site.xml, mapred-site.xml
• Formatting the HDFS
• Start-all.sh
Hadoop Shell Commands
• Hadoop fs –ls
• Hadoop jar <jar file> <main method> <input params>
Web-Based Cluster UI
• Localhost:50070 – Job administration
• Localhost:50030 – DFS administration
Hadoop for other languages
• Hadoop streaming uses Unix standard streams.
– So you can use bash scripts, Ruby, python, etc.
• Hadoop pipes is a C++ interface to MR.
Benefits to AC Technologies Discussion
Report Generation
• Suppose we have HBase for:
– High Availability: Distributed DB
– Partition Tolerance: Auto-Sharding
– Scalability: Horizontal Scaling
• Then, common scenarios will be:
– Service Management & Monitoring:
• Partitioning by month
• Binning by functional category
• Sampling by file status
– Harvest DB
• Top harvested files per day, per site
Log Mining
• Suppose we have a common repository for all log files (Zenoss)
– Exception counting (WARN – FATAL level)
– Info-level reporting
Analytics-Driven Decision Making
• Application Influence Mining through Akamai Logs
– Prevalent and isolated applications over the whole client base
– Prevalent and isolated applications over a single organization
– Projections of Application Influence over time