team: 3 md liakat ali abdulaziz altowayan andreea cotoranu stephanie haughton gene locklear leslie...

27
Introduction to MapReduce and Hadoop Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Upload: magnus-alexander

Post on 20-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Introduction to MapReduceand Hadoop

Team: 3Md Liakat Ali

Abdulaziz AltowayanAndreea Cotoranu

Stephanie HaughtonGene Locklear

Leslie Meadows

Page 2: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Contents

Hadoop MapReduce Download Links Install and Tutorials

Page 3: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

What is Hadoop?• Software platform that lets one easily write and run applications that

process vast amounts of data. It includes:o MapReduce – offline computing engineo HDFS – Hadoop distributed file systemo HBase (pre-alpha) – online data access

• Here's what makes it especially useful:o Scalable: It can reliably store and process petabytes.o Economical: It distributes the data and processing across clusters

of commonly available computers (in thousands).o Efficient: By distributing the data, it can process it in parallel on

the nodes where the data is located.o Reliable: It automatically maintains multiple copies of data and

automatically redeploys computing tasks based on failures.

Page 4: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

What does it do?• Hadoop implements Google’s MapReduce, using HDFS• MapReduce divides applications into many small blocks of

work. • HDFS creates multiple replicas of data blocks for reliability,

placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located.

Page 5: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Example Applications and Organizations using Hadoop

• A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes.

• Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search

• AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity.

• Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage;

• FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning

• University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data;

Page 6: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

More Hadoop Applications

• Adknowledge - to build the recommender system for behavioral targeting, plus other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2.

• Contextweb - to store ad serving log and use it as a source for Ad optimizations/ Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.

• Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)

• NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis

• The New York Times : Large scale image conversions ; EC2 to run hadoop on a large virtual cluster

• Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3

Page 7: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

What is MapReduce? • MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. • MapReduce program is composed of  Map() procedure(method) that performs filtering and sorting, and

Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). 

Page 8: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

What is MapReduce? • Pioneered by Google

– Processes 20 PB of data per day• Sort/merge based distributed computing• Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations

• Popularized by open-source Hadoop project– Used by Yahoo!, Facebook, Amazon, …

Page 9: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

What is MapReduce used for?

• At Google:– Index building for Google Search– Article clustering for Google News– Statistical machine translation

• At Yahoo!:– Index building for Yahoo! Search– Spam detection for Yahoo! Mail

• At Facebook:– Data mining– Ad optimization– Spam detection

• In research:– Analyzing Wikipedia conflicts (PARC)– Natural language processing (CMU)– Bioinformatics (Maryland)– Particle physics (Nebraska)– Ocean climate simulation (Washington)

Page 10: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

MapReduce Goals

1. Scalability to large data volumes:– Scan 100 TB on 1 node @ 50 MB/s =

24 days– Scan on 1000-node cluster = 35

minutes2. Cost-efficiency:

– Commodity nodes (cheap, but unreliable)

– Commodity network– Automatic fault-tolerance (fewer

admins)– Easy to use (fewer programmers)

Page 11: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Hadoop Components

• Distributed file system (DFS)The Hadoop Distributed File System

(HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.• highly fault-tolerant and is designed to be

deployed on low-cost hardware. • provides high throughput access to application

data and is suitable for applications that have large data sets.

• part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/.

Page 12: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Hadoop Distributed File System

• Files split into 128MB blocks• Blocks replicated acrossseveral datanodes (usually 3)• Namenode stores metadata(file names, locations, etc)• Optimized for large files,sequential reads• Files are append-only

Page 13: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Hadoop Components

• MapReduce framework– Executes user jobs specified as

“map” and “reduce” functions– Manages work distribution & fault-

tolerance

Page 14: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

MapReduce Programming Model

Page 15: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Example: Word Count

Page 16: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Word Count Execution

Page 17: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

An Optimization: The Combiner

Page 18: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Word Count with Combiner

Page 19: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Sample applications: Search

Page 20: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Sample applications: Sort

Page 21: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Apache HBase

o HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java.oIt is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFSo it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data)o Hbase is now serving several data-driven websites, including Facebook's Messaging Platform.

Page 22: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Get Started with Hadoop Installation

Download links Sandbox How to Download and Install Tutorials

http://hortonworks.com/

Page 23: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Hadoop- Download

Running Hadoop on Ubuntu Linux (Single-Node Cluster)http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ Sandbox

http://hortonworks.com/products/hortonworks-sandbox/#install Some other links:

http://hadoop-skills.com/setup-hadoop/https://hadoop.apache.org/

Page 24: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Sandbox

• The easiest way to get started with Enterprise Hadoop• Sandbox is a personal, portable Hadoop environment , comes with a dozen interactive Hadoop tutorials that will guide through the basics of Hadoop.• Includes the Hortonworks Data Platform in an easy to use form. We can add our own datasets, and connect it to our existing tools and applications. • We can test new functionality with the Sandbox before we put it into production. Simply, easily and safely.

Page 25: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Download and Installo Install any one of the following on your host

machine :1. VirtualBox 2. VMware Fusion 3. Hyper-V

o Oracle VirtualBox Version 4.2 or later (https://www.virtualbox.org/wiki/Downloads)

o Hortonworks Sandbox virtual appliance for VirtualBoxDownload the correct virtual appliance file for your

environment

http://hortonworks.com/products/hortonworks-sandbox/#install

Page 26: Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

Download and Installo CPU - A 64-bit machine with a multi-core

CPU that supports virtualization.o BIOS - Has been enabled for virtualization

supporto RAM - At least 4 GB of RAMo Browsers - Chrome 25+, IE 9+ (Sandbox

will not run on IE 10), Safari 6+o Complete Installation guide can be found

at: http://hortonworks.com/wp-content/uploads/2015/07/Import_on_Vbox_7_20_2015.pdf