![Page 1: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/1.jpg)
Will Y Lin
Hadoop Product Familyand Ecosystem
![Page 2: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/2.jpg)
Agenda
• What is BigData?
• What is the problem?
• Hadoop– Introduction to Hadoop– Hadoop components– What sort of problems can be solved with Hadoop?
• Hadoop ecosystem
• Conclusion
![Page 3: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/3.jpg)
What is BigData?
A set of files A database A single file
![Page 4: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/4.jpg)
Big data Expands on 4 fronts
Velocity
Volume
Variety
Veracity
MB GB TB PBbatch
periodic
near Real-Time
Real-Time
http://whatis.techtarget.com/definition/3Vs
![Page 5: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/5.jpg)
The Data-Driven World
• Modern systems have to deal with far more data than was the case in the past – Organizations are generating huge amounts of data– That data has inherent value, and cannot be discarded
• Examples: – Yahoo – over 170PB of data – Facebook – over 30PB of data – eBay – over 5PB of data
• Many organizations are generating data at a rate of terabytes per day
![Page 6: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/6.jpg)
What is the problem
• Traditionally, computation has been processor-bound
• For decades, the primary push was to increase the computing power of a single machine– Faster processor, more RAM
• Distributed systems evolved to allow developers to use multiple machines for a single job– At compute time, data is copied to the compute nodes
![Page 7: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/7.jpg)
What is the problem
• Getting the data to the processors becomes the bottleneck
• Quick calculation – Typical disk data transfer rate:
• 75MB/sec – Time taken to transfer 100GB of data
to the processor:
• approx. 22 minutes!
![Page 8: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/8.jpg)
What is the problem
• Failure of a component may cost a lot
• What we need when job fail?– May result in a graceful degradation of application performance,
but entire system does not completely fail– Should not result in the loss of any data– Would not affect the outcome of the job
![Page 9: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/9.jpg)
Hadoop Solutions
The most common problems Hadoop can solve
![Page 10: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/10.jpg)
Threat Analysis/Trade Surveillance
• Challenge: – Detecting threats in the form of fraudulent activity or attacks
• Large data volumes involved• Like looking for a needle in a haystack
• Solution with Hadoop: – Parallel processing over huge datasets – Pattern recognition to identify anomalies
• – i.e., threats
• Typical Industry: – Security, Financial Services
![Page 11: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/11.jpg)
Recommendation Engine
• Challenge: – Using user data to predict which products to recommend
• Solution with Hadoop: – Batch processing framework
• Allow execution in in parallel over large datasets
– Collaborative filtering• Collecting ‘taste’ information from many users• Utilizing information to predict what similar users like
• Typical Industry – ISP, Advertising
![Page 12: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/12.jpg)
Walmart Case
Revenue ?
FridayFriday
BeerBeerDiapersDiapers
![Page 13: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/13.jpg)
Hadoop!
![Page 14: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/14.jpg)
• Apache Hadoop project– inspired by Google's MapReduce and Google File System
papers.
• Open sourced , flexible and available architecture for large scale computation and data processing on a network of commodity hardware
• Open Source Software + Hardware Commodity– IT Costs Reduction
– inspired by
![Page 15: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/15.jpg)
Hadoop Concepts
• Distribute the data as it is initially stored in the system
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.
![Page 16: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/16.jpg)
Hadoop Components
• Hadoop consists of two core components– The Hadoop Distributed File System (HDFS) – MapReduce Software Framework
• There are many other projects based around core Hadoop– Often referred to as the ‘Hadoop Ecosystem’ – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 17: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/17.jpg)
Hadoop Components: HDFS
• HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster
• Two roles in HDFS– Namenode: Record metadata– Datanode: Store data
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 18: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/18.jpg)
How Files Are Stored: Example
• NameNode holds metadata for the data files
• DataNodes hold the actual blocks • Each block is replicated three
times on the cluster
![Page 19: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/19.jpg)
HDFS: Points To Note
• When a client application wants to read a file:
• It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on
• It then communicates directly with the DataNodes to read the data
![Page 20: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/20.jpg)
Hadoop Components: MapReduce
• MapReduce is a method for distributing a task across multiple nodes
• It works like a Unix pipeline:– cat input | grep | sort | uniq -c | cat > output– Input | Map | Shuffle & Sort | Reduce | Output
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 21: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/21.jpg)
Features of MapReduce
• Automatic parallelization and distribution
• Automatic re-execution on failure
• Locality optimizations
• MapReduce abstracts all the ‘housekeeping’ away from the developer – Developer can concentrate simply on writing the Map and
Reduce functions
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 22: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/22.jpg)
Example : word count
• Word count is challenging over massive amounts of data– Using a single compute node would be too time-consuming – Number of unique words can easily exceed the RAM
• MapReduce breaks complex tasks down into smaller elements which can be executed in parallel
• More nodes, more faster
![Page 23: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/23.jpg)
Word Count Example
Key: offsetValue: line
Key: wordValue: count
Key: wordValue: sum of count
0:The cat sat on the mat22:The aardvark sat on the sofa
![Page 24: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/24.jpg)
The Hadoop Ecosystems
![Page 25: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/25.jpg)
Growing Hadoop Ecosystem
• The term ‘Hadoop’ is taken to be the combination of HDFS and MapReduce
• There are numerous other projects surrounding Hadoop– Typically referred to as the ‘Hadoop Ecosystem’
• Zookeeper• Hive and Pig• HBase• Flume • Other Ecosystem Projects
– Sqoop– Oozie– Hue– Mahout
![Page 26: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/26.jpg)
The Ecosystem is the System
• Hadoop has become the kernel of the distributed operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
![Page 27: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/27.jpg)
Relation Map
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 28: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/28.jpg)
Zookeeper – Coordination Framework
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 29: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/29.jpg)
What is ZooKeeper
• A centralized service for maintaining – Configuration information– Providing distributed synchronization
• A set of tools to build distributed applications that can safely handle partial failures
• ZooKeeper was designed to store coordination data– Status information– Configuration– Location information
![Page 30: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/30.jpg)
Why use ZooKeeper?
• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution
![Page 31: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/31.jpg)
ZooKeeper Architecture
– All servers store a copy of the data (in memory)
– A leader is elected at startup
– 2 roles – leader and follower
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have persisted the change
– HA support
![Page 32: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/32.jpg)
Hbase – Column NoSQL DB
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 33: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/33.jpg)
Structured-data vs Raw-data
![Page 34: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/34.jpg)
I – Inspired by
• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper
![Page 35: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/35.jpg)
Row & Column Oriented
![Page 36: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/36.jpg)
Hbase – Data Model
• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
![Page 37: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/37.jpg)
Architecture
• Master Server (HMaster)– Assigns regions to regionservers– Monitors the health of regionservers
• RegionServers– Contain regions and handle client read/write request
![Page 38: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/38.jpg)
Hbase – workflow
![Page 39: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/39.jpg)
When to use HBase
• Need random, low latency access to the data
• Application has a variable schema where each row is slightly different
• Add columns
• Most of columns are NULL in each row
![Page 40: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/40.jpg)
Flume / Sqoop – Data Integration Framework
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 41: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/41.jpg)
What’s the problem for data collection
• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own collection path
![Page 42: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/42.jpg)
(and how can it help?)
• A distributed data collection service
• It efficiently collecting, aggregating, and moving large amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
![Page 43: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/43.jpg)
Flume: High-Level Overview
• Logical Node
• Source
• Sink
![Page 44: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/44.jpg)
Architecture
• basic diagram– one master control multiple node
![Page 45: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/45.jpg)
Architecture
• multiple master control multiple node
![Page 46: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/46.jpg)
An example flow
![Page 47: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/47.jpg)
Flume / Sqoop – Data Integration Framework
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 48: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/48.jpg)
Sqoop
• Easy, parallel database import/export
• What you want do?– Insert data from RDBMS to HDFS– Export data from HDFS back into RDBMS
![Page 49: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/49.jpg)
What is Sqoop
• A suite of tools that connect Hadoop and database systems
• Import tables from databases into HDFS for deep analysis
• Export MapReduce results back to a database for presentation to end-users
• Provides the ability to import from SQL databases straight into your Hive data warehouse
![Page 50: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/50.jpg)
How Sqoop helps
• The Problem– Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)– Easy import of data from many databases to HDFS– Generate code for use in MapReduce applications
![Page 51: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/51.jpg)
Sqoop - import process
![Page 52: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/52.jpg)
Sqoop - export process
• Exports are performed in parallel using MapReduce
![Page 53: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/53.jpg)
Why Sqoop
• JDBC-based implementation– Works with many popular database vendors
• Auto-generation of tedious user-side code– Write MapReduce applications to work with your data, faster
• Integration with Hive– Allows you to stay in a SQL-based environment
![Page 54: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/54.jpg)
Sqoop - JOB
• Job management options
• E.g sqoop job –create myjob –import –connect xxxxxxx--table mytable
![Page 55: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/55.jpg)
Pig / Hive – Analytical Language
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 56: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/56.jpg)
Why Hive and Pig?
• Although MapReduce is very powerful, it can also be complex to master
• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code
• Many organizations have programmers who are skilled at writing code in scripting languages
• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce– Hive was initially developed at Facebook, Pig at Yahoo!
![Page 57: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/57.jpg)
Hive – Developed by
• What is Hive? – An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop– MapRuduce for execution– HDFS for storage
• Hive Query Language– Basic-SQL : Select, From, Join, Group-By– Equi-Join, Muti-Table Insert, Multi-Group-By– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
![Page 58: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/58.jpg)
Pig
• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)B = load ‘b.txt’ as (id, address, ...)C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
– Initiated by
![Page 59: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/59.jpg)
Hive vs. Pig
Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions that are stored in a metastore
A schema is optionally defined at runtime
Programmait Access JDBC, ODBC PigServer
![Page 60: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/60.jpg)
• Input
• For the given sample input the map emits
• the reduce just sums up the values
Hello World Bye WorldHello Hadoop Goodbye Hadoop
< Hello, 1> < World, 1> < Bye, 1> < World, 1>< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
WordCount Example
![Page 61: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/61.jpg)
WordCount Example In MapReducepublic class WordCount {public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());context.write(word, one);
}}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {
sum += val.get();}context.write(key, new IntWritable(sum));
}}
public static void main(String[] args) throws Exception {Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);}
}
![Page 62: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/62.jpg)
WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
![Page 63: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/63.jpg)
WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
![Page 64: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/64.jpg)
Oozie – Job Workflow & Scheduling
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 65: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/65.jpg)
What is ?
• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
Job 1
Job 3
Job 2
Job 4 Job 5
![Page 66: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/66.jpg)
Why
• Why use Oozie instead of just cascading a jobs one after another
• Major flexibility– Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure– You can tell Oozie to restart a job from a specific node in the
graph or to skip specific failed nodes
![Page 67: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/67.jpg)
High Level Architecture
• Web Service API
• database store :– Workflow definitions– Currently running workflow instances, including instance states
and variables
Oozie
Hadoop/Pig/HDFS
DB
WS
API
Tomcatweb-app
![Page 68: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/68.jpg)
How it triggered
• Time– Execute your workflow every 15 minutes
• Time and Data– Materialize your workflow every hour, but only run them when
the input data is ready.
00:15 00:30 00:45 01:00
01:00 02:00 03:00 04:00
HadoopInput Data Exists?
![Page 69: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/69.jpg)
Exeample Workflow
![Page 70: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/70.jpg)
Oozie use criteria
• Need Launch, control, and monitor jobs from your Java Apps– Java Client API/Command Line Interface
• Need control jobs from anywhere– Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done– Email when a job is complete
![Page 71: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/71.jpg)
Hue – Web Console
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 72: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/72.jpg)
Hue – developed by
• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI library
![Page 73: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/73.jpg)
Hue
• HUE comes with a suite of applications– File Browser : Browse HDFS; change permissions and
ownership; upload, download, view and edit files.– Job Browser : View jobs, tasks, counters, logs, etc.– Beeswax : Wizards to help create Hive tables, load data, run and
manage Hive queries, and download results in Excel format.
![Page 74: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/74.jpg)
Hue: File Browser UI
![Page 75: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/75.jpg)
Hue: Beewax UI
![Page 76: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/76.jpg)
Mahout – Data Mining
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 77: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/77.jpg)
What is
• Machine-learning tool
• Distributed and scalable machine learning algorithms on the Hadoop platform
• Building intelligent applications easier and faster
![Page 78: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/78.jpg)
Why
• Current state of ML libraries– Lack Community– Lack Documentation and Examples– Lack Scalability– Are Research oriented
![Page 79: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/79.jpg)
Mahout – scale
• Scale to large datasets– Hadoop MapReduce implementations that scales linearly with
data
• Scalable to support your business case– Mahout is distributed under a commercially friendly Apache
Software license
• Scalable community– Vibrant, responsive and diverse
![Page 80: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/80.jpg)
Mahout – four use cases
• Mahout machine learning algorithms– Recommendation mining : takes users’ behavior and find items
said specified user might like– Clustering : takes e.g. text documents and groups them based
on related document topics– Classification : learns from existing categorized documents what
specific category documents look like and is able to assign unlabeled documents to appropriate category
– Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together
![Page 81: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/81.jpg)
Use case Example
• Predict what the user likes based on– His/Her historical behavior– Aggregate behavior of people similar to him
![Page 82: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/82.jpg)
Conclusion
Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoopecosystem
![Page 83: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/83.jpg)
Recap – Hadoop Ecosystem
MapReduce Runtime(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase(Column NoSQL DB)
Sqoop/Flume(Data integration)
Oozie(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue(Web Console)
Mahout(Data Mining)
![Page 84: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/84.jpg)
Questions?
![Page 85: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2](https://reader033.vdocument.in/reader033/viewer/2022052822/554a5046b4c9054b328b474e/html5/thumbnails/85.jpg)
Thank you !