308471 ch4 intro-big_data-hadoop
TRANSCRIPT
ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 44
M4: Introduction to Big Data and Hadoop
The only way to do great work is to love what you do. -- Steve Jobs --
W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7
I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U
ICT@PSU 308-471 Data Warehousing and Data Mining 2 of 44
The slide is modified from the Big Data Course by IMC Institute
• http://thanachart.org/2015/05/11
• https://www.dropbox.com/s/i2ubtth2d0sw
zlf/Bigdata-Intro-IMC.pdf?dl=0
• https://www.dropbox.com/s/kl6g5o8roqhr0j9/Intro-Hadoop.pdf?dl=0
Sources:
ICT@PSU 308-471 Data Warehousing and Data Mining 3 of 44
Outline
• Basic concepts of Big data.
• Hadoop Ecosystem• Comparison between Hadoop 1 and Hadoop 2
• Hadoop Core components and Cluster• MapReduce, HDFS, YARN
• Supporting components• HIVE, PIG, Flume, Sqoop, Mahout, etc
• Microsoft HDInsight• Set up and case study
ICT@PSU 308-471 Data Warehousing and Data Mining 4 of 44
Data!• Computer generated data
• Application server logs (web sites, games)• Sensor data (weather, water, smart grids)• Images/videos (traffic, security cameras)
• Human generated data• Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)• Blogs/Reviews/Emails/Pictures
• Example of Bid data• The New York Stock Exchange generate about 1TB of new trade data
per day.• A commercial aircraft generates 3GB of flight sensor data in 1 hour.• Vodafone generates 3TB of Call Detail Record (CRDs) per day.• Between 2009 and 2014, the total number of U.S. online banking
households will increase from 54 million to 66 million.
ICT@PSU 308-471 Data Warehousing and Data Mining 5 of 44
Dat
a ar
e cr
eate
d ev
ery
min
ute
http
://w
ww
.adw
eek.
com
/soc
ialti
mes
/info
grap
hic-
wha
t-ha
ppen
s-in
-just
-one
-inte
rnet
-min
ute/
1973
86
ICT@PSU 308-471 Data Warehousing and Data Mining 6 of 44
Big data• Big data is a broad term for data sets so large or complex
that traditional data processing applications are inadequate.• Big data usually includes data sets with sizes beyond the ability
of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.
• Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
• Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale.
• Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing.
ICT@PSU 308-471 Data Warehousing and Data Mining 7 of 44
Characteristics of Big Data• Volume: The quantity of generated data is important in this context. The size of the
data determines the value and potential of the data under consideration, and whether it can actually be considered big data or not. The name ‘big data’ itself contains a term related to size, and hence the characteristic.
• Variety: The type of content, and an essential fact that data analysts must know. This helps people who are associated with and analyze the data to effectively use the data to their advantage and thus uphold its importance.
• Velocity: In this context, the speed at which the data is generated and processed to meet the demands and the challenges that lie in the path of growth and development.
• Variability: The inconsistency the data can show at times—-which can hamper the process of handling and managing the data effectively.
• Veracity: The quality of captured data, which can vary greatly. Accurate analysis depends on the veracity of source data.
• Complexity: Data management can be very complex, especially when large volumes of data come from multiple sources. Data must be linked, connected, and correlated so users can grasp the information the data is supposed to convey.
ICT@PSU 308-471 Data Warehousing and Data Mining 8 of 44
Big Data
Source: http://www.datasciencecentral.com/
ICT@PSU 308-471 Data Warehousing and Data Mining 9 of 44
Why is Big Data Hard (and Getting Harder)?
• Unconstrained growth
• Current systems don’t scale
https://emergingtechblog.emc.com/converged-infrastructure-big-data-storage-analytics/
http://softwarestrategiesblog.com/tag/software/
ICT@PSU 308-471 Data Warehousing and Data Mining 10 of 44
Why is Big Data Hard (and Getting Harder)?
• Changing data requirements
• Faster response time of fresher data
• Sampling is not good enough, and history is important
• Increasing complexity of analytics
• Users demand inexpensive experimentation
ICT@PSU 308-471 Data Warehousing and Data Mining 11 of 44
Big Data
Data Sources Technologies Analytics
ICT@PSU 308-471 Data Warehousing and Data Mining 12 of 44
The Need for new technology
• Size of data is too big
• Many data are unstructured
• RDBMS is inadequate
ICT@PSU 308-471 Data Warehousing and Data Mining 13 of 44
What is Hadoop?• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple programming models.
• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
• Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
• A scalable fault-tolerant distributed system for data storage and processing
• Completely written in java Open source & distributed under Apache license
ICT@PSU 308-471 Data Warehousing and Data Mining 14 of 44
Hadoop• Hadoop is not
• ...a relational database!• ...an online transaction processing (OLTP) system!• ...a structured data store of any kind!• ...a replacement for existing data warehouse systems• ...a File system
• Hadoop is not for all type of work• Not good to process transactions• Not good when work cannot be parallelized• Not good for low latency data access• Not good for processing lots of small files• Not good for intensively calculation with little data
ICT@PSU 308-471 Data Warehousing and Data Mining 15 of 44
Hadoop creation history
ICT@PSU 308-471 Data Warehousing and Data Mining 16 of 44
Hadoop 1.x Core Components• MapReduce (Job Scheduling/Execution System)
• MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.
• Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
• HDFS (Hadoop Distributed File System)• HDFS is a default storage for the Hadoop cluster. The data is
distributed and replicated over multiple machines• HDFS is the primary distributed storage used by Hadoop applications.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
ICT@PSU 308-471 Data Warehousing and Data Mining 17 of 44
MapReduce• MapReduce is a programming model and an associated
implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
• Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner
• MapReduce programs are inherently parallel.
• Technology from Google
• Consists of map and reduce function.• Tasktracker/ Jobtracker
ICT@PSU 308-471 Data Warehousing and Data Mining 18 of 44
HDFS• Default storage for the Hadoop cluster• Data is distributed and replicated over multiple machines• HDFS is designed to reliably store very large files across machines in
a large cluster. It is inspired by the Google File System. • Access to data files is handled in a streaming manner, meaning that
applications or commands are executed directly using the MapReduce processing model
• Designed to handle very large files with streaming data access patterns.
• NameNode/DataNode• Master/slave architecture (1 master 'n' slaves)• Designed for large files (64 MB default, but configurable) across all
the nodes
ICT@PSU 308-471 Data Warehousing and Data Mining 19 of 44
http://www.glennklockwood.com/data-intensive/hadoop/overview.html
ICT@PSU 308-471 Data Warehousing and Data Mining 20 of 44
Mas
ter
and
Sla
ve
ICT@PSU 308-471 Data Warehousing and Data Mining 21 of 44
Had
oop
1.x
= jo
b tr
acke
rS
ourc
e H
adoo
p: S
hash
wat
Shr
ipar
v
ICT@PSU 308-471 Data Warehousing and Data Mining 22 of 44
Hadoop 1.x Limitations
• Maximum cluster size ~ 4,000 nodes
• Maximum concurrent tasks – 40,000
• JobTracker bottleneck – resource management, job scheduling and monitoring
• Only has one namespace for managing HDFS
• Map and Reduce slots are static
• Only job to run is MapReduce
ICT@PSU 308-471 Data Warehousing and Data Mining 23 of 44
Had
oop
2
http://www.slideshare.net/uweseiler/hadoop-2-going-beyond-mapreduce
ICT@PSU 308-471 Data Warehousing and Data Mining 24 of 44
Hadoop 1.x v.s. Hadoop 2
Source: Edureka
ICT@PSU 308-471 Data Warehousing and Data Mining 25 of 44
Hadoop 2
• Potentially up to 10,000 nodes per cluster
• Support Federation (Multiple Cluster)
• Supports multiple namespace for managing HDFS
• Efficient cluster utilization (YARN)
• MRv1 backward and forward compatible
• Any apps can integrate with Hadoop
• Beyond Java
ICT@PSU 308-471 Data Warehousing and Data Mining 26 of 44
YARN: Multi-tenancy
• No more JobTracker / TaskTracker
• Support MR and non-MR running on the same cluster
ICT@PSU 308-471 Data Warehousing and Data Mining 27 of 44
Yarn Components
• Resource Manager• a pluggable scheduler > primarily limited to scheduling
• ApplicationManager >> manages user jobs on the cluster
• Node Manager• Containers
• manages users’ jobs and workflow on a given node
• Application Master• a user job life-cycle manager
ICT@PSU 308-471 Data Warehousing and Data Mining 28 of 44
Sou
rce:
Apa
che
Had
oop™
YA
RN
: M
ovin
g be
yond
M
apR
educ
e an
d B
atch
Pro
cess
ing
with
Apa
che
Had
oop™
2
ICT@PSU 308-471 Data Warehousing and Data Mining 29 of 44
Had
oop
2 E
cosy
stem
ICT@PSU 308-471 Data Warehousing and Data Mining 30 of 44
Hadoop 2 Ecosystem• HDFS: Distributed redundant file system.
• MapReduce: Parallel computation on server cluster
• Hbase: Column-oriented database scaling to billions of rows.
• Flume: Collection and import of log and event data
• Hive: Data warehouse with SQL-like access
• Pig: a high-level language for expressing data analysis programs,
• Sqoop: Imports data from relational databases
• Oozie: Hadoop workflow.
• Mahout: Machine Learning
• Ambari: Cluster management
• R Connectors: Connects with R
ICT@PSU 308-471 Data Warehousing and Data Mining 31 of 44
Hadoop Distribution
ICT@PSU 308-471 Data Warehousing and Data Mining 32 of 44
Big Data on Cloud Roadmap
• Step 1: Build the business case
• Step 2: Assess your Big Data application workloads
• Step 3: Develop a technical approach for deploying and managing Big Data in the cloud
• Step 4: Address governance, security, privacy, risk,
• Step 5: Deploy, integrate, and operationalize your cloud-based Big Data infrastructure
Source : Deploying Big Data Analytics Applications to the Cloud: Roadmap for Success: CSCS
ICT@PSU 308-471 Data Warehousing and Data Mining 33 of 44
Hadoop as a Service
• Amazon Elastic Map Reduce
• Rackspace Cloud Big Data Platform
• Qubole
• Google Cloud Platform
• IBM Bluemix: Analytic on Hadoop
• Microsoft Azure HDInsight
ICT@PSU 308-471 Data Warehousing and Data Mining 34 of 44
Hadoop in Azure HDInsight
• Big-data analysis and processing in the cloud
• Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability.
• HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.
• Hadoop often refers to the entire Hadoop ecosystem of components, which includes Storm and HBase clusters, as well as other technologies under the Hadoop umbrella.
ICT@PSU 308-471 Data Warehousing and Data Mining 35 of 44
Hadoop ecosystem on HDInsight
• HDInsight is a cloud implementation on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack that is the go-to solution for big data analysis.
• It includes implementations of Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and so on.
• HDInsight also integrates with business intelligence (BI) tools such as Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
ICT@PSU 308-471 Data Warehousing and Data Mining 36 of 44
Linux and Windows clusters
• Azure HDInsight deploys and provisions Hadoop clusters in the cloud, by using either Linux or Windows as the underlying OS.• HDInsight on Linux - A Hadoop cluster on Ubuntu. Use this if
you are familiar with Linux or Unix, are migrating from an existing Linux-based Hadoop solution, or want easy integration with Hadoop ecosystem components built for Linux.
• HDInsight on Windows - A Hadoop cluster on Windows Server. Use this if you are familiar with Windows, are migrating from an existing Windows-based Hadoop solution, or want to use .NET or other Windows-only technologies on the cluster.
ICT@PSU 308-471 Data Warehousing and Data Mining 37 of 44
The following table compares the two:
Category Hadoop on Linux Hadoop on Windows
Cluster OSUbuntu 12.04 Long Term Support
(LTS)Windows Server 2012 R2
Cluster Type Hadoop, HBase, Storm Hadoop, HBase, Storm
DeploymentAzure preview portal, Azure CLI,
Azure PowerShell
Azure portal, Azure preview portal,
Azure CLI, Azure PowerShell,
HDInsight .NET SDK
Cluster UI Ambari Cluster Dashboard
Remote AccessSecure Shell (SSH), REST API,
ODBC, JDBC
Remote Desktop Protocol (RDP),
REST API, ODBC, JDBC
ICT@PSU 308-471 Data Warehousing and Data Mining 38 of 44
Hadoop, HBase, Storm• HDInsight provides cluster configurations for Hadoop, HBase, or
Storm. Or, you can customize clusters with script actions.• Hadoop (the "Query" workload): Provides reliable data storage with
HDFS, and a simple MapReduce programming model to process and analyze data in parallel.
• HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns.
• Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight.
ICT@PSU 308-471 Data Warehousing and Data Mining 39 of 44
Sqoop
• Apache Sqoop is tool that transfers bulk data between Hadoop and relational databases such a SQL, or other structured data stores, as efficiently as possible
• http://sqoop.apache.org/
ICT@PSU 308-471 Data Warehousing and Data Mining 40 of 44
Pig
• Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on very large datasets by using a simple scripting language called Pig Latin.
• Pig translates the Pig Latin scripts so they’ll run within Hadoop.
• You can create User Defined Functions (UDFs) to extend Pig Latin.
ICT@PSU 308-471 Data Warehousing and Data Mining 41 of 44
Hive• Apache Hive is data warehouse software built on Hadoop that
allows you to query and manage large datasets in distributed storage by using a SQL-like language called HiveQL.
• Hive, like Pig, is an abstraction on top of MapReduce.
• When run, Hive translates queries into a series of MapReduce jobs.
• Hive is conceptually closer to a relational
database management system than Pig, and is
therefore appropriate for use with more structured
data. • For unstructured data, Pig is the better choice.
ICT@PSU 308-471 Data Warehousing and Data Mining 42 of 44
Mahout
• Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop.
• Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.
ICT@PSU 308-471 Data Warehousing and Data Mining 43 of 44
Assignment1. Following this instruction, you will create a HDInsight Hadoop
cluster with 2 head nodes and 2 worker nodes.• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-
hadoop-tutorial-get-started-windows/
2. Following this instruction, you will upload a Movielens dataset to the blob storage in HDFS. (I recommend using Hadoop CLI)• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-
upload-data/• http://grouplens.org/datasets/movielens/
3. Answer the questions in the lab guidance, and submit it in LMS.
4. After finishing the lab, please delete the HDInsight cluster to save your credits.
ICT@PSU 308-471 Data Warehousing and Data Mining 44 of 44
2 worker nodes
Choose the cheapest price (Both head and worker nodes)
Make sure that the price is $1.28 per hour