308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 44

M4: Introduction to Big Data and Hadoop

The only way to do great work is to love what you do. -- Steve Jobs --

W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7

I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U

mailto:[email protected]


The slide is modified from the Big Data Course by IMC Institute

• http://thanachart.org/2015/05/11

• https://www.dropbox.com/s/i2ubtth2d0sw

zlf/Bigdata-Intro-IMC.pdf?dl=0

• https://www.dropbox.com/s/kl6g5o8roqhr0j9/Intro-Hadoop.pdf?dl=0

Sources:

http://thanachart.org/2015/05/11

https://www.dropbox.com/s/i2ubtth2d0swzlf/Bigdata-Intro-IMC.pdf?dl=0

https://www.dropbox.com/s/kl6g5o8roqhr0j9/Intro-Hadoop.pdf?dl=0


Outline

• Basic concepts of Big data.

• Hadoop Ecosystem• Comparison between Hadoop 1 and Hadoop 2

• Hadoop Core components and Cluster• MapReduce, HDFS, YARN

• Supporting components• HIVE, PIG, Flume, Sqoop, Mahout, etc

• Microsoft HDInsight• Set up and case study


Data!• Computer generated data

• Application server logs (web sites, games)• Sensor data (weather, water, smart grids)• Images/videos (traffic, security cameras)

• Human generated data• Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)• Blogs/Reviews/Emails/Pictures

• Example of Bid data• The New York Stock Exchange generate about 1TB of new trade data

per day.• A commercial aircraft generates 3GB of flight sensor data in 1 hour.• Vodafone generates 3TB of Call Detail Record (CRDs) per day.• Between 2009 and 2014, the total number of U.S. online banking

households will increase from 54 million to 66 million.


Dat

a ar

e cr

eate

d ev

ery

min

ute

http

://w

ww

.adw

eek.

com

/soc

ialti

mes

/info

grap

hic-

wha

t-ha

ppen

s-in

-just

-one

-inte

rnet

-min

ute/

1973

86


Big data• Big data is a broad term for data sets so large or complex

that traditional data processing applications are inadequate.• Big data usually includes data sets with sizes beyond the ability

of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

• Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.

• Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale.

• Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing.


Characteristics of Big Data• Volume: The quantity of generated data is important in this context. The size of the

data determines the value and potential of the data under consideration, and whether it can actually be considered big data or not. The name ‘big data’ itself contains a term related to size, and hence the characteristic.

• Variety: The type of content, and an essential fact that data analysts must know. This helps people who are associated with and analyze the data to effectively use the data to their advantage and thus uphold its importance.

• Velocity: In this context, the speed at which the data is generated and processed to meet the demands and the challenges that lie in the path of growth and development.

• Variability: The inconsistency the data can show at times—-which can hamper the process of handling and managing the data effectively.

• Veracity: The quality of captured data, which can vary greatly. Accurate analysis depends on the veracity of source data.

• Complexity: Data management can be very complex, especially when large volumes of data come from multiple sources. Data must be linked, connected, and correlated so users can grasp the information the data is supposed to convey.


Big Data

Source: http://www.datasciencecentral.com/


Why is Big Data Hard (and Getting Harder)?

• Unconstrained growth

• Current systems don’t scale

https://emergingtechblog.emc.com/converged-infrastructure-big-data-storage-analytics/

http://softwarestrategiesblog.com/tag/software/


Why is Big Data Hard (and Getting Harder)?

• Changing data requirements

• Faster response time of fresher data

• Sampling is not good enough, and history is important

• Increasing complexity of analytics

• Users demand inexpensive experimentation


Big Data

Data Sources Technologies Analytics


The Need for new technology

• Size of data is too big

• Many data are unstructured

• RDBMS is inadequate


What is Hadoop?• The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers using simple programming models.

• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

• Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

• A scalable fault-tolerant distributed system for data storage and processing

• Completely written in java Open source & distributed under Apache license


Hadoop• Hadoop is not

• ...a relational database!• ...an online transaction processing (OLTP) system!• ...a structured data store of any kind!• ...a replacement for existing data warehouse systems• ...a File system

• Hadoop is not for all type of work• Not good to process transactions• Not good when work cannot be parallelized• Not good for low latency data access• Not good for processing lots of small files• Not good for intensively calculation with little data


Hadoop creation history


Hadoop 1.x Core Components• MapReduce (Job Scheduling/Execution System)

• MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.

• Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

• HDFS (Hadoop Distributed File System)• HDFS is a default storage for the Hadoop cluster. The data is

distributed and replicated over multiple machines• HDFS is the primary distributed storage used by Hadoop applications.

A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.


MapReduce• MapReduce is a programming model and an associated

implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

• Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner

• MapReduce programs are inherently parallel.

• Technology from Google

• Consists of map and reduce function.• Tasktracker/ Jobtracker


HDFS• Default storage for the Hadoop cluster• Data is distributed and replicated over multiple machines• HDFS is designed to reliably store very large files across machines in

a large cluster. It is inspired by the Google File System. • Access to data files is handled in a streaming manner, meaning that

applications or commands are executed directly using the MapReduce processing model

• Designed to handle very large files with streaming data access patterns.

• NameNode/DataNode• Master/slave architecture (1 master 'n' slaves)• Designed for large files (64 MB default, but configurable) across all

the nodes


http://www.glennklockwood.com/data-intensive/hadoop/overview.html


Mas

ter

and

Sla

ve


Had

oop

1.x

= jo

b tr

acke

rS

ourc

e H

adoo

p: S

hash

wat

Shr

ipar

v


Hadoop 1.x Limitations

• Maximum cluster size ~ 4,000 nodes

• Maximum concurrent tasks – 40,000

• JobTracker bottleneck – resource management, job scheduling and monitoring

• Only has one namespace for managing HDFS

• Map and Reduce slots are static

• Only job to run is MapReduce


Had

oop

2

http://www.slideshare.net/uweseiler/hadoop-2-going-beyond-mapreduce


Hadoop 1.x v.s. Hadoop 2

Source: Edureka


Hadoop 2

• Potentially up to 10,000 nodes per cluster

• Support Federation (Multiple Cluster)

• Supports multiple namespace for managing HDFS

• Efficient cluster utilization (YARN)

• MRv1 backward and forward compatible

• Any apps can integrate with Hadoop

• Beyond Java


YARN: Multi-tenancy

• No more JobTracker / TaskTracker

• Support MR and non-MR running on the same cluster


Yarn Components

• Resource Manager• a pluggable scheduler > primarily limited to scheduling

• ApplicationManager >> manages user jobs on the cluster

• Node Manager• Containers

• manages users’ jobs and workflow on a given node

• Application Master• a user job life-cycle manager


Sou

rce:

Apa

che

Had

oop™

YA

RN

: M

ovin

g be

yond

M

apR

educ

e an

d B

atch

Pro

cess

ing

with

Apa

che

Had

oop™

2


Had

oop

2 E

cosy

stem


Hadoop 2 Ecosystem• HDFS: Distributed redundant file system.

• MapReduce: Parallel computation on server cluster

• Hbase: Column-oriented database scaling to billions of rows.

• Flume: Collection and import of log and event data

• Hive: Data warehouse with SQL-like access

• Pig: a high-level language for expressing data analysis programs,

• Sqoop: Imports data from relational databases

• Oozie: Hadoop workflow.

• Mahout: Machine Learning

• Ambari: Cluster management

• R Connectors: Connects with R


Hadoop Distribution


Big Data on Cloud Roadmap

• Step 1: Build the business case

• Step 2: Assess your Big Data application workloads

• Step 3: Develop a technical approach for deploying and managing Big Data in the cloud

• Step 4: Address governance, security, privacy, risk,

• Step 5: Deploy, integrate, and operationalize your cloud-based Big Data infrastructure

Source : Deploying Big Data Analytics Applications to the Cloud: Roadmap for Success: CSCS


Hadoop as a Service

• Amazon Elastic Map Reduce

• Rackspace Cloud Big Data Platform

• Qubole

• Google Cloud Platform

• IBM Bluemix: Analytic on Hadoop

• Microsoft Azure HDInsight


Hadoop in Azure HDInsight

• Big-data analysis and processing in the cloud

• Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability.

• HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.

• Hadoop often refers to the entire Hadoop ecosystem of components, which includes Storm and HBase clusters, as well as other technologies under the Hadoop umbrella.


Hadoop ecosystem on HDInsight

• HDInsight is a cloud implementation on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack that is the go-to solution for big data analysis.

• It includes implementations of Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and so on.

• HDInsight also integrates with business intelligence (BI) tools such as Excel, SQL Server Analysis Services, and SQL Server Reporting Services.


Linux and Windows clusters

• Azure HDInsight deploys and provisions Hadoop clusters in the cloud, by using either Linux or Windows as the underlying OS.• HDInsight on Linux - A Hadoop cluster on Ubuntu. Use this if

you are familiar with Linux or Unix, are migrating from an existing Linux-based Hadoop solution, or want easy integration with Hadoop ecosystem components built for Linux.

• HDInsight on Windows - A Hadoop cluster on Windows Server. Use this if you are familiar with Windows, are migrating from an existing Windows-based Hadoop solution, or want to use .NET or other Windows-only technologies on the cluster.


The following table compares the two:

Category Hadoop on Linux Hadoop on Windows

Cluster OSUbuntu 12.04 Long Term Support

(LTS)Windows Server 2012 R2

Cluster Type Hadoop, HBase, Storm Hadoop, HBase, Storm

DeploymentAzure preview portal, Azure CLI,

Azure PowerShell

Azure portal, Azure preview portal,

Azure CLI, Azure PowerShell,

HDInsight .NET SDK

Cluster UI Ambari Cluster Dashboard

Remote AccessSecure Shell (SSH), REST API,

ODBC, JDBC

Remote Desktop Protocol (RDP),

REST API, ODBC, JDBC


Hadoop, HBase, Storm• HDInsight provides cluster configurations for Hadoop, HBase, or

Storm. Or, you can customize clusters with script actions.• Hadoop (the "Query" workload): Provides reliable data storage with

HDFS, and a simple MapReduce programming model to process and analyze data in parallel.

• HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns.

• Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight.


Sqoop

• Apache Sqoop is tool that transfers bulk data between Hadoop and relational databases such a SQL, or other structured data stores, as efficiently as possible

• http://sqoop.apache.org/

http://sqoop.apache.org/


Pig

• Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on very large datasets by using a simple scripting language called Pig Latin.

• Pig translates the Pig Latin scripts so they’ll run within Hadoop.

• You can create User Defined Functions (UDFs) to extend Pig Latin.


Hive• Apache Hive is data warehouse software built on Hadoop that

allows you to query and manage large datasets in distributed storage by using a SQL-like language called HiveQL.

• Hive, like Pig, is an abstraction on top of MapReduce.

• When run, Hive translates queries into a series of MapReduce jobs.

• Hive is conceptually closer to a relational

database management system than Pig, and is

therefore appropriate for use with more structured

data. • For unstructured data, Pig is the better choice.


Mahout

• Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop.

• Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.


Assignment1. Following this instruction, you will create a HDInsight Hadoop

cluster with 2 head nodes and 2 worker nodes.• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-

hadoop-tutorial-get-started-windows/

2. Following this instruction, you will upload a Movielens dataset to the blob storage in HDFS. (I recommend using Hadoop CLI)• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-

upload-data/• http://grouplens.org/datasets/movielens/

3. Answer the questions in the lab guidance, and submit it in LMS.

4. After finishing the lab, please delete the HDInsight cluster to save your credits.

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-tutorial-get-started-windows/

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/

http://grouplens.org/datasets/movielens/


2 worker nodes

Choose the cheapest price (Both head and worker nodes)

Make sure that the price is $1.28 per hour