308471 ch4 intro-big_data-hadoop

44
ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 44 M4: Introduction to Big Data and Hadoop The only way to do great work is to love what you do. -- Steve Jobs -- WORAPOT JAKKHUPAN, PHD [email protected] ROOM BSC.0406/7 Information and Communication Technology Programme, Faculty of Science, PSU

Upload: worapot-jakkhupan

Post on 22-Jan-2017

434 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 44

M4: Introduction to Big Data and Hadoop

The only way to do great work is to love what you do. -- Steve Jobs --

W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7

I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U

Page 2: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 2 of 44

The slide is modified from the Big Data Course by IMC Institute

• http://thanachart.org/2015/05/11

• https://www.dropbox.com/s/i2ubtth2d0sw

zlf/Bigdata-Intro-IMC.pdf?dl=0

• https://www.dropbox.com/s/kl6g5o8roqhr0j9/Intro-Hadoop.pdf?dl=0

Sources:

Page 3: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 3 of 44

Outline

• Basic concepts of Big data.

• Hadoop Ecosystem• Comparison between Hadoop 1 and Hadoop 2

• Hadoop Core components and Cluster• MapReduce, HDFS, YARN

• Supporting components• HIVE, PIG, Flume, Sqoop, Mahout, etc

• Microsoft HDInsight• Set up and case study

Page 4: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 4 of 44

Data!• Computer generated data

• Application server logs (web sites, games)• Sensor data (weather, water, smart grids)• Images/videos (traffic, security cameras)

• Human generated data• Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)• Blogs/Reviews/Emails/Pictures

• Example of Bid data• The New York Stock Exchange generate about 1TB of new trade data

per day.• A commercial aircraft generates 3GB of flight sensor data in 1 hour.• Vodafone generates 3TB of Call Detail Record (CRDs) per day.• Between 2009 and 2014, the total number of U.S. online banking

households will increase from 54 million to 66 million.

Page 5: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 5 of 44

Dat

a ar

e cr

eate

d ev

ery

min

ute

http

://w

ww

.adw

eek.

com

/soc

ialti

mes

/info

grap

hic-

wha

t-ha

ppen

s-in

-just

-one

-inte

rnet

-min

ute/

1973

86

Page 6: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 6 of 44

Big data• Big data is a broad term for data sets so large or complex

that traditional data processing applications are inadequate.• Big data usually includes data sets with sizes beyond the ability

of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

• Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.

• Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale.

• Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing.

Page 7: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 7 of 44

Characteristics of Big Data• Volume: The quantity of generated data is important in this context. The size of the

data determines the value and potential of the data under consideration, and whether it can actually be considered big data or not. The name ‘big data’ itself contains a term related to size, and hence the characteristic.

• Variety: The type of content, and an essential fact that data analysts must know. This helps people who are associated with and analyze the data to effectively use the data to their advantage and thus uphold its importance.

• Velocity: In this context, the speed at which the data is generated and processed to meet the demands and the challenges that lie in the path of growth and development.

• Variability: The inconsistency the data can show at times—-which can hamper the process of handling and managing the data effectively.

• Veracity: The quality of captured data, which can vary greatly. Accurate analysis depends on the veracity of source data.

• Complexity: Data management can be very complex, especially when large volumes of data come from multiple sources. Data must be linked, connected, and correlated so users can grasp the information the data is supposed to convey.

Page 8: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 8 of 44

Big Data

Source: http://www.datasciencecentral.com/

Page 9: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 9 of 44

Why is Big Data Hard (and Getting Harder)?

• Unconstrained growth

• Current systems don’t scale

https://emergingtechblog.emc.com/converged-infrastructure-big-data-storage-analytics/

http://softwarestrategiesblog.com/tag/software/

Page 10: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 10 of 44

Why is Big Data Hard (and Getting Harder)?

• Changing data requirements

• Faster response time of fresher data

• Sampling is not good enough, and history is important

• Increasing complexity of analytics

• Users demand inexpensive experimentation

Page 11: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 11 of 44

Big Data

Data Sources Technologies Analytics

Page 12: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 12 of 44

The Need for new technology

• Size of data is too big

• Many data are unstructured

• RDBMS is inadequate

Page 13: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 13 of 44

What is Hadoop?• The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers using simple programming models.

• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

• Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

• A scalable fault-tolerant distributed system for data storage and processing

• Completely written in java Open source & distributed under Apache license

Page 14: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 14 of 44

Hadoop• Hadoop is not

• ...a relational database!• ...an online transaction processing (OLTP) system!• ...a structured data store of any kind!• ...a replacement for existing data warehouse systems• ...a File system

• Hadoop is not for all type of work• Not good to process transactions• Not good when work cannot be parallelized• Not good for low latency data access• Not good for processing lots of small files• Not good for intensively calculation with little data

Page 15: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 15 of 44

Hadoop creation history

Page 16: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 16 of 44

Hadoop 1.x Core Components• MapReduce (Job Scheduling/Execution System)

• MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.

• Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

• HDFS (Hadoop Distributed File System)• HDFS is a default storage for the Hadoop cluster. The data is

distributed and replicated over multiple machines• HDFS is the primary distributed storage used by Hadoop applications.

A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.

Page 17: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 17 of 44

MapReduce• MapReduce is a programming model and an associated

implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

• Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner

• MapReduce programs are inherently parallel.

• Technology from Google

• Consists of map and reduce function.• Tasktracker/ Jobtracker

Page 18: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 18 of 44

HDFS• Default storage for the Hadoop cluster• Data is distributed and replicated over multiple machines• HDFS is designed to reliably store very large files across machines in

a large cluster. It is inspired by the Google File System. • Access to data files is handled in a streaming manner, meaning that

applications or commands are executed directly using the MapReduce processing model

• Designed to handle very large files with streaming data access patterns.

• NameNode/DataNode• Master/slave architecture (1 master 'n' slaves)• Designed for large files (64 MB default, but configurable) across all

the nodes

Page 19: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 19 of 44

http://www.glennklockwood.com/data-intensive/hadoop/overview.html

Page 20: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 20 of 44

Mas

ter

and

Sla

ve

Page 21: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 21 of 44

Had

oop

1.x

= jo

b tr

acke

rS

ourc

e H

adoo

p: S

hash

wat

Shr

ipar

v

Page 22: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 22 of 44

Hadoop 1.x Limitations

• Maximum cluster size ~ 4,000 nodes

• Maximum concurrent tasks – 40,000

• JobTracker bottleneck – resource management, job scheduling and monitoring

• Only has one namespace for managing HDFS

• Map and Reduce slots are static

• Only job to run is MapReduce

Page 23: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 23 of 44

Had

oop

2

http://www.slideshare.net/uweseiler/hadoop-2-going-beyond-mapreduce

Page 24: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 24 of 44

Hadoop 1.x v.s. Hadoop 2

Source: Edureka

Page 25: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 25 of 44

Hadoop 2

• Potentially up to 10,000 nodes per cluster

• Support Federation (Multiple Cluster)

• Supports multiple namespace for managing HDFS

• Efficient cluster utilization (YARN)

• MRv1 backward and forward compatible

• Any apps can integrate with Hadoop

• Beyond Java

Page 26: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 26 of 44

YARN: Multi-tenancy

• No more JobTracker / TaskTracker

• Support MR and non-MR running on the same cluster

Page 27: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 27 of 44

Yarn Components

• Resource Manager• a pluggable scheduler > primarily limited to scheduling

• ApplicationManager >> manages user jobs on the cluster

• Node Manager• Containers

• manages users’ jobs and workflow on a given node

• Application Master• a user job life-cycle manager

Page 28: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 28 of 44

Sou

rce:

Apa

che

Had

oop™

YA

RN

: M

ovin

g be

yond

M

apR

educ

e an

d B

atch

Pro

cess

ing

with

Apa

che

Had

oop™

2

Page 29: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 29 of 44

Had

oop

2 E

cosy

stem

Page 30: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 30 of 44

Hadoop 2 Ecosystem• HDFS: Distributed redundant file system.

• MapReduce: Parallel computation on server cluster

• Hbase: Column-oriented database scaling to billions of rows.

• Flume: Collection and import of log and event data

• Hive: Data warehouse with SQL-like access

• Pig: a high-level language for expressing data analysis programs,

• Sqoop: Imports data from relational databases

• Oozie: Hadoop workflow.

• Mahout: Machine Learning

• Ambari: Cluster management

• R Connectors: Connects with R

Page 31: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 31 of 44

Hadoop Distribution

Page 32: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 32 of 44

Big Data on Cloud Roadmap

• Step 1: Build the business case

• Step 2: Assess your Big Data application workloads

• Step 3: Develop a technical approach for deploying and managing Big Data in the cloud

• Step 4: Address governance, security, privacy, risk,

• Step 5: Deploy, integrate, and operationalize your cloud-based Big Data infrastructure

Source : Deploying Big Data Analytics Applications to the Cloud: Roadmap for Success: CSCS

Page 33: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 33 of 44

Hadoop as a Service

• Amazon Elastic Map Reduce

• Rackspace Cloud Big Data Platform

• Qubole

• Google Cloud Platform

• IBM Bluemix: Analytic on Hadoop

• Microsoft Azure HDInsight

Page 34: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 34 of 44

Hadoop in Azure HDInsight

• Big-data analysis and processing in the cloud

• Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability.

• HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.

• Hadoop often refers to the entire Hadoop ecosystem of components, which includes Storm and HBase clusters, as well as other technologies under the Hadoop umbrella.

Page 35: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 35 of 44

Hadoop ecosystem on HDInsight

• HDInsight is a cloud implementation on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack that is the go-to solution for big data analysis.

• It includes implementations of Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and so on.

• HDInsight also integrates with business intelligence (BI) tools such as Excel, SQL Server Analysis Services, and SQL Server Reporting Services.

Page 36: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 36 of 44

Linux and Windows clusters

• Azure HDInsight deploys and provisions Hadoop clusters in the cloud, by using either Linux or Windows as the underlying OS.• HDInsight on Linux - A Hadoop cluster on Ubuntu. Use this if

you are familiar with Linux or Unix, are migrating from an existing Linux-based Hadoop solution, or want easy integration with Hadoop ecosystem components built for Linux.

• HDInsight on Windows - A Hadoop cluster on Windows Server. Use this if you are familiar with Windows, are migrating from an existing Windows-based Hadoop solution, or want to use .NET or other Windows-only technologies on the cluster.

Page 37: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 37 of 44

The following table compares the two:

Category Hadoop on Linux Hadoop on Windows

Cluster OSUbuntu 12.04 Long Term Support

(LTS)Windows Server 2012 R2

Cluster Type Hadoop, HBase, Storm Hadoop, HBase, Storm

DeploymentAzure preview portal, Azure CLI,

Azure PowerShell

Azure portal, Azure preview portal,

Azure CLI, Azure PowerShell,

HDInsight .NET SDK

Cluster UI Ambari Cluster Dashboard

Remote AccessSecure Shell (SSH), REST API,

ODBC, JDBC

Remote Desktop Protocol (RDP),

REST API, ODBC, JDBC

Page 38: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 38 of 44

Hadoop, HBase, Storm• HDInsight provides cluster configurations for Hadoop, HBase, or

Storm. Or, you can customize clusters with script actions.• Hadoop (the "Query" workload): Provides reliable data storage with

HDFS, and a simple MapReduce programming model to process and analyze data in parallel.

• HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns.

• Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight.

Page 39: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 39 of 44

Sqoop

• Apache Sqoop is tool that transfers bulk data between Hadoop and relational databases such a SQL, or other structured data stores, as efficiently as possible

• http://sqoop.apache.org/

Page 40: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 40 of 44

Pig

• Apache Pig is a high-level platform that allows you to perform complex MapReduce transformations on very large datasets by using a simple scripting language called Pig Latin.

• Pig translates the Pig Latin scripts so they’ll run within Hadoop.

• You can create User Defined Functions (UDFs) to extend Pig Latin.

Page 41: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 41 of 44

Hive• Apache Hive is data warehouse software built on Hadoop that

allows you to query and manage large datasets in distributed storage by using a SQL-like language called HiveQL.

• Hive, like Pig, is an abstraction on top of MapReduce.

• When run, Hive translates queries into a series of MapReduce jobs.

• Hive is conceptually closer to a relational

database management system than Pig, and is

therefore appropriate for use with more structured

data. • For unstructured data, Pig is the better choice.

Page 42: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 42 of 44

Mahout

• Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop.

• Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.

Page 43: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 43 of 44

Assignment1. Following this instruction, you will create a HDInsight Hadoop

cluster with 2 head nodes and 2 worker nodes.• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-

hadoop-tutorial-get-started-windows/

2. Following this instruction, you will upload a Movielens dataset to the blob storage in HDFS. (I recommend using Hadoop CLI)• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-

upload-data/• http://grouplens.org/datasets/movielens/

3. Answer the questions in the lab guidance, and submit it in LMS.

4. After finishing the lab, please delete the HDInsight cluster to save your credits.

Page 44: 308471 ch4 intro-big_data-hadoop

ICT@PSU 308-471 Data Warehousing and Data Mining 44 of 44

2 worker nodes

Choose the cheapest price (Both head and worker nodes)

Make sure that the price is $1.28 per hour