introduction to hadoop, hive, an d apache...

69
Introducon to Hadoop, Hive, an d Apache Spark Concepts and Tools September 2018 1

Upload: others

Post on 31-Aug-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Introduction to Hadoop, Hive, and Apache Spark

Concepts and Tools

September 2018

1

Page 2: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Outline

• Overview• MapReduce Framework• HDFS Framework• Hadoop Cluster Mechanisms• Relevant Technologies

– Hive, sqoop, Pig and NoSQL• Apache Spark

What and Why?

} How?

2

Page 3: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Overview of Hadoop

3

Page 4: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Why Hadoop?• Hadoop is a platorm for storage and processing huge da 

tasets distributed on clusters of commodity machines.

• Two core components of Hadoop:– Processing engine (traditionally MapReduce, more recently, S

park) – HDFS (Hadoop Distributed File Systems)

4

Page 5: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Why Hadoop (cont.)?

• Hadoop addresses “big data” challenges.• “Big data” creates large business values today.• Various industries face “big data” challenges.

Without an efficient data processing approach, the data cannot create business values.– Many industries end up creating large amounts of

data that they are unable to gain any insight from.

*http://wikibon.org/ 5

Page 6: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Big Data!!

• What is “big data”?• One SKA Survey will generate a data product o

f 4 EB.• The DINGO uv grid dataset is ~ 4 PB• General requirement for SKA Phase 1 –

– Initial bare minimum 600 PB– Annual increase at least 1 EB per year

6

Page 7: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Core Components of Hadoop

7

Page 8: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Core Components of Hadoop

• MapReduce/Spark– An efficient programming framework for processing parallelizable probl   

ems across huge datasets using a large number of commodity machines.

• HDFS– A distributed file system designed to efficiently allocate data across mul

tiple commodity machines, and provide self-healing functions when some of them go down.

Commodity machine

Super computer

Performance Low HighCost Low HighAvailability Readily available Hard to obtain

8

Page 9: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

MapReduce Framework

• Map: – Extract something of interest from each

chunk of record.• Reduce:

– Aggregate the intermediate outputs from the Map process.

• The Map and Reduce have different instantiations in different problems.

General framework

9

Page 10: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

MapReduce Framework

10

Page 11: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

MapReduce Framework

• Inputs and outputs of Mappers and Reducers are key value pairs <k,v>.

• Programmers must do the coding according to the MapReduce Model– Specify Map method– Specify Reduce Method– Define the intermediate outputs in <k,v> format.

11

Page 12: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

HDFS Framework

• Hadoop Distributed File System (HDFS) is a highly fault-tolerant distributed file system for Hadoop.– Data storage infrastructure of Hadoop Cluster– Hadoop ≈ Processing engine (MapReduce or Spark) + HDFS

• Specifically designed to work with MapReduce/Spark.

• Major assumptions:– Large data sets.– Hardware failure.– Write once, read many.

12

Page 13: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

HDFS Framework• Key features of HDFS:

– Fault Tolerance - Automatically and seamlessly recover from failures – Data Replication- to provide redundancy.– Load Balancing - Place data intelligently for maximum efficiency and utilization – Scalability- Add servers to increase capacity

– “Moving computations is cheaper than moving data.”

13

Page 14: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

HDFS Framework

• Components of HDFS:– DataNodes

• Store the data with optimized redundancy.

– Journal Node• Coordinates the data nodes with the Name Node

– NameNode (s)• Manages the DataNodes.• Secondary Name Node – failover capability

14

Page 15: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop vs RDBMS

• Many businesses are turning from RDBMS to Hadoop-based systems for data management.

• In a word, if businesses need to process and analyze large-scale, real-time data, choose Hadoop. Otherwise staying with RDBMS is still a wise choice.

Hadoop-based RDBMS

Data format Structured & Unstructured Mostly structuredScalability Very high LimitedSpeed Fast for large-scale data Very fast for small-medium size data. Analytics Powerful analytical tools for

big-data.Some limited built-in analytics.

15

Page 16: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop vs Other Distributed Systems

• Common Challenges in Distributed Systems– Component Failure

• Individual computer nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. 

– Network Congestion• Data may not arrive at a particular point in time.

– Communication Failure• Multiple implementations or versions of client software may speak sli

ghtly different protocols from one another.

– Security• Data may be corrupted, or maliciously or improperly transmitted.  

– Failover modes. Hadoop is automated.

16

Page 17: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop vs Other Distributed Systems

• Hadoop– Uses efficient programming model.– Efficient, automatic distribution of data and work a

cross machines.– Good in component failure and network congestio

n problems.– Weak for security issues. (Although…)

17

Page 18: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop Cluster Architecture

18

Page 19: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Cloudera

• A platorm that integrates many Hadoop-based products and services.

19

Page 20: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop Architecture (1)

• Hadoop has a master/slave architecture. • Typically one machine in the cluster is designat

ed as the NameNode and another machine as    the JobTracker, exclusively.  – These are the masters.

• The rest of the machines in the cluster act as both DataNode   and TaskTracker. – These are the slaves.

20

Page 21: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop Architecture (2)

• NameNode (master)– Manages the file system namespace.– Executes file system namespace operations like opening, closing, and r

enaming files and directories. – It also determines the mapping of data chunks to DataNodes.– Monitor DataNodes by receiving heartbeats.

• DataNodes (slaves)– Manage storage attached to the nodes that they run on.– Serve read and write requests from the file system’s clients. – Perform block creation, deletion, and replication upon instruction fro

m the NameNode.

21

Page 22: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop Architecture (3)• JobTracker (master)

– Receive jobs from client.– Talks to the NameNode to determine the location of the data– Manage and schedule the entire job. – Split and assign tasks to slaves (TaskTrackers).– Monitor the slave nodes by receiving heartbeats.

• TaskTrackers (slaves)– Manage individual tasks assigned by the JobTracker, including Map operatio

ns and Reduce operations.– Every TaskTracker is configured with a set of slots, these indicate the numbe

r of tasks that it can accept.– Send out heartbeat messages to the JobTracker to tell that it is still alive. – Notify the JobTracker when succeeds or fails.

22

Page 23: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop Architecture

• Example 1

NameNodeJob Tracker

masters

23

Page 24: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Cluster service layout

24

Page 25: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

25

Page 26: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop Resource Management

• Apache YARN (Yet Another Resource Negotiator)• Decouples resource management from data proc

essing requirements• Provides resources for any processing framework

compatible with Hadoop• Resource manager – dedicated scheduler, resides

on one of the service nodes• Node manager Daemons on each of the worker n

odes in the cluster

27

Page 27: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

YARNYet Another Resource Negotiator• Scheduler• Resource Mana

ger

28

Page 28: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Zookeeper

• Zookeeper: A cluster management tool that supports coordination between nodes in a distributed system.– When designing a Hadoop-based application, a lot of coordination works need t

o be considered. Writing these functionalities is difficult.

• Zookeeper provides services that can be used to develop distributed applications.

29

• Zookeeper provide services such as :Configuration managementSynchronizationGroup servicesLeader electionEtc.

Page 29: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop evolution

30

Page 30: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Relevant Technologies

31

Page 31: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Technologies relevant to Hadoop

Zookeeper

Pig

32

Page 32: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hadoop Ecosystem

33

Page 33: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hive

• Hive: data warehousing application based on Hadoop.– Query language is HiveQL, which looks similar to S

QL.– Translate HiveQL into MapReduce or Spark jobs.– Store & manage data on HDFS.– Can be used as an interface for HBase, MongoDB,

Cassandra etc.

34

Page 34: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hive – a datawarehouse for HDFS

• Simply put, Hive is a metadata layer on HDFS datasets • Different table types

– Internal or external tables• Different table formats

– Sequence, text, Parquet. ORC, RCFile• Compression

– Compression codecs – snappy, gzip, zlib• What Hive gives us

– SQL, Partitioning, Indexes

35

Page 35: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

36

Page 36: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Hive Partitioning

37

Page 37: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Sqoop

• Provides simple interface for importing data straight from relational DB to Hadoop.

38

Page 38: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

NoSQL

• HDFS- Append only file system–  A file once created, written, and closed need not be changed. – To modify any portion of a file that is already written, one must rew

rite the entire file and replace the old file.– Not efficient for random read/write.– Use relational database? Not scalable.

• Solution: NoSQL– Stands for Not Only SQL.– Class of non-relational data storage systems.– Usually do not require a pre-defined table schema in advance.– Scale horizontally.

• VS vertically.

39

Page 39: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

NoSQL• NoSQL data store models:

– Document store– Wide-column store– Key Value store– Graph store

• NoSQL Examples:– HBase– Cassandra– MongoDB– CouchDB– Redis– Riak– Neo4J– ….

40

Page 40: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

HBase

• HBase– Hadoop Database.

• Good integration with Hadoop.

– A datastore on HDFS that supports random read and write.

– A distributed database modeled after Google BigTable.

– Best fit for very large Hadoop projects.

41

Page 41: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Pig

• A high-level platorm for creating MapReduce   programs used in Hadoop. 

• Translate into efficient sequences of one or more MapReduce jobs.

• Executing the MapReduce jobs.

42

Page 42: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Need for High-Level Languages

• Hadoop is great for large data processing!– But writing Mappers and Reducers in Java for ever

ything is verbose and slow.• Solution: develop higher-level data processing

languages, on later processing engines.– Hive: HiveQL is like SQL.– Pig: Pig Latin similar to Perl.– Use Python!

43

Page 43: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Apache Spark

45

Page 44: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Apache Spark Background

• Many of the aforementioned Big Data technologies (Hbase, Hive, Pig, Mahout, etc.) are not integrated with each other.

• This can lead to reduced performance and integration difficulties.

• However, Apache Spark is a state-of-the-art Big Data technology that integrates many of the core functions from each of these technologies under one framework.

• Biggest advantage Spark offers over Map Reduce is in memory processing

46

Page 45: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Apache Spark Background

• Apache Spark is fast and general engine for large-scale data processing built upon distributed file systems. – Most common is Hadoop Distributed File System (HDFS).

• Claims to be 100 times faster than MapReduce and supports Java, Python, and Scala API’s.

• Spark is good for distributed computing tasks, and can handle batch, interactive, and real-time data within a single framework.

• Spark can also be run independently of Hadoop as well.

47

Page 46: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Residual Distributed Datasets

• The core abstraction for working with data• Spark automatically distributes the data across t

he cluster and parallelizes the operations• An RDD is simply a distributed collection of objec

ts• RDDs are split into partitions, which can be comp

uted on different cluster nodes• RDDs can contain any type of Python, Java or Sca

la object, including user-defined classes

48

Page 47: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Aside - Spark and Machine Learning

• Why it’s important!• Libraries available (Python on Spark)

– Mllib– astroML– astroPy– Theano– Tensorflow– The usual suspects (numpy, scipy)– More…

49

Page 48: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Deployment Options

• Standalone − Spark occupies the place on  top of HDFS. Spark and MapReduce run side-by-side for all jobs.

• Hadoop Yarn − Spark runs on Yarn withou t any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of the stack.

• Spark in MapReduce (SIMR) − Spark in M apReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.

51

Page 49: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark on YARN

• Resource management, scheduling and security controlled by YARN

• Each Spark executor runs as a YARN container

• Spark vs MapReduce – MapReduce schedules a container and starts a JVM fo

r each task– Spark hosts multiple tasks within the same container

52

Page 50: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark on YARN (continued)

• Each application has• ApplicationMaster process • ApplicationMaster requests resources from the Resour

ce Manager• When resources allocated, instructs the NodeManager

s to start containers on it’s behalf.• Deployment Modes

– Cluster – Driver runs ApplicationMaster on a cluster host chosen by YARN

– Client – Driver runs on the host where the job is submitted

53

Page 51: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components

• Regardless of deployment, Spark provides four standard libraries. – Spark SQL – allows for SQL like que

ries of data– Spark Streaming – allows real-time

processing of data– GraphX – allows graph analytics– Mllib – provides Machine Learning

tools.

54

Page 52: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components – Spark SQL–Spark SQL introduces a new data abstraction called SchemaRDD, which pr

ovides support for structured and semi-structured data. Consider the examples below.

–From Hive:c = HiveContext(sc)rows = c.sql(“select text, year, from hivetable”)rows.filter(lamba r: r.year > 2013).collect()

–From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”)c.sql(“select text, user.name from tweets”)

55

Page 53: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

• Hadoop is powerful. But where do we find so many commodity machines?

56

Page 54: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Amazon Elastic MapReduce

• Setting up Hadoop clusters on the cloud.• Amazon Elastic MapReduce (AEM).

– Powered by Hadoop.– Uses EC2 instances as virtual servers for the master and sla

ve nodes.• Key Features:

– No need to do server maintenance.– Resizable clusters.– Hadoop application support including HBase, Pig, Hive etc.– Easy to use, monitor, and manage.

57

Page 55: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components – MLlib

• MLlib (Machine Learning Library) is a distributed machine learning framework above Spark.

• Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mah out gained a Spark interface).

• Spark MLlib provides a variety of machine learning classic algorithms.

58

Page 56: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components – MLlib Algorithms

• Classification – logistic regression, linear SVM, Naïve Bayes, classification tree

• Regression – Generalized Linear Models (GLMs), Regression tree

• Collaborative filtering – Alternating Least Squares (ALS), Non-negative Matrix Factorization (NMF)

• Clustering – k-means

• Decomposition – SVD, PCA

• Optimization – stochastic gradient descent, L-BFGS

59

Page 57: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Interfaces!

• How do we access Hadoop and Spark?• CLI exist for

– Hive– Pig– Spark– Pyspark– Spark-submit

60

Page 58: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Better interfaces!

• Web interfaces – we will be using– Hue for Hive and Pig;– Jupyter for Pyspark; but others exist as well, eg.– Apache Zeppelin is also worth a look.– And the Cloudera Management server

• Visual representation of the entire cluster• Status of every service within the cluster• Extensive monitoring and configuration options

61

Page 59: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Lab – set up our environment

• Start the docker image• Start the Jupyter notebook server

• Explore Hue and Jupyter• Start the Cloudera Management Server

62

Page 60: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Lab – Set up the housing data set

• Import the housing data set into HDFS• Create an external table definition

• Create an internal table from the external table• Compare the two tables.

63

Page 61: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Lab – Use Pyspark to create RDD• We’ll use a Jupyter notebook to

demonstrate this.• Open nb1-rdd-create notebook in

Jupyter• Run the notebook

• As an optional exercise, save the file as a Hive table and use SparkSQL to

create the RDD

65

Page 62: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

References

• These articles are good for learning Hadoop.– http://developer.yahoo.com/hadoop/tutorial/– https://hadoop.apache.org/docs/r1.2.1/mapred_tu

torial.html– http://www.michael-noll.com/tutorials/– http://www.slideshare.net/cloudera/tokyo-nosqlsl

idesonly– http://www.fromdev.com/2010/12/interview-que

stions-hadoop-mapreduce.html

66

Page 63: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components – Spark Steaming

–Spark Streaming leverages Spark’s fast scheduling ability to perform streaming analytics.

– Chops up the live stream into batches of X seconds

– Spark treats each data batch as Resilient Distributed Datasets (RDDs) and processes them using RDD operations

– The processed results of the RDD operations are returned in batches

67

Page 64: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components – Spark Steaming

• Spark Streaming leverages Spark’s fast scheduling ability to perform streaming analytics.– Chops up the live stream into batches of X sec

onds– Spark treats each data batch as Resilient Distri

buted Datasets (RDDs) and processes them using RDD operations

– The processed results of the RDD operations are returned in batches

68

Page 65: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Lab – Flume to capture streaming data

69

Page 67: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components - GraphX

71

Page 68: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Spark Components – GraphX Algorithms

• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization

• Structured Prediction– Loopy Belief Propagation– Max-Product Linear Programs– Gibbs Sampling

• Semi-supervised ML– Graph SSL– CoEM

• Community Detection– Triangle Counting– K-core Decomposition– K-Truss

• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring

• Classification– Neural Networks

72

Page 69: Introduction to Hadoop, Hive, an d Apache Sparkteaching.csse.uwa.edu.au/units/CITS3402/lectures/CITS3402-Spark.pdf · Why Hadoop? • Hadoop is a platorm for storage and processinghuge

Resources for Apache Spark

• Spark has a variety of free resources you can learn from. – Big Data University -

http://bigdatauniversity.com/courses/spark-fundamentals/

– Founders of Spark, Databricks - https://databricks.com/

– Apache Spark download - http://spark.apache.org/ – Apache Spark set up tutorial -

http://www.tutorialspoint.com/apache_spark/ 73