introduction to apache hadoopais-grid-2013.jinr.ru/docs/23/5-hadoop_introduction.pdf ·...

48
MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP

Upload: others

Post on 18-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

M A T T H I A S B R Ä G E R C E R N G S - A S E

INTRODUCTION TO APACHE HADOOP

Page 2: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

AGENDA

•  Introduction to Big Data •  Introduction to Hadoop •  HDFS file system •  Map/Reduce framework •  Hadoop utilities

•  Summary

Page 3: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop
Page 4: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

BIG DATA FACTS

•  In what timeframe do we now create the same amount of information that we created from the dawn of civilization until 2003?

•  90% of the world’s data was created in the last (how many years)?

•  What is 1024 petabytes also knows as?

2 years

2 days

1 exabyte

Page 5: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

DATA IS GETTING BIGGER

Rapid growth of global data

from 2009-20201

From 1 to 35

Zetabytes

70% of the data

generated by individuals1

Global mobile data traffic will surpass2

10 exabytes in 2016

The number of mobile-connected devices exceeded

the world's population in 20122

7 billion

Every minute in the Internet3

100.000 Twitter

tweets

240.000 shared Facebook content

(1) CSC Report „big data growth infographic“, (2) Cisco Visual Networking Index 2011-2016, (3) Intel

Page 6: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

DATA EXPLOSION COMPOUNDS CHALLENGES

80% of the effort involved in dealing with data is cleaning it up in the first place1

(1) O'Reilly Media

Page 7: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

BIG DATA INCLUDES ALL TYPES OF DATA

Page 8: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

BIG DATA INCLUDES ALL TYPES OF DATA

• Pre-defined schema • Example: Relational database systems

Structured

•  Inconsistent structure • Cannot be stored in rows and tables in a typical database • Examples: logs, tweets, sensor feeds

Semi-structured

• Lacks structure or… • Part of it lack structure • Examples: free-form text, reports, customer feedback forms

Unstructured

Page 9: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS

Big data analytics combines enterprises data with other relevant information

Web Browsing Patterns

Movie Releases

Social Media Sentiments

Enterprise Data

Gaming Industry Advertising Buys

Page 10: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS

to create predictive model of trends.

Page 11: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

A NEW SOLUTION

•  Hadoop = HDFS + Map/Reduce

•  HDFS provides storage •  MapReduce provides analysis

Page 12: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

THE HADOOP APPROACH

• Process data in parallel • Replicate data across cluster for reliability

Distribute large amounts of data across thousands of commodity hardware nodes

• Avoids data copy

Analysis moved to data

• Avoids random seek • Easiest way to proccess

Scanning of data

Page 13: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

A NEW PARADIGM

•  Process data locally •  Reduce dependence on bandwidth •  Expect failure •  Handle failover elegantly •  Duplicate finite blocks of data to small groups of

nodes (rather than entire database) •  Reduce elapse seek time •  Place no conditions on the structure of the data

Page 14: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

Page 15: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HADOOP COMPONENTS (1/2)

Essentials: •  HDFS - a scalable, high-performance distributed file system. •  MapReduce - A Java-based job tracking, node management,

and application container for mappers and reducers.

Frameworks: •  Chukwa - a data collection system for monitoring, displaying,

and analyzing logs from large distributed systems. •  Hive - structured data warehousing infrastructure that provides

a mechanisms for storage, data extraction, transformation, and loading (ETL), and a SQL-like language for querying and analysis.

•  HBase - a column-oriented (NoSQL) database designed for real-time storage, retrieval, and search of very large tables (billions of rows/millions of columns) running atop HDFS.

Page 16: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HADOOP COMPONENTS (2/2)

Utilities: •  Pig - a set of tools for programmatic flat-file data

analysis that provides a programming language, data transformation, and parallelized processing. •  Sqoop - a tool for importing and exporting data

stored in relational databases into Hadoop or Hive, and vice versa using MapReduce tools and standard JDBC drivers. •  ZooKeeper - a distributed application management

tool used for managing the nodes in a Hadoop computational network.

Page 17: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HDFS HADOOP DISTR IBUTED F I LE SYSTEM

Page 18: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

Page 19: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

WHAT IS ?

•  A scalable, high-performance distributed file system •  Primary storage system for Hadoop •  Fast reliable •  Designed for consistency •  Presents a single view of multiple physical disks or

file systems •  Deployed only on Linux

Page 20: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HDFS CHARACTERISTICS

•  Persistent •  Replicated •  Linear scalable •  Applications sequentially stream reads •  Often from very large files

•  Optimized for read performance •  Avoids random disk seeks

•  Write once and read many times •  Data stored in blocks •  Distributed over many nodes •  Block size often range from 128MB to 1GB

Page 21: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HDFS ARCHITECTURE

Secondary NameNode

NameNode

Block Map

DataNode

BL1

BL2 BL7

BL6

DataNode

BL1

BL6 BL2

BL3

DataNode

BL1

BL8 BL9

BL7

Metadata

Page 22: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HDFS COMPONENTS

• Manages DataNodes • Keeps metadata for all nodes & blocks

NameNode

• Manages block reads/writes for HDFS • Manages block replication • Live on racks (rack-aware data organization)

DataNodes

• Talks directly to NameNode then DataNodes

Client

Page 23: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

VS

• distributed file system that is well suited for the storage of large files.

• It is NOT a general purpose file system! • HDFS does not work well with less than 5 DataNodes

HDFS

• Built on top of HDFS • Suitable for hundreds of millions or billions of rows • Should not be used for tables with few thousand/million rows • More a “Data Store” than “Data Base” • RDBMS apps cannot be "ported" to HBase by simply

changing a JDBC driver!

HBASE

Page 24: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

MAP/REDUCE HOW DOES I T WORKS?

Page 25: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

Page 26: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

WHAT IS MAP/REDUCE? (1/2)

•  A framework written in Java •  Big Data analytics and processing •  Node-local computation •  Parallel processes •  Handles node fail-over •  It all started when Google needed a way to: •  Determine which web sites to provide for searches •  Do page ranking

Page 27: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

WHAT IS MAP/REDUCE? (2/2)

•  “Map” applies to all the members of the dataset and returns a list of results •  “Reduce” collates and resolves the results from one

or more mapping operations executed in parallel •  Very large datasets are split into large subsets called

splits •  Separates business logic from multi-processing logic •  MapReduce framework developers focus on process

dispatching, locking, and logic flow •  App developers focus on implementing the business logic

without worrying about infrastructure or scalability issues

Page 28: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HOW MAP/REDUCE WORKS

Page 29: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

BigData Result

“John was ..”

“Hi, John!”

(“John”, 1) (“John”, 3)

Map Reduce

Page 30: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

MAP/REDUCE EXAMPLE (1/2)

Toronto, 20 Dubna, 25 Geneva, 22 Rome, 32 Toronto, 4 Rome, 38 Geneva, 18

(Toronto, 20) (Dubna, 25) (Geneva, 22)

Find maximum temperature for each city out of 5 files:

(Toronto, 18) (Geneva, 32) (Rome, 37) (Dubna, 20) (Geneva, 20) (Rome, 33)(Toronto, 22) (Dubna, 19)

(Rome, 31)(Toronto, 31) (Dubna, 22) (Geneva, 19) (Rome, 30)

Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:

Mapper task result:

(Dubna, 27)

(Rome, 38)

(Toronto, 32)

(Geneva, 33)

Page 31: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

MAP/REDUCE EXAMPLE (2/2)

•  All five of these output streams are fed into the reduce tasks, which combine the input results and output a single value for each city

Final Result:

(Toronto, 32) (Dubna, 27) (Geneva, 33) (Rome, 38)

Page 32: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

PIG A HADOOP SCRIPT ING LANGUAGE

Page 33: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

Page 34: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

WHAT IS PIG?

•  A high-level data-flow language (Pig Latin) and execution framework for parallel computation •  Pig is made of two main components: •  A SQL-like data processing language called Pig Latin •  A compiler that compiles and runs Pig Latin scripts

•  Pig Latin provides: •  Ease of programming. Trivial to achieve parallel execution

of simple, "embarrassingly parallel" data analysis tasks •  Optimization opportunities. Permits the system to optimize

execution of tasks automatically, allowing the user to focus on semantics rather than efficiency.

•  Extensibility. Users can create their own functions

Page 35: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

THE ORIGINS

•  Pig was created by Yahoo! to make it easier to analyze the data in HDFS without the complexities of writing a traditional MapReduce program. •  With Pig, it is possible to develop MapReduce jobs

with a few lines of Pig Latin

Page 36: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

PIG IN THE ECO SYSTEM

•  Pig runs on Hadoop utilizing both HDFS and MapReduce •  By default, Pig reads and writes files from HDFS •  Pig stores intermediate data among MapReduce

jobs

Pig

MapReduce

HDFS

HBase

Page 37: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

RUNNING PIG

•  A Pig Latin script executes in thee modes 1.   MapReduce: the code executes as a MapReduce

application on a Hadoop cluster (default mode) 2.   Local: the code executes locally in a single JVM using a

local text file (for development purposes) 3.   Interactive: Pig commands are entered manually at a

command prompt known as the Grunt shell

Page 38: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

PIG EXAMPLES UNION

grunt> a = LOAD 'A' USING PigStorage(',') AS (a1:int, a2:int, a3:int); grunt> b = LOAD 'B' USING PigStorage(',') AS (b1:int, b2:int, b3:int); grunt> DUMP a; (0,1,2) (1,3,4) grunt> DUMP b; (0,5,2) (1,7,8) grunt> c = UNION a, b AS (c1:int, c2:int, c3:int); grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8)

Page 39: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

PIG EXAMPLES SPLIT

grunt> SPLIT c INTO d IF $0 == 0, e IF $0 == 1; grunt> DUMP d; (0,1,2) (0,5,2) grunt> DUMP e; (1,3,4) (1,7,8)

Page 40: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

PIG EXAMPLES FOREACH

grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8) grunt> mult = FOREACH c GENERATE c2, c2 * c3; grunt> DUMP mult; (1,2) (5,10) (3,12) (7,56)

Page 41: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

EXAMPLE OF A PIG SCRIPT

•  Find the top 10 URLS for users between 18 and 25

Users = LOAD ‘users’ AS (name, age); FilteredUsers = FILTER Users BY age >= 18 AND age <= 25; Pages = LOAD ‘pages’ AS (user, url) JoinResult = JOIN FilteredUsers BY name, Pages BY users; Grouped = GROUP JoinResult BY url; Summed = FOREACH Grouped GENERATE group; COUNT(JoinResult) AS clicks; Sorted = ORDER Summed BY clicks desc; Top10 = LIMIT sorted 10; STORE Top10 INTO ‘top10sites’;

Page 42: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HIVE A DATA WAREHOUSE SYSTEM FOR HADOOP

Page 43: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

Page 44: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

WHAT IS HIVE?

•  Hive is a data warehouse system for Hadoop •  facilitates easy data summarization, ad-hoc queries,

and the analysis of large datasets •  Hive provides a mechanism to project structure

onto this data and query the data using a SQL-like language called HiveQL

Page 45: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

HIVEQL EXAMPLE

•  The underlying table www_access consists of three fields: ip, url, and time.

SELECT COUNT(1) FROM www_access;

SELECT COUNT(distinct v['ip']) FROM www_access WHERE v['url']='/’;

Number of Records:

Number of Unique IPs that accessed the Top Page:

Page 46: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

SUMMARY WHAT D ID WE LEARN?

Page 47: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

TO TAKE AWAY

•  Data is getting bigger and more complex to handle •  Hadoop = HDFS + Map/Reduce •  Will Hadoop replace relational databases? No!

Page 48: INTRODUCTION TO APACHE HADOOPais-grid-2013.jinr.ru/docs/23/5-Hadoop_introduction.pdf · INTRODUCTION TO APACHE HADOOP . AGENDA • Introduction to Big Data • Introduction to Hadoop

QUESTIONS? THANK YOU FOR YOUR ATTENT ION!