introduction to apache hadoopais-grid-2013.jinr.ru/docs/23/5-hadoop_introduction.pdf ·...

M A T T H I A S B R Ä G E R C E R N G S - A S E

INTRODUCTION TO APACHE HADOOP

AGENDA

•  Introduction to Big Data •  Introduction to Hadoop •  HDFS file system •  Map/Reduce framework •  Hadoop utilities

•  Summary

BIG DATA FACTS

•  In what timeframe do we now create the same amount of information that we created from the dawn of civilization until 2003?

•  90% of the world’s data was created in the last (how many years)?

•  What is 1024 petabytes also knows as?

2 years

2 days

1 exabyte

DATA IS GETTING BIGGER

Rapid growth of global data

from 2009-20201

From 1 to 35

Zetabytes

70% of the data

generated by individuals1

Global mobile data traffic will surpass2

10 exabytes in 2016

The number of mobile-connected devices exceeded

the world's population in 20122

7 billion

Every minute in the Internet3

100.000 Twitter

tweets

240.000 shared Facebook content

(1) CSC Report „big data growth infographic“, (2) Cisco Visual Networking Index 2011-2016, (3) Intel

DATA EXPLOSION COMPOUNDS CHALLENGES

80% of the effort involved in dealing with data is cleaning it up in the first place1

(1) O'Reilly Media

BIG DATA INCLUDES ALL TYPES OF DATA

BIG DATA INCLUDES ALL TYPES OF DATA

• Pre-defined schema • Example: Relational database systems

Structured

•  Inconsistent structure • Cannot be stored in rows and tables in a typical database • Examples: logs, tweets, sensor feeds

Semi-structured

• Lacks structure or… • Part of it lack structure • Examples: free-form text, reports, customer feedback forms

Unstructured

EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS

Big data analytics combines enterprises data with other relevant information

Web Browsing Patterns

Movie Releases

Social Media Sentiments

Enterprise Data

Gaming Industry Advertising Buys

EXAMPLE PREDICTING TRENDS AND PREPARING FOR FUTURE DEMANDS

to create predictive model of trends.

A NEW SOLUTION

•  Hadoop = HDFS + Map/Reduce

•  HDFS provides storage •  MapReduce provides analysis

THE HADOOP APPROACH

• Process data in parallel • Replicate data across cluster for reliability

Distribute large amounts of data across thousands of commodity hardware nodes

• Avoids data copy

Analysis moved to data

• Avoids random seek • Easiest way to proccess

Scanning of data

A NEW PARADIGM

•  Process data locally •  Reduce dependence on bandwidth •  Expect failure •  Handle failover elegantly •  Duplicate finite blocks of data to small groups of

nodes (rather than entire database) •  Reduce elapse seek time •  Place no conditions on the structure of the data

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

HADOOP COMPONENTS (1/2)

Essentials: •  HDFS - a scalable, high-performance distributed file system. •  MapReduce - A Java-based job tracking, node management,

and application container for mappers and reducers.

Frameworks: •  Chukwa - a data collection system for monitoring, displaying,

and analyzing logs from large distributed systems. •  Hive - structured data warehousing infrastructure that provides

a mechanisms for storage, data extraction, transformation, and loading (ETL), and a SQL-like language for querying and analysis.

•  HBase - a column-oriented (NoSQL) database designed for real-time storage, retrieval, and search of very large tables (billions of rows/millions of columns) running atop HDFS.

HADOOP COMPONENTS (2/2)

Utilities: •  Pig - a set of tools for programmatic flat-file data

analysis that provides a programming language, data transformation, and parallelized processing. •  Sqoop - a tool for importing and exporting data

stored in relational databases into Hadoop or Hive, and vice versa using MapReduce tools and standard JDBC drivers. •  ZooKeeper - a distributed application management

tool used for managing the nodes in a Hadoop computational network.

HDFS HADOOP DISTR IBUTED F I LE SYSTEM

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

WHAT IS ?

•  A scalable, high-performance distributed file system •  Primary storage system for Hadoop •  Fast reliable •  Designed for consistency •  Presents a single view of multiple physical disks or

file systems •  Deployed only on Linux

HDFS CHARACTERISTICS

•  Persistent •  Replicated •  Linear scalable •  Applications sequentially stream reads •  Often from very large files

•  Optimized for read performance •  Avoids random disk seeks

•  Write once and read many times •  Data stored in blocks •  Distributed over many nodes •  Block size often range from 128MB to 1GB

HDFS ARCHITECTURE

Secondary NameNode

NameNode

Block Map

DataNode

BL1

BL2 BL7

BL6

DataNode

BL1

BL6 BL2

BL3

DataNode

BL1

BL8 BL9

BL7

Metadata

HDFS COMPONENTS

• Manages DataNodes • Keeps metadata for all nodes & blocks

NameNode

• Manages block reads/writes for HDFS • Manages block replication • Live on racks (rack-aware data organization)

DataNodes

• Talks directly to NameNode then DataNodes

Client

VS

• distributed file system that is well suited for the storage of large files.

• It is NOT a general purpose file system! • HDFS does not work well with less than 5 DataNodes

HDFS

• Built on top of HDFS • Suitable for hundreds of millions or billions of rows • Should not be used for tables with few thousand/million rows • More a “Data Store” than “Data Base” • RDBMS apps cannot be "ported" to HBase by simply

changing a JDBC driver!

HBASE

MAP/REDUCE HOW DOES I T WORKS?

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

WHAT IS MAP/REDUCE? (1/2)

•  A framework written in Java •  Big Data analytics and processing •  Node-local computation •  Parallel processes •  Handles node fail-over •  It all started when Google needed a way to: •  Determine which web sites to provide for searches •  Do page ranking

WHAT IS MAP/REDUCE? (2/2)

•  “Map” applies to all the members of the dataset and returns a list of results •  “Reduce” collates and resolves the results from one

or more mapping operations executed in parallel •  Very large datasets are split into large subsets called

splits •  Separates business logic from multi-processing logic •  MapReduce framework developers focus on process

dispatching, locking, and logic flow •  App developers focus on implementing the business logic

without worrying about infrastructure or scalability issues

HOW MAP/REDUCE WORKS

BigData Result

“John was ..”

“Hi, John!”

(“John”, 1) (“John”, 3)

Map Reduce

MAP/REDUCE EXAMPLE (1/2)

Toronto, 20 Dubna, 25 Geneva, 22 Rome, 32 Toronto, 4 Rome, 38 Geneva, 18

(Toronto, 20) (Dubna, 25) (Geneva, 22)

Find maximum temperature for each city out of 5 files:

(Toronto, 18) (Geneva, 32) (Rome, 37) (Dubna, 20) (Geneva, 20) (Rome, 33)(Toronto, 22) (Dubna, 19)

(Rome, 31)(Toronto, 31) (Dubna, 22) (Geneva, 19) (Rome, 30)

Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:

Mapper task result:

(Dubna, 27)

(Rome, 38)

(Toronto, 32)

(Geneva, 33)

MAP/REDUCE EXAMPLE (2/2)

•  All five of these output streams are fed into the reduce tasks, which combine the input results and output a single value for each city

Final Result:

(Toronto, 32) (Dubna, 27) (Geneva, 33) (Rome, 38)

PIG A HADOOP SCRIPT ING LANGUAGE

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

WHAT IS PIG?

•  A high-level data-flow language (Pig Latin) and execution framework for parallel computation •  Pig is made of two main components: •  A SQL-like data processing language called Pig Latin •  A compiler that compiles and runs Pig Latin scripts

•  Pig Latin provides: •  Ease of programming. Trivial to achieve parallel execution

of simple, "embarrassingly parallel" data analysis tasks •  Optimization opportunities. Permits the system to optimize

execution of tasks automatically, allowing the user to focus on semantics rather than efficiency.

•  Extensibility. Users can create their own functions

THE ORIGINS

•  Pig was created by Yahoo! to make it easier to analyze the data in HDFS without the complexities of writing a traditional MapReduce program. •  With Pig, it is possible to develop MapReduce jobs

with a few lines of Pig Latin

PIG IN THE ECO SYSTEM

•  Pig runs on Hadoop utilizing both HDFS and MapReduce •  By default, Pig reads and writes files from HDFS •  Pig stores intermediate data among MapReduce

jobs

Pig

MapReduce

HDFS

HBase

RUNNING PIG

•  A Pig Latin script executes in thee modes 1.   MapReduce: the code executes as a MapReduce

application on a Hadoop cluster (default mode) 2.   Local: the code executes locally in a single JVM using a

local text file (for development purposes) 3.   Interactive: Pig commands are entered manually at a

command prompt known as the Grunt shell

PIG EXAMPLES UNION

grunt> a = LOAD 'A' USING PigStorage(',') AS (a1:int, a2:int, a3:int); grunt> b = LOAD 'B' USING PigStorage(',') AS (b1:int, b2:int, b3:int); grunt> DUMP a; (0,1,2) (1,3,4) grunt> DUMP b; (0,5,2) (1,7,8) grunt> c = UNION a, b AS (c1:int, c2:int, c3:int); grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8)

PIG EXAMPLES SPLIT

grunt> SPLIT c INTO d IF $0 == 0, e IF $0 == 1; grunt> DUMP d; (0,1,2) (0,5,2) grunt> DUMP e; (1,3,4) (1,7,8)

PIG EXAMPLES FOREACH

grunt> DUMP c; (0,1,2) (0,5,2) (1,3,4) (1,7,8) grunt> mult = FOREACH c GENERATE c2, c2 * c3; grunt> DUMP mult; (1,2) (5,10) (3,12) (7,56)

EXAMPLE OF A PIG SCRIPT

•  Find the top 10 URLS for users between 18 and 25

Users = LOAD ‘users’ AS (name, age); FilteredUsers = FILTER Users BY age >= 18 AND age <= 25; Pages = LOAD ‘pages’ AS (user, url) JoinResult = JOIN FilteredUsers BY name, Pages BY users; Grouped = GROUP JoinResult BY url; Summed = FOREACH Grouped GENERATE group; COUNT(JoinResult) AS clicks; Sorted = ORDER Summed BY clicks desc; Top10 = LIMIT sorted 10; STORE Top10 INTO ‘top10sites’;

HIVE A DATA WAREHOUSE SYSTEM FOR HADOOP

HADOOP OVERVIEW

HDFS

MapReduce HBase

Hive Chukwa PIG

Zoo

kee

pe

r

Disk Disk Disk Disk

Sqoop

WHAT IS HIVE?

•  Hive is a data warehouse system for Hadoop •  facilitates easy data summarization, ad-hoc queries,

and the analysis of large datasets •  Hive provides a mechanism to project structure

onto this data and query the data using a SQL-like language called HiveQL

HIVEQL EXAMPLE

•  The underlying table www_access consists of three fields: ip, url, and time.

SELECT COUNT(1) FROM www_access;

SELECT COUNT(distinct v['ip']) FROM www_access WHERE v['url']='/’;

Number of Records:

Number of Unique IPs that accessed the Top Page:

SUMMARY WHAT D ID WE LEARN?

TO TAKE AWAY

•  Data is getting bigger and more complex to handle •  Hadoop = HDFS + Map/Reduce •  Will Hadoop replace relational databases? No!

QUESTIONS? THANK YOU FOR YOUR ATTENT ION!

introduction to apache hadoopais-grid-2013.jinr.ru/docs/23/5-hadoop_introduction.pdf ·...

Documents