2014 hadoop wrocław jug

39
Hadoop: Introduction Wojciech Langiewicz Wrocław Java User Group 2014

Upload: wojciech-langiewicz

Post on 15-Jan-2015

758 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 2014 hadoop   wrocław jug

Hadoop: Introduction

Wojciech LangiewiczWrocław Java User Group 2014

Page 2: 2014 hadoop   wrocław jug

2/39

About me

● Working with Hadoop and Hadoop related technologies for last 4 years

● Deployed 2 large clusters, bigger one was almost 0.5 PB in total storage

● Currently working as consultant / freelancer in Java and Hadoop

● On site Hadoop trainings from time to time

● In meantime working on Android apps

Page 3: 2014 hadoop   wrocław jug

3/39

Agenda

● Big Data

● Hadoop

● MapReduce basics

● Hadoop processing framework – Map Reduce on YARN

● Hadoop Storage system – HDFS

● Using SQL on Hadoop with Hive

● Connecting Hadoop with RDBMS using Sqoop

● Example of real Hadoop architecture – examples

Page 4: 2014 hadoop   wrocław jug

4/39

Big Data from technological perspective

● Huge amount of data

● Data collection

● Data processing

● Hardware limitations

● System reliability:

– Partial failures

– Data recoverability

– Consistency

– Scalability

Page 5: 2014 hadoop   wrocław jug

5/39

Approaches to Big Data problem

● Vertical scaling

● Horizontal scaling

● Moving data to processing

● Moving processing close to data

Page 6: 2014 hadoop   wrocław jug

6/39

Hadoop - motivations

● Data won't fit on one machine

● More machines → higher chance of failure

● Disk scan faster than seek

● Batch vs real time processing

● Data processing won't fit on one machine

● Move computation close to data

Page 7: 2014 hadoop   wrocław jug

7/39

Hadoop properties

● Linear scalability

● Distributed

● Shared (almost) nothing architecture

● Whole ecosystem of tools and techniques

● Unstructured data

● Raw data analysis

● Transparent data compression

● Replication at it's core

● Self-managing (replication, master election, etc)

● Easy to use

● Massive parallel processing

Page 8: 2014 hadoop   wrocław jug

8/39

Hadoop Architecture

● “Lower” layer: HDFS – data storage and retrieval system

● “Higher” layer: MapReduce – execution engine that relies on HDFS

● Please note that there are other systems that rely on HDFS for data storage, but won't be covered in this presentation

Page 9: 2014 hadoop   wrocław jug

9/39

Map Reduce basics

● Batch processing system

● Handles many distributed systems problems

● Automatic parallelization and distribution

● Fault tolerance

● Job status and monitoring

● Borrows from functional programming

● Based on Google's work: MapReduce: Simplified Data Processing on Large Clusters

Page 10: 2014 hadoop   wrocław jug

10/39

Word Count pseudo code

1: def map(String key, String value)2: foreach word in value:3: emit(word, 1);4:5: def reduce(String key, int[] values)6: int result = 0;7: foreach val in values:8: result += val;9: emit(key, result);10:

Page 11: 2014 hadoop   wrocław jug

11/39

Word Count Example

Source: http://xiaochongzhang.me/blog/?p=338

Page 12: 2014 hadoop   wrocław jug

12/39

Hadoop Map Reduce Architecture

Client

Job Tracker

Task Tracker

Map

Reduce

Task Tracker

Map

Reduce

Task Tracker

Map

Reduce

…...

Page 13: 2014 hadoop   wrocław jug

13/39

What can be expressed as MapReduce?

● grep

● sort

● SQL operators, for example:

– GROUP BY

– DISTINCT

– JOIN

● Recommending friends

● Reverting web indexes

● And many more

Page 14: 2014 hadoop   wrocław jug

14/39

HDFS – Hadoop Distributed File System

● Optimized for streaming access (prefers throughput over latency, no caching)

● Built-in replication

● One master server storing all metadata (Name Node)

● Multiple slaves that store data and report to master (Data Nodes)

● JBOD optimized

● Works better on moderate number of large files vs small files

● Based on Google's work: The Google File System

Page 15: 2014 hadoop   wrocław jug

15/39

HDFS design

Page 16: 2014 hadoop   wrocław jug

16/39

HDFS limitations

● No file updates

● Name Node as SPOF in basic configurations

● Limited security

● Inefficient at handling lots of small files

● No way to provide global synchronization or shared mutable state (this can be an advantage)

Page 17: 2014 hadoop   wrocław jug

17/39

HDFS + MapReduce: Simplified Architecture

Name Node

Job Tracker

Master Node

Slave Node

Data Node

Task Tracker

Slave Node

Data Node

Task Tracker

Slave Node

Data Node

Task Tracker

…....

* Real setup will include few more boxes, but they areomitted here for simplicity

Page 18: 2014 hadoop   wrocław jug

18/39

Hive

● “Data warehousing for Hadoop”

● SQL interface to HDFS files (language is called HiveQL)

● SQL is translated into multiple MR jobs that are executed in order

● Doesn't support UPDATE

● Powerful and easy to use UDF mechanism:add jar /home/hive/my-udfs.jarcreate temporary function lower as 'com.example.Lower';select my_lower(username) from users;

Page 19: 2014 hadoop   wrocław jug

19/39

Hive components

● Shell – similar to MySQL shell

● Driver – responsible for executing jobs

● Compiler – translates SQL into MR job

● Execution engine – manages jobs and job stages (one SQL usually is translated into multiple MR jobs)

● Metastore – schema, location in HDFS, data format

● JDBC interface – allows for any JDBC compatible client to connect

Page 20: 2014 hadoop   wrocław jug

20/39

Hive examples 1/2

● CREATE TABLE page_view(view_time INT, user_id BIGINT,page_url STRING, referrer_url STRING,ip STRING);

● CREATE TABLE users(user_id BIGINT, age INT);

● SELECT * From page_view LIMIT 10;

● SELECTuser_id,COUNT(*) AS cFROM usersWHERE view_time > 10GROUP BY user_id;

Page 21: 2014 hadoop   wrocław jug

21/39

Hive examples 2/2

● CREATE TABLE page_views_age ASSELECTpv.page_url,u.age,COUNT(*) AS countFROM page_view pvJOIN users u ON (u.user_id = pv.user_id)GRUP BY pv.page_url, u.age;

Page 22: 2014 hadoop   wrocław jug

22/39

Hive best practices 1/2

● Use partitions, especially on date columns

● Compress where possible

● JOIN optimization hive.auto.convert.join=true

● Improve parallelism: hive.exec.parallel=true

Page 23: 2014 hadoop   wrocław jug

23/39

Hive best practices 2/2

● SELECT COUNT(DISTINCT user_id) FROM logs;

● SELECT COUNT(*) FROM (SELECT DISTINCT user_id FROM logs);

image source: http://www.slideshare.net/oom65/optimize-hivequeriespptx

Page 24: 2014 hadoop   wrocław jug

24/39

Sqoop

● SQL to Hadoop import/export tool

● Performs a MapReduce query that interacts with target database via JDBC

● Can work with almost all JDBC databases

● Can “natively” import and export Hive tables

● Import supports:

– Full databases

– Full tables

– Query results

● Export can update/append data to SQL tables

Page 25: 2014 hadoop   wrocław jug

25/39

Sqoop examples

● sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES

● sqoop import --connect jdbc:mysql://db.foo.com/corp --table --hive-import

● sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /user/hive/warehouse/exportingtable

Page 26: 2014 hadoop   wrocław jug

26/39

Hadoop problems

● Relatively hard to setup – Linux knowledge required

● Hard to find logs – multiple directories on each server

● Name Node can be a SPOF if configured incorrectly

● Not real time – jobs take some setup/warm up time (other projects try to address that

● Performance not visible until you exceed 3-5 servers

● Hard to convince people to use it from the start in some projects (Hive via JDBC can help here)

● Relatively complicated configuration management

Page 27: 2014 hadoop   wrocław jug

27/39

Hadoop ecosystem

● HBase – Big Table database

● Spark – Real time query engine

● Flume – log collection

● Impala – similar to Spark

● HUE – Hive console (MySQL workbench / phpMyAdmin) + user permission

● Oozie – Job scheduling, orchestration, dependency, etc

Page 28: 2014 hadoop   wrocław jug

28/39

Use case examples

● Generic production snapshot updates

– Using asynchronous mechanisms

– Using more synchronous approach

● Friends/product recommendations

Page 29: 2014 hadoop   wrocław jug

29/39

Hadoop use case example: snapshots

● Log collection, aggregation

● Periodic batch jobs (hourly, daily)

● Jobs integrate collected logs and production data

● Results from batch jobs feed production system

● Hadoop jobs generate reports for business users

Page 30: 2014 hadoop   wrocław jug

30/39

Hadoop pipeline – feedback loop

Production system Xgenerates logs

RabbitMQintegration step

logs

Production system Ygenerates logs

logs

HadoopHDFS + MR

Multiple rabbitconsumers write to HDFSlogs

logs – HDFS writes

RDBMS:stores models

feeds production system

Daily jobsDaily processing

Results of daily processing

Updated “snapshots”

Current “snapshots”

Updates “snapshots”stored on production servers

Page 31: 2014 hadoop   wrocław jug

31/39

Feedback loop using sqoop

HadoopHDFS + MR

RDBMS:stores data

for production system

Daily jobs

sqoop export

Hadoop MR jobsqoop import

Page 32: 2014 hadoop   wrocław jug

32/39

Agenda

● Big Data

● Hadoop

● MapReduce basics

● Hadoop processing framework – Map Reduce on YARN

● Hadoop Storage system – HDFS

● Using SQL on Hadoop with Hive

● Connecting Hadoop with RDBMS using Sqoop

● Example of real Hadoop architecture – examples

Page 33: 2014 hadoop   wrocław jug

33/39

How to recommend friends – PYMK 1/4

● Database of users

– CREATE TABLE users (id INT);

● Each user has a list of friends (assume integers)

– CREATE TABLE friends (user1 INT, user2 INT);

● For simplicity: relationship is always bidirectional

● Possible to do in SQL (run on RDBMS or on Hive):

● SELECT users.id, new_friend, COUNT(*) AS common_friends FROM users JOIN friends f1 JOIN f2 ….….….

Page 34: 2014 hadoop   wrocław jug

34/39

PYMK: 2/4 Example

0: 1,2,31: 32: 1,4,53: 0,14: 55: 2,4

We expect to see following recommendations:(1,3)(0,4)(0,5)

0

1

2

3

45

Page 35: 2014 hadoop   wrocław jug

35/39

PYMK 3/4

● For each user emit pairs for all his friends

– Example: user X has friends: 1,5,6, we emit: (1,5), (1,6), (5,6)

● Sort all pairs by first user

● Eliminate direct friendships, if 5&6 are friends, remove them

● Sort all pairs by frequency

● Group by each user in pair

Page 36: 2014 hadoop   wrocław jug

36/39

PYMK 4/5 mapper

//user: integer, friends: integer listfunction map(user, friends) for i = 0 to friends.length-1: emit(user, (1, friends[i])) //direct friends

for j = i+1 to friends.length-1://indirect friendsemit(friends[i], (2, friends[j]))emit(friends[j], (2, friends[i]))

Page 37: 2014 hadoop   wrocław jug

37/39

PYMK 5/5 reducer

//user: integer, rlist: list of pairs (path_length, rfriend)reduce(user, rlist):

recommened = new Map()

for(path_length, rfriend) in rlist:if(path_length == 1)//direct friends

recommened.remove(rfriend)if(path_length == 2)//recommend them

recommened.incrementOrAdd(rfriend)

recommend_list = recommened.toList()recommend_list.sortBy(_.2)

emit(user, recommend_list.toString())

Page 38: 2014 hadoop   wrocław jug

38/39

Additional sources

● Data-Intensive Text Processing with MapReduce: http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

● Programming Hive: http://shop.oreilly.com/product/0636920023555.do

● Cloudera Quick Start VM: http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-5-1-x1.html

● Hadoop: The Definitive Guide: http://shop.oreilly.com/product/0636920021773.do

Page 39: 2014 hadoop   wrocław jug

39/39

Thanks!Time for questions