introduction to apache pig ja hive - ut · introduction to apache pig ja hive pelle jakovits 30...

50
Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu

Upload: others

Post on 21-May-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Introduction toApache Pig ja Hive

Pelle Jakovits

30 September, 2014, Tartu

Page 2: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Outline

• Why Pig or Hive instead of MapReduce• Apache Pig

– Pig Latin language– Examples– Architecture

• Hive– Hive Query Language– Examples

• Disadvantages of scripting languages• Pig vs Hive

Pelle Jakovits 2/18

Page 3: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

You already know MapReduce

• MapReduce = Map, GroupBy, Sort, Reduce”

• Designed or huge scale data processing

• Provides– Distributed file system

– High scalability

– Automatic parallelisation

– Automatic fault recovery• Data is replicated

• Failed tasks are re-executed on other nodes

Pelle Jakovits 3/18

Page 4: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Is MapReduce enough?

• Hadoop MapReduce is one of the most used frameworks for large scale data processing

• However:

– Writing low level Mapreduce code slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping is slow

– A lot of custom code required

• Even for the most simplest tasks

– Hard to manage more complex mapreduce job chains

Pelle Jakovits 4/18

Page 5: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Apache Pig

• A data flow framework ontop of Hadoop MapReduce– Retains all its advantages

– And some of it’s disadvantages

• Models a scripting language– Fast prototyping

• Uses Pig Latin language

– Similiar to declarative SQL

– Easier to get started with

• Pig Latin statements are automatically translated into MapReduce jobs

Pelle Jakovits 5/18

Page 6: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig workflow

Pelle Jakovits 6/18

Page 7: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pelle Jakovits 7/18

Pig workflow

Page 8: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Advantages of Pig

• Easy to Program– 5% of the code, 5% of the time required

• Self-Optimizing– Pig Latin statment optimizations– Generated MapReduce code optimizations

• Can manage more complex data flows– Easy to use and join multiple separate inputs,

transformations and outputs

• Extensible– Can be extended with User Defined Functions (UDF)

to provide more functionality

Pelle Jakovits 8/18

Page 9: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Running Pig

• Local mode– Everything installed locally on one machine

• Distributed mode– Everything runs in a MapReduce cluster

• Interactive mode– Grunt shell

• Batch mode– Pig scripts

Pelle Jakovits 9/18

Page 10: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig Latin

• Write complex MapReduce transformations using much simpler scripting language

• Not quite SQL, but similar

• Lazy evaluation

• Compiling is hidden from the user

Pelle Jakovits 10/18

Page 11: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig Latin Example

I = load ‘/mydata/images’ using ImageParser() as (id, image);

F = foreach I generate id, detectFaces(image);

store F into ‘/mydata/faces’;

• Input and output are HDFS folders or files – /mydata/images

– /mydata/faces

• I and F are relations

• Right hand side contains Pig expressions

Pelle Jakovits 11/18

Page 12: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Relations, Bags, Tuples, Fields

• Relation– Can have nested relations

– Similiar to a table in a relational database

– Consists of a Bag

• Bag– Collection of unordered tuples

• Tuple– An ordered set of fields

– Similiar to a row in a relational database

– Can contain any number of fields, does not have to match other tuples

• Fields– A ‘piece’ of data

Pelle Jakovits 12/18

Page 13: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Fields

• Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,

datetime, etc.

– Complex data - Bag, Map, Tuple

• Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

• Referencing Fields– By order - $0, $1, $2

– By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);

Pelle Jakovits 13/18

Page 14: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Complex data types

• Looking into complex, nested data

– client.$0

– author.age

Pelle Jakovits 14/18

Page 15: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Loading and storing data

• LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS (f1:int,

f2:int, f3:int);– User defines data loader and delimiters

• STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);

• Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.

Pelle Jakovits 15/18

Page 16: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

FOREACH … GENERATE

• General data transformation statement

• Used to:

– Change the structure of data

– Apply functions to data

– Flatten complex data to remove nesting

• X = FOREACH C GENERATE FLATTEN (A.(a1,a2)), FLATTEN(B.$1);

Pelle Jakovits 16/18

Page 17: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Group .. BY

• A = load 'student' AS (name:chararray, age:int, gpa:float);

• DUMP A; – (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• DUMP B; – (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})

– (19, {(Mary, 19, 3.8F)})

– (20, {(Bill, 20, 3.9F)})

Pelle Jakovits 17/18

Page 18: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

JOIN

• A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

• B = LOAD 'data2' AS (b1:int,b2:int);

• X = JOIN A BY a1, B BY b1;

Pelle Jakovits 18/18

DUMP A; (1,2,3) (4,2,1)

DUMP B; (1,3) (2,7) (4,6)

DUMP X; (1,2,3,1,3)(4,2,1,4,6)

Page 19: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Union

• A = LOAD 'data' AS (a1:int, a2:int, a3:int);

• B = LOAD 'data' AS (b1:int, b2:int);

• X = UNION A, B;

Pelle Jakovits 19/18

DUMP A; (1,2,3)(4,2,1)

DUMP A; (2,4) (8,9)

DUMP X; (1,2,3)(4,2,1) (2,4) (8,9)

Page 20: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Functions

• SAMPLE

– A = LOAD 'data' AS (f1:int,f2:int,f3:int);

– X = SAMPLE A 0.01;

– X will contain 1% of tuples in A

• FILTER

– A = LOAD 'data' AS (a1:int, a2:int, a3:int);

– X = FILTER A BY a3 == 3;

Pelle Jakovits 20/18

Page 21: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Functions

• DISTINCT – removes duplicate tuples

– X = DISTINCT A;

• LIMIT –

– X = LIMIT B 3;

• SPLIT –

– SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);

Pelle Jakovits 21/18

Page 22: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig Example

• A = LOAD 'student' USING PigStorage() AS (name, age, gpa);

• DUMP A;

– (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• C = FOREACH B GENERATE group, AVG(A.gpa)

Pelle Jakovits 22/18

Page 23: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Hive

• Data warehousing on top of Hadoop.

• Designed to

– enable easy data summarization

– ad-hoc querying

– analysis of large volumes of data.

• HiveQL statements are automatically translated into MapReduce jobs

23

Page 24: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Advantages of Hive

• Higher level query language

– Simplifies working with large amounts of data

• Lower learning curve than Pig or MapReduce

– HiveQL is much closer to SQL than Pig

– Less trial and error than Pig

24

Page 25: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Running Hive

• We will look at it more closely in the practicesession, but you can run hive from– Hive web interface

– Hive shell• $HIVE_HOME/bin/hive for interactive shell

• Or you can run queries directly: $HIVE_HOME/bin/hive -e ‘select a.col from tab1 a’

– JDBC – Java Database Connectivity• "jdbc:hive://host:port/dbname„

– Also possible to use hive directly in Python, C, C++, PHP

25

Page 26: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

HiveQL

• Hive query language provides the basic SQL like operations. • These operations are:

– Ability to filter rows from a table using a where clause.– Ability to select certain columns from the table using a select clause.– Ability to do equi-joins between two tables.– Ability to evaluate aggregations on multiple "group by" columns for

the data stored in a table.– Ability to store the results of a query into another table.– Ability to download the contents of a table to a local directory.– Ability to store the results of a query in a hadoop dfs directory.– Ability to manage tables and partitions (create, drop and alter).– Ability to use custom scripts in chosen language (for map/reduce).

26

Page 27: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Data units

• Databases: Namespaces that separate tables and other data units from naming confliction.

• Tables:

– Homogeneous units of data which have the same schema.

– Consists of specified columns accordingly to its schema

• Partitions:

– Each Table can have one or more partition Keys which determines how data is stored.

– Partitions allow the user to efficiently identify the rows that satisfy a certain criteria.

– It is the user's job to guarantee the relationship between partition name and data!

– Partitions are virtual columns, they are not part of the data, but are derived on load.

• Buckets (or Clusters):

– Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.

– For example the page_views table may be bucketed by userid to sample the data.

27

Page 28: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Complex types

• Structs– the elements within the type can be accessed using the DOT (.)

notation. F– or example, for a column c of type STRUCT {a INT; b INT} the a field is

accessed by the expression c.a

• Maps (key-value tuples)– The elements are accessed using ['element name'] notation. – For example in a map M comprising of a mapping from 'group' -> gid

the gid value can be accessed using M['group']

• Arrays (indexable lists)– The elements in the array have to be in the same type. – Elements can be accessed using the [n] notation where n is an index

(zero-based) into the array. – For example for an array A having the elements ['a', 'b', 'c'], A[1]

retruns 'b'.

28

Page 29: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Create Table

CREATE TABLE page_view(viewTime INT, useridBIGINT,

page_url STRING, referrer_url STRING,ip STRING COMMENT 'IP Address of the

User')COMMENT 'This is the page view table'PARTITIONED BY(dt STRING, country STRING)ROW FORMAT DELIMITED

FIELDS TERMINATED BY '1'STORED AS SEQUENCEFILE;

29

Page 30: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Load Data

• There are multiple ways to load data into Hive tables.

• The user can create an external table that points to a specified location within HDFS.

• The user can copy a file into the specified location in HDFS and create a table pointing to this location with all the relevant row format information.

• Once this is done, the user can transform the data and insert them into any other Hive table.

30

Page 31: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Load example

• hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

• FROM page_view_stg pvs

• INSERT OVERWRITE TABLE page_viewPARTITION(dt='2008-06-08', country='US')

• SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip

• WHERE pvs.country = 'US';

31

Page 32: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Loading and storing data

• Loading data directly:

LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_viewPARTITION(date='2008-06-08', country='US')

32

Page 33: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Storing locally

• To write the output into a local disk, forexample to load it into an excel spreadsheetlater:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'

SELECT pv_gender_sum.*

FROM pv_gender_sum;

33

Page 34: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

INSERT

INSERT OVERWRITE TABLE xyz_com_page_views

SELECT page_views.*

FROM page_views

WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND

page_views.referrer_url like '%xyz.com';

34

Page 35: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Multiple table/file inserts

• The output can be sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities).

• If along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

FROM pv_usersINSERT OVERWRITE TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid)GROUP BY pv_users.gender

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'SELECT pv_users.age, count_distinct(pv_users.userid)GROUP BY pv_users.age;

• The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files.

35

Page 36: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

JOIN

• LEFT OUTER, RIGHT OUTER or FULL OUTER• Can join more than 2 tables at once• It is best to put the largest table on the rightmost

side of the join to get the best performance.

INSERT OVERWRITE TABLE pv_usersSELECT pv.*, u.gender, u.ageFROM user u JOIN page_view pv ON (pv.userid = u.id)WHERE pv.date = '2008-03-03';

36

Page 37: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Aggregations

INSERT OVERWRITE TABLE pv_gender_agg

SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)

FROM pv_users

GROUP BY pv_users.gender;

37

Page 38: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Union

INSERT OVERWRITE TABLE actions_usersSELECT u.id, actions.dateFROM (

SELECT av.uid AS uidFROM action_video avWHERE av.date = '2008-06-03'

UNION ALL

SELECT ac.uid AS uidFROM action_comment acWHERE ac.date = '2008-06-03') actions JOIN users u ON(u.id = actions.uid);

38

Page 39: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Running custom mapreduce

FROM (FROM pv_usersMAP pv_users.userid, pv_users.dateUSING 'map_script.py'AS dt, uidCLUSTER BY dt) map_output

INSERT OVERWRITE TABLE pv_users_reducedREDUCE map_output.dt, map_output.uidUSING 'reduce_script.py'AS date, count;

39

Page 40: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Map Example

import sysimport datetime

for line in sys.stdin:line = line.strip()userid, unixtime = line.split('\t')weekday =

datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()print ','.join([userid, str(weekday)])

40

Page 41: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Using UDF’s in Hive

41

• create temporary function my_lower as 'com.example.hive.udf.Lower';

• hive> select my_lower(title), sum(freq) from titles group by my_lower(title);

Page 42: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Java UDF

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;

public final class Lower extends UDF {public Text evaluate(final Text s) {

if (s == null) { return null; }return new Text(s.toString().toLowerCase());

}}

42

Page 43: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig vs Hive

Pig Hive

Purpose Data transformation Ad-Hoc querying

Language Something similiar to SQL SQL-like

Difficulty Medium (Trial-and-error) Low to medium

Schemas Yes (implicit) Yes (explicit)

Join (Distributed) Yes Yes

Shell Yes Yes

Streaming Yes Yes

Web interface No Yes

Partitions No Yes

UDF’s Yes Yes

43

Page 44: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig vs Hive

• SQL might not be the perfect language for expressing data transformation commands

• Pig

– Mainly for data transformations and processing

– Unstructured data

• Hive

– Mainly for warehousing and querying data

– Structured data

44

Page 45: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig & Hive disadvantages

• Slow start-up and clean-up of MapReduce jobs

– It takes time for Hadoop to schedule MR jobs

• Not suitable for interactive OLAP Analytics

– When results are expected in < 1 sec

• Complex applications may require many UDF’s

– Pig loses it’s simplicity over MapReduce

Pelle Jakovits 45/18

Page 46: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Pig & Hive disadvantages

• Updating data is complicated

– Mainly because of using HDFS

– Can add records

– Can overwrite partitions

• No real time access to data

– Use other means like Hbase or Impala

• High latency

46

Page 47: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

More Related Hadoop projects

• Hbase

– Open-source distributed database ontop of HDFS

– Hbase tables only use a single key

– Tuned for real-time access to data

• Cloudera Impala™

– Simplified, real time queries over HDFS

– Bypass job schedulling, and remove everythingelse that makes MR slow.

47

Page 48: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Big Picture

• Store large amounts of data to HDFS

• Process raw data: Pig

• Build schema using Hive

• Data querries: Hive

• Real time access

– access to data with Hbase

– real time queries with Impala

48

Page 49: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Is ApacheHadoop enough?

• Why use Hadoop for large scale data processing?

– It is becoming a de facto standard in Big Data

– Collaboration among Top Companies instead ofvendor tool lock-in.

• Amazon, Apache, Facebook, Yahoo!, etc all contribute toopen source Hadoop

– There are tools from setting up Hadoop cluster inminutes and importing data from relational databasesto setting up workflows of MR, Pig and Hive.

49

Page 50: Introduction to Apache Pig ja Hive - ut · Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu. Outline •Why Pig or Hive instead of MapReduce •Apache Pig

Thats All

• This week`s practice session

– Processing data with Pig ja Hive

– Similiar data processing exercise as in the previous 2 weeks

• Next week`s lecture

– Data processing in Spark

Pelle Jakovits 50/18