big data on public cloud using cloudera on gogrid & amazon emr

Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1

Big Data on Public CloudUsing Cloudera on GoGrid

& Amazon EMR

Hands On WorkshopOctober 2014

Dr.Thanachart NumnondaIMC Institute

thanachart@imcinstitute.com

Modifiy from Original Version by Danairat T.Certified Java Programmer, TOGAF – Silver

danairat@gmail.com

Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running Hadoop

Running this lab using Cloudera Live

Using Cloudera Live on GoGrid

Alternatively Using Cloudera VM

Starting Hue on Cloudera

Viewing HDFS

Hands-On: Importing/Exporting Data to HDFS

Importing Data to Hadoop

Download War and Peace Full Text

www.gutenberg.org/ebooks/2600

$hadoop fs -mkdir input

$hadoop fs -mkdir output

$hadoop fs -copyFromLocal Downloads/pg2600.txt input

Review file in Hadoop HDFS

[hdadmin@localhost bin]$ hadoop fs -cat input/pg2600.txt

List HDFS File

Read HDFS File

Retrieve HDFS File to Local File System

Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html

[hdadmin@localhost bin]$ hadoop fs -copyToLocal input/pg2600.txt tmp/file.txt

Review file in Hadoop HDFS using File Browse

Review file in Hadoop HDFS using Hue

Hadoop Port Numbers

Daemon Default Port

Configuration Parameter in conf/*-site.xml

HDFS Namenode 50070 dfs.http.address

Datanodes 50075 dfs.datanode.http.address

Secondarynamenode 50090 dfs.secondary.http.address

MR JobTracker 50030 mapred.job.tracker.http.address

Tasktrackers 50060 mapred.task.tracker.http.address

Removing data from HDFS using Shell Command

hdadmin@localhost detach]$ hadoop fs -rm input/pg2600.txt

Deleted hdfs://localhost:54310/input/pg2600.txt

hdadmin@localhost detach]$

Lecture: Understanding Map Reduce Processing

Client

Name Node Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Map Reduce

High Level Architecture of MapReduce

Before MapReduce…

● Large scale data processing was difficult!– Managing hundreds or thousands of processors– Managing parallelization and distribution– I/O Scheduling– Status and monitoring– Fault/crash tolerance

● MapReduce provides all of these, easily!

Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html

MapReduce Overview

● What is it?– Programming model used by Google– A combination of the Map and Reduce models with an

associated implementation– Used for processing and generating large data sets

MapReduce Overview

● How does it solve our previously mentioned problems?– MapReduce is highly scalable and can be used across many

computers.– Many small machines can be used to process jobs that

normally could not be processed by a large machine.

MapReduce Framework

Source: www.bigdatauniversity.com

How does the MapReduce work?

Output in a list of (Key, List of Values)

in the intermediate file

Sorting

Partitioning

Output in a list of (Key, Value)

InputSplit

RecordReader

RecordWriter

How does the MapReduce work?

Output in a list of (Key, List of Values)

Sorting

Partitioning

Output in a list of (Key, Value)

InputSplit

RecordReader

RecordWriter

Map Abstraction

● Inputs a key/value pair– Key is a reference to the input value– Value is the data set on which to operate

● Evaluation– Function defined by user– Applies to every value in value input

● Might need to parse input● Produces a new list of key/value pairs

– Can be different type from input pair

Reduce Abstraction

● Starts with intermediate Key / Value pairs● Ends with finalized Key / Value pairs

● Starting pairs are sorted by key● Iterator supplies the values for a given key to the

Reduce function.

Reduce Abstraction

● Typically a function that:– Starts with a large number of key/value pairs

● One key/value for each word in all files being greped (including multiple entries for the same word)

– Ends with very few key/value pairs● One key/value for each unique word across all the files with

the number of instances summed into this entry● Broken up so a given worker works with input of the

same key.

How Map and Reduce Work Together

● Map returns information● Reduces accepts information● Reduce applies a user defined function to reduce the

amount of data

Other Applications

● Yahoo!– Webmap application uses Hadoop to create a database of

information on all known webpages● Facebook

– Hive data center uses Hadoop to provide business statistics to application developers and advertisers

● Rackspace– Analyzes sever log files and usage data using Hadoop

MapReduce Framework

map: (K1, V1) -> list(K2, V2))

reduce: (K2, list(V2)) -> list(K3, V3)

Hands-On: Writing you own Map Reduce Program

Wordcount (HelloWord in Hadoop)1. package org.myorg;

3. import java.io.IOException; 4. import java.util.*;

6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*;

12. public class WordCount {

14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text();

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }

Wordcount (HelloWord in Hadoop)

28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }

Wordcount (HelloWord in Hadoop)

38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount");

42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);

45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class);

49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class);

52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));

55. JobClient.runJob(conf); 57. } 58. }

Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime

Environment

Packaging Map Reduce Program

Assuming we have already downloaded wordcount.jar into the folder /Downloads:

Reviewing MapReduce Job in Hue

Reviewing MapReduce Output Result

Running MapReduce Job Using Oozie: Select Java

Hands-On: Running WordCount.jar on Amazon EMR

Architecture Overview of Amazon EMR

Creating an AWS account

Signing up for the necessary services

● Simple Storage Service (S3)● Elastic Compute Cloud (EC2)● Elastic MapReduce (EMR)

Caution! This costs real money!

Creating Amazon S3 bucket

Upload .jar file and input file to Amazon S3

1. Select <yourbucket> in Amazon S3 service

2. Create folder : applications

3. Upload wordcount.jar to the applications folder

4. Create another folder: input

5. Upload input_test.txt to the input folder

Create a new cluster in EMR

Assign a cluster name and log location

Assign hardware configuration

Select Custom JAR

Input JAR Location and Arguments

Select Create Cluster

Monitoring the cluster

View Result from the S3 bucket

LectureUnderstanding Hive

IntroductionA Petabyte Scale Data Warehouse Using Hadoop

Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL

What Hive is NOT

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).

Hive Metastore

● Store Hive metadata

● Configurations

– Embedded: in-process metastore, in-process database

– Local: in-process metastore, out-of-process database

– Remote: out-of-process metastore,out-of-process database

Hive Schema-On-Read

● Faster loads into the database (simply copy or move)

● Slower queries

● Flexibility – multiple schemas for the same data

HiveQL

● Hive Query Language● SQL dialect● No support for:

– UPDATE, DELETE

– Transactions

– Indexes

– HAVING clause in SELECT

– Updateable or materialized views

– Srored procedure

Hive Tables

● Managed- CREATE TABLE

– LOAD- File moved into Hive's data warehouse directory

– DROP- Both data and metadata are deleted.

● External- CREATE EXTERNAL TABLE

– LOAD- No file moved

– DROP- Only metadata deleted

– Use when sharing data between Hive and Hadoop applications

or you want to use multiple schema on the same data

Running Hive

Hive Shell

● Interactive

hive● Script

hive -f myscript● Inline

hive -e 'SELECT * FROM mytable'

Hive.apache.org

System Architecture and Components

• Metastore: To store the meta data.• Query compiler and execution engine: To convert SQL queries to a

sequence of map/reduce jobs that are then executed on Hadoop.• SerDe and ObjectInspectors: Programmable interfaces and

implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.

• UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions).

• Clients: Command line client similar to Mysql command line.

hive.apache.org

Architecture Overview

Hive CLIQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Hive.apache.org

Sample HiveQL

The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query

SELECT * FROM t where t.c = 'xyz'

SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)

SELECT t1.c1, count(1) from t1 group by t1.c1

Hive.apache.org

Hands-On: Creating Table and Retrieving Data using Hive

Running Hive from terminal

Starting Hive

hive> quit;

Quit from Hive

Starting Hive Editor from Hue

Scroll Downthe web page

Creating Hive Table

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Time taken: 4.069 seconds

hive (default)> show tables;

test_tbl

hive (default)> describe test_tbl;

id int

country string

hive (default)>

See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html

Using Hue Query Editor

Reviewing Hive Table in HDFS

Alter and Drop Hive Table

hive (default)> alter table test_tbl add columns (remarks STRING);

hive (default)> describe test_tbl;

id int

country string

remarks string

hive (default)> drop table test_tbl;

See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

Loading Data to Hive Table

$ hive

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Creating Hive table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl;

Copying data from file:/tmp/test_tbl_data.csv

Copying file: file:/tmp/test_tbl_data.csv

Loading data to table default.test_tbl

hive (default)>

Loading data to Hive table

Querying Data from Hive Table

hive (default)> select * from test_tbl;

62 Indonesia

63 Philippines

65 Singapore

66 Thailand

hive (default)>

Insert Overwriting the Hive Table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl;

Copying data from file:/tmp/test_tbl_data_updated.csv

Copying file: file:/tmp/test_tbl_data_updated.csv

Loading data to table default.test_tbl

Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl

hive (default)>

LectureUnderstanding Pig

IntroductionA high-level platform for creating MapReduce programs Using Hadoop

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Pig Components

● Two Compnents● Language (Pig Latin)● Compiler

● Two Execution Environments● Local

pig -x local● Distributed

pig -x mapreduce

Hive.apache.org

Running Pig

● Script

pig myscript● Command line (Grunt)

pig● Embedded

Writing a java program

Hive.apache.org

Pig Latin

Hive.apache.org

Pig Execution Stages

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Why Pig?

● Makes writing Hadoop jobs easier● 5% of the code, 5% of the time● You don't need to be a programmer to write Pig scripts

● Provide major functionality required for DatawareHouse and Analytics● Load, Filter, Join, Group By, Order, Transform

● User can write custom UDFs (User Defined Function)

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Hive.apache.org

Hands-On: Running a Pig script

Starting Pig Command Line

[hdadmin@localhost ~]$ pig -x local

2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53

2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log

2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found

2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

grunt>

Starting Pig from Hue

countryFilter.pig

A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int);B = FILTER A BY gni > 2000;C = ORDER B BY gni;dump C;

#Preparing Data

Download hdi-data.csv

#Edit Your Script

[hdadmin@localhost ~]$ cd Downloads/

[hdadmin@localhost ~]$ vi countryFilter.pig

Writing a Pig Script

[hdadmin@localhost ~]$ cd Downloads

[hdadmin@localhost ~]$ pig -x local

grunt > run countryFilter.pig

(150,Cameroon,0.482,51,5,10,2031)

(126,Kyrgyzstan,0.615,67,9,12,2036)

(156,Nigeria,0.459,51,5,8,2069)

(154,Yemen,0.462,65,2,8,2213)

(138,Lao People's Democratic Republic,0.524,67,4,9,2242)

(153,Papua New Guinea,0.466,62,4,5,2271)

(165,Djibouti,0.43,57,3,5,2335)

(129,Nicaragua,0.589,74,5,10,2430)

(145,Pakistan,0.504,65,4,6,2550)

Running a Pig Script

Lecture: Understanding Sqoop

Introduction

Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

• Imports individual tables or entire databases to files in HDFS

• Generates Java classes to allow you to interact with your imported data

• Provides the ability to import from SQL databases straight into your Hive data warehouse

See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

Architecture Overview

Hive.apache.org

Hands-On: Loading Data from DBMS to Hadoop HDFS

Loading Data into MySQL DB

[root@localhost ~]# service mysqld start

Starting mysqld: [ OK ]

[root@localhost ~]# mysql

mysql> create database countrydb;

Query OK, 1 row affected (0.00 sec)

mysql> use countrydb;

Database changed

mysql> create table country_tbl(id INT, country VARCHAR(100));

Query OK, 0 rows affected (0.02 sec)

mysql> LOAD DATA LOCAL INFILE '/tmp/country_data.csv' INTO TABLE country_tbl FIELDS terminated by ',' LINES TERMINATED BY '\r\n';

Query OK, 237 rows affected, 4 warnings (0.00 sec)

Records: 237 Deleted: 0 Skipped: 0 Warnings: 0

Loading Data into MySQL DB

mysql> select * from country_tbl;

+------+------------------------------+

| id | country |

+------+------------------------------+

| 93 | Afghanistan |

| 355 | Albania |

| 213 | Algeria |

| 1684 | AmericanSamoa |

| 376 | Andorra |

+------+------------------------------+

237 rows in set (0.00 sec)

mysql>

Testing data query from MySQL DB

Installing DB driver for Sqoop

[hdadmin@localhost conf]$ su -

Password:

[root@localhost ~]# cp /01_hadoop_workshop/03_sqoop1.4.2/mysql-connector-java-5.0.8-bin.jar /usr/local/sqoop-1.4.2.bin__hadoop-0.20/lib/

[root@localhost ~]# chown hdadmin:hadoop /usr/local/sqoop-1.4.2.bin__hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar

[root@localhost ~]# exit

Importing data from MySQL to Hive Table

[hdadmin@localhost ~]$ sqoop import --connect jdbc:mysql://localhost/countrydb --username root -P --table country_tbl --hive-import --hive-table country_tbl -m 1

Warning: /usr/lib/hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.

Warning: $HADOOP_HOME is deprecated.

Enter password: <enter here>13/03/21 18:07:43 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override

13/03/21 18:07:43 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.

13/03/21 18:07:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

13/03/21 18:07:43 INFO tool.CodeGenTool: Beginning code generation

13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1

13/03/21 18:07:44 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop/libexec/..

Note: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

13/03/21 18:07:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.jar

13/03/21 18:07:44 WARN manager.MySQLManager: It looks like you are importing from mysql.

13/03/21 18:07:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct

13/03/21 18:07:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.

13/03/21 18:07:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)

13/03/21 18:07:44 INFO mapreduce.ImportJobBase: Beginning import of country_tbl

13/03/21 18:07:45 INFO mapred.JobClient: Running job: job_201303211744_0001

13/03/21 18:07:46 INFO mapred.JobClient: map 0% reduce 0%

13/03/21 18:08:02 INFO mapred.JobClient: map 100% reduce 0%

13/03/21 18:08:07 INFO mapred.JobClient: Job complete: job_201303211744_0001

13/03/21 18:08:07 INFO mapred.JobClient: Counters: 18

13/03/21 18:08:07 INFO mapred.JobClient: Job Counters

13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12154

13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0

13/03/21 18:08:07 INFO mapred.JobClient: Launched map tasks=1

13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0

13/03/21 18:08:07 INFO mapred.JobClient: File Output Format Counters

13/03/21 18:08:07 INFO mapred.JobClient: Bytes Written=3252

13/03/21 18:08:07 INFO mapred.JobClient: FileSystemCounters

13/03/21 18:08:07 INFO mapred.JobClient: HDFS_BYTES_READ=87

13/03/21 18:08:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=30018

13/03/21 18:08:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3252

13/03/21 18:08:07 INFO mapred.JobClient: File Input Format Counters

13/03/21 18:08:07 INFO mapred.JobClient: Bytes Read=0

13/03/21 18:08:07 INFO mapred.JobClient: Map-Reduce Framework

13/03/21 18:08:07 INFO mapred.JobClient: Map input records=235

13/03/21 18:08:07 INFO mapred.JobClient: Physical memory (bytes) snapshot=37916672

13/03/21 18:08:07 INFO mapred.JobClient: Spilled Records=0

13/03/21 18:08:07 INFO mapred.JobClient: CPU time spent (ms)=210

13/03/21 18:08:07 INFO mapred.JobClient: Total committed heap usage (bytes)=16252928

13/03/21 18:08:07 INFO mapred.JobClient: Virtual memory (bytes) snapshot=375439360

13/03/21 18:08:07 INFO mapred.JobClient: Map output records=235

13/03/21 18:08:07 INFO mapred.JobClient: SPLIT_RAW_BYTES=87

13/03/21 18:08:07 INFO mapreduce.ImportJobBase: Transferred 3.1758 KB in 22.965 seconds (141.6067 bytes/sec)

13/03/21 18:08:07 INFO mapreduce.ImportJobBase: Retrieved 235 records.

13/03/21 18:08:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1

13/03/21 18:08:07 INFO hive.HiveImport: Removing temporary files from import process: hdfs://localhost:54310/user/hdadmin/country_tbl/_logs

13/03/21 18:08:07 INFO hive.HiveImport: Loading uploaded data into Hive

13/03/21 18:08:08 INFO hive.HiveImport: Logging initialized using configuration in file:/usr/local/hive-0.9.0-bin/conf/hive-log4j.properties

13/03/21 18:08:08 INFO hive.HiveImport: Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303211808_1458043622.txt

13/03/21 18:08:12 INFO hive.HiveImport: OK

13/03/21 18:08:12 INFO hive.HiveImport: Time taken: 4.079 seconds

13/03/21 18:08:13 INFO hive.HiveImport: Loading data to table default.country_tbl

13/03/21 18:08:13 INFO hive.HiveImport: OK

13/03/21 18:08:13 INFO hive.HiveImport: Time taken: 0.253 seconds

13/03/21 18:08:13 INFO hive.HiveImport: Hive import complete.

13/03/21 18:08:13 INFO hive.HiveImport: Export directory is empty, removing it.

[hdadmin@localhost ~]$

Reviewing data from Hive Table

[hdadmin@localhost ~]$ hive

Logging initialized using configuration in file:/usr/local/hive-0.9.0-bin/conf/hive-log4j.properties

Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303211810_964909984.txt

hive (default)> show tables;

country_tbl

test_tbl

hive (default)> select * from country_tbl;

93 Afghanistan

355 Albania

hive (default)> quit;

[hdadmin@localhost ~]$

Reviewing HDFS Database Table files

Start Web Browser to http://localhost:50070/ then navigate to /user/hive/warehouse

LectureUnderstanding HBase

IntroductionAn open source, non-relational, distributed database

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

HBase Features

● Hadoop database modelled after Google's Bigtab;e● Column oriented data store, known as Hadoop Database● Support random realtime CRUD operations (unlike

HDFS)● No SQL Database● Opensource, written in Java● Run on a cluster of commodity hardware

Hive.apache.org

When to use Hbase?

● When you need high volume data to be stored ● Un-structured data● Sparse data● Column-oriented data● Versioned data (same data template, captured at various

time, time-elapse data)● When you need high scalability

Hive.apache.org

Which one to use?

● HDFS● Only append dataset (no random write)● Read the whole dataset (no random read)

● HBase● Need random write and/or read● Has thousands of operation per second on TB+ of data

● RDBMS● Data fits on one big node● Need full transaction support● Need real-time query capabilities

Hive.apache.org

HBase Components

Hive.apache.org

● Region● Row of table are stores

● Region Server● Hosts the tables

● Master● Coordinating the Region

Servers● ZooKeeper● HDFS● API

● The Java Client API

HBase Shell Commands

Hive.apache.org

Hands-On: Running HBase

Starting HBase shell

[hdadmin@localhost ~]$ start-hbase.sh

starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-localhost.localdomain.out

[hdadmin@localhost ~]$ jps

3064 TaskTracker

2836 SecondaryNameNode

2588 NameNode

3513 Jps

3327 HMaster

2938 JobTracker

2707 DataNode

[hdadmin@localhost ~]$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013

hbase(main):001:0>

Create a table and insert data in HBase

hbase(main):009:0> create 'test', 'cf'

0 row(s) in 1.0830 seconds

hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'

hbase(main):011:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1375363287644, value=val1

hbase(main):002:0> get 'test', 'row1'

COLUMN CELL

cf:a timestamp=1375363287644, value=val1

Using Data Browsers in Hue for HBase

Recommendation to Further Study

Thank you

www.imcinstitute.comwww.facebook.com/imcinstitute

big data on public cloud using cloudera on gogrid & amazon emr

Technology

cloudera user group sf - cloudera manager: apis &...

microgroove (gogrid customer) presentation at cloud connect...

discography - amazon s3...sing, sing, sing (prima) n emr...

apache atlas reference - cloudera · cloudera, cloudera...

gogrid — scaling your internet business

xen summit gogrid and xen - april 2010 v4

cloudera impala

centerity service pack for cloudera€¦ · oozie [roles...

cloudera godatafest deploying cloudera in the cloud

discography - amazon s3 · take five n° emr brass band emr...

discography · take five n° emr brass band emr 3619 emr...

discography - alle-noten.de · concerto n° 1 trumpet,...

building the enterprise data lake with cloudera & cisco ·...

agile development at gogrid with pallet and jclouds

discography · wind band (arrangements) (fortsetzung -...

discography - s3.eu-central-1.amazonaws.com · concert band...

discography · (vangelis) n° emr blasorchester concert...

cloudera user group chicago - cloudera manager: apis &...

discography - edrmartin.com file1’52 2’39 2’31 2’08...

16069 romance strs - alle-noten.dehejre kati (hubay) n° emr...