hbase and bigtable storage

39
HBase and Bigtable Storage Xiaoming Gao Judy Qiu Hui Li

Upload: shino

Post on 23-Feb-2016

64 views

Category:

Documents


0 download

DESCRIPTION

HBase and Bigtable Storage. Xiaoming Gao Judy Qiu Hui Li. Outline. HBase and Bigtable Storage HBase Uses Cases Load CSV file to Hbase table with MapReduce Demo Search Engine System with MapReduce Technologies (Hadoop/HDFS/HBase/Pig). HBase Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HBase and  Bigtable  Storage

HBase and Bigtable Storage

Xiaoming GaoJudy Qiu

Hui Li

Page 2: HBase and  Bigtable  Storage

Outline

• HBase and Bigtable Storage• HBase Uses Cases• Load CSV file to Hbase table with MapReduce• Demo Search Engine System with MapReduce

Technologies (Hadoop/HDFS/HBase/Pig)

Page 3: HBase and  Bigtable  Storage

HBase Introduction

• HBase is an open source, distributed, sorted map modeled after Google’s BigTable

• HBase is built on Hadoop:– Fault tolerance– Scalability– Batch processing with MapReduce

• HBase uses HDFS for storage

Page 4: HBase and  Bigtable  Storage

HBase Cluster Architecture

• Tables split into regions and served by region servers• Regions vertically divided by column families into “stores”• Stores saved as files on HDFS

Page 5: HBase and  Bigtable  Storage

Data Model: A Big Sorted Map• A Big Sorted Map• Not a relational database, no sql, • Tables consist of rows, each of which has a primary key (row key)• Each row has any number of columns: sortedMap<rowKey, List(sortedMap(Column, List(Value,TimeStamp))))>

Page 6: HBase and  Bigtable  Storage

HBase VS. RDBMSRDBMS HBase

Data layout Row-oriented Column-family-oriented

Indexes On row and columns On row

Hardware requirement Large arrays of fast and expensive disks

Designed for commodity hardware

Max data size TBs ~1PB

Read/write throughput 1000s queries/second Millions of queries/second

Query language SQL (Join, Group) Get/Put/

Easy of use Relational data modeling, easy to learn

A sorted Map, significant learning curve, communities and tools are increasing

Page 7: HBase and  Bigtable  Storage

When to Use HBase

• Dataset Scale – Indexing huge amount of web pages in internet or genome

data– Need data mining large social media data sets

• Read/Write Scale– reads/writes are distributed as tables are distributed across

nodes– Writes are extremely fast and require no index updates

• Batch Analysis– Massive and convoluted SQL queries can be executed in

parallel via MapReduce jobs

Page 8: HBase and  Bigtable  Storage

Use Cases:

• Facebook Analytics– Real-time counters of URLs shared, preferred links

• Twitter – 25 TB of message every month

• Mozilla– Store crashes report, 2.5 million per day.

Page 9: HBase and  Bigtable  Storage

Programming with HBase

1. HBase shell– Scan, List, Create

2. Native Java API– Get(byte[] row, byte[] column, long ts, int version)

3. Non-Java Clients– Thrift server (Ruby, C++, Php)– REST server

4. HBase MapReduce API – TableInput/TableOuputFormat for MapReduce

5. High Level Interface– Pig, Hive

Page 10: HBase and  Bigtable  Storage

Hands-on HBase MapReduce Programming

• HBase MapReduce APIimport org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;import org.apache.hadoop.hbase.mapreduce.TableMapper;import org.apache.hadoop.hbase.mapreduce.TableReducer;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.util.GenericOptionsParser;

Page 11: HBase and  Bigtable  Storage

Hands-on: load CSV file into HBase table with MapReduce

• Main entry point of programpublic static void main(String[] args) throws Exception {

Configuration conf = HBaseConfiguration.create();String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();if(otherArgs.length != 2) {

System.err.println("Wrong number of arguments: " + otherArgs.length); System.err.println("Usage: <csv file> <hbase table name>"); System.exit(-1); }//end if

Job job = configureJob(conf, otherArgs);System.exit(job.waitForCompletion(true) ? 0 : 1);

}//main

Page 12: HBase and  Bigtable  Storage

Hands-on: load CSV file into HBase table with MapReduce

• Configure HBase MapReduce jobpublic static Job configureJob(Configuration conf, String [] args) throws IOException { Path inputPath = new Path(args[0]);

String tableName = args[1]; Job job = new Job(conf, NAME + "_" + tableName); job.setJarByClass(Uploader.class); FileInputFormat.setInputPaths(job, inputPath); job.setInputFormatClass(SequenceFileInputFormat.class); job.setMapperClass(Uploader.class);

TableMapReduceUtil.initTableReducerJob(tableName, null, job); job.setNumReduceTasks(0); return job;}//public static Job configure

Page 13: HBase and  Bigtable  Storage

Hands-on: load CSV file into HBase table with MapReduce

• The map functionpublic void map(LongWritable key, Text line, Context context) throws IOException { // Input is a CSV file Each map() is a single line, where the key is the line number // Each line is comma-delimited; row,family,qualifier,value String [] values = line.toString().split(","); if(values.length != 4) { return; } byte [] row = Bytes.toBytes(values[0]); byte [] family = Bytes.toBytes(values[1]); byte [] qualifier = Bytes.toBytes(values[2]); byte [] value = Bytes.toBytes(values[3]); Put put = new Put(row); put.add(family, qualifier, value); try { context.write(new ImmutableBytesWritable(row), put); } catch (InterruptedException e) { e.printStackTrace(); } if(++count % checkpoint == 0) { context.setStatus("Emitting Put " + count); } } }

Page 14: HBase and  Bigtable  Storage

Hands-on: load CSV file into HBase table with MapReduce

• Steps to run program

1. Create hbase table with specified data schema2. Compile the program with Ant3. Run the program

/bin/hadoop org.apache.hadoop.hbase.mapreduce.CSV2HBase input.csv “test”4. Check inserted records in Hbase table

Page 15: HBase and  Bigtable  Storage

Extension: write output to HBase table public static Job configureJob(Configuration conf, String [] args) throws IOException {conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(new Scan())); conf.set(TableInputFormat.INPUT_TABLE, tableName); conf.set("index.tablename", tableName); conf.set("index.familyname", columnFamily); String[] fields = new String[args.length - 2]; for(int i = 0; i < fields.length; i++) { fields[i] = args[i + 2]; } conf.setStrings("index.fields", fields); conf.set("index.familyname", "attributes"); Job job = new Job(conf, tableName); job.setJarByClass(IndexBuilder.class); job.setMapperClass(Map.class); job.setNumReduceTasks(0); job.setInputFormatClass(TableInputFormat.class); job.setOutputFormatClass(MultiTableOutputFormat.class); return job; }

Page 16: HBase and  Bigtable  Storage

Extension: write output to HBase table

public static class Map extends Mapper<ImmutableBytesWritable, Result, ImmutableBytesWritable, Writable> { private byte[] family; private HashMap<byte[], ImmutableBytesWritable> indexes;protected void map(ImmutableBytesWritable rowKey, Result result, Context context) throws IOException, InterruptedException {for(java.util.Map.Entry<byte[], ImmutableBytesWritable> index : indexes.entrySet()) { byte[] qualifier = index.getKey(); ImmutableBytesWritable tableName = index.getValue(); byte[] value = result.getValue(family, qualifier); if (value != null) {

Put put = new Put(value);put.add(INDEX_COLUMN, INDEX_QUALIFIER, rowKey.get());

context.write(tableName, put); }//if }//for }//map

Page 17: HBase and  Bigtable  Storage

Big Data Challenge

Mega 10^6

Giga 10^9

Tera 10^12

Peta 10^15

Page 18: HBase and  Bigtable  Storage

Search Engine System with MapReduce Technologies

1. Search Engine System for Summer School2. To give an example of how to use MapReduce

technologies to solve big data challenge.3. Using Hadoop/HDFS/HBase/Pig4. Indexed 656K web pages (540MB in size)

selected from Clueweb09 data set.5. Calculate ranking values for 2 million web

sites.

Page 19: HBase and  Bigtable  Storage

Architecture for SESSS

Web UI

Apache Server on Salsa Portal

PHP script

Hive/Pig script

Thrift client

HBase

Thrift server

HBase Tables1. inverted index table2. page rank table

Hadoop Cluster on FutureGrid

Ranking System

Pig script

Inverted Indexing System

Apache Lucene

Page 20: HBase and  Bigtable  Storage

Demo Search Engine System for Summer School

build-index-demo.exe (build index with HBase)pagerank-demo.exe (compute page rank with Pig)http://salsahpc.indiana.edu/sesss/index.php

Page 21: HBase and  Bigtable  Storage

High Level Language: Pig Latin

Hui Li Judy Qiu

Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012

Page 22: HBase and  Bigtable  Storage

What is Pig• Framework for analyzing large un-structured and semi-

structured data on top of Hadoop.– Pig Engine Parses, compiles Pig Latin scripts into MapReduce

jobs run on top of Hadoop.– Pig Latin is simple but powerful data flow language similar to

scripting languages.

Page 23: HBase and  Bigtable  Storage

Motivation of Using Pig• Faster development

– Fewer lines of code (Writing map reduce like writing SQL queries)– Re-use the code (Pig library, Piggy bank)

• One test: Find the top 5 words with most high frequency– 10 lines of Pig Latin V.S 200 lines in Java– 15 minutes in Pig Latin V.S 4 hours in Java

0

50

100

150

200

250

300

Pig Latin Java

0

50

100

150

200

250

300

Pig Latin Javamin-

utes

Page 24: HBase and  Bigtable  Storage

Word Count using MapReduce

Page 25: HBase and  Bigtable  Storage

Word Count using Pig

Lines=LOAD ‘input/hadoop.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word;Groups = GROUP Words BY word;Counts = FOREACH Groups GENERATE group, COUNT(Words);Results = ORDER Words BY Counts DESC;Top5 = LIMIT Results 5;STORE Top5 INTO /output/top5words;

Page 26: HBase and  Bigtable  Storage

Pig Tutorial

• Basic Pig knowledge: (Word Count)– Pig Data Types– Pig Operations– How to run Pig Scripts

• Advanced Pig features: (Kmeans Clustering)– Embedding Pig within Python– User Defined Function

Page 27: HBase and  Bigtable  Storage

Pig Data Types• Concepts: fields, tuples, bags, relations,

– A Field is a piece of data– A Tuple is an ordered set of fields– A Bag is a collection of tuples– A Relation is a bag

• Simple Types– Int, long, float, double, boolean,nul, chararray, bytearry,

• Complex types– Tuple Row in Database

• ( 0002576169, Tome, 21, “Male”)– Data Bag Table or View in Database

{(0002576169 , Tome, 21, “Male”), (0002576170, Mike, 20, “Male”), (0002576171 Lucy, 20, “Female”)…. }

Page 28: HBase and  Bigtable  Storage

Pig Operations• Loading data

– LOAD loads input data– Lines=LOAD ‘input/access.log’ AS (line: chararray);

• Projection– FOREACH … GENERTE … (similar to SELECT)– takes a set of expressions and applies them to every record.

• Grouping– GROUP collects together records with the same key

• Dump/Store– Dump displays results to screen, Store save results to file system

• Aggregation– AVG, COUNT, COUNT_STAR, MAX, MIN, SUM

Page 29: HBase and  Bigtable  Storage

How to run Pig Latin scripts• Local mode

– Local host and local file system is used– Neither Hadoop nor HDFS is required– Useful for prototyping and debugging

• MapReduce mode– Run on a Hadoop cluster and HDFS

• Batch mode - run a script directly – Pig –x local my_pig_script.pig– Pig –x mapreduce my_pig_script.pig

• Interactive mode use the Pig shell to run script– Grunt> Lines = LOAD ‘/input/input.txt’ AS (line:chararray);– Grunt> Unique = DISTINCT Lines;– Grunt> DUMP Unique;

Page 30: HBase and  Bigtable  Storage

Hands-on: Word Count using Pig Latin

1. cd pigtutorial/pig-hands-on/2. tar –xf pig-wordcount.tar3. cd pig-wordcount4. pig –x local5. grunt> Lines=LOAD ‘input.txt’ AS (line: chararray); 6. grunt>Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line))

AS word;7. grunt>Groups = GROUP Words BY word;8. grunt>counts = FOREACH Groups GENERATE group, COUNT(Words);9. grunt>DUMP counts;

Page 31: HBase and  Bigtable  Storage

Sample: Kmeans using Pig Latin

A method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Assignment step: Assign each observation to the cluster with the closest mean

Update step: Calculate the new means to be the centroid of the observations in the cluster.

Reference: http://en.wikipedia.org/wiki/K-means_clustering

Page 32: HBase and  Bigtable  Storage

Kmeans Using Pig Latin

PC = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = Foreach grouped Generate group, AVG(centroided.gpa); store result into 'output'; """)

Page 33: HBase and  Bigtable  Storage

Kmeans Using Pig Latinwhile iter_num<MAX_ITERATION: PCB = PC.bind({'centroids':initial_centroids}) results = PCB.runSingle() iter = results.result("result").iterator() centroids = [None] * v distance_move = 0.0 # get new centroid of this iteration, calculate the moving distance with last iteration for i in range(v): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / v; if distance_move<tolerance: converged = True break……

Page 34: HBase and  Bigtable  Storage

Embedding Python scripts with Pig Statements

• Pig does not support flow control statement: if/else, while loop, for loop, etc.

• Pig embedding API can leverage all language features provided by Python including control flow: – Loop and exit criteria– Similar to the database embedding API– Easier parameter passing

• JavaScript is available as well• The framework is extensible. Any JVM implementation

of a language could be integrated

Page 35: HBase and  Bigtable  Storage

User Defined Function

• What is UDF– Way to do an operation on a field or fields– Called from within a pig script– Currently all done in Java

• Why use UDF– You need to do more than grouping or filtering– Actually filtering is a UDF– Maybe more comfortable in Java land than in

SQL/Pig LatinP = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids');

Page 36: HBase and  Bigtable  Storage

Hands-on Run Pig Latin Kmeans

1. export PIG_CLASSPATH= /opt/pig/lib/jython-2.5.0.jar2. Hadoop dfs –copyFromLocal input.txt ./input.txt3. pig –x mapreduce kmeans.py4. pig—x local kmeans.py

Page 37: HBase and  Bigtable  Storage

Hands-on Run Pig Latin Kmeans

2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to run:register udf.jar DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output';

Input(s): Successfully read 10000 records (219190 bytes) from: "hdfs://iw-ubuntu/user/developer/student.txt"

Output(s): Successfully stored 4 records (134 bytes) in: "hdfs://iw-ubuntu/user/developer/output“

last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]

Page 38: HBase and  Bigtable  Storage

References:1. http://pig.apache.org (Pig official site)2. http://en.wikipedia.org/wiki/K-means_clustering3. Docs http://pig.apache.org/docs/r0.9.04. Papers: http://wiki.apache.org/pig/PigTalksPapers5. http://en.wikipedia.org/wiki/Pig_Latin6. Slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012

• Questions?

Page 39: HBase and  Bigtable  Storage

Acknowledgement