scaling up - polo club of data sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...mar 07,...

46
Scaling Up HBase, Hive, Pegasus CSE 6242 A / CS 4803 DVA Mar 7, 2013 Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Scaling UpHBase, Hive, Pegasus

CSE 6242 A / CS 4803 DVAMar 7, 2013

Duen Horng (Polo) ChauGeorgia Tech

Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Page 2: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Lecture based on these two books.

2

http://goo.gl/YNCWN http://goo.gl/svzTV

Page 3: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Introduction to Pig

Showed you a simple example run using Grunt (interactive shell)

• Find maximum temperature by year

• Easy to program, understand and maintain• You write a few lines of Pig Latin, Pig turns them into

MapReduce jobs

• Illustrate command creates sample dataset (from your full data)• Helps you debug your Pig Latin

Started with HBase

Last Time...

3

Page 4: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Built on top of HDFS

Supports real-time read/write random access

Scale to very large datasets, many machines

Not relational, does NOT support SQL (“NoSQL” = “not only SQL”)

Supports billions of rows, millions of columns

Written in Java; can work with many languages

Where does HBase come from?4

http://hbase.apache.org

Page 5: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

HBase’s “history”Hadoop & HDFS based on...

• 2003 Google File System (GFS) paper • 2004 Google MapReduce paper

HBase based on ...

• 2006 Google Bigtable paper

5

Not designed for random access

This “fixes” that

Page 6: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

How does HBase work?Column-oriented

Column is the most basic unit (instead of row)

• Columns form a row• A column can have multiple versions, each

version stored in a cellRows form a table

• Row key locates a row• Rows sorted by row key lexicographically

6

Page 7: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Row key is uniqueThink of row key as the “index” of the table

• You look up a row using its row keyOnly one “index” per table (via row key)

HBase does not have built-in support for multiple indices; support enabled via extensions

7

Page 8: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Rows sorted lexicographically (=alphabetically)

8

hbase(main):001:0> scan 'table1'ROW COLUMN+CELLrow-1 column=cf1:, timestamp=1297073325971 ... row-10 column=cf1:, timestamp=1297073337383 ... row-11 column=cf1:, timestamp=1297073340493 ... row-2 column=cf1:, timestamp=1297073329851 ... row-22 column=cf1:, timestamp=1297073344482 ... row-3 column=cf1:, timestamp=1297073333504 ... row-abc column=cf1:, timestamp=1297073349875 ... 7 row(s) in 0.1100 seconds

“row-10” comes before “row-2”. How to fix?

Page 9: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Rows sorted lexicographically (=alphabetically)

8

hbase(main):001:0> scan 'table1'ROW COLUMN+CELLrow-1 column=cf1:, timestamp=1297073325971 ... row-10 column=cf1:, timestamp=1297073337383 ... row-11 column=cf1:, timestamp=1297073340493 ... row-2 column=cf1:, timestamp=1297073329851 ... row-22 column=cf1:, timestamp=1297073344482 ... row-3 column=cf1:, timestamp=1297073333504 ... row-abc column=cf1:, timestamp=1297073349875 ... 7 row(s) in 0.1100 seconds

“row-10” comes before “row-2”. How to fix?

Pad “row-2” with a “0”.i.e., “row-02”

Page 10: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Columns grouped into column families

Column family is a new concept from HBase

• Why? Helps with organization, understanding, optimization, etc.

• In details...• Columns in the same family stored in same file

called HFile (inspired by Google’s SSTable = large map whose keys are sorted)

• Apply compression on the whole family• ...

9

Page 11: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

More on column family, columnColumn family

• Each table only supports a few families (e.g., <10)• Due to shortcomings in implementation

• Family name must be printable• Should be defined when table is created

• Shouldn’t be changed oftenEach column referenced as “family:qualifier”

• Can have millions of columns• Values can be anything that’s arbitrarily long

10

Page 12: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Cell ValueTimestamped

• Implicitly by system• Or set explicitly by user

Let you store multiple versions of a value

• = values over timeValues stored in decreasing time order

• Most recent value can be read first11

Page 13: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Time-oriented view of a row

12

Page 14: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Concise way to describe all these?

HBase data model (= Bigtable’s model)

• Sparse, distributed, persistent, multidimensional map• Indexed by row key + column key + timestamp

13

(Table, RowKey, Family, Column, Timestamp) ! Value

Page 15: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

... and the geeky way

14

(Table, RowKey, Family, Column, Timestamp) ! Value

SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>>

Page 16: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

An exerciseHow would you use HBase to create a webtable store snapshots of every webpage on the planet, over time?

Also want to store each webpage’s incoming and outgoing links, metadata (e.g., language), etc.

15

Page 17: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Details: How does HBase scale up storage & balance load?

Automatically divide contiguous ranges of rows into regions

Start with one region, split into two when getting too large

16

Page 18: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Details: How does HBase scale up storage & balance load?

17

Page 19: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

How to use HBaseInteractive shell

• Will show you an example, locally (on your computer, without using HDFS)

Programmatically

• e.g., via Java, C++, Python, etc.

18

Page 20: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example, using interactive shell

19

Start HBase

Start Interactive Shell

Check HBase is running

Page 21: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: Create table, add values

20

Page 22: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: Scan (show all cell values)

21

Page 23: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: Get (look up a row)

22

Can also look up a particular cell value, with a certain timestamp, etc.

Page 24: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: Delete a value

23

Page 25: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: Disable & drop table

24

Page 26: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

RDBMS vs HBaseRDBMS (=Relational Database Management System)

• MySQL, Oracle, SQLite, Teradata, etc.• Really great for many applications

• Ensure strong data consistency, integrity • Supports transactions (ACID guarantees)• ...

25

Page 27: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

RDBMS vs HBaseHow are they different?

• Hbase when you don’t know the structure/schema• HBase supports sparse data (many columns, most values are

not there)

• Use hbase if you only work with a small number of columns• Relational databases good for getting “whole” rows

• HBase: Multiple versions of data

• RDBMS support multiple indices, minimize duplications• Generally a lot cheaper to deploy HBase, for same size of data

(petabytes)

26

Page 28: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Advanced topics to learn aboutOther ways to get, put, delete... (e.g., programmatically via Java)

• Doing them in batchMaintaining your cluster

• Configurations, specs for “master” and “slaves”?• Administrating cluster• Monitoring cluster’s health

Key design

• bad keys can decrease performanceIntegrating with MapReduce

More...27

Page 29: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

HiveUse SQL to run queries on large datasets

Developed at Facebook

Similar to Pig, Hive runs on your computer

• You write HiveQL (Hive’s query language), which gets converted into MapReduce jobs

28

http://hive.apache.org

Page 30: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: starting Hive

29

Page 31: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: create table, load data

30

Specify that data file is tab-separated

This data file will be copied to Hive’s internal data directoryOverwrite old file

Page 32: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: Query

31

So simple and boring! Or is it?

Page 33: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Same thing done with Pig

32

records = LOAD 'input/ ncdc/ micro-tab/ sample.txt' AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND (quality = = 0 OR quality = = 1 OR quality = = 4 OR quality = = 5 OR quality = = 9);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group, MAX( filtered_records.temperature);

DUMP max_temp;

Page 34: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Graph Mining using Hadoop

33

Page 36: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Generalized Iterated Matrix Vector Multiplication (GIM-V)

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos.

ICDM 2009, Miami, Florida, USA. Best Application Paper (runner-up).

35

Page 37: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Generalized Iterated Matrix Vector Multiplication (GIMV)

• PageRank• proximity (RWR)• Diameter• Connected components• (eigenvectors, • Belief Prop. • … )

Matrix – vectorMultiplication

(iterated)

details

36

Page 38: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components – 4 observations:

Size

Count

37

Page 39: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components

Size

Count

1) 10K x largerthan next

38

Page 40: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components

Size

Count

2) ~0.7B singleton nodes

39

Page 41: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components

Size

Count

3) SLOPE!

40

Page 42: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components

Size

Count300-size

cmptX 500.Why?1100-size cmpt

X 65.Why?

4) Spikes!

41

Page 43: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components

Size

Count

4) Spikes!

41

Page 44: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components

Size

Count

suspiciousfinancial-advice sites

(not existing now)

42

Page 45: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

Example: GIM-V At Work• Connected Components

Size

Count

42

Page 46: Scaling Up - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/...Mar 07, 2013  · • HBase supports sparse data (many columns, most values are not there)

GIM-V At Work• Connected Components over Time

• LinkedIn: 7.5M nodes and 58M edges

Stable tail slopeafter the gelling

point

43