hcatalog - the enterprise data cloud...

21
HCatalog Table Management for Hadoop Page 1 Alan F. Gates @alanfgates

Upload: others

Post on 08-Sep-2020

1 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

HCatalog Table Management for Hadoop

Page 1

Alan F. Gates

@alanfgates

Page 2: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Who Am I?

Page 2

• HCatalog committer and mentor

• Co-founder of Hortonworks

• Tech lead for Data team at Hortonworks

• Pig committer and PMC Member

• Member of Apache Software Foundation and Incubator

PMC

• Author of Programming Pig from O’Reilly

© Hortonworks Inc. 2012

Page 3: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Users: Data Sharing is Hard

Page 3

Photo Credit: totalAldo via Flickr

This is programmer Bob, he uses Pig to crunch data.

This is analyst Joe, he uses Hive to build reports and

answer ad-hoc queries.

Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed?

And can you help me load it into Hive, I can never remember all the parameters I have to

pass that alter table command.

Ok

Bob, I need today’s data

Dude, we need HCatalog

© Hortonworks Inc. 2012

Page 4: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Pig Example

Page 4

raw = load ‘/rawevents/20100819/data’ using MyLoader()

as (ts:long, user:chararray, url:chararray);

botless = filter raw by NotABot(user);

store output into ‘/processedevents/20100819/data’;

Processedevents consumers must be manually informed by producer that

data is available, or poll on HDFS (this is bad).

raw = load ‘rawevents’ using HCatLoader();

botless = filter raw by date = ‘20100819’ and NotABot(user);

store output into ‘processedevents’

using HCatStorer(“date=20100819”);

Processedevents consumers will be notified by HCatalog via JMS that

data is available; they can then start their jobs.

© Hortonworks Inc. 2012

Page 5: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Developers:

Each tool requires its own Translator

Page 5

Pig Hive Map Reduce

RCFile Custom

Format

Custom

Loader

Hive

Columnar

Loader

RCFile

Input

Format

Custom

Input

Format

Columnar

SerDe

Custom

SerDe

© Hortonworks Inc. 2012

Page 6: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Developers:

Each tool requires its own Translator

Page 6

Pig Hive Map Reduce

RCFile Custom

Format

HCatalog

HCatLoader HCatInputFormat

Columnar

SerDe

Custom

SerDe

© Hortonworks Inc. 2012

Page 7: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Ops: Can I Ever Change the Data Format?

Page 7

• Data format can be changed (e.g. CSV to JSON)

– no need to reformat existing data

– new data will be written new format

– HCatalog can read across the format changes

– Users programs will not see the difference

• Data location can be changed without affecting user

programs

• New columns can be added via alter table add column

– no need to reformat existing data

– fields not present in old data will get a null when reading old data

© Hortonworks Inc. 2012

Page 8: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Relationship to Hive

Page 8

• If you are a Hive user, you can transition to HCatalog with no metadata modifications (true starting with version 0.4)

– Uses Hive’s SerDes for data translation (starting in 0.4)

– Uses Hive SQL for data definition (DDL)

• Provides a different security implementation; security

delegated to underlying storage

– Currently integrated with HDFS

– Integration with HBase in progress

– In near future you will be able to choose to use Hive authorization model if you wish

© Hortonworks Inc. 2012

Page 9: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Current Status

Page 9

• Release 0.4 (expected out next month)

– Hive/Pig/MapReduce integration

– Support for any data format with a SerDe (Text, Sequence,

RCFile, JSON SerDes included)

– Notification via JMS

– Basic HBase integration

© Hortonworks Inc. 2012

Page 10: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Current Development

Page 10

• Improving HBase integration

– new HBase security features

– repeatable read for HBase tables

• REST API for HCatalog

– Databases, tables, partitions are objects

– PUT to create or modify objects

– GET to retrieve information about objects

– DELETE to drop objects

– Submitted documents and results encoded in JSON

© Hortonworks Inc. 2012

Page 11: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Page 11

Future Directions

© Hortonworks Inc. 2012

Page 12: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Storing Semi-/Unstructured Data

Page 12

Name Zip

Alice 93201

Bob 76331

select name, zip

from users;

{"name":"alice","zip":"93201"}

{"name":"bob”,"zip":"76331"}

{"name":"cindy"}

{"zip":"87890"}

Table Users File Users

A = load ‘Users’ as

(name:chararray, zip:chararray);

B = foreach A generate name, zip;

© Hortonworks Inc. 2012

Page 13: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Storing Semi-/Unstructured Data

Page 13

Name Zip

Alice 93201

Bob 76331

select name, zip

from users;

{"name":"alice","zip":"93201"}

{"name":"bob”,"zip":"76331"}

{"name":"cindy"}

{"zip":"87890"}

Table Users File Users

A = load ‘Users’

B = foreach A generate name, zip;

© Hortonworks Inc. 2012

Page 14: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Storing Semi-/Unstructured Data

Page 14

Name Zip

Alice 93201

Bob 76331

select name, zip

from users;

{"name":"alice","zip":"93201"}

{"name":"bob”,"zip":"76331"}

{"name":"cindy"}

{"zip":"87890"}

Table Users File Users

A = load ‘Users’

B = foreach A generate name, zip;

© Hortonworks Inc. 2012

Page 15: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Data Lifecycle Management

Page 15

Replication

Cleaning

Compaction

Archiving

© Hortonworks Inc. 2012

Page 16: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Partitions in Different Storage

Page 16

HBase HDFS

•  Data can be

streamed in

•  available for

read almost

instantly

•  Must load in

batches

•  scan times

10x HBase

Latest Historical

Stores data in Stores data in

© Hortonworks Inc. 2012

Page 17: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Partitions in Different Storage

Page 17

HBase HDFS

•  Data can be

streamed in

•  available for

read almost

instantly

•  Must load in

batches

•  scan times

10x HBase

All

Stores data in

© Hortonworks Inc. 2012

Page 18: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Federation with MPP Data Stores

Page 18

MPP Store

HCatalog

HiveStorageHandler

MPPTable

HadoopTable

Web Services

select *

from HadoopTable;

select *

from MPPTable;

© Hortonworks Inc. 2012

Page 19: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Storing More Metadata

Page 19

• HCatalog currently stores metadata in RDBMS

– Most people use MySQL

• Table level statistics can currently be stored

• Would like to store partition statistics

– But this could overwhelm the RDBMS

• Would like to store user generated metadata for partitions

– But this could overwhelm the RDBMS

• Could we store this data in HBase instead?

• Could we store all the metadata in HBase instead?

© Hortonworks Inc. 2012

Page 20: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Thank You

Page 20 © Hortonworks Inc. 2012

Page 21: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters

Other Resources

Page 21 © Hortonworks Inc. 2012

•  Next Webcast: Extending Hadoop beyond MapReduce

– March 7

–  10am Pacific/1pm Eastern

–  http://hortonworks.com/webinars/

•  Hadoop Summit

–  June 13-14

– San Jose, California

– Call for Papers Deadline extended

– Hadoopsummit.org

•  Hadoop Training and Certification

– Developing Solutions Using Apache Hadoop

– Administering Apache Hadoop

–  http://hortonworks.com/training/