introduction to apache accumulo

42
Introduction to Apache Accumulo Boulder/Denver BigData Meetup - March 21,2012 Jared Winick @jaredwinick

Upload: jared-winick

Post on 27-Jan-2015

137 views

Category:

Technology


6 download

DESCRIPTION

Presented at the Boulder/Denver BigData Meetup on March 21, 2012

TRANSCRIPT

Page 1: Introduction to Apache Accumulo

Introduction to Apache Accumulo

Boulder/Denver BigData Meetup - March 21,2012Jared Winick@jaredwinick

Page 2: Introduction to Apache Accumulo

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

Page 3: Introduction to Apache Accumulo

http://yourmotivational.com/uploads/8604.jpg

Page 4: Introduction to Apache Accumulo

Annotation AddedJeff Dean: Designs, Lessons and Advice from Building Large Distributed Systemshttp://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

Page 5: Introduction to Apache Accumulo

Enables interactive access to…

Trillions of recordspetabytes of indexed data

across 100s-1000s of servers

Page 6: Introduction to Apache Accumulo

Short Accumulo History Lesson

http://www.flickr.com/photos/mr_t_in_dc/4249886990/sizes/l/in/photostream/

Page 7: Introduction to Apache Accumulo

2006

Page 8: Introduction to Apache Accumulo

2008

http://upload.wikimedia.org/wikipedia/commons/8/84/National_Security_Agency_headquarters%2C_Fort_Meade%2C_Maryland.jpg

Page 9: Introduction to Apache Accumulo

2011

Page 10: Introduction to Apache Accumulo

2012

Page 11: Introduction to Apache Accumulo

Uses of BigTable and Kin

(BigTable)

•Google Analytics1

•Crawl1

•AppEngine Datastore2

•Many more1

(HBase)

•Messages3,4,6

•Insights5,6

(Accumulo)

•???

(Cassandra)

•Rainbird (realtime analytics)7

1.) http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf2.) http://code.google.com/appengine/articles/storage_breakdown.html3.) http://www.facebook.com/note.php?note_id=4549916089194.) http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf5.) http://www.facebook.com/note.php?note_id=101501039002589206.) http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf7.) http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011

Page 12: Introduction to Apache Accumulo

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

Page 13: Introduction to Apache Accumulo

Multi-dimension Key

Key

Row ID

Column

Timestamp

Family Qualifier Visibility

Value

http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html

Page 14: Introduction to Apache Accumulo

Keys Sorted Lexicographically

Row ID, Column Family, Column Qualifier, Column Visibility, Timestamp

Everything is a byte[] except the Timestamp which is a long

Page 15: Introduction to Apache Accumulo

Physical Layout

Row ID Col Fam Col Qual Col Vis Time Value

Alice properties age public March 2011 31

Alice properties phone private Feb 2011 555-1234

Alice purchases Xbox public Feb 2011 $299

Bob properties phone private March 2011 555-4321

Bob purchases iPhone Public Feb 2011 $399

Key Value

Page 16: Introduction to Apache Accumulo

Queries

•By exact Key or range of Keys•Data is always returned in sorted order

Query Requirements Drive Data Model Design

Page 17: Introduction to Apache Accumulo

http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html

Page 18: Introduction to Apache Accumulo

Accumulo

Hadoop HDFS Zookeeper

Storage Configuration/State

Hadoop MapReduce

Analytics

Clients

Read/Write

Page 19: Introduction to Apache Accumulo

Accumulo

Hadoop HDFS

Tablet Server

Tablet Server

Tablet Server

Master

Data Node

Data Node

Data Node

Name Node

. . .

. . .

TableTablets

… … … … … …

Page 20: Introduction to Apache Accumulo

Accumulo

Hadoop HDFS

Tablet Server

Tablet Server

Tablet Server

Master

Data Node

Data Node

Data Node

Name Node

. . .

. . .

TableTablets

Tablet Server Failure

1.) Detect Failure

2.) Reassign

Page 21: Introduction to Apache Accumulo

Accumulo

Hadoop HDFS

Tablet Server

Writes

Client

Write-Ahead

Log (WAL)

1MemTable2

Data Node

Data Node

Data Node. . .

Tablet

Page 22: Introduction to Apache Accumulo

Accumulo

Hadoop HDFS

Tablet Server

Writes

Client

Write-Ahead

Log (WAL)

1MemTable2

Data Node

Data Node

Data Node. . .

File 1

3

Tablet

Page 23: Introduction to Apache Accumulo

Compactions

Minor Major

The process of flushing a MemTable of a Tablet to a single file in HDFS

The process of combining multiple files into a single file

Page 24: Introduction to Apache Accumulo

• Tablets are split when they reach a max size• Always split on row boundary• Master assigns a split Tablet to another Tablet

server (no data is moved!)

Tablet Splits

Page 25: Introduction to Apache Accumulo

Reads

AccumuloTablet Server

Client MemTable

File 1

Tablet

File 1

Merge

Page 26: Introduction to Apache Accumulo

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

Page 27: Introduction to Apache Accumulo

http://wiki.eeng.dcu.ie/ee557/287-EE/version/default/part/ImageData/data/server-side_intro.gif

Iterators: Server-side programming

Page 28: Introduction to Apache Accumulo

Iterators

Can be run at:•Scan Time•Minor Compaction•Major Compaction

Can do things like:•Aggregation (Combiners)•Age-Off•Filtering (access control)•Transformation

Push Processing to the Data

Page 29: Introduction to Apache Accumulo

Accumulo /əˈkjuːmjʊˌloʊ/

1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing

Page 30: Introduction to Apache Accumulo

• Every key-value has a visibility label• Label is defined with boolean operators• Label is arbitrary and ad-hoc

• Authorizations presented at scan time• Data is filtered out automatically by system-

level Iterator

Access Control

Public Private | Admin Finance | (HR & Manager)

Page 31: Introduction to Apache Accumulo

Access Control – Typical Architecture

Web Server

Enterprise Identity

Management

Accumulo1.) Pass Credentials

2.) Lookup User

3.) Return Authorizations

4.) Proxy Authorization

Trusted Zone

5.) Return Visible Data6.) Return Data

Page 32: Introduction to Apache Accumulo

Access Control – Typical Architecture

3.) Auths:[SECRET, UNCLASSIFIED,PROJECT X, PROJECT Y]

Web Server

Enterprise Identity

Management

Accumulo

1.) PKI Cert

2.) Lookup Bob

4.) Proxy Bob’s Auths

Trusted Zone

5.) Return [6,8]6.) Return [6,8]

Bob

SECRET&PROJECT X, 6SECRET&PROJECT Y, 8SECRET&PROJECT Z, 3

Page 33: Introduction to Apache Accumulo

Demo

Page 34: Introduction to Apache Accumulo

Application Requirements

Build an application to analyze trends in Twitter messages.

•Query for word/phrase and view real-time activity in a time series graph•View at different time ranges (1 day, 7 days, 30 days, etc)•Allow multiple query terms to compare activity (ex. Breakfast,Lunch)•Automatically extract daily trends for the user

Page 35: Introduction to Apache Accumulo

Demo Setup/Data

• Twitter Streaming API• US country codes only messages• 1,2,3-grams built• Data since Dec 24 – Live• Running on average workstation, 1 SATA disk,

6 GB memory.• 72GB, 2.6 billion entries and counting

Page 36: Introduction to Apache Accumulo
Page 37: Introduction to Apache Accumulo

Data Model• Tweets table– Row ID: n-gram– Column Family: Date Granularity (DAY, HOUR)– Column Qual: Date Value– Value: Count– SummingCombiner (Iterator) used to update Count

Row ID Col Fam Col Qual Value

breakfast DAY 20120318 31

breakfast DAY 20120319 56

… … … …

lunch HOUR 2012031801 3

lunch HOUR 2012031802 4

Page 38: Introduction to Apache Accumulo

Data Model• Trends table– Row ID: (Date Granularity + Date Value)– Column Family: (Integer.MAX_VALUE – trendScore)– Column Qual: n-gram– Value: []

Row ID Col Fam Col Qual Value

DAY:20120318 2147483145 church

DAY:20120318 2147483316 hangover

… … … …

DAY:20120319 2147476521 the broncos

DAY:20120319 2147477704 tim tebow

Page 39: Introduction to Apache Accumulo

• Utilize MapReduce for building trends• AccumuloInputFormat reads from tweets

table• AccumuloOutputFormat writes to trends

table• AccumuloStorage LoadFunc for Pig

available on github

MapReduce Analytics

Page 40: Introduction to Apache Accumulo
Page 41: Introduction to Apache Accumulo

Summary

•Accumulo exploits locality to enable interactive access to huge data sets while adding cell-level access control and server-side programming

•Nothing in life is free. Accumulo comes with the complexity and responsibility of managing a distributed system and designing indexes on your data

Page 42: Introduction to Apache Accumulo

References

http://incubator.apache.org/accumulo/

http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012

• Documentation, Mailing Lists, Links

• HBase Shootout

• Trendulohttps://github.com/jaredwinick/trendulo