accumulo meetup 20130109

Post on 27-Jan-2015

111 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Securely explore your data

APPACHE ACCUMULO

Adam Fuchs and John Vines

sqrrl data, inc.

January 9, 2013

Securely explore your data

APACHE ACCUMULO

Sorted, Distributed Key/Value Store

Based on Google’s Big Table Design

Built on Top of Apache Hadoop and Apache Zookeeper

Augments and Integrates With the Hadoop ecosystem

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 2

TODAY’S TALK

Overview of the Accumulo Project

Accumulo Design

Table Design Strategy

Live Demonstration

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 3

ACCUMULO TIMELINE

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 4

2005 2006 2007 2008 2009 2010 2011 2012 2013

Google publishes

Bigtable paper

NSA begins development of

Accumulo

NSA open sources

Accumulo into incubation at

Apache

sqrrl is founded

First sqrrl release planned

Google Publishes Papers:

GFS (2003)Map Reduce (2004)

Accumulo becomes a top-level Apache

project

ACCUMULO’S STRENGTHS

Apache Accumulo excels at:

- Security

Cell-level security reduces the cost of application development in the presence of complex legal or policy restrictions on data use

Mandatory access control keeps your data safe

- Scalability

Proven reliability and performance at the multi-petabyte scale

High-performance parallel I/O library

- Adaptability

Flexible schema support to quickly ingest new data sources

Sorted key/value paradigm supports a multitude of search and analysis applications

Server-side programming framework “iterator trees” support best-in-class aggregation, filtering, and complex query semantics

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 5

BASIC SCHEMA

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 6

An Accumulo key is a 5-tuple, consisting of:- Row: Controls Atomicity- Column Family: Controls Locality- Column Qualifier: Controls Uniqueness- Visibility Label: Controls Access- Timestamp: Controls Versioning

Keys are sorted:- Hierarchically: Row first, then column family, and so on.

- Lexicographically: Compare first byte, then second, and so on.

Accumulo stores sorted key/value pairs (entries).

Values are byte arrays.

KEY/VALUE EXAMPLES

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 7

Row Col. Fam. Col. Qual. Visibility Timestamp Value

Jane Doe Friends John Doe JD 20121130

Jane Doe Phone Number 555-1212 20090115

John Doe Friends Jane Doe JD 20121201

John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …

John Doe Test Results Cholesterol JD|PCP_JD 20120912 183

John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass

John Doe Test Results Mental Health PSYCH_JD 20120801 Crazy!

John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…

VISIBILITY SYNTAX & SEMANTICS

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 8

TABLET ORGANIZATION

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 9

Root Tablet-∞ to ∞

Metadata Tablet 1-∞ to

“Encyclopedia:Ocelot”

Data Tablet-∞ : thing

Data Tabletthing : ∞

Data Tablet-∞ : Ocelot

Data TabletOcelot : Yak

Data TabletYak : ∞

Data Tablet-∞ to ∞

Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞

Well-Known Location

(zookeeper)

Table: Adam’s Table Table: Encyclopedia Table: Foo

Collections of entries from tables

Tables are partitioned into Tablets

Metadata tablets hold info about other tablets, forming a 3-level hierarchy

A Tablet is a unit of work for a Tablet Server

ACCUMULO ARCHITECTURE

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 10

Tablet Server

Tablet

Tablet Server

Tablet

Tablet Server

Tablet

Application

Zookeeper

Zookeeper

Zookeeper

Master

HDFS

Read/Write

Store/Replicate

Assign/Balance

Application

Application

GarbageCollector

ScanDelete

Delegate Authority,

Configs

Delegate Authority,

Configs

TABLET DATA FLOW

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 11

In-Memory Map

Write AheadLog

(For Recovery)

Sorted, Indexed

File

Sorted, Indexed

FileSorted, Indexed

File

Tablet

ReadsIterator

TreeMinor

Compaction

Merging / Major Compaction

Iterator Tree

Writes Iterator Tree

Scan

ITERATOR FRAMEWORK

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 12

Iterator Operations:

- File Reads- Block Caching- Merging- Deletion- Isolation- Locality Groups- Range Selection- Column Selection- Cell-level Security- Versioning- Filtering- Aggregation- Partitioned Joins

CLIENT API

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 13

Instancenew ZooKeeperInstance(...)

new MockInstance()

Connector

getConnector(auth info...)

Scanner BatchScanner BatchWriter

createScanner(...) createBatchScanner(...) createBatchWriter(...)

TableOperations

InstanceOperations

SecurityOperations

Map.Entry

Key Value

Mutation

Range

IteratorOption

iterator()addMutation(...)

Authorizations

TABLE DESIGN

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 14

Graphs

Document-distributed indexing

Multi-dimensional index

Custom index

No built-in secondary indices

Sort Order Index

Basic design pattern: forward and inverted index tables

Additional table design patterns

Table:

Row:

Column Family:

Column Qualifier:

Value:

Forward Index

<UUID>

<Type>

<Field>

<Term>

Inverted Index

<Term>

<Type> + <Field>

<UUID>

<Digest of Event>

DEMO TIME!

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 15

OPEN SOURCE PROJECT

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 16

Apache Software Foundation project since October 2011

site: http://accumulo.apache.org

jira: https://issues.apache.org/jira/browse/ACCUMULO

lists: http://accumulo.apache.org/mailing_list.html

CURRENT CONTRIBUTORS

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 17

CONTACT

© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 18

Adam FuchsCTO

adam@sqrrl.com

John VinesDirector of Ecosystems

john@sqrrl.com

sqrrl data, Inc.www.sqrrl.com

@sqrrl_incinfo@sqrrl.com

top related