apache accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a...

25
Accumulo Adam Fuchs What is Accumulo? How can I use Accumulo? Who is involved in the Accumulo community? Where is Accumulo going? Apache Accumulo Adam Fuchs National Security Agency Computer and Information Sciences Research Group July 17, 2012

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Apache Accumulo

Adam Fuchs

National Security AgencyComputer and Information Sciences Research Group

July 17, 2012

Page 2: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Design Drivers

Analysis of big data is central to our customers’ requirements, in which thestrongest drivers are:

Scalability: The ability to do twice the work at only (about) twice the cost.

Adaptability: The ability to rapidly evolve the analytical tools available inan operational environment, building upon and enhancing existingcapabilities.

From these directives we can derive the following requirements:

Simplicity in the overall architecture to encourage collaboration andameliorate learning curve.

Generic design patterns to store and organize data whose format we don’tcontrol.

Generic discovery analytics to retrieve and visualize generic data.

Solutions for common sub-problems, such as multi-level security andenforcement of legal restrictions, built into the infrastructure.

Page 3: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Optimization

... is a secondary concern, given:

hundreds of evolving applications,

hundreds of changing data sources,

non-trivial data volumes,

many complicated interactions.

Instead, we need a generic platform that is cheap, simple, scalable, secure, andadaptable, with pretty good performance.

Page 4: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Growth of Accumulo

Page 5: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Key/Value Structure

An Accumulo Key is a 5-tuple, including:

Row: controls Atomicity

Column Family: controls Locality

Column Qualifier: controls Uniqueness

Visibility: controls Access (unique to Accumulo)

Timestamp: controls Versioning

Sample Entries

Row : Col. Fam. : Col. Qual. : Visibility : Timestamp ⇒ ValueAdam : Favorites : Food : (Public) : 20090801 ⇒ SushiAdam : Favorites : Programming Language : (Private) : 20090830 ⇒ JavaAdam : Favorites : Programming Language : (Private) : 20070725 ⇒ C++Adam : Friends : Bob : (Public) : 20110601 ⇒Adam : Friends : Joe : (Private) : 20110601 ⇒

Page 6: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Visibility Label Syntax and Semantics

Document LabelsDoc1 : (Federation)Doc2 : (Klingon|Vulcan)Doc3 : (Federation&Human&Vulcan)

Doc4 : (Federation&(Human|Vulcan))

User Authorization SetsCptKirk : {Federation,Human}

MrSpock : {Federation,Human,Vulcan}

Syntax

WORD ⇒ [a-zA-Z0-9 ]+CLAUSE ⇒ AND

⇒ ORAND ⇒ AND & AND

⇒ (CLAUSE)⇒ WORD

OR ⇒ OR | OR⇒ (CLAUSE)⇒ WORD

Semantics(T ⇒ τ) ∧ (τ ∈ A)

(T,A) |= trueterm

(T ⇒ T1 & T2) ∧ ((T1,A) |= true) ∧ ((T2,A) |= true)

(T,A) |= trueand

(T ⇒ T1 | T2) ∧ (((T1,A) |= true) ∨ ((T2,A) |= true))

(T,A) |= trueor

(T ⇒ (T1)) ∧ (T1 |= true)

(T,A) |= trueparen

Page 7: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Tablets

Collections ofkey/value pairsform Tables

Tables arepartitioned intoTablets

Metadata tabletshold info aboutother tablets,forming athree-levelhierarchy

A Tablet is a unitof work for aTablet Server

Page 8: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Distributed Processes

Page 9: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Tablet Server Composition

Quick and loose definitions:

Table: A map of keys to values with one global sort order among keys.Tablet: A row range within a Table.Tablet Server: The mechanism that hosts Tablets, providing the primaryfunctionality of Bigtable or Accumulo.

Tablet servers have several primary functions:

1 Hosting RPCs (read, write, etc.)

2 Managing resources (RAM, CPU, File I/O, etc.)

3 Scheduling background tasks (compactions, caching, etc.)

4 Handling key/value pairs

Category 4 is almost entirely accomplished through the Iterator framework.

Page 10: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Tablet Server Data Flow

Iterator Uses

File Reads

Block Caching

Merging

Deletion

Isolation

Locality Groups

Range Selection

Column Selection

Cell-level Security

Versioning

Filtering

Aggregation

Partitioned Joins

Page 11: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

The Perils of Distributed Computing

Dealing with failures is hard!

Operations like table creation are logically atomic, but consist of multipleoperations on distributed systems.

Resource locking (via mutex, semaphores, etc.) provides some sanity.

Distributed systems have many complicated failure modes: clients, master,tablet servers, and dependent systems can all go offline periodically.

Who is responsible for unlocking locks when any component can fail?

How do we know it’s safe to unlock a lock?

Page 12: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Accumulo Testing Procedures

Testing Frameworks

Unit: Verify correct functioning ofeach module separately

System: Perform correctness andperformance tests on a smallrunning instance

Load/Scale: Generate high loadsat scale and measure performanceand correctness

Random Walk: Randomly,repeatedly, and concurrentlyexecute a variety of test modulesrepresentative of user activity onan instance at scale

Simulation: Evaluate the model togauge expected performance

Other ConsiderationsScoping tests to includeserver-side code, client-side code,dependent processes, etc.

Code coverage vs. path coverage

Static vs. dynamic analysis

Simulating failures of distributedcomponents

Strange failure modes (oftenhardware/physics-related)

Page 13: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Fault-Tolerant Executor

If a process dies, previously submitted operations continueto execute on restart.

FATE serializes every task in Zookeeper before execution.

The Master process uses FATE to execute table operationsand administrative actions.

FATE eliminates the single point of failure.

Page 14: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Verified State Models

State models used formany internal functions

Explicit-state modelchecking provescorrectness

Page 15: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Page 16: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Page 17: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Event Table with Inverted Index

Page 18: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Inverted Index Flow

Page 19: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Multidimensional Index

See also: http://en.wikipedia.org/wiki/Geohash

Page 20: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Graph Table

Page 21: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

The “shard” Table

Page 22: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Committers, Contributors, and Community

Accumulo-Related Companies

42six

Accumulo Data

Berico

Booz Allen Hamilton

CyberPoint

Data Tactics

Eclectic Consulting

Invertix

KEYW

PDI

Peterson Technologies

Potomac Fusion

Praxis

SAIC

sqrrl

SRA

SW Complete

Tetra Concepts

TexelTek

Your name here!

Page 23: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

User Base

Page 24: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Features in the Pipeline

Block stats indexing

Transient block indexing

Pluggable Authentication and Authorization

HDFS-based write-ahead log

Multiple namenode/volume support

Integration with cluster management systems

Web-integrated shell

Page 25: Apache Accumulopeople.apache.org/~afuchs/slides/project_accumulo.pdf · 2012. 7. 17. · execute a variety of test modules representative of user activity on an instance at scale

Accumulo

Adam Fuchs

What isAccumulo?

How can I useAccumulo?

Who isinvolved in theAccumulocommunity?

Where isAccumulogoing?

Theoretical Projects and Challenges

Coprocessors

Multi-row Transactions

Improved Iterator Framework

Statistics API

Custom Sort Order

Multi-Data Center Replication

Other Suggestions?