accumulo meetup 20130109
DESCRIPTION
TRANSCRIPT
Securely explore your data
APPACHE ACCUMULO
Adam Fuchs and John Vines
sqrrl data, inc.
January 9, 2013
Securely explore your data
APACHE ACCUMULO
Sorted, Distributed Key/Value Store
Based on Google’s Big Table Design
Built on Top of Apache Hadoop and Apache Zookeeper
Augments and Integrates With the Hadoop ecosystem
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 2
TODAY’S TALK
Overview of the Accumulo Project
Accumulo Design
Table Design Strategy
Live Demonstration
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 3
ACCUMULO TIMELINE
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 4
2005 2006 2007 2008 2009 2010 2011 2012 2013
Google publishes
Bigtable paper
NSA begins development of
Accumulo
NSA open sources
Accumulo into incubation at
Apache
sqrrl is founded
First sqrrl release planned
Google Publishes Papers:
GFS (2003)Map Reduce (2004)
Accumulo becomes a top-level Apache
project
ACCUMULO’S STRENGTHS
Apache Accumulo excels at:
- Security
Cell-level security reduces the cost of application development in the presence of complex legal or policy restrictions on data use
Mandatory access control keeps your data safe
- Scalability
Proven reliability and performance at the multi-petabyte scale
High-performance parallel I/O library
- Adaptability
Flexible schema support to quickly ingest new data sources
Sorted key/value paradigm supports a multitude of search and analysis applications
Server-side programming framework “iterator trees” support best-in-class aggregation, filtering, and complex query semantics
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 5
BASIC SCHEMA
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 6
An Accumulo key is a 5-tuple, consisting of:- Row: Controls Atomicity- Column Family: Controls Locality- Column Qualifier: Controls Uniqueness- Visibility Label: Controls Access- Timestamp: Controls Versioning
Keys are sorted:- Hierarchically: Row first, then column family, and so on.
- Lexicographically: Compare first byte, then second, and so on.
Accumulo stores sorted key/value pairs (entries).
Values are byte arrays.
KEY/VALUE EXAMPLES
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 7
Row Col. Fam. Col. Qual. Visibility Timestamp Value
Jane Doe Friends John Doe JD 20121130
Jane Doe Phone Number 555-1212 20090115
John Doe Friends Jane Doe JD 20121201
John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results Mental Health PSYCH_JD 20120801 Crazy!
John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…
VISIBILITY SYNTAX & SEMANTICS
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 8
TABLET ORGANIZATION
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 9
Root Tablet-∞ to ∞
Metadata Tablet 1-∞ to
“Encyclopedia:Ocelot”
Data Tablet-∞ : thing
Data Tabletthing : ∞
Data Tablet-∞ : Ocelot
Data TabletOcelot : Yak
Data TabletYak : ∞
Data Tablet-∞ to ∞
Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞
Well-Known Location
(zookeeper)
Table: Adam’s Table Table: Encyclopedia Table: Foo
Collections of entries from tables
Tables are partitioned into Tablets
Metadata tablets hold info about other tablets, forming a 3-level hierarchy
A Tablet is a unit of work for a Tablet Server
ACCUMULO ARCHITECTURE
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 10
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
Application
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Application
Application
GarbageCollector
ScanDelete
Delegate Authority,
Configs
Delegate Authority,
Configs
TABLET DATA FLOW
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 11
In-Memory Map
Write AheadLog
(For Recovery)
Sorted, Indexed
File
Sorted, Indexed
FileSorted, Indexed
File
Tablet
ReadsIterator
TreeMinor
Compaction
Merging / Major Compaction
Iterator Tree
Writes Iterator Tree
Scan
ITERATOR FRAMEWORK
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 12
Iterator Operations:
- File Reads- Block Caching- Merging- Deletion- Isolation- Locality Groups- Range Selection- Column Selection- Cell-level Security- Versioning- Filtering- Aggregation- Partitioned Joins
CLIENT API
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 13
Instancenew ZooKeeperInstance(...)
new MockInstance()
Connector
getConnector(auth info...)
Scanner BatchScanner BatchWriter
createScanner(...) createBatchScanner(...) createBatchWriter(...)
TableOperations
InstanceOperations
SecurityOperations
Map.Entry
Key Value
Mutation
Range
IteratorOption
iterator()addMutation(...)
Authorizations
TABLE DESIGN
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 14
Graphs
Document-distributed indexing
Multi-dimensional index
Custom index
No built-in secondary indices
Sort Order Index
Basic design pattern: forward and inverted index tables
Additional table design patterns
Table:
Row:
Column Family:
Column Qualifier:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<Type> + <Field>
<UUID>
<Digest of Event>
DEMO TIME!
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 15
OPEN SOURCE PROJECT
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 16
Apache Software Foundation project since October 2011
site: http://accumulo.apache.org
jira: https://issues.apache.org/jira/browse/ACCUMULO
lists: http://accumulo.apache.org/mailing_list.html
CURRENT CONTRIBUTORS
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 17
CONTACT
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential 18
Adam FuchsCTO
John VinesDirector of Ecosystems
sqrrl data, Inc.www.sqrrl.com