sqrrl€¦ · secure.’’scale.’’adapt.’ 3...
TRANSCRIPT
sqrrl Secure. Scale. Adapt
[email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Adam Fuchs, CTO sqrrl data, INC.
February 21, 2013
&
Secure. Scale. Adapt.
2 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
1. An overview of what maFers for big data analysis.
2. An in-‐depth discussion of Accumulo technology.
3. Approaches to sorted key-‐value table design. 4. An overview of Sqrrl Enterprise.
Today’s Talk
Secure. Scale. Adapt.
3 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
The Value of Sqrrl and Accumulo
Security
AdapLvity Scalability
Secure. Scale. Adapt.
4 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
2005 2006 2007 2008 2009 2010 2011 2012 2013
Google publishes
Bigtable paper
NSA begins development of
Accumulo
NSA open sources Accumulo
sqrrl is founded
First sqrrl release planned
Google Publishes Papers:
GFS (2003) Map Reduce (2004)
Accumulo becomes a top-‐level Apache
project
sqrrl | Accumulo Timeline
Secure. Scale. Adapt.
5 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Apache Accumulo
" Sorted, Distributed Key/Value Store
" Based on Google’s Big Table Design
" Built on Top of Apache Hadoop and Apache Zookeeper
" Augments and Integrates With the Hadoop ecosystem
" Originally developed at the National Security Agency, now an Apache Software Foundation project
Secure. Scale. Adapt.
6 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Apache Accumulo excels at: • Security
– Cell-‐level security reduces the cost of applicaLon development in the presence of complex legal or policy restricLons on data use
– IdenLty and access management and encrypLon Le into enterprise security standards
• Scalability – Proven reliability and performance at the mulL-‐petabyte scale – High-‐performance parallel I/O library
• AdapLvity – Flexible schema support to quickly ingest new data sources – Sorted key/value paradigm supports a mulLtude of search and
analysis applicaLons – Server-‐side programming framework “iterator trees” support best-‐in-‐
class aggregaLon, filtering, and complex query semanLcs
Accumulo’s Strengths
Secure. Scale. Adapt.
7 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
• Start small, but design for scalability – One applicaLon first, then grow to hundreds – One gigabyte first, then grow to petabytes
• Itera6ve schema refinement – IniLally, let the data define the schema – Refine the schema in bulk as you beFer understand the data – Middle ground between flat files and complete ontologies
• Discovery analy6cs as applica6on building blocks – Universal search: structured and unstructured data, low latency – Basic staLsLcs: aggregaLons of query results, parallelized, low latency, to
support big picture analysis – Graphs: scalable graph analyLcs for analyzing how everything is connected
• Data-‐centric security – Separate modeling of security and analysis – Simplifies mulL-‐tenancy and applicaLon accreditaLon
Big Data Lessons Learned
Secure. Scale. Adapt.
8 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
DefiniLon: A form of security in which data carries with it the elements of provenance that are required to make policy decisions on its releasability. • Separate data modeling for Security and Analysis • Reusability of applicaLons across security domains
• Distributed development of ingest and query applicaLons
• Supported by Accumulo’s cell-‐level security
Data-Centric Security
Secure. Scale. Adapt.
9 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Security Policy Evaluation
Original Records
Transformed Records Enterprise Security Server
Query Mechanism
Insert and
Model Security
Transform Data and
Model Security Access Checks
TradiOonal Security
Filters 1000s Per Second
Secure. Scale. Adapt.
10 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Security Policy Evaluation
Original Records
Transformed Records Enterprise Security Server
Query Mechanism
Insert and
Model Security
Transform Data and
Preserve Labels Access Checks
Data-‐Centric Security
Filter Billions Per Second
Secure. Scale. Adapt.
11 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Visibility Syntax & Semantics
Secure. Scale. Adapt.
12 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Accumulo stores sorted key/value pairs (entries).
An Accumulo key is a 5-‐tuple, consisOng of: " Row: Controls Atomicity " Column Family: Controls Locality " Column Qualifier: Controls Uniqueness " Visibility Label: Controls Access " Timestamp: Controls Versioning
Keys are sorted: • Hierarchically: Row first, then column family, and so on. • Lexicographically: Compare first byte, then second, and so on.
Values are byte arrays.
Basic Schema
Secure. Scale. Adapt.
13 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Row Col. Fam. Col. Qual. Visibility Timestamp Value
Jane Doe Friends John Doe JD 20121130
Jane Doe Phone Number 555-‐1212 20090115
John Doe Friends Jane Doe JD 20121201
John Doe Notes PCP PCP_JD 20120912 PaLent suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results Mental Health PSYCH_JD 20120801 Crazy!
John Doe Test Results X-‐Ray JD|PHYS_JD 20120513 1010110110100…
Key/Value Examples
Secure. Scale. Adapt.
14 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
" CollecLons of entries form Tables
" Tables are parLLoned into Tablets
" Metadata tablets hold info about other tablets, forming a 3-‐level hierarchy
" A Tablet is a unit of work for a Tablet Server
Root Tablet -‐∞ to ∞
Metadata Tablet 1 -‐∞ to
“Encyclopedia:Ocelot”
Data Tablet -‐∞ : thing
Data Tablet thing : ∞
Data Tablet -‐∞ : Ocelot
Data Tablet Ocelot : Yak
Data Tablet Yak : ∞
Data Tablet -‐∞ to ∞
Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞
Well-‐Known LocaLon
(zookeeper)
Table: Adam’s Table Table: Encyclopedia Table: Foo
Tablet Organization
Secure. Scale. Adapt.
15 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
ApplicaLon
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority, Configs
Delegate Authority, Configs
ApplicaLon
ApplicaLon
Accumulo Architecture
Garbage Collector
Scan Delete
Secure. Scale. Adapt.
16 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
In-‐Memory Map
Write Ahead Log
(For Recovery)
Sorted, Indexed File
Sorted, Indexed File
Sorted, Indexed File
Tablet Reads
Iterator Tree
Minor Compac>on
Merging / Major Compac>on
Iterator Tree
Writes Iterator Tree
Scan
Tablet Data Flow
Secure. Scale. Adapt.
17 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Iterator Framework
Iterator OperaOons: " File Reads " Block Caching " Merging " DeleLon " IsolaLon " Locality Groups " Range SelecLon " Column SelecLon " Cell-‐level Security " Versioning " Filtering " AggregaLon " ParLLoned Joins
Secure. Scale. Adapt.
18 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Instance new ZooKeeperInstance(...)
new MockInstance()
Connector
getConnector(auth info...)
Scanner BatchScanner BatchWriter
createScanner(...) createBatchScanner(...) createBatchWriter(...)
TableOperaLons
InstanceOperaLons
SecurityOperaLons
Map.Entry
Key Value
MutaLon
Range
IteratorOpLon
iterator() addMutaLon(...)
AuthorizaLons
Client API
Secure. Scale. Adapt.
19 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
• No built-‐in secondary indices
• Sort Order ó Index • Balance between ingest and query
• Avoid introducing boFlenecks
• Preserve cell-‐level security and scalability
Table Design Table:
Row:
Column Family:
Column Qualifier:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<Type> + <Field>
<UUID>
<Digest of Event>
Secure. Scale. Adapt.
20 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Forward and Inverted Indexes
Table:
Row:
Column Family:
Column Qualifier:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<Type>
<UUID> + <Field>
<Digest of Event>
Secure. Scale. Adapt.
21 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Forward and Inverted Indexes
Secure. Scale. Adapt.
22 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
sqrrl extends Accumulo with: • Discovery analyLcs
– Real-‐Lme search across structured and unstructured data – Graph analysis primiLves to link across datasets – Distributed computaLon of ad-‐hoc staLsLcs – Online staLsLcs indexing for real-‐Lme report generaLon – Custom indexing to support your applicaLons
• Enterprise integraLon – Support for standard query languages, including Lucene and a subset
of SQL – I/O support for hierarchical documents, including JSON – IdenLty and Access Management with exisLng Kerberos, AD, LDAP,
and other installaLons – Polyglot service layer, including Python, Ruby, C, and other languages
Sqrrl Enterprise
Secure. Scale. Adapt.
23 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
ApplicaOons
AnalyOcs APIs
Security & Access Controls
Data IntegraOon
Search, Sta6s6cs, Graph, Geospa6al, SQL, Machine Learning, Custom Extensions
IAM, Encryp6on, DAM, Secure Code
ETL, Hadoop
Accumulo
sqrrl analytics Architecture
Secure. Scale. Adapt.
24 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
• Flexible Schema: – Generic hierarchical structured/unstructured document store (JSON, XML, etc.)
– Graph store (enLLes, relaLonships) – Schema stats for iteraLve refinement
• Core Discovery AnalyLcs: – Structured/unstructured text search (Lucene) – Big-‐Picture Analysis, AggregaLon (SQL) – Graph search (fronLer expansion) – AddiLonal scalable indexes (Geo, Image, Video ...)
Adaptability
Secure. Scale. Adapt.
25 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
The future of Big Data innovaLon is Apps, built on: • Universal Search • Schema-‐less StaLsLcs • Graphs • IntuiLve Languages • Secure, Scalable, and
Adaptable platorms
Lightweight Apps
Secure. Scale. Adapt.
26 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Schema Discovery
Secure. Scale. Adapt.
27 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Search + Graph Analysis
Secure. Scale. Adapt.
28 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
Schema-less Statistics
Secure. Scale. Adapt.
29 [email protected] | @sqrrl_inc | 617.902.0784 sqrrl data, INC., All Rights Reserved
sqrrl data, Inc. 275 Third St.
Cambridge, MA 02142
(617) 902-‐0784 www.sqrrl.com @sqrrl_inc
Contact