completing the big data ecosystem: sqrrl and...
TRANSCRIPT
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Completing the Big Data Ecosystem:sqrrl and Accumulo
Adam Fuchs and John Vines
sqrrl data INC.
August 3, 2012
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Design Drivers
Analysis of big data is central to our customers’ requirements, in which thestrongest drivers are:
Scalability: The ability to do twice the work at only (about) twice the cost.
Adaptability: The ability to rapidly evolve the analytical tools available inan operational environment, building upon and enhancing existingcapabilities.
Security: Getting all of the above without giving up secrecy and assuranceproperties.
From these directives we can derive the following requirements:
Data-Centric Security to reduce coordination needed between applicationdevelopers and data providers.
Simplicity in the overall architecture to encourage participation andameliorate learning curve.
Generic design patterns to store and organize data whose format we don’tcontrol.
Generic discovery analytics to retrieve and visualize generic data.
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Optimization
... is a secondary concern, given:
hundreds of evolving applications,
hundreds of changing data sources,
petabytes/exabytes of data,
many complicated interactions.
Instead, we need a generic platform that is cheap, simple, scalable, secure, andadaptable, with pretty good performance.
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Key/Value Structure
An Accumulo Key is a 5-tuple, including:
Row: controls Atomicity
Column Family: controls Locality
Column Qualifier: controls Uniqueness
Visibility: controls Access (unique to Accumulo)
Timestamp: controls Versioning
Sample Entries
Row : Col. Fam. : Col. Qual. : Visibility : Timestamp ⇒ ValueAdam : Favorites : Food : (Public) : 20090801 ⇒ SushiAdam : Favorites : Programming Language : (Private) : 20090830 ⇒ JavaAdam : Favorites : Programming Language : (Private) : 20070725 ⇒ C++Adam : Friends : Bob : (Public) : 20110601 ⇒Adam : Friends : Joe : (Private) : 20110601 ⇒
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Visibility Label Syntax and Semantics
Document LabelsDoc1 : (Federation)Doc2 : (Klingon|Vulcan)Doc3 : (Federation&Human&Vulcan)
Doc4 : (Federation&(Human|Vulcan))
User Authorization SetsCptKirk : {Federation,Human}
MrSpock : {Federation,Human,Vulcan}
Syntax
WORD ⇒ [a-zA-Z0-9 ]+CLAUSE ⇒ AND
⇒ ORAND ⇒ AND & AND
⇒ (CLAUSE)⇒ WORD
OR ⇒ OR | OR⇒ (CLAUSE)⇒ WORD
Semantics(T ⇒ τ) ∧ (τ ∈ A)
(T,A) |= trueterm
(T ⇒ T1 & T2) ∧ ((T1,A) |= true) ∧ ((T2,A) |= true)
(T,A) |= trueand
(T ⇒ T1 | T2) ∧ (((T1,A) |= true) ∨ ((T2,A) |= true))
(T,A) |= trueor
(T ⇒ (T1)) ∧ (T1 |= true)
(T,A) |= trueparen
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Tablets
Collections ofkey/value pairsform Tables
Tables arepartitioned intoTablets
Metadata tabletshold info aboutother tablets,forming athree-levelhierarchy
A Tablet is a unitof work for aTablet Server
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Distributed Processes
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Tablet Server Composition
Quick and loose definitions:
Table: A map of keys to values with one global sort order among keys.Tablet: A row range within a Table.Tablet Server: The mechanism that hosts Tablets, providing the primaryfunctionality of Bigtable or Accumulo.
Tablet servers have several primary functions:
1 Hosting RPCs (read, write, etc.)
2 Managing resources (RAM, CPU, File I/O, etc.)
3 Scheduling background tasks (compactions, caching, etc.)
4 Handling key/value pairs
Category 4 is almost entirely accomplished through the Iterator framework.
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Tablet Server Data Flow
Iterator Uses
File Reads
Block Caching
Merging
Deletion
Isolation
Locality Groups
Range Selection
Column Selection
Cell-level Security
Versioning
Filtering
Aggregation
Partitioned Joins
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
The Perils of Distributed Computing
Dealing with failures is hard!
Operations like table creation are logically atomic, but consist of multipleoperations on distributed systems.
Resource locking (via mutex, semaphores, etc.) provides some sanity.
Distributed systems have many complicated failure modes: clients, master,tablet servers, and dependent systems can all go offline periodically.
Who is responsible for unlocking locks when any component can fail?
How do we know it’s safe to unlock a lock?
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Accumulo Testing Procedures
Testing Frameworks
Unit: Verify correct functioning ofeach module separately
System: Perform correctness andperformance tests on a smallrunning instance
Load/Scale: Generate high loadsat scale and measure performanceand correctness
Random Walk: Randomly,repeatedly, and concurrentlyexecute a variety of test modulesrepresentative of user activity onan instance at scale
Simulation: Evaluate the model togauge expected performance
Other ConsiderationsScoping tests to includeserver-side code, client-side code,dependent processes, etc.
Code coverage vs. path coverage
Static vs. dynamic analysis
Simulating failures of distributedcomponents
Strange failure modes (oftenhardware/physics-related)
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Adampotence
Idempotent: f (f (x)) = f (x)
Adampotent: f (f ′(x)) = f (x),where f ′(x) denotes partial execution of f (x)
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Fault-Tolerant Executor
If a process dies, previously submitted operations continueto execute on restart.
FATE serializes every task in Zookeeper before execution.
The Master process uses FATE to execute table operationsand administrative actions.
FATE eliminates the single point of failure.
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Verified State Models
State models used formany internal functions
Explicit-state modelchecking provescorrectness
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Event Table with Inverted Index
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Inverted Index Flow
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Multidimensional Index
See also: http://en.wikipedia.org/wiki/Geohash
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Graph Table
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
The “shard” Table
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Appstores
Reduced barrier to entry
Faster app development
More innovation!
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Security Model Supported Innovation
App 1 App 2 App 3 App 4
Role 1
Role 2
Role 3
Role 4
= role-centric security
= data-centric security
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Data-Centric Security
No in-app security necessary
Apps can be used by anyone
Adding users is easy
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Maximizing Utility
Application developers needunderlying design patternsexposed via an API.
Examples include SQL, SPARQL,JAQL, D4M, and many more.
What are the right ways toexpose Accumulo’s technologyand techniques?
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Analytic Primitives
Information Retrieval
Online Statistics
Graph Analytics
sqrrl andAccumulo
Adam Fuchsand John
Vines
Our Big-DataPerspective
HowAccumuloWorks
ImplicationsforApplications
Accumulo inProduction
Contacts
Questions/Contact Info
Adam Fuchs
Cofounder and CTOsqrrl data, [email protected]
John Vines
Cofounder and Directorof Ecosystemssqrrl data, [email protected]
General Info Email: [email protected]: www.sqrrl.co