completing the big data ecosystem: sqrrl and...

27
sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective How Accumulo Works Implications for Applications Accumulo in Production Contacts Completing the Big Data Ecosystem: sqrrl and Accumulo Adam Fuchs and John Vines sqrrl data INC. August 3, 2012

Upload: others

Post on 19-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Completing the Big Data Ecosystem:sqrrl and Accumulo

Adam Fuchs and John Vines

sqrrl data INC.

August 3, 2012

Page 2: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Design Drivers

Analysis of big data is central to our customers’ requirements, in which thestrongest drivers are:

Scalability: The ability to do twice the work at only (about) twice the cost.

Adaptability: The ability to rapidly evolve the analytical tools available inan operational environment, building upon and enhancing existingcapabilities.

Security: Getting all of the above without giving up secrecy and assuranceproperties.

From these directives we can derive the following requirements:

Data-Centric Security to reduce coordination needed between applicationdevelopers and data providers.

Simplicity in the overall architecture to encourage participation andameliorate learning curve.

Generic design patterns to store and organize data whose format we don’tcontrol.

Generic discovery analytics to retrieve and visualize generic data.

Page 3: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Optimization

... is a secondary concern, given:

hundreds of evolving applications,

hundreds of changing data sources,

petabytes/exabytes of data,

many complicated interactions.

Instead, we need a generic platform that is cheap, simple, scalable, secure, andadaptable, with pretty good performance.

Page 4: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Key/Value Structure

An Accumulo Key is a 5-tuple, including:

Row: controls Atomicity

Column Family: controls Locality

Column Qualifier: controls Uniqueness

Visibility: controls Access (unique to Accumulo)

Timestamp: controls Versioning

Sample Entries

Row : Col. Fam. : Col. Qual. : Visibility : Timestamp ⇒ ValueAdam : Favorites : Food : (Public) : 20090801 ⇒ SushiAdam : Favorites : Programming Language : (Private) : 20090830 ⇒ JavaAdam : Favorites : Programming Language : (Private) : 20070725 ⇒ C++Adam : Friends : Bob : (Public) : 20110601 ⇒Adam : Friends : Joe : (Private) : 20110601 ⇒

Page 5: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Visibility Label Syntax and Semantics

Document LabelsDoc1 : (Federation)Doc2 : (Klingon|Vulcan)Doc3 : (Federation&Human&Vulcan)

Doc4 : (Federation&(Human|Vulcan))

User Authorization SetsCptKirk : {Federation,Human}

MrSpock : {Federation,Human,Vulcan}

Syntax

WORD ⇒ [a-zA-Z0-9 ]+CLAUSE ⇒ AND

⇒ ORAND ⇒ AND & AND

⇒ (CLAUSE)⇒ WORD

OR ⇒ OR | OR⇒ (CLAUSE)⇒ WORD

Semantics(T ⇒ τ) ∧ (τ ∈ A)

(T,A) |= trueterm

(T ⇒ T1 & T2) ∧ ((T1,A) |= true) ∧ ((T2,A) |= true)

(T,A) |= trueand

(T ⇒ T1 | T2) ∧ (((T1,A) |= true) ∨ ((T2,A) |= true))

(T,A) |= trueor

(T ⇒ (T1)) ∧ (T1 |= true)

(T,A) |= trueparen

Page 6: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Tablets

Collections ofkey/value pairsform Tables

Tables arepartitioned intoTablets

Metadata tabletshold info aboutother tablets,forming athree-levelhierarchy

A Tablet is a unitof work for aTablet Server

Page 7: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Distributed Processes

Page 8: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Tablet Server Composition

Quick and loose definitions:

Table: A map of keys to values with one global sort order among keys.Tablet: A row range within a Table.Tablet Server: The mechanism that hosts Tablets, providing the primaryfunctionality of Bigtable or Accumulo.

Tablet servers have several primary functions:

1 Hosting RPCs (read, write, etc.)

2 Managing resources (RAM, CPU, File I/O, etc.)

3 Scheduling background tasks (compactions, caching, etc.)

4 Handling key/value pairs

Category 4 is almost entirely accomplished through the Iterator framework.

Page 9: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Tablet Server Data Flow

Iterator Uses

File Reads

Block Caching

Merging

Deletion

Isolation

Locality Groups

Range Selection

Column Selection

Cell-level Security

Versioning

Filtering

Aggregation

Partitioned Joins

Page 10: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

The Perils of Distributed Computing

Dealing with failures is hard!

Operations like table creation are logically atomic, but consist of multipleoperations on distributed systems.

Resource locking (via mutex, semaphores, etc.) provides some sanity.

Distributed systems have many complicated failure modes: clients, master,tablet servers, and dependent systems can all go offline periodically.

Who is responsible for unlocking locks when any component can fail?

How do we know it’s safe to unlock a lock?

Page 11: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Accumulo Testing Procedures

Testing Frameworks

Unit: Verify correct functioning ofeach module separately

System: Perform correctness andperformance tests on a smallrunning instance

Load/Scale: Generate high loadsat scale and measure performanceand correctness

Random Walk: Randomly,repeatedly, and concurrentlyexecute a variety of test modulesrepresentative of user activity onan instance at scale

Simulation: Evaluate the model togauge expected performance

Other ConsiderationsScoping tests to includeserver-side code, client-side code,dependent processes, etc.

Code coverage vs. path coverage

Static vs. dynamic analysis

Simulating failures of distributedcomponents

Strange failure modes (oftenhardware/physics-related)

Page 12: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Adampotence

Idempotent: f (f (x)) = f (x)

Adampotent: f (f ′(x)) = f (x),where f ′(x) denotes partial execution of f (x)

Page 13: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Fault-Tolerant Executor

If a process dies, previously submitted operations continueto execute on restart.

FATE serializes every task in Zookeeper before execution.

The Master process uses FATE to execute table operationsand administrative actions.

FATE eliminates the single point of failure.

Page 14: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Verified State Models

State models used formany internal functions

Explicit-state modelchecking provescorrectness

Page 15: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Page 16: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Page 17: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Event Table with Inverted Index

Page 18: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Inverted Index Flow

Page 19: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Multidimensional Index

See also: http://en.wikipedia.org/wiki/Geohash

Page 20: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Graph Table

Page 21: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

The “shard” Table

Page 22: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Appstores

Reduced barrier to entry

Faster app development

More innovation!

Page 23: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Security Model Supported Innovation

App 1 App 2 App 3 App 4

Role 1

Role 2

Role 3

Role 4

= role-centric security

= data-centric security

Page 24: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Data-Centric Security

No in-app security necessary

Apps can be used by anyone

Adding users is easy

Page 25: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Maximizing Utility

Application developers needunderlying design patternsexposed via an API.

Examples include SQL, SPARQL,JAQL, D4M, and many more.

What are the right ways toexpose Accumulo’s technologyand techniques?

Page 26: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Analytic Primitives

Information Retrieval

Online Statistics

Graph Analytics

Page 27: Completing the Big Data Ecosystem: sqrrl and Accumulopeople.apache.org/~afuchs/slides/mit_20120803.pdf · 8/3/2012  · sqrrl and Accumulo Adam Fuchs and John Vines Our Big-Data Perspective

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Questions/Contact Info

Adam Fuchs

Cofounder and CTOsqrrl data, [email protected]

John Vines

Cofounder and Directorof Ecosystemssqrrl data, [email protected]

General Info Email: [email protected]: www.sqrrl.co