completing the big data ecosystem: sqrrl and...

Post on 19-Mar-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Completing the Big Data Ecosystem:sqrrl and Accumulo

Adam Fuchs and John Vines

sqrrl data INC.

August 3, 2012

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Design Drivers

Analysis of big data is central to our customers’ requirements, in which thestrongest drivers are:

Scalability: The ability to do twice the work at only (about) twice the cost.

Adaptability: The ability to rapidly evolve the analytical tools available inan operational environment, building upon and enhancing existingcapabilities.

Security: Getting all of the above without giving up secrecy and assuranceproperties.

From these directives we can derive the following requirements:

Data-Centric Security to reduce coordination needed between applicationdevelopers and data providers.

Simplicity in the overall architecture to encourage participation andameliorate learning curve.

Generic design patterns to store and organize data whose format we don’tcontrol.

Generic discovery analytics to retrieve and visualize generic data.

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Optimization

... is a secondary concern, given:

hundreds of evolving applications,

hundreds of changing data sources,

petabytes/exabytes of data,

many complicated interactions.

Instead, we need a generic platform that is cheap, simple, scalable, secure, andadaptable, with pretty good performance.

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Key/Value Structure

An Accumulo Key is a 5-tuple, including:

Row: controls Atomicity

Column Family: controls Locality

Column Qualifier: controls Uniqueness

Visibility: controls Access (unique to Accumulo)

Timestamp: controls Versioning

Sample Entries

Row : Col. Fam. : Col. Qual. : Visibility : Timestamp ⇒ ValueAdam : Favorites : Food : (Public) : 20090801 ⇒ SushiAdam : Favorites : Programming Language : (Private) : 20090830 ⇒ JavaAdam : Favorites : Programming Language : (Private) : 20070725 ⇒ C++Adam : Friends : Bob : (Public) : 20110601 ⇒Adam : Friends : Joe : (Private) : 20110601 ⇒

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Visibility Label Syntax and Semantics

Document LabelsDoc1 : (Federation)Doc2 : (Klingon|Vulcan)Doc3 : (Federation&Human&Vulcan)

Doc4 : (Federation&(Human|Vulcan))

User Authorization SetsCptKirk : {Federation,Human}

MrSpock : {Federation,Human,Vulcan}

Syntax

WORD ⇒ [a-zA-Z0-9 ]+CLAUSE ⇒ AND

⇒ ORAND ⇒ AND & AND

⇒ (CLAUSE)⇒ WORD

OR ⇒ OR | OR⇒ (CLAUSE)⇒ WORD

Semantics(T ⇒ τ) ∧ (τ ∈ A)

(T,A) |= trueterm

(T ⇒ T1 & T2) ∧ ((T1,A) |= true) ∧ ((T2,A) |= true)

(T,A) |= trueand

(T ⇒ T1 | T2) ∧ (((T1,A) |= true) ∨ ((T2,A) |= true))

(T,A) |= trueor

(T ⇒ (T1)) ∧ (T1 |= true)

(T,A) |= trueparen

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Tablets

Collections ofkey/value pairsform Tables

Tables arepartitioned intoTablets

Metadata tabletshold info aboutother tablets,forming athree-levelhierarchy

A Tablet is a unitof work for aTablet Server

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Distributed Processes

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Tablet Server Composition

Quick and loose definitions:

Table: A map of keys to values with one global sort order among keys.Tablet: A row range within a Table.Tablet Server: The mechanism that hosts Tablets, providing the primaryfunctionality of Bigtable or Accumulo.

Tablet servers have several primary functions:

1 Hosting RPCs (read, write, etc.)

2 Managing resources (RAM, CPU, File I/O, etc.)

3 Scheduling background tasks (compactions, caching, etc.)

4 Handling key/value pairs

Category 4 is almost entirely accomplished through the Iterator framework.

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Tablet Server Data Flow

Iterator Uses

File Reads

Block Caching

Merging

Deletion

Isolation

Locality Groups

Range Selection

Column Selection

Cell-level Security

Versioning

Filtering

Aggregation

Partitioned Joins

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

The Perils of Distributed Computing

Dealing with failures is hard!

Operations like table creation are logically atomic, but consist of multipleoperations on distributed systems.

Resource locking (via mutex, semaphores, etc.) provides some sanity.

Distributed systems have many complicated failure modes: clients, master,tablet servers, and dependent systems can all go offline periodically.

Who is responsible for unlocking locks when any component can fail?

How do we know it’s safe to unlock a lock?

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Accumulo Testing Procedures

Testing Frameworks

Unit: Verify correct functioning ofeach module separately

System: Perform correctness andperformance tests on a smallrunning instance

Load/Scale: Generate high loadsat scale and measure performanceand correctness

Random Walk: Randomly,repeatedly, and concurrentlyexecute a variety of test modulesrepresentative of user activity onan instance at scale

Simulation: Evaluate the model togauge expected performance

Other ConsiderationsScoping tests to includeserver-side code, client-side code,dependent processes, etc.

Code coverage vs. path coverage

Static vs. dynamic analysis

Simulating failures of distributedcomponents

Strange failure modes (oftenhardware/physics-related)

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Adampotence

Idempotent: f (f (x)) = f (x)

Adampotent: f (f ′(x)) = f (x),where f ′(x) denotes partial execution of f (x)

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Fault-Tolerant Executor

If a process dies, previously submitted operations continueto execute on restart.

FATE serializes every task in Zookeeper before execution.

The Master process uses FATE to execute table operationsand administrative actions.

FATE eliminates the single point of failure.

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Verified State Models

State models used formany internal functions

Explicit-state modelchecking provescorrectness

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Event Table with Inverted Index

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Inverted Index Flow

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Multidimensional Index

See also: http://en.wikipedia.org/wiki/Geohash

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Graph Table

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

The “shard” Table

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Appstores

Reduced barrier to entry

Faster app development

More innovation!

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Security Model Supported Innovation

App 1 App 2 App 3 App 4

Role 1

Role 2

Role 3

Role 4

= role-centric security

= data-centric security

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Data-Centric Security

No in-app security necessary

Apps can be used by anyone

Adding users is easy

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Maximizing Utility

Application developers needunderlying design patternsexposed via an API.

Examples include SQL, SPARQL,JAQL, D4M, and many more.

What are the right ways toexpose Accumulo’s technologyand techniques?

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Analytic Primitives

Information Retrieval

Online Statistics

Graph Analytics

sqrrl andAccumulo

Adam Fuchsand John

Vines

Our Big-DataPerspective

HowAccumuloWorks

ImplicationsforApplications

Accumulo inProduction

Contacts

Questions/Contact Info

Adam Fuchs

Cofounder and CTOsqrrl data, INC.adam@sqrrl.co

John Vines

Cofounder and Directorof Ecosystemssqrrl data, INC.john@sqrrl.co

General Info Email: info@sqrrl.coWebsite: www.sqrrl.co

top related