accumulo summit 2015: using d4m for rapid prototyping of analytics for apache accumulo [frameworks]

Post on 15-Jul-2015

385 Views

Category:

Technology

8 Downloads

Preview:

Click to see full reader

TRANSCRIPT

D4M and Apache Accumulo

Vijay Gadepally, Lauren Edwards,

Dylan Hutchison, Jeremy Kepner

Accumulo SummitCollege Park, MD

April 29, 2015

This work is sponsored by the Assistant Secretary of Defense for Research

and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,

interpretations, recommendations and conclusions are those of the authors

and are not necessarily endorsed by the United States Government

Accumulo Summit

VNG - 2

Giving away the punch line

• D4M is a popular open source software tool that

connects scientists with Big Data technologies

• D4M-Accumulo binding provides high performance

connectivity to Apache Accumulo for quick analytic

prototyping

• Graphulo: Implement GraphBLAS server-side

iterators and operators on Accumulo tables

Accumulo Summit

VNG - 3

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Accumulo Summit

VNG - 4

Common Big Data Challenge

CommandersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Data

Users

Gap

2000 2005 2010 2015 & Beyond

Rapidly increasing

- Data volume

- Data velocity

- Data variety

- Data veracity (security)

Accumulo Summit

VNG - 5

Common Big Data Architecture

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Accumulo Summit

VNG - 6

Common Big Data Architecture- Data Volume: Cloud Computing -

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Operators

MIT

SuperCloud

Enterprise Cloud

Big Data Cloud Database Cloud

Compute Cloud

MIT SuperCloud merges four clouds

Accumulo Summit

VNG - 7

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Lincoln benchmarking

validated Accumulo performance

Common Big Data Architecture- Data Velocity: Accumulo Database -

Accumulo Summit

VNG - 8

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

D4M demonstrated a

universal approach to diverse data

columnsro

ws

Σ

raw

Common Big Data Architecture- Data Variety: D4M Schema -

intel reports, DNA, health records, publication

citations, web logs, social media, building alarms,

cyber, … all handled by a common 4 table schema

Accumulo Summit

VNG - 9

Common Big Data Architecture- Data Veracity: Security Tools-

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Using cryptography to protect

sensitive data-Verifiable Query Results-

-Computing on Masked Data-

Big Data Cloud

Masked

Query

Plaintext

Query

Encrypt

CMD

Masked

Analytic

Result

Decrypt

Plaintext

Analytic

Result

Accumulo Summit

VNG - 10

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Accumulo Summit

VNG - 11

High Level Language: D4Mhttp://d4m.mit.edu

Accumulo

Distributed Database

Query:Alice

Bob

Cathy

David

Earl

Associative ArraysNumerical Computing Environment

D4MDynamic

Distributed

Dimensional

Data ModelA

C

DE

B

A D4M query returns a sparse

matrix or a graph…

…for statistical signal processing

or graph analysis in MATLAB

D4M binds associative arrays to databases, enabling rapid

prototyping of data-intensive cloud analytics and visualization

Accumulo Summit

VNG - 12

What is D4M?

• The Dynamic Distributed Dimensional Data Model:

– Support for mathematical foundation – associative arrays

– Schema to represent most unstructured data as associative arrays

– Software tools to connect with variety of databases such as Apache Accumulo, SciDB, mySQL, PostgreSQL, …

• Software tools currently implemented in MATLAB/Octave, and Julia (v1)

• Connect to databases via JDBC (relational), SHIM (SciDB) or custom Java API (Accumulo)

Accumulo Summit

VNG - 13

• Key innovation: mathematical closure

– All associative array operations return associative arrays

• Enables composable mathematical operations

A + B A - B A & B A|B A*B

• Enables composable query operations via array indexing

A('alice bob ',:) A('alice ',:) A('al* ',:)

A('alice : bob ',:) A(1:2,:) A == 47.0

• Simple to implement in a library in programming environments

with: 1st class support of 2D arrays, operator overloading,

sparse linear algebra

Mathematical Foundation: Associative Arrays

• Complex queries with ~50x less effort than Java/SQL

• Naturally leads to high performance parallel implementation

• Need a schema to convert arbitrary data to associative array

Accumulo Summit

VNG - 14

D4M Data Schema

• A structure described in a language supported by the database management system.

• Use D4M schema to represent heterogeneous data types in common data format

– Schema converts structured or unstructured raw text to a tuple representation supported by Accumulo:

• Usually use a 4 table representation

– The Edge Table, the Transpose Table, Degree Table, Raw Table

33659254179712 2013-05-20 21:21:42 20798128

kiefpief web 3b77caf94bfc81fe I am

sending love to Oklahoma. And actually -- to everyone who

may need it. You are loved. And you are not alone.

Promise. #PrayforOklahoma

33660010027264 2013-05-20 21:54:56 35.99894978 -

78.90660222 -8783842.7781526 4300476.86376416

22435220 RyanBLeslie Twitter for iPad348803787

bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe:

The devastation in Oklahoma is

D4M

Schema

(33659254179712, time|2013-05-20 21:21:42, 1)

(33659254179712, user|kiefpief, 1)

(33659254179712, text, Sending love to OK #PrayforOklahoma)

(33659254179712, word|Sending, 1)

(33660010027264, time|2013-05-20 21:54:56, 1)

(33660010027264, lat|-78.90660222, 1 )

(33660010027264, lon|35.99894978, 1)

(33660010027264, user|RyanBLeslie, 1)

(33660010027264, RT|@HaydenBigCntry , 1)

(33660010027264, word|Oklahoma, 1)

Accumulo Summit

VNG - 15

4 Table D4M Schema

row_num col1 col2 col3

001 row1col1 row1col2 word1 word2 word3

002 row2col1 row2col2 word2 word3

003 … … word1 word3

col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3

row_num|001 1 1 1 1 1

row_num|002 1 1 1 1

row_num|003 1 1

col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3

Degree 1 1 1 1 2 2 3

row_num|001 row_num|002 row_num|003

col1|row1col1 1

col1|row2col1

col2|row1col2 1 1

col2|row2col2 1

col3|word1 1 1

col3|word2 1 1

col3|word3 1 1

Tedge

TedgeDeg

TedgeT

text

row_num|001 word1 word2 word3

row_num|002 word2 word3

row_num|003 word1 word3

TedgeTxt

Accumulo Summit

VNG - 16

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Accumulo Summit

VNG - 17

D4M Software Library

• Associative Array representation works very well as an interface among databases.

• D4M currently implemented in languages with first class support of sparse matrices:

– MATLAB

– GNU Octave

– Julia (in progress)

• Implemented in ~2000 lines of MATLAB code

Download D4M

Source from

d4m.mit.edu

d4m_api.zip

matlab_src/

d4m_api_java.jar

libext.zip

dependency JARs

Accumulo Summit

VNG - 18

D4M: What a user sees

(row, col, val)

Matlab strings

d4m

Matlab API

d4m_api_java

Java API

Accumulo

Java APIAccumulo

Table

% D4M Associative Array APIrow = 'r1,r2,'; col = 'c1,c1,'; val = '7,3,';

A = Assoc(row,col,val,@min);

% D4M Accumulo APIDB = DBserver(’zoohost.edu:2181', 'Accumulo',

'instance', 'user', 'password');

T = DB('Table'); % Create table if doesn't exist.

put(T,A); % Put associative array in T.

Aret = T(:,:); % Scan all of T.

Accumulo Summit

VNG - 19

D4M: What a developer sees

Type Matlab/Julia File Java Class Use

Table

management

DBcreate.m D4mDbTableOperations Create table

@DBserver/ls.m D4mDbInfo List tables

@DBtable/nnz.m D4mDbTableOperationsNumber of entries in table,

summed from table's tablets

DBdelete.m D4mDbTableOperations Delete table

Write DBinsert.m D4mDbInsert Insert

Scan@DBtable/DBtable.m D4mDataSearch Create query holder

@DBtable/subsref.m D4mDataSearch Do query, possibly holding batches

@DBtable/close.m D4mDataSearch Reset query

Delete@DBtable/deleteTriple.m AccumuloDelete Delete entries

@DBtable/deleteAssoc.m AccumuloDelete Delete entries

Iterators@DBtable/ColCombiner.m D4mDbTableOperations List table iterators

@DBtable/addColCombiner.m D4mDbTableOperations Add all-scope table iterator

@DBtable/deleteColCombiner.m D4mDbTableOperations Remove iterator

Splits

@DBtable/Splits.m D4mDbTableOperationsReturn splits, number of entries in each

tablet, tablet server addresses

@DBtable/addSplits.m D4mDbTableOperations Add new table split

@DBtable/putSplits.m D4mDbTableOperations Replace table splits, merging old splits

@DBtable/mergeSplits.m D4mDbTableOperations Remove splits by merging tablets

• Source code released and available!

Accumulo Summit

VNG - 20

D4M Write

More details on Batched Insert – 500 kB by default

• putNumBytes() controls #entries to insert in one batch, on MATLAB side

• Independent batches: each creates, flushes and closes separate

BatchWriters

• Guarantee BatchWriters correctly closed

• No need to maintain BatchWriter lifecycle in MATLAB

• 30 ms maximum latency before flushing

• 50 Write threads

• 1 MB maximum memory on BatchWriter, plenty for default batch size

KeyValue

Assoc

Val

Row ID

Assoc

Row

Column Timestamp

FamilyputColumn

Family()

QualifierAssoc

Col

Visibilityput

Security()

Accumulo Summit

VNG - 21

D4M Scan Example

1. Translate Matlab queries into ranges for BatchScanner

T(:,:) %Scan all

T('r1;r5;:;r7;', :) %Scan given row ranges

T(:, 'c1;') %Use fetchColumn(), or row scan

Transpose table

T('r5;:;r9;', 'c1;:;c3;') %Complicated; break into simpler

queries

2. Hold state of Scanner iterator as state of MATLAB object

T_it = Iterator(T, 'elements', 1e5); % 100k entry batch size

A = T_it(:,:); % Initial query

while nnz(A) % While there is another batch

handleBatch(A);

A = T_it(); % Get next batch

end

Accumulo Summit

VNG - 22

Parallel Accumulo Access

Sample script writing files to Accumulo in parallel:

T = DB('Tedge','TedgeT');

myFiles = global_ind(zeros(Nfile,1,map([Np 1],{},0:Np-1)));

for i = myFiles

fname = ['data/' num2str(i)]; % Create filename.

load([fname '.A.mat']); % Load file data.

put(T,num2str(A)); % Insert to Accumulo.

end

Run on 4 local processors: eval(pRUN('Script',4,{}));

• D4M + pMATLAB gives rise to high performance

Accumulo Summit

VNG - 23

Accumulo Scaling on MIT SuperCloud

• Scales linearly with ingest processes, server nodes, and data sizeS

erv

er n

od

es

Accumulo Summit

VNG - 24

115,000,000 inserts per second

• Using supercomputing techniques allows peak insert to be achieve

within seconds of launch

1M edge

Graph500

graph

43K

43B edges in

5 minutes

Accumulo Summit

VNG - 25

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Accumulo Summit

VNG - 26

D4M Twitter Demo

• August 24, 2014: Earthquake in Northern California

• Tweets from August 24-25

• Using D4M for:

– Exploration

– Analytics

– Visualization

Accumulo Summit

VNG - 27

Set Table Bindings

Accumulo Summit

VNG - 28

Query Tweets

Accumulo Summit

VNG - 29

Find Common Locations

Accumulo Summit

VNG - 30

Filter Tweets

Accumulo Summit

VNG - 31

Query for Full Tweets

Accumulo Summit

VNG - 32

Load Stopwords

Accumulo Summit

VNG - 33

Remove Stopwords

Accumulo Summit

VNG - 34

Find Co-Occurring Words

Accumulo Summit

VNG - 35

Remove Diagonal

Accumulo Summit

VNG - 36

See Words Most Used Together

Accumulo Summit

VNG - 37

Display on Map

Accumulo Summit

VNG - 38

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Accumulo Summit

VNG - 39

Summary

• D4M is a popular software tool that connects

scientists with Big Data technologies

• D4M-Accumulo binding provides high performance

connectivity to Apache Accumulo for quick analytic

prototyping

• Current research expands this connection to support

high performance graph analytics

Accumulo Summit

VNG - 40

• Graphulo: Implement GraphBLAS server-side iterators and operators on

Accumulo tables

• Use case: Queued analytics = Localized within a neighborhood

• Aim for Accumulo Contrib

• Released:

– Design Document

• Upcoming:

– Beta version of tools in

late May/early June

• Future:

– Scalability

– Schemas

– More example algorithms

G R A P H U L O

http://graphulo.mit.edu

Graphulo:

Contact Dylan Hutchison if you have any thoughts!

dhutchis@mit.edu

Accumulo Summit

VNG - 41

Acknowledgements

• Bill Arcand

• Bill Bergeron

• David Bestor

• Chansup Byun

• Matt Hubbell

• Jeremy Kepner

• Jake Bolewski

• Pete Michaleas

• Julie Mullen

• Andy Prout

• Albert Reuther

• Tony Rosa

• Charles Yee

• Dylan Hutchison

And many more …

Accumulo Summit

VNG - 42

Thank you!

• Contact:

– Vijay Gadepally (vijayg@ll.mit.edu)

– Lauren Edwards (lauren.edwards@ll.mit.edu)

– Jeremy Kepner (kepner@ll.mit.edu)

top related