transforming big data with d4m

D4M-1

Jeremy Kepner

MIT Lincoln Laboratory

3 October 2012

Transforming Big Data with D4M

This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

D4M-2

• Nicholas Arcolano

• Michelle Beard

• Bob Bond

• Josh Haines

• Matthew Schmidt

• Ben Miller

• Benjamin O’Gwynn

• Tamara Yu

• Bill Arcand

• Bill Bergeron

Acknowledgements

• David Bestor• Chansup Byun• Matt Hubbell• Pete Michaleas• Julie Mullen• Andy Prout• Albert Reuther• Tony Rosa• Charles Yee• Dylan Hutchinson

D4M-3

• Introduction

• Theory

• Results

• Summary

Outline

D4M-4

• Cross-Mission Challenge: Detection of subtle patterns in massivemulti-source noisy datasets

Example Applications of Graph Analytics

Cyber

• Graphs represent communication patterns of

computers on a network

• 1,000,000s – 1,000,000,000s network events

• GOAL: Detect cyber attacks or malicious software

Social

• Graphs represent relationships between

individuals or documents

• 10,000s – 10,000,000s individual and interactions

• GOAL: Identify hidden social networks

• Graphs represent entities and

relationships detected through multi-

INT sources• 1,000s – 1,000,000s tracks

and locations• GOAL: Identify anomalous

patterns of life

ISR

D4M-5

Four Ecosystems Dominate Cloud Computing

Enterprise

Big Data DBMS

Big Compute

• Each ecosystem is at the center of a multi-$B market• Pros/cons of each are numerous; diverging hardware/software• Some missions can exist wholly in one ecosystem; some can’t

- Interactive- On-demand- Elastic

- High performance- Parallel Languages- Scientific computing

- Java- Map/Reduce- Easy admin

- Indexing- Search- Security

D4M-6

• LLGrid MapReduce provides map/reduce interface in a big compute environment

• D4M provides an interactive parallel scientific computing environment to databases

LLGridEnterprise

Big Data DBMS

- Interactive- On-demand- Elastic

- High performance- Parallel Languages- Scientific computing

- Java- Map/Reduce- Easy admin

- Indexing- Search- Security

Big Compute

MapReduce

Four Ecosystems Dominate Cloud Computing

D4M-7

Big Data + Big Compute ChallengeDatabase Worldview

“It’s the data!”Delivering data is the end

Supercomputing Worldview“It’s the computer!”

Delivering data is the start

Shared Compute

Separate DataSeparate Compute

Shared Data

• Database and supercomputing views are fundamentally different• Have never coexisted; do not know how to coexist• Big Data “Analytics” are forcing them together• Current standard practice duplicates hardware and data

D4M-8

Big Data + Big Compute Stack

High Level Composable API: D4M (“Databases for Matlab”)

Weak Signatures,Noisy Data,Dynamics

Novel Analytics for:Text, Cyber, Bio

Interactive Super-computing

High Performance Computing: LLGrid + Hadoop

Distributed Database/ Distributed File System

Distributed Database: Accumulo (triple store)

A

C

E

B

Array Algebra

• Combining Big Compute and Big Data enables entirely new domains

D4M-9

High Level Language: D4Mhttp://www.mit.edu/~kepner/D4M

Distributed Database

Query:AliceBobCathyDavidEarl

Associative ArraysNumerical Computing Environment

D4MDynamic Distributed Dimensional Data Model A

C

DE

B

A D4M query returns a sparse matrix or a graph…

…for statistical signal processing or graph analysis in MATLAB

D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

D4M-10

• Introduction

• Theory– Associate Arrays– Incidence Matrix

• Results

• Summary

Outline

D4M-11

What are Spreadsheets and Big Tables?

Spreadsheets

Big Tables

• Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?)

• Big Tables (Google, Amazon, …) store most of the analyzed data in the world (Exabytes?)

• Simultaneous diverse data: strings, dates, integers, reals, …• Simultaneous diverse uses: matrices, functions, hash tables, databases, …• No formal mathematical basis; Zero papers in AMA or SIAM

D4M-12

D4M Key Concept:Associative Arrays Unify Four Abstractions

• Extends associative arrays to 2D and mixed data typesA('alice ','bob ') = 'cited '

or A('alice ','bob ') = 47.0

• Key innovation: 2D is 1-to-1 with triple store('alice ','bob ','cited ')

or ('alice ','bob ',47.0)

x ATxAT

alice

bob

alice

carl

bob

carlcited

cited

D4M-13

• Key innovation: mathematical closure– All associative array operations return associative arrays

• Enables composable mathematical operations

A + B A - B A & B A|B A*B

• Enables composable query operations via array indexingA('alice bob ',:) A('alice ',:) A('al* ',:)

A('alice : bob ',:) A(1:2,:) A == 47.0

• Simple to implement in a library (~2000 lines) in programming environments with: 1st class support of 2D arrays, operator overloading, sparse linear algebra

Composable Associative Arrays

• Complex queries with ~50x less effort than Java/SQL• Naturally leads to high performance parallel implementation

D4M-14

Associative Array Definitions

• Keys and values are from the infinite strict totally ordered set S

• Associative array A(k) : Sd S, k=(k1,…,kd), is a partial function from d keys (typically 2) to 1 value, where

A(ki) = vi and otherwise

• Binary operations on associative arrays A3 = A1 A2, where = f() or f(), have the properties

– If A1(ki) = v1 and A2(ki) = v2, then A3(ki) is

v1 f() v2 = f(v1,v2) or v1 f() v2 = f(v1,v2)

– If A1(ki) = v or and A2(ki) = or v, then A3(ki) is

v f() = v or v f() = • High level usage dictated by these definitions• Deeper algebraic properties set by the collision function f()• Frequent switching between “algebras” (how spreadsheets are used)

D4M-15

• Associative arrays can be constructed from a few definitions

• Similar to linear algebra, but applicable to a wider range of data

• Key questions– Which linear algebra properties do apply to associative arrays (intuitive)– Which linear algebra properties do not apply to associative arrays

(watch out)– Which associative array properties do not apply to linear algebra (new)

Theory Questions

LinearAlgebra

watch out

AssociativeArrays

intuitivenew

D4M-16

• Book: “Graph Algorithms in the Language of Linear Algebra”• Editors: Kepner (MIT-LL) and Gilbert (UCSB)• Contributors:

– Bader (Ga Tech)– Bliss (MIT-LL)– Bond (MIT-LL)– Dunlavy (Sandia)– Faloutsos (CMU)– Fineman (CMU)– Gilbert (USCB)– Heitsch (Ga Tech)– Hendrickson (Sandia)– Kegelmeyer (Sandia)– Kepner (MIT-LL)– Kolda (Sandia)– Leskovec (CMU)– Madduri (Ga Tech)– Mohindra (MIT-LL)– Nguyen (MIT)– Radar (MIT-LL)– Reinhardt (Microsoft)– Robinson (MIT-LL)– Shah (USCB)

References

D4M-17

• Introduction

• Theory– Associate Arrays– Incidence Matrix

• Results

• Summary

Outline

D4M-18

Digraphs are Black & White

D4M-19

The World is Color

Artist: Ann Pibal; Painting: “XCRS”

D4M-20

5 Edge Colors


BlueSilverGreenOrangePink

D4M-21

20 Vertices


V2

V1

V3

V4

V5

V6

V7

V8

V9

V10

V11

V13

V14V12

V15

V16

V17

V18

V19

V20

D4M-22

1 Isolated Standard Edge


P4

D4M-23

12 Multi Edges


B1,S

1,G

1,O

1,O

2,P1

B2,S

2,G

2,O

3,O

4,P2

D4M-24

18 Hyper Edges


B1,S

1,G

1,O

1,O

2,P1

B2,S

2,G

2,O

3,O

4,P2

O5

P3B1

,S1,

G1,

O1,

O2,

P1

B2,S

2,G

2,O

3,O

4,P2

P5

P6

P7

P8

D4M-25

27 Edge Orderings


O5

P3B1

,S1,

G1,

O1,

O2,

P1

B2,S

2,G

2,O

3,O

4,P2

P5

P6

P7

P8

O5 < P3,P6,P7,P8O5 < B1,S1,G1,O1,O2,P1O5 < B2,S2,G2,O3,O4,P2 < P7,P8

D4M-26

52 Standard Multi Edges


O5x5

P3x3(B

1,S1

,G1,

O1,

O2,

P1)x

2

(B2,

S2,G

2,O

3,O

4,P2

)x4P5x2

P6x2

P7x2

P8x2

D4M-27

Summary Observations


• Standard edge representation fragments hyper edges– Information is lost

• Digraph representation compresses multi-edges– Information is lost

• Matrix representation drops edge labels– Information is lost

• Standard graph representation drops edge order– Information is lost

• Need edge representation that preserves information

D4M-28

Solution: Incidence Matrix


Edge Color Order V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20

B1 Blue 2 1 1 1

S1 Silver 2 1 1 1

G1 Green 2 1 1 1

O1 Orange 2 1 1 1

O2 Orange 2 1 1 1

P1 Pink 2 1 1 1

B2 Blue 2 1 1 1 1 1

S2 Silver 2 1 1 1 1 1

G2 Green 2 1 1 1 1 1

O3 Orange 2 1 1 1 1 1

O4 Orange 2 1 1 1 1 1

P2 Pink 2 1 1 1 1 1

O5 Orange 1 1 1 1 1 1 1

P3 Pink 2 1 1 1 1

P4 Pink 2 1 1

P5 Pink 2 1 1 1

P6 Pink 2 1 1 1

P7 Pink 3 1 1 1

P8 Pink 3 1 1 1

D4M-29

• Introduction

• Theory

• Results– Network monitoring example– Bioinformatics example

• Summary

Outline

D4M-30

Graph Construction Using D4M:Explode Schema


Raw Data

CSV Files

Assoc.Arrays

log_id src_ip server_ip001 128.0.0.1 208.29.69.138

002 192.168.1.2 157.166.255.18

003 128.0.0.1 74.125.224.72

Dense Table

src_ip|128.0.0.1 src_ip|192.168.1.2 server_ip|157.166.255.18 server_ip|208.29.69.138 server_ip|74.125.224.72

log_id|001 1 0 0 1 0

log_id|002 0 1 1 0 0

log_id|003 1 0 0 0 1

Exploded Table

Use as row indices

Create columns for each unique

type/value pair

D4M-35

• Graphs can be constructed with minimal effort using D4M queries and associative array algebra



Raw Data

CSV Files

Assoc.Arrays



Associative Array AlgebraG = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);

Adj(G);

D4M-36

Accumulo Ingestion Scalability StudyLLGrid MapReduce With A Python Application

Data #1: 5 GB of 200 files Data #2:

30 GB of 1000 files

4 Mil e/s

Accumulo Database: 1 Master + 7 Tablet servers

D4M-37

• Introduction

• Theory

• Results– Network monitoring example– Bioinformatics example

• Summary

Outline

D4M-38

Relative Cost per DNA Sequence

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 03/08/2012

High Volume Sequencer

Big Data

Energy Efficient

PortableSequencer

D4M-39

Example Disease OutbreakMay-July 2011 - Virulent E. Coli Outbreak Germany

Sequencing and crowd source analysis showed promising potential -> Still too slow

Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses Publishing sequence data allowed for broad community to fully characterize pathogen

DNASequencereleased

Outbreakidentified

SpanishCucumbersimplicated

SproutsIdentified

Deaths

www.rki.de EHEC final report

diarrheakidney

D4M-40

RNA Reference Set Collected Sample

sequence word (10mer)

refe

renc

ese

quen

ce ID

unkn

own

sequ

ence

IDA1 A2

A1 A2'

refe

renc

eba

cter

ia

unkn

own

bact

eria

sequence word (10mer)

refe

renc

ese

quen

ce ID

unknown sequence ID

Sequence Matching Graph Sparse Matrix Multiply in D4M

• Associative arrays provide a natural framework for sequence matching

D4M-41

Database Automatically ComputesReference 10mer Distribution

• Using 10mer distribution can quickly select reference 10mers that maximally differentiate sample sequences and eliminate most 10mers

50% 5% 0.5%

D4M-42

Leveraging “Big Data” Technologies for High Speed Sequence Matching

• High performance triple store database trades computations for lookups• Used Apache Accumulo database to accelerate comparison by 100x• Used Lincoln D4M software to reduce code size by 100x

100 10000 100000010

100

1000

10000

code volume (lines)

run

time

(sec

onds

)D4M

D4M +Triple Store

BLAST

100x

fas

ter

100x smaller

D4M-43

• Big data is found across a wide range of areas– Document analysis– Computer network analysis– DNA Sequencing

• Currently there is a gap in big data analysis tools for algorithm developers

• D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation

Summary

transforming big data with d4m

Documents

coexistbig data analytics

big compute environmentd4m

llgrid mapreduce

s tracks

s individual

patterns of coordination

application of graph

graph analytics proclivity