d4m – signal processing on databases 42 sydney st artarmon nsw 2064 australia virtualnation

19
D4M – Signal Processing On Databases 42 Sydney St Artarmon NSW 2064 Australia Virtualnat ion

Upload: domenic-campbell

Post on 24-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

D4M – Signal Processing On Databases42 Sydney StArtarmon NSW 2064Australia

Virtualnation

Starting with Big Data

• Why care?• In your reach - big data and big compute on a budget• Start with data and apply math• D4M with Accumulo: New technology from MIT and NSA

that claims• It requires 100x less code; and is• 100x faster than other approaches

• Fundamentally mathematical analysis for big data• Lift the lid.

Virtualnation

Understand the world through data and math

• How do you want to understand and the world?• IT approaches have evolved from a past where IT was

expensive and controlled by the few• Modeled and constrained problems to not only fit onto

limited computers but fit in with the politics of the enterprise

• If you could observe without built in constraints and pre-conceived bias – how would you approach computing?

• Understand through scientific method - data and math

Virtualnation

The Primordial Web (92)

Virtualnation

Browser (html): Server (http):

Language:

Database (sql):

Client Server Database

http put

http get

SQL

data

Gopher

• Browser GUI? HTTP for files? Perl for analysis? SQL for data?• A lot of work just to view data.• Won’t catch on.

The Modern Web

Game (data): Server (http):

Language:

Database (triples):

Client Server Database

• Game GUI! HTTP for files? Perl for analysis? Triples for data!• A lot of work to view a lot of data.• Great view. Massive data.

http put

http get

java

data

Future Web?

Game (data): Server (http):

Language:

Database (triples):

Client Server Database

• Game GUI! Fileserver for files! D4M for analysis! Triples for data!

• A little work to view a lot of data. Securely.• Great view. Massive data.

http put

http get

java

data

Big Data and Big Compute on a budget

• ~$9K server with 256G RAM, 32 CPU core and 1.7TB SSD• ~ $26K cost 270TB storage server• $199 4TB USB drive

• ZFS / Smart OS as a free virtualization technology

• ~68TB entire transactional corpus $45B Australian retailer

• How big are your possible data sets?

Virtualnation

Virtualnation

Apache Accumulo

NSA’s Big Table implementation and now top level Apache project

Cell level security to support privacy and need to know

Supports large scale processing of sparse matrices…

Packaged into a secure production configuration

Virtualnation

Parallel Warehouse Scale Computer

Virtualnation

Registers

Cache

Local Memory

Disk

Instruction Operands

Blocks

Pages

Remote Memory

Messages

CPU

RAM

disk

CPU

RAM

disk

CPU

RAM

disk

CPU

RAM

disk

Parallel Architecture

CPU

RAM

disk

CPU

RAM

disk

CPU

RAM

disk

CPU

RAM

disk

Network Switch

Memory HierarchyImplications

Ban

dw

id

th Late

ncy

Cap

aci

ty

Pro

gra

mm

ab

ilit

y

Unit of Memory

High

High

High

High SSD

See http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

Starting with Big Data

• Now cheap to collect all data forever. • Unconstrained approach to data acquisition• No analysis up front or modeling• Much of it involves Graph Analytics

Virtualnation

Cyber

• GOAL: Detect cyber attacks or malicious software

Social

• GOAL: Identify hidden social networks

• GOAL: Identify anomalous patterns of life

ISR

Virtualnation

D4M - Signal Processing on Database

High Level Composable API: D4M (“Databases for Matlab”)

Weak Signatures,Noisy Data,Dynamics

Novel Analytics for:Text, Cyber, Bio

Interactive Super-computing

High Performance Computing: Cluster+ Hadoop

Distributed Database/ Distributed File System

Distributed Database: Accumulo/HBase (triple store)

Virtualnation

Detection Theory

Virtualnation

Matlab Demo - Reuters Corpus V1 (NIST)

810,000 Reuters news items

Demonstration picked 70,000 and found 13,000 entities

A is a 70Kx13K associative array with 500K entries.

D4M demonstrations

Virtualnation

7 Universal Constructs for Analytics

Virtualnation

Multi-Dimensional Associative Array

Virtualnation

Universal Exploded Schema

Virtualnation

D4M Stores Giant Space Matrices in the Accumulo Triple Store Database

Triple StoreDistributed Database

Query:T(:,ggaatctgcc)

Associative ArraysNumerical Computing Environment

D4MDynamic Distributed Dimensional Data Model

A

C

DE

B

A D4M query returns a sparse matrix or graph from a triple store…

…for statistical signal processing or graph analysis in Matlab

Triple store are high performance distributed databases for heterogeneous data

Virtualnation

Big Data for High Speed Sequence Matching