accumulo summit 2014: accumulo with distributed sql queries

17
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 1 Real-Time Risk Analytics at Network Speed and Hadoop Scale When Minutes Means Millions

Upload: accumulo-summit

Post on 27-Jan-2015

117 views

Category:

Technology


0 download

DESCRIPTION

SQL queries are often the #1 requested feature of key/value stores. Argyle will present our integration of Accumulo with Facebook’s PrestoDB distributed query engine. We will discuss: · Data locality between PrestoDB and Accumulo · Predicate pushdown for row keys · Leveraging a secondary index for column based queries The talk will include a live demonstration of big data benchmark queries.

TRANSCRIPT

Page 1: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 1

Real-Time Risk Analytics at Network Speed and Hadoop ScaleWhen Minutes Means Millions

Page 2: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 2

Agenda

• About Argyle

• Use Cases we are Focusing on

• Case Study

• Architecture

• Deep Packet Inspection

• SQL on Accumulo

Page 3: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 3

Argyle Data

• Founded 2009

• Venture backed

• 25+ employees

• Headquartered in San Mateo, CA

• Mobile Communications, Finance Services, eCommerce, Federal

• Alliance program vertical market ISV app providers

History Vertical Markets

Page 4: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 4

Argyle Data – Our Story

• Every Enterprise App– Will be re-written in a better Data Driven way

• Data Driven Apps– Will be Real-Time, Network Speed and Hadoop Scale

• Proven Stack for Data Driven apps

Page 5: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 5

Pattern for Real-Time Risk Applications

Minutes Means Millions

Risk App Same Common Pattern Customer

Real-Time Call-Data Non-Invasive Network Packet Ingestion – Call DataMillions of Mixed Inserts/Reads/SecondReal-Time Analytics – Fast and Fresh

Real-Time SMS-DataNon-Invasive Network Packet Ingestion – SMS DataMillions of Mixed Inserts/Reads/SecondReal-Time Analytics – Fast and Fresh

Real-Time Operational Data

Non-Invasive Packet/Log File Ingestion - TextMillions of Mixed Inserts/Reads/SecondReal-Time Analytics – Fast and Fresh

Page 6: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 6

Real-Time Fraud Detection

• Situation– Wangiri Fraud – Missed Call

– Multi-Billion Dollar Fraud

– Next Day Call Data Record Analysis

• Solution– Real-Time Network DPI

– Real-Time Analytics and Detection

• Scale– Ingest All Live Call Data for Whole Country

– Non-Intrusive Tap – 10Gb/s to 100Gb/s

• Benefit– Detect IRSF Callback Fraud in Minutes

– Data Packet Lake for Multiple Apps

Page 7: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 7

Stack Shift

• 24 Hour ETL/DB Process

• In-Memory Analytics

• Patchwork Quilt Systems

• App Transaction, Log Files

• Application Data Silos

• Complex Rules

• Complex App Dev

• Real-Time

• Petabyte Scale Analytics

• Single Hadoop Stack

• Network Packet Ingestion

• Network Packet Data Lake

• Machine Learning at Scale

• As Simple as Splunk

“62% Moving to Hadoop Infrastructure” - Gartner

Old world architecture New world architecture

Page 8: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 8

ArgyleDBEnabling Data Driven Risk Apps at Network Speed and Hadoop Scale

• Ingestion– Network Packet Ingestion

– Deep Packet Inspection

– Storage Optimization

• Universal Schema

• Query– Distributed SQL Optimization

• Machine Learning

Machine Learning

Query Search GraphIngest

Page 9: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 9

Deep Packet Inspection

A Sea of Protocols

Page 10: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 10

Presto + Hive

Architecture

Page 11: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 11

Presto + Accumulo

From K/V to SQL

Page 12: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 12

Parallel Architecture / Data Locality

Collocate Presto-Accumulo Workers and Accumulo Nodes

Page 13: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 13

Accumulo KV to Presto data model mapping

Schema-less to Schema-full

• Accumulo is schema-less

• Presto expects a predefined schema for tables

• Table definitions in ZooKeeper

• Each Presto table mapped to an Accumulo table

• Each Presto column mapped to an Accumulo colfam+colqualifier

• Use column definition to detect data type and deserialize from byte[]

Page 14: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 14

Secondary Index

Or how to make it columnar

• Presto works well with Columnar storage

• Presto fetches individual columns, not rows

• We considered Accumulo Locality Groups

• But we decided to use a separate index table

Page 15: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 15

Secondary Index Table

Presto Worker

Table1_index

Table1

Page 16: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 16

Secondary Index Table

Table1

Table1_index

Prefixed with a byte for sharding data (to prevent “burning kindle”)

key Value

<shard_byte2>Joe <shard_byte1>123

<shard_byte3>Smith <shard_byte1>123

Key Column Value

<shard_byte1>123 Firstname Joe

<shard_byte1>123 Lastname Smith

Page 17: Accumulo Summit 2014: Accumulo with Distributed SQL queries

Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 17

REAL-TIME RISK ANALYTICSAT NETWORK SPEEDAND HADOOP SCALE

When Minutes Means Millions