big data analysis patterns with hadoop, mahout and solr

Big DataAnalysis PatternsAtlanta Big Data User Group8/15/2013

whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• banderson@maprtech.com

Announcements Next ATLHUG Meeting - Sept. 26–How Google Does Big Data

Wednesday – MapR Data Warehouse Offload Roadshow

MapR Upcoming Training• MapR M7 & HBase for Developers on August 27 in Campbell, CA• MapR M7 & HBase for Developers on Sept 17 in Reston, VA• MapR M5 for Administrators on Oct 3 in Campbell, CA

BIG DATA

Big Data is not new!but the tools are.

The Good News in Big Data:

“Simple algorithms and lots of data trump complex models”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

The Challenge: So Many Solutions!

What solutions fit your business problem?

For example, do you need… Apache Hadoop? Apache Mahout? Storm? Apache Solr/Lucene? Apache HBase (or MapR M7)? Apache Drill (or Impala?) d3.js or Tableau? Node.js Titan?

Ask a Different Question

It may be more useful to better define the problem by asking some of these questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response? How fast is data arriving? (bursts or continuously?) Are queries by sophisticated users? Are you looking for common patterns or outliers? How are your data sources structures?

Picking the Best Solution

Your responses to these questions can help you better: define the problem recognize the analysis pattern to which it belongs guide the choice of solutions to try

But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape.

Apache Solr/Lucene

Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as Full text Geographical data Statistically weighted data

Solr is a small data tool that has flourished in a big data world

Apache Mahout

Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.

Mahout algorithms mainly are used for Recommendation (collaborative filtering) Clustering Classification

Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr

Apache Drill

Google Dremel clone Pluggable Query Languages– Starts with ANSI SQL 2003– Hive, Pig, Cascading, MongoQL, …

Pluggable Storage Backends– Hadoop, Hbase– MongoDB (BSON)– RDBMS?

Bypasses MapReduce

Realtime Stream Computation Engine Horizontal Scalability Guaranteed Data Processing Fault Tolerance Higher level abstraction over:– Message Queues– Worker Logic

“The Hadoop of Realtime”

Titan Distributed Graph Database Property Graph Pluggable Backend Storage– HBase or M7– Cassandra– Berkeley DB

Search Integrated– Solr/Lucene– Elastic Search

Faunus– Batch processing of large graphs

Fulgora– Graph traversals on subset– In-memory

Using the Answers to Guide Your Choices

For simplicity, let’s focus in on the first three questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response?

Big Data Decision Tree

How big is your data?

<10 GB >200 GBmid

What size queries?

Single element at a time

One passover 100%

Multiple passesover big chunks

Big storage Streaming

Response time?

< 100s(human scale)

throughputnot response

Use Cases Company Data Shape Technique(s) Business Value

Business Value

Telecommunications Giant

ETL Offload

Lots of Data Lots of Queries across Large Sets Throughput important

Data ShapeTelecommunications

Techniques

AnalyticsETL

Telecommunications

Techniques

ETL (Hadoop) Analytics (Teradata)

Telecommunications

Business ValueTelecommunications

Credit CardIssuer

Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations

Data Shape

Credit CardIssuer

History matrix

One row per user

One column per thing

A Recommendation Engine with Mahout and Solr/Lucene

Techniques

Credit CardIssuer

Recommendation based on cooccurrence

Cooccurrence gives item-item mapping

One row and column per thing

Techniques

Credit CardIssuer

Cooccurrence matrix can also be implemented as a search index

Techniques

Credit CardIssuer

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Techniques

20 Hrs 3 Hrs

Credit CardIssuer

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Techniques

8Hrs 3 Min

Credit CardIssuer

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

PresentationData Store

Hadoop Export(4 hrs)

Import(4 hrs)

Credit CardIssuer

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

RecommendationSearch Index

(Solr)

Hadoop

IndexUpdate(3 min)

Credit CardIssuer

Business Value

Credit CardIssuer

Idle Alerts

Waste & Recycling Leader

Truck Geolocation Data– 20,000 trucks– 5 sec interval (arriving quickly)

Landfill Geographic Boundaries

Data Shape

Techniques

TruckGeolocation

Realtime Stream Computation(Storm)

Batch Computation(MapReduce)

ImmediateAlerts

Tax ReductionReporting

HadoopStorage

Shortest PathGraph Algorithm

(Titan)

Route Optimization

Business Value

Social Engagement Application

Beverage Company

Tweets, FB Messages Person, Activity links Graph Traversal

Data Shape

Consumer Activity Graph

Wal*Mart.com

Dollar General

Ebay Motors

Toys R UsStubHub

Shopping.comSam’s

Techniques

Property Graph(Titan)

Key/Value Store(MapR M7)

Social Activity Stream

Graph Traversal(Faunus/Fulgora)

Business Value

Fraud DetectionData Lake

Anti-Money Laundering Consumer Transactions

Data Sources

TechniquesAnti-Money Laundering

SystemConsumer Transactions

System

Techniques

Consumer Transactions

Data Lake(Hadoop)

Suspicious Events

Latent Dirichlet Allocation,Bayesian Learning Neural Network,

Peer Group Analysis

Analyst

Business Value

Machine LearningSearch Relevance

DNA Matching

Birth, Death, Census, Military, Immigration records

Search Behavior Activity DNA SNP (snips)

Data Sources

Techniques Record Linking Search Relevance Clickstream Behavior Security Forensics DNA Matching

Business Value

Traffic Analytics

Inrix Road Segment Data– Avg Speed / minute / segment– Reference Speeds

Road Segment Geolocation Data

Data Sources

Techniques Bottleneck Detection Algorithm Time Offset Correlations– Alternate Routes

Predictive Congestion Analysis– Growth & Term Assumptions

Business Value

Similar Characteristics Lots of Data Structured, Semi-Structured, Unstructured Varied Systems Interoperating– Hadoop, Storm, Solr, MPP, Visualizations

Increase Revenue Decrease Costs

Questions?

big data analysis patterns with hadoop, mahout and solr

data arriving

big data world11

big data decision tree

small data tool

cooccurrence data base

data sources structures

apache hadoop

apache mahout mahout

Technology

training & certification - online self learning€¦ ·...

parallelizing k-means with hadoop/mahout for big data...

xactly: how to build a successful converged data platform...

apache mahout - isabel...

hive on spark - berlin buzzwords · pdf file• open-source...

large scale search, discovery and analytics with hadoop,...

alex lefur intro to hadoop and mahout

chapter 1: introducing apache mahout...mahout math apache...

enhancing discovery with solr and mahout

solr, lucene and hadoop @ etsy

scaling search at trovit with solr and hadoop

how hadoop changes the analytics paradigm...search solr...

nosql, apache solr and apache hadoop

1confidential | thinking lucene think lucid grant ingersoll...

cosc 6397 big data analytics hadoop mapreduce...

an introduction to apache hadoop, mahout and hbase

50 mustread hadoop interview questions & answers · 50...

leveraging solr and mahout

mahout, machine learning pour hadoop par bertrand dechoux

intro to mahout -- dc hadoop