osc2012: big data using open source: netapp project - technical

Open source Big Data case study: Building a

platform for remote device support at NetApp

(Part II – Technical)

Copyright © 2012 Accenture All rights reserved. 2

Topics

Big Data Perspective

Case Study: NetApp AutoSupport

Technology Primer

Design Overview


Big Data

The concept is disruptive. The technology is disruptive. And, markets and

clients are being impacted.

1 Wordle for Credit Suisse, Does Size Matter Only?, September 2011


Shifts in Data and Analytics

Data-led Innovation

Data Explosion • Unstructured data is doubling

every 3 months

• 2011 saw 47% growth overall

• By 2015, number of networked

devices will be 2x global

population

• De-coupling data from

applications

• Disparate external data shaping

context

• Cost effective mobilization of

massive scale data

Monetization • Growth of enterprise data

monetization services

• Large retailers monetizing own

data to provide insights to

suppliers

Social Media • Growing market for scrubbed,

aggregate data from social

media and blogs

• Greater focus on data that

provides insight in a customer’s

digital persona

Technology

• Commodity priced storage and

compute

• Emergence of open source and

big data technologies solving

production problems at scale

Data Mobilization • Novel approaches to analyze

unstructured data creating

shorter time from data to insight

• Shift towards data consumption

in multiple environments

(business apps, mobile, social)

The changing landscape and required winning strategies are creating shifts

within Big Data collection and analytics


The Big Data Approach

Culture

• Data-driven decision making

• Experimentation and

continuous improvement with

academic rigor

• End-to-end ownership of

services

• Sharing of platform, tools and

code

Treat data as a strategic asset, seek to maximize it’s value to the organization

Invest in common services, data platforms and tools

Rapidly prototype, deliver, and measure value-added data services, evolve over time


Topics



Technology Primer

Design Overview


Client Context

NetApp, Inc.

• Industry: Data storage, data management

• 77% Fortune 500 companies are customers

• Creator of Data ONTAP: industry leading storage OS


• Secure automated “call-home” service

• Catch issues before they become critical

• System monitoring and alerting

• RMA requests without customer action

• Faster incident management

AutoSupport

AutoSupport

Messages AutoSupport

Data Warehouse Storage Devices


Business Challenges

• Increase in response times / lower

availability for services

• Incoming data volume doubling every 16

months

• Proliferation of ad hoc datamarts and

point solutions

• Unable to analyze full AutoSupport

contents efficiently File StorageASUP

Messages

SAP CRM

ODS

DSS

MyASUP STOReBI Analytics & MiningASUP Tools

PWillows

BI

Light

Parser

Core

Parser

Xterra DB

Parser

DW 2 Adhoc DB’s

PMBTA

Adhoc

Parsers

DRM DMHDDSAP CRMSTAGE

PNOW GEO

Transform

Presentation

Interface

Rules

Integrate

RulesRules

Rules

Stage

Extract

Source

Xterra

Parser Parser

Custom ETLCustom ETL Custom ETLCustom ETL

eB

I

DW 3

Java Interface Jasper

Stored Proc

Loader

RulesRules

Rules

Rest InterfaceCRM Module

Rules Module

Xterra File

Various

RulesRules

Rules

0

500

1000

1500

2000

2500

3000

3500

Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16

AutoSupport Flat-File Storage Requirement

Total Usage (tb)

Projected Total Usage (tb)

Doubles


Solution Design Goals

• Improve system response times

and data availability

• Expose common data services for

consumption across business units

• Standardize key business metrics

into common rules repository

• Lower operational costs as

ecosystem continues to scale

• Provide more granular analytical

capabilities

Improve data access and technology cost effectiveness and performance.


Role of Open Source

Platform is composed of open source technologies purpose-built for large-scale

storage, processing and analysis

1 Actual Big Data Solution Blueprint for a hybrid deployment


Topics



Technology Primer

Design Overview


Technology Primer – Hadoop

Hadoop Distributed Filesystem

(HDFS) • Divides files into smaller “blocks”,

stored across machines

• Automated replication, fault tolerance

Hadoop MapReduce • Parallel processing for large datasets

across machines

• Breaks job into tasks, using a simple map()

and reduce() paradigm for data flows


Technology Primer – MapReduce

MapReduce

(Simple Example – Word Count)

One fish,

two fish,

red fish,

blue fish.

m

m

m

m

r

r

r

<one,1>

<fish,1>

<two,1>

<fish,1>

<red,1>

<fish,1>

<blue,1>

<fish,1>

<one,1>

<two,1>

<red,1>

<blue,1>

<fish,4>

Input

Map Phase Shuffle Phase

Map(key,value)

Reduce(key, List<value> values)


Technology Primer – NoSQL

• “Not only” SQL

• Catch-all term for various non-relational database systems

• Typical areas of differentation

• Data model semantics • eg. Database, Document, Key-Value

• CAP trade-offs • Consistency, Availability, Partition-Tolerance

• Scale-out architecture • eg. Sharding, Distributed hash

• Query language

Examples: HBase, Cassandra, mongoDB, Neo4j, etc.


Topics



Technology Primer

Design Overview


Data Pipeline Overview

Incoming Messages

Ingestion Core Data

Processing

Data Service

Interface

Ad hoc

analytics

ETL


Data Ingestion

Flume

client

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Technologies

• Apache Flume, Apache Hadoop, Drools BRMS, JMS

Capabilities

• Handle dynamic data volumes

• Normalization of disparate file formats

• Real-time aggregation of documents

• JMS alerts for critical messages

HDFS

Routing tier

Parsing tier Aggregation & sink tier

Documents from

Front End HTTP/SMTP

Gateway Aggregated files

JMS

Notifications

Rules

Engine


HDFS

Core Data Processing

Map

Map

Map

Reduce

Reduce

HBase

Solr Parse text

contents Transform and derive data objects

Technologies

• MapReduce, HBase, Solr, Avro

Capabilities

• Parallel processing for increased throughput

• Efficient storage of complex data objects in Avro

Primary storage

Search indexes

Write derived objects to

data stores

Hive

Documents gathered

from Flume

Data warehouse


Data Services

Technologies

• Apache HBase, Solr, Tomcat

Capabilities

• Unified web services API for end

users

• Support for complex queries and

searches across multiple dimensions

with Solr

• Access both raw and derived content

for a given system


Analytics / ETL

Technologies

• Apache Hive, Pig, Datameer (Ad hoc analytics)

• Pentaho (ETL / Data Integration)

Capabilities

• Analytical environment for both business analysts and “power

users”

• Hive or Pig as higher level query languages

• Datameer for analytics with a spreadsheet UI

• ETL through Pentaho MapReduce • (runs Pentaho ETL server inside of a MapReduce Job)


Successes and Challenges

Successes

• Web service interface contracts simplified integration with

user tools, allowed for flexibility in internal implementation

• Open source core allowed rapid for rapid iteration

• Met or exceeded all SLAs using commodity hardware,

significantly driving down costs

Challenges

• Monitoring a large distributed system requires discipline and

a strong operations team

• Shared storage systems and Big Data technologies don’t

always play well together

• “Schemaless” systems can become a headache to

maintain, especially with complex data models


Thank you

Jonathan Bender Consultant, Accenture Technology Labs

[email protected]

osc2012: big data using open source: netapp project - technical

Documents

big data approachtreat

data storage

data services

data management

applicationsaggregate

big data technologies

compute unstructured

growth of enterprise