osc2012: big data using open source: netapp project - technical

23
Open source Big Data case study: Building a platform for remote device support at NetApp (Part II Technical)

Upload: accenture-the-netherlands

Post on 20-Aug-2015

716 views

Category:

Documents


7 download

TRANSCRIPT

Open source Big Data case study: Building a

platform for remote device support at NetApp

(Part II – Technical)

Copyright © 2012 Accenture All rights reserved. 2

Topics

Big Data Perspective

Case Study: NetApp AutoSupport

Technology Primer

Design Overview

Copyright © 2012 Accenture All rights reserved. 3

Big Data

The concept is disruptive. The technology is disruptive. And, markets and

clients are being impacted.

1 Wordle for Credit Suisse, Does Size Matter Only?, September 2011

Copyright © 2012 Accenture All rights reserved. 4

Shifts in Data and Analytics

Data-led Innovation

Data Explosion • Unstructured data is doubling

every 3 months

• 2011 saw 47% growth overall

• By 2015, number of networked

devices will be 2x global

population

• De-coupling data from

applications

• Disparate external data shaping

context

• Cost effective mobilization of

massive scale data

Monetization • Growth of enterprise data

monetization services

• Large retailers monetizing own

data to provide insights to

suppliers

Social Media • Growing market for scrubbed,

aggregate data from social

media and blogs

• Greater focus on data that

provides insight in a customer’s

digital persona

Technology

• Commodity priced storage and

compute

• Emergence of open source and

big data technologies solving

production problems at scale

Data Mobilization • Novel approaches to analyze

unstructured data creating

shorter time from data to insight

• Shift towards data consumption

in multiple environments

(business apps, mobile, social)

The changing landscape and required winning strategies are creating shifts

within Big Data collection and analytics

Copyright © 2012 Accenture All rights reserved. 5

The Big Data Approach

Culture

• Data-driven decision making

• Experimentation and

continuous improvement with

academic rigor

• End-to-end ownership of

services

• Sharing of platform, tools and

code

Treat data as a strategic asset, seek to maximize it’s value to the organization

Invest in common services, data platforms and tools

Rapidly prototype, deliver, and measure value-added data services, evolve over time

Copyright © 2012 Accenture All rights reserved. 6

Topics

Big Data Perspective

Case Study: NetApp AutoSupport

Technology Primer

Design Overview

Copyright © 2012 Accenture All rights reserved. 7

Client Context

NetApp, Inc.

• Industry: Data storage, data management

• 77% Fortune 500 companies are customers

• Creator of Data ONTAP: industry leading storage OS

Copyright © 2012 Accenture All rights reserved. 8

• Secure automated “call-home” service

• Catch issues before they become critical

• System monitoring and alerting

• RMA requests without customer action

• Faster incident management

AutoSupport

AutoSupport

Messages AutoSupport

Data Warehouse Storage Devices

Copyright © 2012 Accenture All rights reserved. 9

Business Challenges

• Increase in response times / lower

availability for services

• Incoming data volume doubling every 16

months

• Proliferation of ad hoc datamarts and

point solutions

• Unable to analyze full AutoSupport

contents efficiently File StorageASUP

Messages

SAP CRM

ODS

DSS

MyASUP STOReBI Analytics & MiningASUP Tools

PWillows

BI

Light

Parser

Core

Parser

Xterra DB

Parser

DW 2 Adhoc DB’s

PMBTA

Adhoc

Parsers

DRM DMHDDSAP CRMSTAGE

PNOW GEO

Transform

Presentation

Interface

Rules

Integrate

RulesRules

Rules

Stage

Extract

Source

Xterra

Parser Parser

Custom ETLCustom ETL Custom ETLCustom ETL

eB

I

DW 3

Java Interface Jasper

Stored Proc

Loader

RulesRules

Rules

Rest InterfaceCRM Module

Rules Module

Xterra File

Various

RulesRules

Rules

0

500

1000

1500

2000

2500

3000

3500

Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16

AutoSupport Flat-File Storage Requirement

Total Usage (tb)

Projected Total Usage (tb)

Doubles

Copyright © 2012 Accenture All rights reserved. 10

Solution Design Goals

• Improve system response times

and data availability

• Expose common data services for

consumption across business units

• Standardize key business metrics

into common rules repository

• Lower operational costs as

ecosystem continues to scale

• Provide more granular analytical

capabilities

Improve data access and technology cost effectiveness and performance.

Copyright © 2012 Accenture All rights reserved. 11

Role of Open Source

Platform is composed of open source technologies purpose-built for large-scale

storage, processing and analysis

1 Actual Big Data Solution Blueprint for a hybrid deployment

Copyright © 2012 Accenture All rights reserved. 12

Topics

Big Data Perspective

Case Study: NetApp AutoSupport

Technology Primer

Design Overview

Copyright © 2012 Accenture All rights reserved. 13

Technology Primer – Hadoop

Hadoop Distributed Filesystem

(HDFS) • Divides files into smaller “blocks”,

stored across machines

• Automated replication, fault tolerance

Hadoop MapReduce • Parallel processing for large datasets

across machines

• Breaks job into tasks, using a simple map()

and reduce() paradigm for data flows

Copyright © 2012 Accenture All rights reserved. 14

Technology Primer – MapReduce

MapReduce

(Simple Example – Word Count)

One fish,

two fish,

red fish,

blue fish.

m

m

m

m

r

r

r

<one,1>

<fish,1>

<two,1>

<fish,1>

<red,1>

<fish,1>

<blue,1>

<fish,1>

<one,1>

<two,1>

<red,1>

<blue,1>

<fish,4>

Input

Map Phase Shuffle Phase

Map(key,value)

Reduce(key, List<value> values)

Copyright © 2012 Accenture All rights reserved. 15

Technology Primer – NoSQL

• “Not only” SQL

• Catch-all term for various non-relational database systems

• Typical areas of differentation

• Data model semantics • eg. Database, Document, Key-Value

• CAP trade-offs • Consistency, Availability, Partition-Tolerance

• Scale-out architecture • eg. Sharding, Distributed hash

• Query language

Examples: HBase, Cassandra, mongoDB, Neo4j, etc.

Copyright © 2012 Accenture All rights reserved. 16

Topics

Big Data Perspective

Case Study: NetApp AutoSupport

Technology Primer

Design Overview

Copyright © 2012 Accenture All rights reserved. 17

Data Pipeline Overview

Incoming Messages

Ingestion Core Data

Processing

Data Service

Interface

Ad hoc

analytics

ETL

Copyright © 2012 Accenture All rights reserved. 18

Data Ingestion

Flume

client

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Flume

agent

Technologies

• Apache Flume, Apache Hadoop, Drools BRMS, JMS

Capabilities

• Handle dynamic data volumes

• Normalization of disparate file formats

• Real-time aggregation of documents

• JMS alerts for critical messages

HDFS

Routing tier

Parsing tier Aggregation & sink tier

Documents from

Front End HTTP/SMTP

Gateway Aggregated files

JMS

Notifications

Rules

Engine

Copyright © 2012 Accenture All rights reserved. 19

HDFS

Core Data Processing

Map

Map

Map

Reduce

Reduce

HBase

Solr Parse text

contents Transform and derive data objects

Technologies

• MapReduce, HBase, Solr, Avro

Capabilities

• Parallel processing for increased throughput

• Efficient storage of complex data objects in Avro

Primary storage

Search indexes

Write derived objects to

data stores

Hive

Documents gathered

from Flume

Data warehouse

Copyright © 2012 Accenture All rights reserved. 20

Data Services

Technologies

• Apache HBase, Solr, Tomcat

Capabilities

• Unified web services API for end

users

• Support for complex queries and

searches across multiple dimensions

with Solr

• Access both raw and derived content

for a given system

Copyright © 2012 Accenture All rights reserved. 21

Analytics / ETL

Technologies

• Apache Hive, Pig, Datameer (Ad hoc analytics)

• Pentaho (ETL / Data Integration)

Capabilities

• Analytical environment for both business analysts and “power

users”

• Hive or Pig as higher level query languages

• Datameer for analytics with a spreadsheet UI

• ETL through Pentaho MapReduce • (runs Pentaho ETL server inside of a MapReduce Job)

Copyright © 2012 Accenture All rights reserved. 22

Successes and Challenges

Successes

• Web service interface contracts simplified integration with

user tools, allowed for flexibility in internal implementation

• Open source core allowed rapid for rapid iteration

• Met or exceeded all SLAs using commodity hardware,

significantly driving down costs

Challenges

• Monitoring a large distributed system requires discipline and

a strong operations team

• Shared storage systems and Big Data technologies don’t

always play well together

• “Schemaless” systems can become a headache to

maintain, especially with complex data models

Copyright © 2012 Accenture All rights reserved. 23

Thank you

Jonathan Bender Consultant, Accenture Technology Labs

[email protected]