advanced analytics and big data (august 2014)

33
1 Advanced Analytics with Big Data Thomas W. Dinsmore

Upload: thomas-w-dinsmore

Post on 26-Jan-2015

108 views

Category:

Technology


0 download

DESCRIPTION

Summary of currently available options for advanced analytics in Hadoop

TRANSCRIPT

Page 1: Advanced Analytics and Big Data (August 2014)

1

Advanced Analytics with Big Data

Thomas W. Dinsmore

Page 2: Advanced Analytics and Big Data (August 2014)

Advanced Analytics with Big Data

• What do we mean by “Big Data”?

• Do we need to use all of the data?

• What analytics can run inside Big Data platforms?

2

Page 3: Advanced Analytics and Big Data (August 2014)

Big Data

• Data that cannot be efficiently handled in a relational database

• The three Vs:

• Volume

• Variety

• Velocity

3

Page 4: Advanced Analytics and Big Data (August 2014)

Big Data Platforms

• Hadoop ecosystem: MapReduce, Hive, Impala, Spark etc

• Appliances: Teradata, IBM PureData, Pivotal, Oracle BDA, Vertica,

Par Accel/Redshift etc etc

• NoSQL/NewSQL: Cassandra, Mongo, MemSQL

• Streaming engines: Infosphere Streams

4

Convergence: Federated SQL engines (e.g.) Pivotal Hawq

Page 5: Advanced Analytics and Big Data (August 2014)

6

Analytics Platform

For aggregate models, you can simply

sample the data and work offline.

Page 6: Advanced Analytics and Big Data (August 2014)

7

Anomaly

Detection

Affinity

Analysis

Microsegmentation

Social

Network

Analysis

Collaborative

Filtering

However, for some use cases you may

need to use all of the data.

Page 7: Advanced Analytics and Big Data (August 2014)

8

Catastrophic Risk Modeling Modeling with Fine-grained

Behavioral Data

For other use cases, using all of the

data is worth extra time and effort.

Page 8: Advanced Analytics and Big Data (August 2014)

9

HDFS HDFS HDFS HDFS HDFS HDFS

Data

Most legacy analytic packages can read

HDFS files.

Page 9: Advanced Analytics and Big Data (August 2014)

10

HDFS HDFS HDFS HDFS HDFS HDFS

MapReduce

Data

Some tools also provide

pass-through capabilities.

Page 10: Advanced Analytics and Big Data (August 2014)

11

HDFS HDFS HDFS HDFS HDFS HDFS

MapReduce

Advantages

• Co-exists w/ other applications

• Integrated workload management

• Simplified administration

Disdvantages

• MapReduce latency

Several tools translate user requests to MapReduce. This eliminates data

movement and co-exists well with other applications.

Page 11: Advanced Analytics and Big Data (August 2014)

12

YARN

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

Advantages

• Easy to adapt legacy apps

• Isolates analytic workload

Disdvantages

• Data moves within the cluster

• Requires YARN

YARN (*) makes it possible to bypass MapReduce and run

analytics in memory on dedicated nodes.

(*) Yet Another Resource Negotiater

Page 12: Advanced Analytics and Big Data (August 2014)

13

HDFS

Map

Reduce

YARN

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

Advantages

• Lowest latency

Disdvantages

• Upgrade every node

• Requires YARN

Distributing in-memory analytics across the Hadoop cluster

minimizes internal data movement.

Page 13: Advanced Analytics and Big Data (August 2014)

14

Open Source Projects

Page 14: Advanced Analytics and Big Data (August 2014)

Apache Mahout • Apache incubator project (2007)

• Machine learning library

• Included in most distributions

• Thin acceptance, few contributors

• Diverse architecture

• Single-node

• MapReduce

• New algos run on Spark

• Recently cleaned up

15

Page 15: Advanced Analytics and Big Data (August 2014)

Apache Giraph • Apache top-level project

• Runs in MapReduce

• Dedicated graph engine

• Used by Facebook, few others

• Dead in the water

• No presence in leading distros

• No significant commercial support

• No releases in 13 months

• No recent code commits on Git

16

Page 16: Advanced Analytics and Big Data (August 2014)

GraphLab • Carnegie Mellon project (2009)

• Distributed in-memory engine:

• Primarily graph analysis

• Selected machine learning algos

• Interface from Java, JavaScript, Python

• GraphLab Inc provides commercial

support (2013, $6.75MM)

• Independent distribution, or through

Pivotal

• Minimal development effort past six

months

17

Page 17: Advanced Analytics and Big Data (August 2014)

0xdata H2O • Vendor-driven open source project

• 0xdata sells support, customization

• Distributed in-memory prediction engine

• Multiple deployment options:

• Standalone (with HDFS)

• Over YARN

• In MapReduce

• Claims 2,000+ users

• 4 public references

• Used by a leading P&C insurer

• Java, R, Python and Scala interfaces

18

Page 18: Advanced Analytics and Big Data (August 2014)

Apache Spark • Top-level Apache project (2/14)

• Release 1.02 (8/14)

• Distributed in-memory analytics

• Machine learning

• Graph analytics

• Streaming analytics

• Fast SQL

• Compatible with Hadoop storage

• Integrated with YARN

• Scala, Python, Java interfaces (+SparkR)

• Growing ecosystem

• Supported in leading Hadoop distributions

19

Page 19: Advanced Analytics and Big Data (August 2014)

Analytic Features

22

0xdata

H2O 2.2

Apache

Giraph 1.1

Apache

Mahout 0.9

Apache

Spark 1.02 GraphLab 2.2

Prediction +++ + +++

Dimension Reduction + +++ + +

Clustering + +++ + +++

Collaborative Filtering +++ + +++

Text Analytics +++ +++

Matrix Operations + +++ +

Graph Analysis + + +++

Page 20: Advanced Analytics and Big Data (August 2014)

Summary: Open Source • Giraph appears to be dead in the water

• Mahout may be recovering from roadkill status

• GraphLab outperforms Spark GraphX today in graph analytics

• 0xdata H2O currently has more machine learning features than Spark

MLLib and a better R interface

• Spark catching up fast

• More resources and distribution

• Integrated platform for ML and graph analysis

23

Page 21: Advanced Analytics and Big Data (August 2014)

24

Commercial Software

Page 22: Advanced Analytics and Big Data (August 2014)

Alpine • Business user interface

• Collaboration environment

• Broad library of techniques

• Strong cloud offering

• Leverages Hadoop (multiple distros), Hawq or

Pivotal Greenplum

• Push-down MapReduce

• Certified on Spark

• Small but growing customer base

25

Page 23: Advanced Analytics and Big Data (August 2014)

IBM SPSS Analytics Server • Introduced 2013

• Serves as “back end” for SPSS

Modeler

• Uses push-down MR

• Limited analytic feature set

• IBM supports on multiple Hadoop

distros

• Customer acceptance unknown

26

Page 24: Advanced Analytics and Big Data (August 2014)

Revolution Analytics ScaleR • ScaleR library of distributed statistics, machine

learning functions

• Tools to distribute arbitrary R functions

• Runs in Cloudera, Hortonworks, Teradata, LSF

clusters, MS HPC

• Hadoop edition uses MR push-down

• Tools simplify installation in large clusters

• R interface

• Partnerships with Alteryx, Qlik, MicroStrategy,

Tableau provide business interfaces

27

Page 25: Advanced Analytics and Big Data (August 2014)

Skytree Server • Georgia Tech’s FastLab project, repurposed as

commercial software

• Distributed machine learning platform

• Very opaque about technical details

• User interface is an API

• Co-located in Hadoop under YARN

• Just certified by Hortonworks

• Customer acceptance unknown

• No new public references in a year

• Used by leading credit card company

28

Page 26: Advanced Analytics and Big Data (August 2014)

SAS High-Performance Analytics • Distributed in-memory analytics

• Designed to run in special-purpose appliances (2011)

• Repurposed to run in Hadoop (2013)

• Co-exists poorly — cannot run SAS and MapReduce at

the same time

• Reads entire dataset into memory

• Uses MPI to communicate among nodes

• Requires upgrades from standard Hadoop infrastructure

• Customer acceptance unknown

• No public references

• Generic success stories missing from Strata presos

29

Page 27: Advanced Analytics and Big Data (August 2014)

SAS LASR Server • SAS’ “other” distributed in-memory platform

• Back end for several end-user products

• SAS Visual Analytics (2012)

• SAS Visual Statistics (New)

• SAS In-Memory Statistics for Hadoop (New)

• Recently added statistics and machine learning

• Does not read raw HDFS; must be transformed to proprietary

SASHDAT

• Like HPA, reads entire dataset into memory.

• 16 Core 256GB node can load 75GB table

• Runs DS2 programs, not Legacy SAS programs

• Fast, but with limited feature set

• SAS claims 1,400 “sites” for Visual Analytics

• Many of those are standalone boxes

30

Page 28: Advanced Analytics and Big Data (August 2014)

Summary: Commercial

• Alpine’s interface is compelling to business user

• IBM Analytics Server is a good first release

• RRE ScaleR appeals to R users, plays well in Hadoop sandbox

• Skytree Server: strong in prediction

• SAS: why two competing memory-centric architectures?

31

Page 29: Advanced Analytics and Big Data (August 2014)

Progress

• Spark: blindingly fast maturity

• Rapidly expanding library of analytic features

• Growing developer community, ecosystem

• Commercial: from zero to many

32

Page 30: Advanced Analytics and Big Data (August 2014)

Interesting Questions • Will Mahout get a second wind?

• Will Spark MLLib displace 0xdata?

• Will Spark GraphX catch up to GraphLab?

• Can Spark Streaming compete with Storm and commercial entrants?

• How quickly will customers adopt memory-centric architecture for analytics?

• What will Alpine and MicroStrategy do with Spark?

• When will SAS announce a reference customer for HPA/LASR in Hadoop?

33

Page 31: Advanced Analytics and Big Data (August 2014)

Questions

34

Page 32: Advanced Analytics and Big Data (August 2014)

Thank You

35

Page 33: Advanced Analytics and Big Data (August 2014)

36

Advanced Analytics with Big Data

Thomas W. Dinsmore