advanced analytics and big data (august 2014)

1

Advanced Analytics with Big Data

Thomas W. Dinsmore


• What do we mean by “Big Data”?

• Do we need to use all of the data?

• What analytics can run inside Big Data platforms?

2

Big Data

• Data that cannot be efficiently handled in a relational database

• The three Vs:

• Volume

• Variety

• Velocity

3

Big Data Platforms

• Hadoop ecosystem: MapReduce, Hive, Impala, Spark etc

• Appliances: Teradata, IBM PureData, Pivotal, Oracle BDA, Vertica,

Par Accel/Redshift etc etc

• NoSQL/NewSQL: Cassandra, Mongo, MemSQL

• Streaming engines: Infosphere Streams

4

Convergence: Federated SQL engines (e.g.) Pivotal Hawq

6

Analytics Platform

For aggregate models, you can simply

sample the data and work offline.

7

Anomaly

Detection

Affinity

Analysis

Microsegmentation

Social

Network

Analysis

Collaborative

Filtering

However, for some use cases you may

need to use all of the data.

8

Catastrophic Risk Modeling Modeling with Fine-grained

Behavioral Data

For other use cases, using all of the

data is worth extra time and effort.

9

HDFS HDFS HDFS HDFS HDFS HDFS

Data

Most legacy analytic packages can read

HDFS files.

10


MapReduce

Data

Some tools also provide

pass-through capabilities.

11


MapReduce

Advantages

• Co-exists w/ other applications

• Integrated workload management

• Simplified administration

Disdvantages

• MapReduce latency

Several tools translate user requests to MapReduce. This eliminates data

movement and co-exists well with other applications.

12

YARN

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

Advantages

• Easy to adapt legacy apps

• Isolates analytic workload

Disdvantages

• Data moves within the cluster

• Requires YARN

YARN (*) makes it possible to bypass MapReduce and run

analytics in memory on dedicated nodes.

(*) Yet Another Resource Negotiater

13

HDFS

Map

Reduce

YARN

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

HDFS

Map

Reduce

Advantages

• Lowest latency

Disdvantages

• Upgrade every node

• Requires YARN

Distributing in-memory analytics across the Hadoop cluster

minimizes internal data movement.

14

Open Source Projects

Apache Mahout • Apache incubator project (2007)

• Machine learning library

• Included in most distributions

• Thin acceptance, few contributors

• Diverse architecture

• Single-node

• MapReduce

• New algos run on Spark

• Recently cleaned up

15

Apache Giraph • Apache top-level project

• Runs in MapReduce

• Dedicated graph engine

• Used by Facebook, few others

• Dead in the water

• No presence in leading distros

• No significant commercial support

• No releases in 13 months

• No recent code commits on Git

16

GraphLab • Carnegie Mellon project (2009)

• Distributed in-memory engine:

• Primarily graph analysis

• Selected machine learning algos

• Interface from Java, JavaScript, Python

• GraphLab Inc provides commercial

support (2013, $6.75MM)

• Independent distribution, or through

Pivotal

• Minimal development effort past six

months

17

0xdata H2O • Vendor-driven open source project

• 0xdata sells support, customization

• Distributed in-memory prediction engine

• Multiple deployment options:

• Standalone (with HDFS)

• Over YARN

• In MapReduce

• Claims 2,000+ users

• 4 public references

• Used by a leading P&C insurer

• Java, R, Python and Scala interfaces

18

Apache Spark • Top-level Apache project (2/14)

• Release 1.02 (8/14)

• Distributed in-memory analytics

• Machine learning

• Graph analytics

• Streaming analytics

• Fast SQL

• Compatible with Hadoop storage

• Integrated with YARN

• Scala, Python, Java interfaces (+SparkR)

• Growing ecosystem

• Supported in leading Hadoop distributions

19

Analytic Features

22

0xdata

H2O 2.2

Apache

Giraph 1.1

Apache

Mahout 0.9

Apache

Spark 1.02 GraphLab 2.2

Prediction +++ + +++

Dimension Reduction + +++ + +

Clustering + +++ + +++

Collaborative Filtering +++ + +++

Text Analytics +++ +++

Matrix Operations + +++ +

Graph Analysis + + +++

Summary: Open Source • Giraph appears to be dead in the water

• Mahout may be recovering from roadkill status

• GraphLab outperforms Spark GraphX today in graph analytics

• 0xdata H2O currently has more machine learning features than Spark

MLLib and a better R interface

• Spark catching up fast

• More resources and distribution

• Integrated platform for ML and graph analysis

23

24

Commercial Software

Alpine • Business user interface

• Collaboration environment

• Broad library of techniques

• Strong cloud offering

• Leverages Hadoop (multiple distros), Hawq or

Pivotal Greenplum

• Push-down MapReduce

• Certified on Spark

• Small but growing customer base

25

IBM SPSS Analytics Server • Introduced 2013

• Serves as “back end” for SPSS

Modeler

• Uses push-down MR

• Limited analytic feature set

• IBM supports on multiple Hadoop

distros

• Customer acceptance unknown

26

Revolution Analytics ScaleR • ScaleR library of distributed statistics, machine

learning functions

• Tools to distribute arbitrary R functions

• Runs in Cloudera, Hortonworks, Teradata, LSF

clusters, MS HPC

• Hadoop edition uses MR push-down

• Tools simplify installation in large clusters

• R interface

• Partnerships with Alteryx, Qlik, MicroStrategy,

Tableau provide business interfaces

27

Skytree Server • Georgia Tech’s FastLab project, repurposed as

commercial software

• Distributed machine learning platform

• Very opaque about technical details

• User interface is an API

• Co-located in Hadoop under YARN

• Just certified by Hortonworks


• No new public references in a year

• Used by leading credit card company

28

SAS High-Performance Analytics • Distributed in-memory analytics

• Designed to run in special-purpose appliances (2011)

• Repurposed to run in Hadoop (2013)

• Co-exists poorly — cannot run SAS and MapReduce at

the same time

• Reads entire dataset into memory

• Uses MPI to communicate among nodes

• Requires upgrades from standard Hadoop infrastructure


• No public references

• Generic success stories missing from Strata presos

29

SAS LASR Server • SAS’ “other” distributed in-memory platform

• Back end for several end-user products

• SAS Visual Analytics (2012)

• SAS Visual Statistics (New)

• SAS In-Memory Statistics for Hadoop (New)

• Recently added statistics and machine learning

• Does not read raw HDFS; must be transformed to proprietary

SASHDAT

• Like HPA, reads entire dataset into memory.

• 16 Core 256GB node can load 75GB table

• Runs DS2 programs, not Legacy SAS programs

• Fast, but with limited feature set

• SAS claims 1,400 “sites” for Visual Analytics

• Many of those are standalone boxes

30

Summary: Commercial

• Alpine’s interface is compelling to business user

• IBM Analytics Server is a good first release

• RRE ScaleR appeals to R users, plays well in Hadoop sandbox

• Skytree Server: strong in prediction

• SAS: why two competing memory-centric architectures?

31

Progress

• Spark: blindingly fast maturity

• Rapidly expanding library of analytic features

• Growing developer community, ecosystem

• Commercial: from zero to many

32

Interesting Questions • Will Mahout get a second wind?

• Will Spark MLLib displace 0xdata?

• Will Spark GraphX catch up to GraphLab?

• Can Spark Streaming compete with Storm and commercial entrants?

• How quickly will customers adopt memory-centric architecture for analytics?

• What will Alpine and MicroStrategy do with Spark?

• When will SAS announce a reference customer for HPA/LASR in Hadoop?

33

Questions

34

Thank You

35

36


Thomas W. Dinsmore

advanced analytics and big data (august 2014)

Technology