advanced analytics and big data (august 2014)
DESCRIPTION
Summary of currently available options for advanced analytics in HadoopTRANSCRIPT
1
Advanced Analytics with Big Data
Thomas W. Dinsmore
Advanced Analytics with Big Data
• What do we mean by “Big Data”?
• Do we need to use all of the data?
• What analytics can run inside Big Data platforms?
2
Big Data
• Data that cannot be efficiently handled in a relational database
• The three Vs:
• Volume
• Variety
• Velocity
3
Big Data Platforms
• Hadoop ecosystem: MapReduce, Hive, Impala, Spark etc
• Appliances: Teradata, IBM PureData, Pivotal, Oracle BDA, Vertica,
Par Accel/Redshift etc etc
• NoSQL/NewSQL: Cassandra, Mongo, MemSQL
• Streaming engines: Infosphere Streams
4
Convergence: Federated SQL engines (e.g.) Pivotal Hawq
6
Analytics Platform
For aggregate models, you can simply
sample the data and work offline.
7
Anomaly
Detection
Affinity
Analysis
Microsegmentation
Social
Network
Analysis
Collaborative
Filtering
However, for some use cases you may
need to use all of the data.
8
Catastrophic Risk Modeling Modeling with Fine-grained
Behavioral Data
For other use cases, using all of the
data is worth extra time and effort.
9
HDFS HDFS HDFS HDFS HDFS HDFS
Data
Most legacy analytic packages can read
HDFS files.
10
HDFS HDFS HDFS HDFS HDFS HDFS
MapReduce
Data
Some tools also provide
pass-through capabilities.
11
HDFS HDFS HDFS HDFS HDFS HDFS
MapReduce
Advantages
• Co-exists w/ other applications
• Integrated workload management
• Simplified administration
Disdvantages
• MapReduce latency
Several tools translate user requests to MapReduce. This eliminates data
movement and co-exists well with other applications.
12
YARN
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
Advantages
• Easy to adapt legacy apps
• Isolates analytic workload
Disdvantages
• Data moves within the cluster
• Requires YARN
YARN (*) makes it possible to bypass MapReduce and run
analytics in memory on dedicated nodes.
(*) Yet Another Resource Negotiater
13
HDFS
Map
Reduce
YARN
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
Advantages
• Lowest latency
Disdvantages
• Upgrade every node
• Requires YARN
Distributing in-memory analytics across the Hadoop cluster
minimizes internal data movement.
14
Open Source Projects
Apache Mahout • Apache incubator project (2007)
• Machine learning library
• Included in most distributions
• Thin acceptance, few contributors
• Diverse architecture
• Single-node
• MapReduce
• New algos run on Spark
• Recently cleaned up
15
Apache Giraph • Apache top-level project
• Runs in MapReduce
• Dedicated graph engine
• Used by Facebook, few others
• Dead in the water
• No presence in leading distros
• No significant commercial support
• No releases in 13 months
• No recent code commits on Git
16
GraphLab • Carnegie Mellon project (2009)
• Distributed in-memory engine:
• Primarily graph analysis
• Selected machine learning algos
• Interface from Java, JavaScript, Python
• GraphLab Inc provides commercial
support (2013, $6.75MM)
• Independent distribution, or through
Pivotal
• Minimal development effort past six
months
17
0xdata H2O • Vendor-driven open source project
• 0xdata sells support, customization
• Distributed in-memory prediction engine
• Multiple deployment options:
• Standalone (with HDFS)
• Over YARN
• In MapReduce
• Claims 2,000+ users
• 4 public references
• Used by a leading P&C insurer
• Java, R, Python and Scala interfaces
18
Apache Spark • Top-level Apache project (2/14)
• Release 1.02 (8/14)
• Distributed in-memory analytics
• Machine learning
• Graph analytics
• Streaming analytics
• Fast SQL
• Compatible with Hadoop storage
• Integrated with YARN
• Scala, Python, Java interfaces (+SparkR)
• Growing ecosystem
• Supported in leading Hadoop distributions
19
Analytic Features
22
0xdata
H2O 2.2
Apache
Giraph 1.1
Apache
Mahout 0.9
Apache
Spark 1.02 GraphLab 2.2
Prediction +++ + +++
Dimension Reduction + +++ + +
Clustering + +++ + +++
Collaborative Filtering +++ + +++
Text Analytics +++ +++
Matrix Operations + +++ +
Graph Analysis + + +++
Summary: Open Source • Giraph appears to be dead in the water
• Mahout may be recovering from roadkill status
• GraphLab outperforms Spark GraphX today in graph analytics
• 0xdata H2O currently has more machine learning features than Spark
MLLib and a better R interface
• Spark catching up fast
• More resources and distribution
• Integrated platform for ML and graph analysis
23
24
Commercial Software
Alpine • Business user interface
• Collaboration environment
• Broad library of techniques
• Strong cloud offering
• Leverages Hadoop (multiple distros), Hawq or
Pivotal Greenplum
• Push-down MapReduce
• Certified on Spark
• Small but growing customer base
25
IBM SPSS Analytics Server • Introduced 2013
• Serves as “back end” for SPSS
Modeler
• Uses push-down MR
• Limited analytic feature set
• IBM supports on multiple Hadoop
distros
• Customer acceptance unknown
26
Revolution Analytics ScaleR • ScaleR library of distributed statistics, machine
learning functions
• Tools to distribute arbitrary R functions
• Runs in Cloudera, Hortonworks, Teradata, LSF
clusters, MS HPC
• Hadoop edition uses MR push-down
• Tools simplify installation in large clusters
• R interface
• Partnerships with Alteryx, Qlik, MicroStrategy,
Tableau provide business interfaces
27
Skytree Server • Georgia Tech’s FastLab project, repurposed as
commercial software
• Distributed machine learning platform
• Very opaque about technical details
• User interface is an API
• Co-located in Hadoop under YARN
• Just certified by Hortonworks
• Customer acceptance unknown
• No new public references in a year
• Used by leading credit card company
28
SAS High-Performance Analytics • Distributed in-memory analytics
• Designed to run in special-purpose appliances (2011)
• Repurposed to run in Hadoop (2013)
• Co-exists poorly — cannot run SAS and MapReduce at
the same time
• Reads entire dataset into memory
• Uses MPI to communicate among nodes
• Requires upgrades from standard Hadoop infrastructure
• Customer acceptance unknown
• No public references
• Generic success stories missing from Strata presos
29
SAS LASR Server • SAS’ “other” distributed in-memory platform
• Back end for several end-user products
• SAS Visual Analytics (2012)
• SAS Visual Statistics (New)
• SAS In-Memory Statistics for Hadoop (New)
• Recently added statistics and machine learning
• Does not read raw HDFS; must be transformed to proprietary
SASHDAT
• Like HPA, reads entire dataset into memory.
• 16 Core 256GB node can load 75GB table
• Runs DS2 programs, not Legacy SAS programs
• Fast, but with limited feature set
• SAS claims 1,400 “sites” for Visual Analytics
• Many of those are standalone boxes
30
Summary: Commercial
• Alpine’s interface is compelling to business user
• IBM Analytics Server is a good first release
• RRE ScaleR appeals to R users, plays well in Hadoop sandbox
• Skytree Server: strong in prediction
• SAS: why two competing memory-centric architectures?
31
Progress
• Spark: blindingly fast maturity
• Rapidly expanding library of analytic features
• Growing developer community, ecosystem
• Commercial: from zero to many
32
Interesting Questions • Will Mahout get a second wind?
• Will Spark MLLib displace 0xdata?
• Will Spark GraphX catch up to GraphLab?
• Can Spark Streaming compete with Storm and commercial entrants?
• How quickly will customers adopt memory-centric architecture for analytics?
• What will Alpine and MicroStrategy do with Spark?
• When will SAS announce a reference customer for HPA/LASR in Hadoop?
33
Questions
34
Thank You
35
36
Advanced Analytics with Big Data
Thomas W. Dinsmore