analyzing analytics - ieee computer society · pdf fileanalyzing analytics! ... “a...

ANALYZING ANALYTICS

Rajesh Bordawekar

1

IBM Thomas J. Watson Research Center

[email protected]

www.adms-conf.org/Analytics/tutorial.pdf

mailto:[email protected]

http://www.adms-conf.org/Analytics/tutorial.pdf

The process of using mathematical formulations for identifying, extracting, processing, and integrating information from raw data, and then applying it to solve a problem.

ANALYTICS: A DEFINITIONAnalytics Workloads

2

Understand and characterize execution flow of analytics workloads across multiple application domains:

Characterization targeted towards computer architects, compiler developers, system optimizers, and parallel programmers

Explore different layers from user to hardware to identify common design and computational patterns: Algorithms, Data Structures, Data Types, Operators

Define a set of Exemplars that capture key algorithmic, functional, and runtime characteristics

Investigate opportunities for parallelism and acceleration for key analytics exemplars via software, architectural, and systems solutions

GOALS

3

Analytics Workloads (30 mins)

Analytics Exemplars: 1 (60 mins)

Break

Analytics Exemplars: 2 (45 mins)

Accelerating Analytics (45 mins)

OUTLINE

4

“A Survey of Business Analytics Models…”, Bordawekar et al. IBM Research TR RC25186

“Analyzing Analytics”, Bordawekar et al. , IBM Research TR RC25317, SIGMOD Record, Dec 2013

Analytics Workloads

5

ANALYTICS AT AT YOUR SERVICEAnalytics Workloads

6

Prescription

Prediction

Reporting

Simulation

Pattern Matching

Recommendation

Alerting

Quantitative Analysis

ANALYTICS FUNCTIONAL GOALSAnalytics Workloads

7

Descriptive and Inferential Statistics

Learning: Supervised, Unsupervised, and Reinforcement

Structured and Unstructured Data Analysis

Modeling and Simulation

Optimization

ANALYTICS PROBLEM TYPESAnalytics Workloads

8

CLASSIFICATION OF ANALYTICAL WORKLOADS

9

Analytics Workloads

A Problem type can be used for multiple functions.

Consists of a pipeline with multiple stand-alone analytics components, each component with a different focus

In Watson: Question Analysis, Query Decomposition, Hypothesis Generation and Scoring, Answer ranking, etc.

Each component has distinct functional and runtime goals, and different implementation stacks partitioned into solution, library and kernel components.

Solution: End-user focused and customized to satisfy user’s functional goals (e.g., Answer ranking in Watson)

Library: Designed to be portable and applicable across multiple solutions (e.g., DeepQA UIMA Infrastructure that powers Watson)

Kernel: Architectural and system specific implementations of library functions using specific data structures and kernels (e.g., sparse matrix vector multiplication)

ANALYTICS AT WORKAnalytics Workloads

10

ANALYTICS WORKLOAD EXAMPLEAnalytics Workloads

11

Watson (DeepQA) Question-Answer System

Question

Question Analysis

Query Decomposition

Synthesis

Hypothesis Generation

Soft Filtering

Hypothesis and

Evidence Scoring

Final Merging and Ranking

Answer and Confidence

Answer Sources

Evidence Sources

Trained Models

ANALYTICS FUNCTIONAL FLOWAnalytics Workloads

System

Architecture

Functional and Runtime Goals

Analytical Discipline

Problem Types

Processes

Solution

Models

Algorithms

Kernels

Library

Kernel

12

An analytics solution uses an analytic type based on target functional and runtime goals

An analytic type has multiple models, each in turn, can use one or more algorithms

Based on the problem formulation and goals, the chosen algorithm can use specific data structures and kernels

Given the variety of algorithmic and system alternatives, it is difficult to make right choices to address specific performance issues

From systems perspective, it is important to identify common core computational and runtime patterns

WHY EXEMPLARS?Analytics Workloads

13

Regression Analysis

Clustering

Nearest Neighbor Search

Association Rule Mining/Recommender Systems

Neural Networks

Support Vector Machines

Decision Tree Learning

ANALYTICS EXEMPLARS

Time Series Processing

Text Analytics

Monte Carlo Methods

Mathematical Programming

Online Analytical Processing (OLAP)

Graph Analytics

Analytics Workloads

14

A set of widely used analytical models that capture key computational and runtime data access patterns across different analytics workloads.

Statistics: Regression

Data Mining: Clustering, Nearest neighbor search, Associative rule mining

Machine Learning: Neural networks, Decision tree learning, Support vector machines, Recommender Systems

Simulation: Monte Carlo algorithms

Optimization: Mathematical programming

Data Analysis: Time series processing, Text analytics, Online Analytical Processing, Graph analytics

ANALYTICS EXEMPLARSAnalytics Workloads

15

Analytical Domains

1. Regression Analysis

2. Decision Trees

3. Cluster Analysis

4. Time Series Processing

5. Text Mining

6. Neural Networks

7. Association Rule Mining/Recommender Systems

8. Support Vector Machines

9. Social Network Analysis

2013 REXER ANALYTICS SURVEYAnalytics Workloads

Top Analytics Algorithms in Practice

16

www.rexeranalytics.com

http://www.rexeranalytics.com

EXEMPLAR CHARACTERISTICSANALYTIC EXEMPLAR PROBLEM TYPE FUNCTIONAL GOALS

Regression Analysis Inferential Statistics PredictionClustering Unsupervised Learning Reporting, Prediction

Nearest Neighbor Search Unsupervised Learning Prediction, RecommendationAssociation Rule Mining Unsupervised Learning Recommendation

Neural Networks Supervised Learning Pattern matching, PredictionSupport Vector Machines Supervised Learning Pattern matching, PredictionDecision Tree Learning Supervised Learning Recommendation, PredictionTime Series Processing Unstructured Data Analysis Pattern matching, Alerting

Text Analytics Unstructured Data Analysis Pattern matching, ReportingMonte Carlo Methods Modeling and Simulation Simulation

Mathematical Programming Optimization PrescriptionOn-line Analytical Processing Structured Data Analysis Reporting, Prediction

Graph Analytics Unstructured Data Analysis Pattern matching, Recommendation

Analytics Workloads

17An Exemplar can address one or more functional goals.

Analytics Exemplars

18

13 Exemplars

Focus on high-level computational and runtime characteristics, not mathematical formulations

For each exemplar:

General usage

Basic idea

Key algorithms

Important data structures, functions, and operations

ORGANIZATIONAnalytics Exemplars

19

Classical statistical technique for modeling relationship between a dependent variable and one or more independent variables

Primary uses include prediction, forecasting, and discovering relationships between dependent and independent variables

Can be viewed as example of supervised learning which uses independent variables for training

Key application domains: Economics, Psychology, Social Sciences, Marketing, Health-care.. (Rated the most used analytics model by Rexer Analytics in 2013)

REGRESSION ANALYSISAnalytics Exemplars

Introduction

20

!

Any regression model relates a dependent variable Y to a regression function f of independent variables (regressors) X, and unknown regression parameters, β:

Quality of prediction depends on the amount of information about X

For k unknown parameters β, regression analysis is possible on if N ≥ k, where N is the number of data points of the form (Y, X)


Basic Idea

21

ILLUSTRATION: LINEAR REGRESSION

22

y

x

Independent Variable

Dependent Variable

y=a0+a1*x

University GPA

High School GPAa0

The dependent variable can be approximated as a linear combination of regressors and a disturbance term,

The linear regression model can be formulated as:

Any solution for the linear regression aims to infer the values of the regression parameters

Ordinary Least Squares (OLS) estimation minimizes the sum of squared residuals

Generalized Least Squares (GLS) estimation minimizes the squared Mahanobis length of the residual vector


Linear Regression

23

The dependent variables are related to the regressor variables via non-linear relationship on one or more unknown parameters . The model has the following form:

Most common nonlinear function is the exponential decay or growth model:

Other non-linear functions include logarithmic, trigonometric, power, and gaussian

Estimation of regression parameters via minimizing a suitable goodness-of-fit expression with respect to

Minimize sum of squared residual using non-linear least-squares method

Use linear regression to solve a linear form of the non-linear function


Non-linear Regression

24

Logistic regression predicts the probability of occurrences of an event by fitting data to a logistic function:

Regression parameters can then be estimated by the maximum likelihood method for a linear model of its inverse-logit function

Probit regression predicts the binary outcome of an event (Y is a binary variable). The Probit model can then be specified as following, where Pr is the probability, and is the CDF of the normal distribution:

Regression parameters can then be estimated using maximum likelihood method.


Logistic and Probit Regression

25

Key Data Structures

Sparse Matrices and Vectors

Key Types

Double-precision Float and Complex

Key Operations

Matrix computations (Inversion, LU Decomposition, Transpose, and Factorization), Least-squares estimation, Logistic function


26

Key Data Structures, Types and Operations

A process of grouping together entities from an ensemble into classes of entities that are similar in some sense. Also referred to as data segmentation.

An example of unsupervised learning

Key application domains

Market segmentation analysis

Gene Sequence Analysis

Medical Imaging

Document clustering based on semantic information

CLUSTERINGAnalytics Exemplars

Introduction

27

Any clustering algorithm identifies and exploits relevant similarities in the underlying potentially disparate data sources

Similarity can use either geometric distance-based metric or conceptual relationships in data

Input data: Potentially noisy, different types (e.g., binary, interval-based, categorical, ordinal, etc.), can have high dimensions, large data sizes

Two broad classes of algorithms: Parametric and Non-parametric

Parametric (model) based solution assumes an underlying probability distribution and clusters to fit the data accordingly

Non-parametric solution clusters based on spatial properties (e.g., distance or density)

Algorithms also differ depending on the dimensionality of the input data sets


Basic Idea

28


K-Means Clustering

29

An iterative partitioning method that constructs k non-overlapping partitions from n objects, where each partition represents a cluster.

Cluster similarity is measured as the mean value of objects, which can be viewed as the cluster’s centroid

Iterative steps

Choose k objects, mark them as the cluster centroid,

Assign the remaining objects to one of the clusters based on the distance of the object with the centroid.

The new mean is calculated, and objects are relocated to cluster whose mean is the nearest

The iterative relocation process continues until the convergence criterion is met

For categorical data (whose means can not be calculated), modes are used as similarity

ILLUSTRATION: K-MEANS CLUSTERING

30

K=3


Hierarchical Clustering

31

Hierarchical clustering group data objects into a tree of clusters based on distance based models using two approaches:

Agglomerative: bottom-up approach that creates increasingly large clusters

Divisive: Start with a single cluster and subdivide into smaller pieces until termination conditions are met

BIRCH: Two-stage agglomerative clustering algorithm designed to operate on very large datasets

First phase builds in-memory clusters that summarizes input data using statistical clustering features

Second phase, clusters the input summary are reclustered using iterative partitioning method


EM Clustering

32

Model-based approach that extends the k-means partitioning algorithm: assumes that underlying data is a mixture of k probability distributions

Aims to estimate parameters of the underlying probability distributions

Most common form of EM learns a mixture of Gaussian distribution

Starts with the initial estimate of parameters of the mixture model. Randomly selects k items as initial clusters

(Expectation): Assign each item to a cluster according to a weight that denotes probability of membership.

(Maximization) The item weights are re-scored based on the new model distribution, and then parameters are updated to maximize the likelihood of the distributions.

Key Data Structures

Trees, Graphs, Matrices, Queries

Key Types

Double-,Single-precision Floats

Key Operations

Matrix computations, Distance Functions


33


An optimization problem for finding points from an ensemble that are closest in some defined proximity

The definition of proximity varies from domain to domain, and usually formulated using a metric function (e.g., Euclidian distance for spatial proximity)

Applied to both clustering and classification scenarios

Well-known applications include online media providers (e.g., Netflix) suggesting movies or songs to match taste of a particular user

Other applications include analyzing multimedia data for copy-right violations, robotics, and drug discovery.

NEAREST NEIGHBOR SEARCHAnalytics Exemplars

Introduction

34

Given a set S of n points in some metric space (X,d), the problem is to preprocess S, so that given any query point , one can efficiently find a point that minimizes

Several variations of the problem based on the input data dimensionality and size, metric used for proximity calculations, and result cardinality (e.g., top-k or all-pairs).

Commonly used metrics for proximity calculations: Euclidian distance for low-dimensional data and Hamming distance for high-dimensional data

For very high-dimensional data, approximate version of search algorithm is preferred for efficiency


Basic Idea

35

p 2 Xq 2 S d(p, q)


K-d Trees

36

Addresses the precise nearest-neighbor problem for low-dimensional datasets

K-d tree is a binary tree that uses k-dimensional data using recursive hyperplane decomposition

Each record is represented by a k-element vector of real values. The tree is constructed by recursively selecting one of the k dimensions as the discriminatory dimension and partition the dataset according to a certain partition value. Each leaf points to one or more records at that location

At query time, the tree is traversed recursively. At every level, value of the discriminatory dimension is used to choose the path.

When the traversal reaches a leaf, a list of records is returned as the nearest neighbor candidates

ILLUSTRATION: K-D TREE

37

Reference: The Stony Brook Algorithm Repository www.cs.sunysb.edu/~algorith/files/kd-trees.shtml

Input K-d Tree

http://www.cs.sunysb.edu/~algorith/files/kd-trees.shtml


Approx. Nearest Neighbor

38

Uses hierarchical space decomposition for solving approximate nearest neighbor problem for low-dimensional data

Represents points in d-dimensional space using a balanced box-decomposition (BBD) tree.

Each node is associated with a cell, which represents points in either a d-dimensional rectangle or a region corresponding to set-theoretic intersection of two rectangles.

Each leaf cell is associated with a single point; leaves of the tree span the entire space.

During the query process, for a given query point q, a node with the minimum distance to the query point is selected for recursive traversal towards the leaves. The search terminates when the distance to the current leaf falls below a bound.


Locality-sensitive Hashing

39

Locality-sensitive hashing (LSH) algorithms are designed for solving the approximate nearest neighbor problem for very high-dimensional datasets

Hash points using several hash functions to ensure that for each function, the probability of collision is much higher for points that are close to each other than for those that are far apart.

For a query point, one can determine its nearest neighbors by hashing the query point, and retrieving the points in the bucket containing that point.

LSH relies on a family of hash functions that have the property that if two points are close, then they hash to the same bucket with high probability; if they are far apart, they hash to same bucket with low probability.

Key Data Structures

Higher-dimensional data structures (e.g., K-D, BBD trees), Hash tables, Matrices

Key Types

Double- and Single-precision Floats

Key Operations

Matrix computations (e.g., Singular Value Decomposition), Hashing, Distance function calculations


40


A key data mining method used for discovering co-occurrence relations between variables

First proposed for identifying relationships between items purchased in retail stores: process known as market-basket analysis

Amazon’s “people who bought an item x, also bought item y and z”

The association rule mining has been applied to more complex data patterns such as sequences, trees, graph, with applications to bio-informatics, intrusion detection, and web-usage analysis

ASSOCIATION RULE MININGAnalytics Exemplars

41

Introduction

Formally, association rule mining processes a set D of transactions, where each transaction is a set of items.

Association rule: An implication of the form , where X is a set of some items and Y is an item not present in X.

An association rule in the transaction set D has support s if s% of transactions in D contain X

An association rule in the transaction set D has confidence c if c% of transactions in D that contain X also contain X and Y.

The problem of association rule mining is to generate all association rules that have support and confidence greater than the user-specified support and confidence levels, minsup and minconf.


42

Basic Idea

X =) Y

The association rule problem consists of two subproblems

Find all sets of items (itemsets) that have transaction support above minsup. These items are called frequent itemsets.

Use the frequent item sets to discover desired rules. First, for every frequent itemset l, find all non-empty subsets of l. Use minconf and minsup to identify candidate itemsets.

Use downward closure of itemsets: all subsets of frequent sets are frequent.

Implementations differ according to:

traversal strategies to identify candidate itemsets: depth-first or breadth-first search

computing the support values: direct counting vs. intersection


43

Basic Algorithm

Both approaches use breadth-first traversal for itemset identification

Apriori exploits downward closure property to iteratively compute large candidate itemsets.

Makes multiple passes over raw data and uses a priori knowledge of infrequent item sets to reduce creation of unnecessary candidate itemsets.

Partition algorithm is similar to Apriori, but requires only two passes over underlying dataset.

Partitions the intermediate datasets into non-overlapping partitions.


44

Apriori and Partition

These approaches use depth-first traversal for itemset identification and all of them use algorithmic optimizations for improving performance

FP-Growth uses a novel prefix-tree like data structure (FP-tree) to compactly store candidate itemsets. This tree gets traversed to compute larger itemsets.

Both Eclat and MaxClique view the downward closure property as a lattice and use lattice properties to prune candidate item sets

Eclat uses equivalent classes and MaxClique uses Maximal Cliques in a hypergraph to identify potential maximal itemsets.


45

FP-Growth, Eclat and MaxClique

Key Data Structures

Trees, Linked Lists, Arrays, Sets, Hash tables

Key Types

Integers, Single-precision Floats

Key Operations

Set union, intersection, Tree traversals, Hashing


46


Generate meaningful recommendations to a collection of users for items that might interest them, e.g., suggestions for books, movies, songs..

Uses rating matrix whose cell r(x,y) represents rating of user “x” for item “y”. Goal is to predict r(a,i) for an active user “a”

Based on collective information on user ratings or content/profile information. Does not use association rule.

Three common approaches

Collaborative filtering uses past ratings of all users collectively

Content-based recommendation recommends items similar in content to items user has liked in past

Hybrid approaches that combine collaborative- and content-based approaches

RECOMMENDER SYSTEMSAnalytics Exemplars

47

Basic Idea

Neighborhood-based (memory-based) approaches choose a subset of users based on similarity to the active user and weighted combination of their ratings (can be extended for matching items to a user’s rated items)

Most common similarity measure is the Pearson correlation coefficient.

Similarity measures can be refined to reduce the impact of widely loved/disliked items using inverse user frequency

Model-based approaches provide recommendation by estimating parameters of the statistical models for user ratings.

Latent factor/matrix factorization aim to detect underlying hidden lower-dimensional structure

Weighted non-negative matrix factorization aim to detect additive components of user’s ratings


48

Collaborative Filtering

Exploits personal profile information of the user or information about the items: demographic or item genre information

Information Retrieval approaches analyze textual content

User’s preferences are viewed as a query and unrated documents scored with relevance/similarity to this query

Recommendation as a classification problem with item attributes as features and ratings as classes

Use Naive Bayes Classifier, Nearest-neighbor, Decision Trees or Neural Networks


49

Content-based Filtering

Merge ranked lists of recommendations produced by content-based and collaborative filtering methods

Content-boosted collaborative filters uses content information to generate full pseudo-rating matrix with predictions calculated using Naive Bayes classifier

Use content-based profiles, rather than co-rated items, to find similar users for the neighborhood-based collaborative filtering

Classification-based hybrid approaches


50

Hybrid Approaches

Key Data Structures

Sparse, dense matrices, Vectors

Key Types

Integers, Single-precision Floats

Key Operations

Matrix factorization, Matrix-Vector multiplications


51


A system inspired by the biological network of neurons in the brain which uses connectionistic information processing model

Implemented as a system of interconnected simple computing nodes operating asynchronously in parallel whose function is determined by network structure, connection strengths, and processing executed on the computing nodes

A massively distributed system that acquires knowledge through a learning process and stores it using inter-neuron connection strengths, represented using synaptic weights

Key goals: Model complex relationships between input and output to infer results for novel inputs or find patterns in data. A neural network is usually designed for one of the following tasks:

Function approximation: regression analysis, prediction, time-series processing

Data processing and mining: filtering, clustering, classification of patterns and sequences

Decision making/Inferencing: systems control, robotics

Cognitive modeling: simulating and understanding neural activities

NEURAL NETWORKSAnalytics Exemplars

52

Introduction

A neuron receives a vector input, , which is weighed and accumulated to a scalar value as a weighted sum before transmitting to the receiver neuron.

The weighted sum is called the propagation function, and the set of weights represent information storage of a neural network.

The neural output is non-linear activation function

Multiple scalar outputs from different neurons in turn form a vector input for a neuron


53

Basic Idea

~x X

i

wixi

y = f(X

i

wixi)

Neural networks classified by

Underlying network topology

Feed-forward networks: Layers of neurons with connections to any of the next layers

Feedback networks: Can have cyclic connections

Completely linked networks: Symmetric connections between all neurons

Type of learning algorithm used

Unsupervised learning: no separate learning phase

Reinforcement learning: feedback on the response to the training data

Supervised Learning: training set includes input patterns and correct results

Type of input data: Categorical or Quantitative (numerical measurements)


54

Types of Neural Networks

A perceptron is the simplest feed-forward network with one input neuron layer connected to one or more trainable weight layers

Single-level perceptron (SLP) with an input neuron layer and only one trainable weight layer

Multi-level perceptron (MLP) has n variable weight layers and n+1 neuron layers, the first layer being the input layer. MLPs usually trained using supervised learning algorithm called back-propagation.

Forward propagation of the input values and backward propagation of output values to calculate the delta errors

Using delta errors to determine the weights

Radial Basis Function (RBF) Network: a three layer network that uses a radial basis function: a real-valued function whose value depends only on the norm, usually Euclidian distance


55

Perceptron Networks

Recurrent neural network is a class of networks where connections between neurons form a directed graph

Ability to influence themselves by means of recurrents, e.g., using output in the computational steps. Dynamics systems with varying temporal behavior

Most common recurrent networks

Fully-connected network: Each neuron connected to the other, with time-varying real-valued activation, and each connection has modifiable real-valued weight

Hopfield network: Connections are symmetric

Elman and Jordan networks: Both are three-layer networks, with additional set of context units in the input layer. In Elman networks, the hidden layer is connected to the context layer with unit weight edges, In Jordan network, the output layer is connected to the context layer with unit weight edges

Can be trained with either supervised or reinforced learning


56

Recurrent Networks

ILLUSTRATION: NEURAL NETWORKS

57

Single-level Perceptron

Multi-level Perceptron

Recurrent Networks

“Networks of Artificial Neurons, Single-level Perceptrons”, J. Bullinaria

Class of neural networks focus on clustering or classification of the input datasets

Vector Quantization (VQ)

Unsupervised density estimators. Each neuron acts as a cluster whose center is defined as its codebook vector.

Uses a learning algorithm that finds a codebook vector closed to the training set and updates it based on it learning rate.VQ learning rule an approximation of the K-Means algorithm

Self-organized Maps

Define an mapping from a set of potentially high-dimensional data onto regular two-dimensional grid.

Each node is associated with a model (or a codebook), and a data item is mapped to a node that is most similar to the input data using some metric. The model is then updated using a learning algorithm that employs neighborhood smoothing function the data item. Upon convergence, the grid matches the similarity graph of the data items.

Learning Vector Quantization (LVQ)

Supervised version of the Vector Quantization .Codebook vector of a neuron is assigned to one of the target classes.

Each element of the training sample is classified by finding nearest codebook vector and assigned to the class. For the actual datasets, LVQ network behaves the nearest-neighbor search algorithm


58

Kohonen Neural Networks

Key Data Structures

Dense Matrices, Vectors

Key Data Types

Double-precision and Complex data

Key Operations

Matrix computations (e.g., Matrix multiplication, Inversion, Factorization), Activation functions (e.g., logistic, gaussian, RBF)


59


A family of supervised learning methods, primarily used for classification and regression analysis

Originally designed for pattern recognition applications

Applications in a wide spectrum of domains:

bio-informatics (gene classification)

Medical Imaging (brain fMRI processing)

Text analytics (Text classification)

Time-series prediction (traffic modeling)

Financial modeling (stock market prediction)

SUPPORT VECTOR MACHINESAnalytics Exemplars

60

Introduction

Support Vector Machines (SVMs) aim to produce a model based on the training data that predict the target values of test data, given only the test data attributes.

Each training set item contains one target value and several attributes (features)

Each data point mapped to a high-dimensional space. SVM constructs a set of hyperplanes to partition the space. An hyperplane is a set of points whose dot-product with a vector is constant.

The optimal hyperplane is the one that provides maximum margin separation between the data points using a subset of training data called support vectors.

Uses a combination of statistical learning along with optimization techniques.


61

Basic Idea

Two broad classes based on whether data is linearly separable or not

If the data is linearly separable, one can partition data into 2 classes using a hyperplane

The SVM aims to find a separating hyperplane with the maximum distance from the training sample

This problem is formulated as an optimization problem, and can be solved using either quadratic or linear programming approaches

If the data is not linearly separable, kernel functions are used to map data to a higher dimensional space.

Examples of kernel functions include linear, polynomial, Radial Basis Function (RBF), and Sigmoid

Kernel functions compute inner-product in a high-dimensional feature space to determine the separating hyperplane.


62

Core Algorithms

ILLUSTRATION: SUPPORT VECTOR MACHINES

63

Partitioning Hyperplanes Maximum-margin Hyperplane

Support Vectors

Used for text classification in a very high dimensional space using distinct words as features and word count as feature attribute

SVMs analyze text documents natively using their constituent strings: more substrings in common, more common are the documents.

Substrings need not be contiguous; degree of contiguity reflected in the weight

The string kernel maps strings to a feature vector indexed by all possible tuples of characters

A entry in the feature vector will have a non-zero entry if it occurs in a substring; the weight reflects frequency, contiguity and length of the string

The inner product of the feature vectors for two strings give the sum over all common sub-sequences weighted based on their frequency and length.


64

String Kernel Methods

Key Data Structures

Sparse Matrices, Vectors

Key Data Types

Double-precision Floats

Key Operations

Matrix computations (e.g., Factorization, Matrix-Vector, Matrix-Matrix Multiplication), Kernel Functions (e.g., Linear, Sigmoid)


65


Decision tree learning is a class of supervised learning algorithms that use a tree-based model to represent decisions and their possible consequences

Encoding of all possible outcomes for a given problem scenario annotated with their conditional probabilities

Can be used for classification to predict values of categorical variables (e.g., yes or no) or continuous variables (e.g., amount of money a customer is willing to spend)

Important domains include marketing, fraud detection, medical diagnostics, manufacturing/production

DECISION TREE LEARNINGAnalytics Exemplars

66

Introduction

Supervised learning uses a set of training class-labeled samples, where each sample is a n-dimensional feature vector, and is associated with a class-label attribute (either categorical or continuous).

Learning phase generates a decision tree whose internal nodes represent conjunction of features and leaves represent classification.

Decision trees use iterative top-down algorithm to split data set using a feature attribute at every step as the splitting parameter

Algorithms use heuristics called attribute selection measures or splitting rules to select the feature predicates. Most popular being:

Information Gain: Choose the information attribute with the most distinct values

Information Gain Ratio: Choose the attribute that produces good classification with fewer clusters

Gini Index: Choose the attribute that reduces the inequality in a distribution


67

Basic Idea

Iterative Dichotomizer (ID3) and C4.5 are two decision tree algorithms that use entropy-based attribute selection measures.

Both employ a greedy approach to build a decision tree in a top-down recursive divide-and-conquer manner

Both algorithms require 3 parameters for building the tree: training set, set of attributes of the training vectors and the attribute selection heuristic

ID3 uses information gain measure for attribute selection, while the C4.5 uses the information gain ratio

Both suffer from the problem of overfitting noisy data. To address it the tree is pruned to remove the least reliable branches


68

ID3/C4.5

Classification and Regression Trees (CART) is a family of non-parametric recursive tree-building algorithms for predicting continuous dependent variables (regression) or categorical predictor variables (classification).

CART builds binary tress via top-down recursive partitioning of the dataset using different splitting criteria

Uses least-squares deviation criteria for continuous variables and Gini index for categorical variables

Post-pruning approach to address over-fitting using a measure called cost-complexity as a function of number of leaves and error rate as a percentage of the misclassified vectors


69

CART

CHI-square Automatic Interaction Detector (CHAID) also uses a recursive tree-building process to create a wide tree with multiple branches (potentially different at different levels). Can represent multiple categories well.

Internally, CHAID uses categorical variables. It first prepares data by initial binning into categories.

The algorithm then uses Pearson CHI-square test (for categorical) and F test for continuous variables to determine statistical independence/significance of data.

This information used to determine branching factor at internal tree nodes

If significance level is above a certain threshold, new branches are created, else branches are merged


70

CHAID

Key Data Structures

Binary and multi-way trees, vectors

Key Types

Double-precision floats, Integers

Key Operations

Tree traversals via dynamic programming, recursion


71


A time series is a sequence of observations reported according to the time of their outcome

Time-series analysis achieves one of two goals: (both require the underlying pattern to be identified and modeled)

Analysis: Understand basic characteristics of the observed data

Forecasting: Model the observed data set and apply it for predicting future values based on known past values

Examples of time series data are prevalent in everyday life:

stock prices, weather reports, biometric data, utility consumption data,..

other domains include geology, social sciences, control systems, economics..

TIME SERIES PROCESSINGAnalytics Exemplars

Introduction

72

Time-series data not independent and identically distributed: dispersion can vary in time, can have cyclic components, and is often governed by a trend.

Time-series data exhibits temporal ordering: observations closer in time are more related than observations taken further apart. Observation at a given time can be derived from past observations.

A time series can be viewed as a sequence of random variables that can individually decomposed into four components:

Trend is a monotone function of time

and reflect long- and short-term non-random cyclic influences (Seasonality)

represents random noise

Time series analysis involves identifying trends and seasonalities, and can be carried in either time or frequency domains.


Basic Idea

yt = Tt + Zt + St +Rt

Tt

yt

Zt St

Rt

t

73


Trend Analysis

74

Trend Analysis captures the trend component of the time series

If the data is noisy, it needs to be smoothed before analyzing any trends

Smoothing involves some form of local averaging such that the irregular components of the individual observations cancel each other.

Simple or weighted average of n surrounding elements

For random errors, least-squares or exponential smoothing is applied

After smoothing, the monotonous (increasing or decreasing) component of the time series can be represented using linear or non-linear functions


Seasonality Analysis

75

Seasonality component captures that cyclic fluctuations in the data.

Seasonality can be measured by evaluating dependences between elements of a time series separated with a distance or lag k.

In time-domain analysis, auto-correlation and auto-covariances are the most commonly used measures for inter-dependence between time-series elements

High auto-correlation values at lag positions that are multiple of k, exposes a repeated pattern

Auto-correlation values for consecutive lags are inter-dependent: they suffer from serial dependencies

Partial auto-correlation calculations for a lag of 1 can be used to remove serial dependencies


Spectral Analysis

76

The frequency-domain (spectral) analysis of a time series aims to decompose the original time series into its cyclic components and compute their frequencies

A time series is represented using its harmonic components via two periodic sinusoidal functions

The problem can be viewed as linear multiple regression whose aim is to identify parameters (regression coefficients) that express the impact of different components

Computationally, spectral decomposition can be done using Fourier Transformations, usually implemented via Fast Fourier Transform (FFT).


ARIMA

77

The auto-regressive integrated moving average (ARIMA) is widely used for understanding and forecasting stochastic processes represented in a time series data. This model is a combination of three different time-domain models.

Auto-regressive Model: A time series is viewed as a composition of random error and linear combination of prior observations.

Stationery Model: The stationery model assumes that the time series has constant mean, variance, and auto-correlation over time.

Moving Average Model: The moving average captures the white noise error

Once the parameters are estimated, they can be used for predicting future values of the time series

Key Data Structures

Vectors, Sparse Matrices

Key Data Types

Integers, Double- and Single-precision Floats

Key Operations

Smoothing functions (e.g., moving average, least-squares, exponential), covariance calculations, FFT


78


Text analytics covers computational approaches that process structured and un-structured text data to extract and present innate information.

Operate on a corpus of text documents, potentially in multiple languages, and with noise

Goals of text analytics: derive new information from data, find patterns across datasets, and separate relevant contextual information from noise.

Text analytics different from information retrieval, which aims to extract already known information from data

Multi-disciplinary field that uses techniques from statistics, natural language processing, linguistics, artificial intelligence, data mining

Extensive use in daily life: from web searches, email filtering, help-desk communications, online advertisements, etc..

TEXT ANALYTICSAnalytics Exemplars

79

Introduction

Main goals of text analytics: pre-process, categorize, classify, and summarize the input text corpus.

Two key phases:

Pre-processing: clean the raw text data and prepare for further analysis

Parsing, Stemming, Whitespace elimination, Stopword removal, Synonym identification..

Data structure initialization: Most common data structure is the term-document matrix.

Processes text viewed as a bag of words (token order irrelevant)

Element (i,j) represent different weightings of a term j for document i. Common weighting include binary to denote inclusion/exclusion, inverse frequency weighting which gives more weight to less frequent terms

Term-document matrix is sparse, and processed in the compressed format.


80

Basic Idea

Count-based analysis: View word-frequencies as a measure of importance. Can be then used for computing association between different terms via computing frequencies of their co-occurrences.

Text clustering: Enable (semi-)automatic categorization of text documents based on certain similarity measure. The text-document matrix can be viewed a high-dimensional representation of the text. Can be clustered using measures such as metric, cosine distances.

Text classification: Organize text documents into pre-defined groups (unlike clustering). Use similar metrics like in the clustering approach.

Semantic analysis: Extract and represent knowledge from a text corpus, without any prior information.

Sentiment and Topic analysis: Sentiment analysis aims to detect the tone of a document by applying natural language processing techniques to the text data. Topic analysis uses various transformations on the text-document matrix to identify hot topics from the input text corpus.


81

Key Operations

An example of supervised text classifier. Each document is associated with a class and the target class is determined using document words.

Naive Bayes classifier uses the naive Bayes learning method to generate the classifier. It uses the training set to estimate the parameters of the generating model.

Naive Bayes method uses document words as training samples and assumes that they are mutually independent. Uses two models to generate the training data

In the multi-variate Bernoulli event model, a document is represented by a binary vector indicating which words occur or do not occur in the document

In the multinomial model, the document is an ordered sequence of word events and uses word frequency information.

Time complexity of the naive Bayes classifier is linear for both testing and training.


82

Naive Bayes Classifier

The latent semantic analysis (LSA) is a method for extracting innate semantic meaning, as approximated via contextual usage of words.

Infers relations between different words, word and passages, words and documents, different documents, etc.

Uses a term-document matrix representation of the input text corpus

columns can be passages, documents

Matrix values can be either tf-idf (term frequency-inverse document frequency) where weight is inversely proportion to the frequency or logarithm of the term frequency

Key operation in LSA is to perform reduced-rank singular value decomposition of the term-document matrix to create a low-rank approximation of the term-document matrix using the top k singular values

Term document analysis can be then carried out by using dot-product or cosine-similarity metric on the contained row and column vectors


83

Latent Semantic Analysis

A family of unsupervised learning algorithms that view an object using parts-based additive representation that uses only non-negative weights.

NNMF Suited for classification problems in text analysis such as sentiment (topic) analysis

A document can be viewed as a set of words combined with their occurrences. NNMF can automatically discover hidden classification using the topics as a classifier, where each topic can be characterized by a set of words.

Core operation: Given a non-negative matrix V(w,d), find non-negative matrix factors W (w,c) and H(c,d) such that

Elements of H indicate with document belong to which cluster, and elements of W indicate the degree to which a word belongs to a cluster


84

Non-negative Matrix Factorization

V ⇡ W ⇤H

Key Data structures

Inverse Indexes, Strings (character arrays), Sparse Matrices, lists

Key Types

Characters, Strings, Integers, Double-precision Floats

Key Operations

String Operations (e.g., parsing, substring matching), Similarity and Scoring (e.g., distance computations), Matrix computations (e.g., Factorization), Set union and intersection


85


A class of algorithms that employ repeated statistical sampling to compute approximate solutions to quantitative problems.

Model applications with inherent uncertainty, e.g., pricing of various financial instruments

Simulating systems with multiple degrees of freedom, e.g., simulating behaviors of different materials

Solving deterministic problems with infeasible computational requirements, e.g., solving high-dimension definite integrals

Applications include financial engineering, molecular modeling, process engineering, computational numerical analysis

MONTE CARLO METHODSAnalytics Exemplars

86

Introduction

For a problem, the Monte Carlo (MC) approach repeatedly generate independent identically distributed random variables from the same distribution as the problem and then uses deterministic or stochastic model to compute the solution.

In the simplest formulation, MC approach uses statistical sampling to compute a numerical integral

Standard error of a MC estimation decreases with the square root of the sample size

Standard error is independent of the dimensionality of the integral

Amount of work does not increase exponentially in the number of dimensions

To improve estimation quality, a large number of samples are needed

Variance reduction methods such as importance sampling are used to improve efficacy of the approach


87

Basic Idea

Key steps in any Monte Carlo approach

Identify the probability distribution that mimics the problem under consideration (e.g., normal distribution for option pricing)

Generate samples from the probability distribution function using a pseudo-random number generator

Pass the sample values through a deterministic or stochastic model to get the final result

Irrespective of the type of approach, all MC implementation require good pseudo-random number generators


88

Methodology

A class of computational algorithms that can generate a sequence of numbers that mimic random numbers (PRNGs)

Require a seed number as its initial state

Generated numbers repeat after a certain time (period). Length of the seed determines the period: for a n-bit seed the period is always less than

Most algorithms use bit-manipulation, shuffling (e.g., multiply-with-carry, xor-shift), combined with recurrence strategies over relatively prime numbers to generate pseudo-random numbers

MC Methods require PRNGs with large periods

Marsenne Twister generates integers with uniform distribution with a period chosen to be one of the Mersenne prime numbers. For example, the most commonly used variant MT19937 generates 32-bit pseudo-random integers over with the period of


89

Pseudo-Random Number Generators

2n

[0, 232 � 1]219937 � 1

Key Data Structures

Bit-vectors, Vectors

Key Data Types

Integers, Bit representations

Key Operations

Bit manipulations (e.g., bit shifting, shuffling), modulo function


90


In mathematical programming or optimization, one seeks to find an optimal solution for a problem defined by its constraints using a mathematical formulation

A solution aims to minimize or maximize an objective function or real or integer variables, subject to constraints to variables

Applied to cases where a closed-form solution is not easily found and one has to settle for the best available solution.

Forms the cornerstone for operations research and related disciplines like industrial engineering, social sciences, economics..

Applied to domains such as scheduling, manufacturing, supply-chain,..

MATHEMATICAL PROGRAMMINGAnalytics Exemplars

91

Introduction

A mathematical program is an optimization problem of the form , , , where X is in the domain of functions f, g, and h, which map into real spaces.

The relations are called constraints

The function f is called the objective or cost function, and the domain X of the objective function is called the search space

A point X is feasible if it satisfies the constraints. A point X* is optimal if it is feasible and if the value of the objective function is better than that of any other feasible solution, for all feasible x (also called candidate or feasible solutions)

The problem can be framed as either maximization or minimization problem

In practice, different approaches are classified based on the properties of objective function, constraints, and candidate solutions.


92

Basic Idea

Maximize f(x) : x 2 X, g(x) 0, h(x) = 0

x 2 X, g(x) 0, h(x) = 0

A special case of convex programming where the object function f is both linear and convex, and associated set of constraints are specified using only linear inequalities. Canonically, the linear programs are expressed in matrix form as

The set of constraints, , form a convex polytope and any solution traverses over its vertices to find a point (candidate solution) where the objective function has the maximum(minimum) value, if such point exists.

Traditional Approaches:

Simplex method: constructs feasible solution at a polytope vertex and then traverses the polytope edges to search for optimized solution. Exponential worst case complexity; in practice, performs much better

Ellipsoid method: Average-case polynomial algorithm that uses an iterative approach to generate sequences of ellipsoids (practical performance closer to the worst-case scenario)

Interior point method: Uses a set of feasible points lying in the interior of the polytope via projection. Polynomial time in both worst and average cases.

Barrier approaches: Aims to minimize the traversal trajectory using a logarithmic barrier function


93

Linear Programming

Maximize c

Tx subject to Ax b, x � 0

Ax b

ILLUSTRATION: LINEAR PROGRAMMING

94

Edge traversal of a polytop defined by a set of inequalities

Optimal Solution

2 Variables, 5 Inequalities

Feasible Space

A linear programming formulation where unknown variables are integers. If only some of the variables are integers, the problem is called Mixed-Integer programming (MIP).

IP formulations generally NP-Hard, and solved using two basic approaches: cutting plane and branch-and-bound.

The cutting plane approach first solves the “relaxed linear” form of the problem. If the solution is integer, then we are done. Else, the problem is reformulated using this solution and the process is repeated.

Branch-and-bound approach is a general class of search techniques that can be applied to enumerate feasible solutions and prune them using upper and lower bounds of the cost function being optimized.

The branch-and-cut approach first solves the relaxed linear form of the problem using the cutting plane method; the solutions is then pruned using the branch-and-bound approach.


95

Integer Programming (IP)

Combinatorial optimizations cover methods that aim to optimize a cost function by selecting a subset of objects from the input set

Path traversals, Flow/circulation, cliques, packing, scheduling,..

Unlike LP problems, feasibility space is not convex. Two key approaches:

Approximate algorithms that find a solution provably closer to the optimal in polynomial time.

Suited for NP-Hard optimization problems like bin-packing.

Heuristics that search the feasibility space in a reasonable time to compute potentially sub-optimal solutions.

Greedy, Simulated Annealing, Search algorithms with backtracking,..


96

Combinatorial Programming

Constraint satisfaction/optimization problems (CSPs) covers problems with a constant objective function with a set of constraints that impose conditions on the solution variables.

Occur in scenarios that require solutions from combinatorial, logic programming or artificial intelligence

A CSP solution is an assignment to every variable such that every constraint is satisfied.

Boolean satisfiability, graph coloring, resource allocation and scheduling

Most CSP algorithms use depth-first search to traverse the search tree over different alternative solutions

Tree traversal heuristics can exploit problem structure and perform search in any order. Most algorithms use depth-first search with backtracking

Most widely used CSP search algorithm, A*, uses best-first strategy using a cost function to choose the next search node to traverse.


97

Constraint Optimization

Key Data Structures

Sparse and Hyper-sparse Matrices, Trees

Key Data Types

Double-precision (and higher) floats

Key Operations

Sparse matrix computations (e.g., matrix-vector multiplication, factorization), Tree traversals


98


Online Analytical Processing (OLAP) is the key business intelligence (BI) technology for solving decision support problems like reporting, financial planning, budgeting/forecasting

Broad class of analytics techniques that process historical data using a logical multi-dimensional data model

OLAP works usually works on data warehouses that are collections of very large, subject-oriented, integrated, time-varying, historical collection of data

OLAP queries involve complex operations (e.g., aggregation or grouping) over large datasets

ON-LINE ANALYTICAL PROCESSINGAnalytics Exemplars

99

Introduction

OLAP queries evaluate relationships within the underlying data over multiple attributes or dimensions: attributes can be independent or related via parent-child relationships

OLAP data model views data as multi-dimensional cubes

A cube is organized around a central theme, e.g., car sales

Theme captured by one or more numeric measures or facts (e.g., number of cars sold)

Measures are associated with a set of independent dimensions that provide the context. Each measure value is associated with an unique combination of dimension values; thus a measure value can be viewed as an entry in a cell of a multi-dimensional cube.

Each dimension can be further characterized by attributes related via hierarchical parent-child relationships (e.g, time dimension has the hierarchy year, month,day..) A dimension can be associated with multiple hierarchies.

The parent-child hierarchy enforces the order of summarization via aggregation: measure values associated with parents are computed via aggregation of measures of its children.


100

Logical Model

Two main goals:

Reporting: Organize dimensions and perform computations on corresponding measures.

Presentation: Select dimensions and measure from the original or computed forms of data, and prepare them for display

Functional classification: what-now (post-mortem analysis), what-if (prediction), and what-next (forecasting)

Key operators

Group-by: Collates the measures as per the unique values of the specified dimensions

Slice and dice: Reducing dimensionality of data by projecting data on a subset of dimensions for selected values of other dimensions

Pivoting: Re-orient the original cube to visualize data using new relationships.

Rollup and Drill-down: Support aggregation across hierarchies over one or more dimensions. Rollup computes final total by aggregating sub-totals of dimensions in increasing granularity. For a summarized value, the drill down computes the contributions of the hierarchies and dimensions.


101

OLAP Queries

Relational OLAP (ROLAP)

Data stored as records in relational tables and queried using SQL-based OLAP queries

Two types of tables: fact tables for storing measures and dimension tables for storing hierarchical dimensions. Tables linked via referential relationships. Uses materialized views to store aggregation values.

Most common approach, the Star schema, uses a single fact table, and multiple dimension tables, one for each dimension..

MOLAP

Data stored and processed using multi-dimensional arrays using specialized data structures

OLAP queries use languages that can express data access patterns via the multi-dimensional array model. For example, the Microsoft MDX provides explicit syntactical support for navigation of dimensions

HOLAP

Uses a combination of multi-dimensional and relational strategies


102

OLAP Server Implementations


103

Key Data structures

Relational tables, Prefix trees, Linear arrays, Hash tables, Key-value pairs

Key Data Types

Strings, Integers, Double Precision Floats

Key Operations

Joins (Hash and Sort), Grouping, Multi-attribute ordering, OLAP Operators such as CUBE, ROLLUP, DRILLDOWN, Arithmetic operators, e.g., Sum, Min, Max


Graphs and related data structures (e.g., trees and directed-acyclic graphs (DAGs)) form the fundamental tools used for expressing and analyzing relationships between entities.

Relationships modeled by graphs include

Association, Hierarchies, Sequences, Positions and Paths

Applications in a diverse array of application domains

Biology, Chemistry, Pharmacology, Linguistics, Economics, Operations Research..

GRAPH ANALYTICSAnalytics Exemplars

104

Introduction

A graph, G=(V,E), is described as a set of vertices V, and a set of edges, E.

The vertices or edges can be weighted, and edges can be either directed or undirected. Graph vertices can have additional information such as color.

Graph analytics refers to a class of techniques that either use graph models to solve a problem (e.g., traveling salesperson) or to analyze and exploit the inherent graph structure of a problem.

Broad classification into three overlapping categories:

Structural algorithms, Traversals algorithms, and Pattern-matching algorithms


105

Basic Idea

Structural algorithms, commonly known as network analysis algorithms, analyze symmetric and asymmetric relationships between networked entities by exploring structure of the underlying graph.

Two types of networks:

Small-world networks: distance between any random vertices grows proportional to the logarithm of the total number of vertices.

Scale-free networks: distribution of vertex degrees follows the power law. Many small-world networks exhibit scale-free properties, but the reverse is not true.

Network analysis algorithms understand and exploit the inherent structural properties of a graph: order, size, degree, distance, diameter, chromatic number, centrality (degree, closeness, betweenness, eigenvector).

Key example: Eigenvector centrality algorithm used for ranking linked web-page (e.g., the PageRank algorithm) and analyzing fMRI brain scans.


106

Structural Algorithms

Operate on graphs that either capture the structure of the underlying physical network (e.g., roads, pipes, etc.) or capture the abstract model of a problem

Four classes of traversal problems:

Route problems: optimize path lengths under different constraints

Flow problems: optimize material flow over a network represented by underlying directed graph

Coloring problems: label graph vertices that satisfy certain constraints.

Searching problems: find a solution by traversing vertices that encode problem states

Notoriously difficult to solve precisely: many of them are NP-Complete and extensively use heuristics.


107

Traversal Algorithms

Focuses on finding different patterns in an input graph: cycles, cliques, sub-graphs of certain properties, and network motifs.

Generalized combinatorial problems of enumerating or identifying structural patterns are NP-Complete. Pattern-matching algorithms are either solved using heuristics or their constrained versions are solved in polynomial time.

Most use traversal-based solutions, but some pattern-matching algorithms use matrix algorithms (e.g., spectral clustering).

Key applications:

Clique determination for analyzing social networks, financial networks

Motif identification to diagnose diseases like Alzheimer’s or Schizophrenia

Epidemiological analysis to understand spread of diseases like AIDS or SARS


108

Pattern-matching Algorithms

Key Data Structures

Adjacency/Incidence list, Sparse Matrix, Trees, Queues, Lists

Key Types

Integers, Double- and Single-precision Floats

Key Operations

Tree and Graph Traversals (Depth-first, Breadth-first search,..), Matrix computations (e.g., Eigenvalue solvers, Sparse Matrix-Vector Multiplication, Non-negative Matrix Factorization)


109


Accelerating Analytics

110

Analytics workloads often consist of one or more components with different functional goals, runtime requirements, algorithms, and data features

Same analytical model can be used in by different applications in varying environments

Linear programming used for scheduling algorithms or as a kernel in Support Vector Machines

A model can use different algorithms under different contexts

A clustering model can use either the hierarchical or k-means algorithm. Logistic regression used for regression or classification..

An analytical algorithm can use different implementation of a kernel depending on the runtime constraints

In practice, systems are integrating databases, analytics, and high-performance computing operations

Analytics-specific optimizations must co-exist or should be integrated with other system features

ANALYTICS WORKLOADS ISSUESAccelerating Analytics

111

COMPUTATIONAL PATTERNS OF ANALYTICS WORKLOADS


112

Key computational patterns

Linear algebraic formulations over vectors and sparse matrices

Operations on higher-dimensional data structures

Operations on sets, string, and hierarchical structured data

Probabilistic algorithms

Common Operations

Matrix factorizations, Matrix-Matrix and Matrix-Vector Multiplications, Transpose, Eigensolvers

Tree/Graph traversals, Hash table queries

Dynamic programming, Greedy Algorithms

Set union/intersection, Grouping/ordering of multi-dimensional elements

BEHAVIOR OF ANALYTICS WORKLOADS: COMPUTATIONAL CHARACTERISTICS


113

Key data types

Integers, Double-precision and Complex floats, Strings

Common data structures

Sparse matrices, Vectors, Higher-dimensional trees, Prefix trees, Relational tables, Graphs, Hash Tables, Bit Vectors, Inverse indexes, Adjacency/incident lists, Lists, Queues

Common functions and operations

Numerical (e.g., log, sqrt,sine,cosine) and bit-level (shift, mask) operations

Distance and scoring functions (e.g., euclidian, hamming, Minkowski,..)

Statistical functions (e.g., Gaussian, Logistic, Spline)

Smoothing and String functions

Matrix Operations: Factorization, Multiplication, Linear Solvers

BEHAVIOR OF ANALYTICS WORKLOADS: COMPUTATIONAL CHARACTERISTICS


114

RUNTIME CHARACTERISTICS OF ANALYTICS WORKLOADS


115

Most workloads read-only

Exhibit both iterative and non-iterative execution methodologies

Most operate in the batch mode

Time-series processing has real-time constraints

Operates on different types and size of data

Input data can be purely in-memory or stored on disks

Can be stored in relational tables, files or streams

Can be structured (e.g., relational tables, matrices), semi-structured (e.g., graphs) or unstructured (text)

Output size is usually smaller than the input size

Exceptions being associated rule mining and OLAP

BEHAVIOR OF ANALYTICS WORKLOADS: RUNTIME CHARACTERISTICS


116

Purely Compute-bound Workloads

Mathematical Programming and Monte Carlo Methods

Compute-bound and Network/Memory-bound

Time-series Processing

Compute-bound (in-memory), I/O-bound (disk-based)

Text Analytics, Regression Analysis, Clustering, Nearest Neighbor, Neural Networks, Support Vector Machines, Recommender Systems

Memory-bound (in-memory), I/O-bound (disk-based)

OLAP, Graph Analytics, Text Analytics, Decision Tree Learning, Associated Rule Mining

ANALYTICS WORKLOADS: PERFORMANCE CLASSIFICATION


117

Updatable global execution state

Non-contiguous mostly read-only data access

Gather-scatter

Inter-iteration dependencies

Collective reductions (e.g., aggregation)

ANALYTICS WORKLOADS: KEY PARALLEL EXECUTION PATTERNS


118

Due to irregular computation pattern, shared-address space parallelization better suited for analytics

Key exceptions: Time series processing, Monte Carlo Simulation, Text Analytics can be parallelized using distributed memory approaches

OpenMP style loop parallelization not widely useful

For parallelizing out-of-core workloads, due to iterative nature, the GraphLab/Spark-like approaches more applicable than MapReduce

MapReduce is very effective for parallelizing some text analytics workloads

MapReduce is not suited for implementing numerical linear algebraic kernels

Analytics workloads are more amenable for specific functional acceleration

PARALLELIZATION OF ANALYTICS WORKLOADS


119

ACCELERATION AND PARALLELIZATION OPPORTUNITIES


120

Foundational accelerators

Compute Accelerators: SIMD, GPU, FPGA, ASIC

Memory Accelerators: Gather-scatter accelerator, GPU Texture memory

Network Accelerators: Packet evaluation/routing acceleration

I/O Accelerators: Active Storage, Solid State Drives

Used to construct higher-order accelerators

ANALYTICS ACCELERATORS Accelerating Analytics

121

Functional Accelerators

Pattern Matching/Regex (FPGA)

Compression/Decompression, Encoding/decoding (FPGA, GPU, and SIMD)

Random Number Generators (FPGA, GPU, and SIMD)

Distance Metric Calculations (SIMD and GPU)

Smoothing Functions (SIMD and GPU)

Aggregation (SIMD and GPU)

FFT, Sort, Histogram,… (SIMD and GPU)

Matrix Computing (SIMD and GPU)

CLASSIFYING ANALYTICS ACCELERATORS (1)

Data Structure Accelerators

Hash tables

Bloom Filters

K-D, R, ANN Trees

Inverse Index (posting lists)

Prefix/suffix Trees

Key-value Pair

Graph Traversals (BFS, DFS)

Analytics Workloads

122

System Accelerators

Garbage Collection Acceleration

Virtual Memory Compression

Software-defined Networking

Visualization

MapReduce Shuffle Accelerators

CLASSIFYING ANALYTICS ACCELERATORS (2)

Workload Accelerators

XML Parsing/XPath/XQuery Accelerators

Network Accelerator for Intrusion Detection, Routing etc.

Data Warehousing Acceleration

Image Tagging and Classification Acceleration

Speech Processing Acceleration

Bio-informatics Acceleration

Analytics Workloads

123

Memory accelerators for random memory accesses

Support for scatter/gather, indirection based accesses

Application-specific memories: Two-dimensional memories, Composite types (as supported by GPU Texture Memories)

Extensions to SIMD

Wider SIMD, with newer intrinsic functions (e.g., distance metric or similarity calculations)

Support for higher-order data structures (e.g., arrays)

Analytics-specific function accelerators

Random Number Generators

String Processing Acceleration (e.g., Sub-string matching, Multiple string comparison)

OLAP Aggregation and Grouping Accelerator

Support for “Matrix Acceleration”

Unified accelerator for improving computing and memory accesses

Designed to 2/3 dimensional dense and sparse arrays

ARCHITECTURAL WISH-LIST FOR ON-CHIP ANALYTICS ACCELERATORS


124

HPC AND ANALYTICS: SIMILARITIES AND DIFFERENCES


125

Both HPC and Analytics applications use mathematical formulations to solve the problem at hand: both make extensive use of matrix-based linear algebraic kernels

Both HPC and Analytics require extensive visualization

HPC applications usually a single-domain focus (e.g., seismic processing). Most HPC applications can be viewed as information extraction processes.

Traditional HPC applications compute-bound for in-core processing

Single workflow to address domain-specific functional and runtime goals

Most analytics applications have multi-domain focus (e.g., retail analysis). Most analytics applications can be viewed as information integration processes.

Several independent workflows with potentially different domain-specific functional and runtime goals

May not be compute-bound for in-core processing

DATA MANAGEMENT AND ANALYTICS: SIMILARITIES AND DIFFERENCES


126

Business Intelligence (Data Warehousing/OLAP) shared across two domains

Data management over streaming data has similarities with time-series processing

Unstructured/semi-structured data processing (e.g., XML, RDF) share algorithms with text analytics and graph analytics

Relational data is used as the primary source for different analytics applications

In-memory data warehousing data layout and data structures applicable to analytics workloads: compressed columnar storage, spatial data structures

Transactional Processing completely distinct from analytical applications (no isolation modes, two-phase commit)

Transactional processing oriented disk layout and index structures (e.g., B+ trees) not useful for analytics applications

An ideal integrated system needs to support following capabilities

Balanced support for computation, memory system, networking, and I/O

Integrated with the data-centric ecosystem: transactional databases, data warehouses, text repositories, streams

System software focus on information integration, not just on computational performance

Should support scalable data structures like hash tables, trees, linked lists, bit vectors (in addition, conventional matrices)

Should support different data layout and access strategies

A traditional scalable HPC system may not be a good analytics system. However, a well-balanced, flexible, integrated analytics system can serve as a scalable HPC system.

INTEGRATING HPC, ANALYTICS AND DATA MANAGEMENT: SYSTEMS ISSUES


127

BRINGING IT ALL TOGETHERAccelerating Analytics

128

Graph Analysis

Analytics

HPC

Data Management

Visualization

OLAP

RDF

Transaction

Processing

Text Analytics

Regression Analysis

Modeling Simulation

Mathematical Programming

Time-Series

Processing

Clustering

Rule Mining

Nearest-neighbor Search Support Vector Machines

Neural Networks

Decision-tree Learning

Streams

SUMMARY

129

Analytics is a exciting new field that uses mathematical formulations to solve problems in diverse domains

An analytics workload usually consists of multiple components, each with its distinct runtime and functional goals

Usually employ one of the key models (exemplars)

Multiple exemplars share many computational and runtime patterns

Such patterns can be used to identify acceleration and parallelization opportunities that can used across multiple workloads

THANKS!

130

Rexer Analytics Survey (www.rexeranalytics.com)

T. Davenport, J. Harris, and R. Morrison, Analytics at Work, Smarter Decisions, Better Results, Harvard Business School Press, 2010

G. Shmueli, N. Patel, P. Bruce, Data Mining for Business Intelligence, John Wiley and Sons, 2010

J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2006

A. Rajaraman and J. Ullman, Mining Massive Datasets, Cambridge University Press, 2010

P. Melville, V. Sindhwani, Recommender Systems, Encyclopedia of Machine Learning, 2010

Andrew Ng, Machine Learning Course, Coursera

FURTHER INFORMATION

131

http://www.rexeranalytics.com

analyzing analytics - ieee computer society · pdf fileanalyzing analytics! ... “a...

Documents