analyzing analytics - ieee computer society · pdf fileanalyzing analytics! ... “a...
TRANSCRIPT
ANALYZING ANALYTICS
Rajesh Bordawekar
1
IBM Thomas J. Watson Research Center
www.adms-conf.org/Analytics/tutorial.pdf
The process of using mathematical formulations for identifying, extracting, processing, and integrating information from raw data, and then applying it to solve a problem.
ANALYTICS: A DEFINITIONAnalytics Workloads
2
Understand and characterize execution flow of analytics workloads across multiple application domains:
Characterization targeted towards computer architects, compiler developers, system optimizers, and parallel programmers
Explore different layers from user to hardware to identify common design and computational patterns: Algorithms, Data Structures, Data Types, Operators
Define a set of Exemplars that capture key algorithmic, functional, and runtime characteristics
Investigate opportunities for parallelism and acceleration for key analytics exemplars via software, architectural, and systems solutions
GOALS
3
Analytics Workloads (30 mins)
Analytics Exemplars: 1 (60 mins)
Break
Analytics Exemplars: 2 (45 mins)
Accelerating Analytics (45 mins)
OUTLINE
4
“A Survey of Business Analytics Models…”, Bordawekar et al. IBM Research TR RC25186
“Analyzing Analytics”, Bordawekar et al. , IBM Research TR RC25317, SIGMOD Record, Dec 2013
Analytics Workloads
5
ANALYTICS AT AT YOUR SERVICEAnalytics Workloads
6
Prescription
Prediction
Reporting
Simulation
Pattern Matching
Recommendation
Alerting
Quantitative Analysis
ANALYTICS FUNCTIONAL GOALSAnalytics Workloads
7
Descriptive and Inferential Statistics
Learning: Supervised, Unsupervised, and Reinforcement
Structured and Unstructured Data Analysis
Modeling and Simulation
Optimization
ANALYTICS PROBLEM TYPESAnalytics Workloads
8
CLASSIFICATION OF ANALYTICAL WORKLOADS
9
Analytics Workloads
A Problem type can be used for multiple functions.
Consists of a pipeline with multiple stand-alone analytics components, each component with a different focus
In Watson: Question Analysis, Query Decomposition, Hypothesis Generation and Scoring, Answer ranking, etc.
Each component has distinct functional and runtime goals, and different implementation stacks partitioned into solution, library and kernel components.
Solution: End-user focused and customized to satisfy user’s functional goals (e.g., Answer ranking in Watson)
Library: Designed to be portable and applicable across multiple solutions (e.g., DeepQA UIMA Infrastructure that powers Watson)
Kernel: Architectural and system specific implementations of library functions using specific data structures and kernels (e.g., sparse matrix vector multiplication)
ANALYTICS AT WORKAnalytics Workloads
10
ANALYTICS WORKLOAD EXAMPLEAnalytics Workloads
11
Watson (DeepQA) Question-Answer System
Question
Question Analysis
Query Decomposition
Synthesis
Hypothesis Generation
Soft Filtering
Hypothesis and
Evidence Scoring
Final Merging and Ranking
Answer and Confidence
Answer Sources
Evidence Sources
Trained Models
ANALYTICS FUNCTIONAL FLOWAnalytics Workloads
System
Architecture
Functional and Runtime Goals
Analytical Discipline
Problem Types
Processes
Solution
Models
Algorithms
Kernels
Library
Kernel
12
An analytics solution uses an analytic type based on target functional and runtime goals
An analytic type has multiple models, each in turn, can use one or more algorithms
Based on the problem formulation and goals, the chosen algorithm can use specific data structures and kernels
Given the variety of algorithmic and system alternatives, it is difficult to make right choices to address specific performance issues
From systems perspective, it is important to identify common core computational and runtime patterns
WHY EXEMPLARS?Analytics Workloads
13
Regression Analysis
Clustering
Nearest Neighbor Search
Association Rule Mining/Recommender Systems
Neural Networks
Support Vector Machines
Decision Tree Learning
ANALYTICS EXEMPLARS
Time Series Processing
Text Analytics
Monte Carlo Methods
Mathematical Programming
Online Analytical Processing (OLAP)
Graph Analytics
Analytics Workloads
14
A set of widely used analytical models that capture key computational and runtime data access patterns across different analytics workloads.
Statistics: Regression
Data Mining: Clustering, Nearest neighbor search, Associative rule mining
Machine Learning: Neural networks, Decision tree learning, Support vector machines, Recommender Systems
Simulation: Monte Carlo algorithms
Optimization: Mathematical programming
Data Analysis: Time series processing, Text analytics, Online Analytical Processing, Graph analytics
ANALYTICS EXEMPLARSAnalytics Workloads
15
Analytical Domains
1. Regression Analysis
2. Decision Trees
3. Cluster Analysis
4. Time Series Processing
5. Text Mining
6. Neural Networks
7. Association Rule Mining/Recommender Systems
8. Support Vector Machines
9. Social Network Analysis
2013 REXER ANALYTICS SURVEYAnalytics Workloads
Top Analytics Algorithms in Practice
16
www.rexeranalytics.com
EXEMPLAR CHARACTERISTICSANALYTIC EXEMPLAR PROBLEM TYPE FUNCTIONAL GOALS
Regression Analysis Inferential Statistics PredictionClustering Unsupervised Learning Reporting, Prediction
Nearest Neighbor Search Unsupervised Learning Prediction, RecommendationAssociation Rule Mining Unsupervised Learning Recommendation
Neural Networks Supervised Learning Pattern matching, PredictionSupport Vector Machines Supervised Learning Pattern matching, PredictionDecision Tree Learning Supervised Learning Recommendation, PredictionTime Series Processing Unstructured Data Analysis Pattern matching, Alerting
Text Analytics Unstructured Data Analysis Pattern matching, ReportingMonte Carlo Methods Modeling and Simulation Simulation
Mathematical Programming Optimization PrescriptionOn-line Analytical Processing Structured Data Analysis Reporting, Prediction
Graph Analytics Unstructured Data Analysis Pattern matching, Recommendation
Analytics Workloads
17An Exemplar can address one or more functional goals.
Analytics Exemplars
18
13 Exemplars
Focus on high-level computational and runtime characteristics, not mathematical formulations
For each exemplar:
General usage
Basic idea
Key algorithms
Important data structures, functions, and operations
ORGANIZATIONAnalytics Exemplars
19
Classical statistical technique for modeling relationship between a dependent variable and one or more independent variables
Primary uses include prediction, forecasting, and discovering relationships between dependent and independent variables
Can be viewed as example of supervised learning which uses independent variables for training
Key application domains: Economics, Psychology, Social Sciences, Marketing, Health-care.. (Rated the most used analytics model by Rexer Analytics in 2013)
REGRESSION ANALYSISAnalytics Exemplars
Introduction
20
!
Any regression model relates a dependent variable Y to a regression function f of independent variables (regressors) X, and unknown regression parameters, β:
Quality of prediction depends on the amount of information about X
For k unknown parameters β, regression analysis is possible on if N ≥ k, where N is the number of data points of the form (Y, X)
REGRESSION ANALYSISAnalytics Exemplars
Basic Idea
21
ILLUSTRATION: LINEAR REGRESSION
22
y
x
Independent Variable
Dependent Variable
y=a0+a1*x
University GPA
High School GPAa0
The dependent variable can be approximated as a linear combination of regressors and a disturbance term,
The linear regression model can be formulated as:
Any solution for the linear regression aims to infer the values of the regression parameters
Ordinary Least Squares (OLS) estimation minimizes the sum of squared residuals
Generalized Least Squares (GLS) estimation minimizes the squared Mahanobis length of the residual vector
REGRESSION ANALYSISAnalytics Exemplars
Linear Regression
23
The dependent variables are related to the regressor variables via non-linear relationship on one or more unknown parameters . The model has the following form:
Most common nonlinear function is the exponential decay or growth model:
Other non-linear functions include logarithmic, trigonometric, power, and gaussian
Estimation of regression parameters via minimizing a suitable goodness-of-fit expression with respect to
Minimize sum of squared residual using non-linear least-squares method
Use linear regression to solve a linear form of the non-linear function
REGRESSION ANALYSISAnalytics Exemplars
Non-linear Regression
24
Logistic regression predicts the probability of occurrences of an event by fitting data to a logistic function:
Regression parameters can then be estimated by the maximum likelihood method for a linear model of its inverse-logit function
Probit regression predicts the binary outcome of an event (Y is a binary variable). The Probit model can then be specified as following, where Pr is the probability, and is the CDF of the normal distribution:
Regression parameters can then be estimated using maximum likelihood method.
REGRESSION ANALYSISAnalytics Exemplars
Logistic and Probit Regression
25
Key Data Structures
Sparse Matrices and Vectors
Key Types
Double-precision Float and Complex
Key Operations
Matrix computations (Inversion, LU Decomposition, Transpose, and Factorization), Least-squares estimation, Logistic function
REGRESSION ANALYSISAnalytics Exemplars
26
Key Data Structures, Types and Operations
A process of grouping together entities from an ensemble into classes of entities that are similar in some sense. Also referred to as data segmentation.
An example of unsupervised learning
Key application domains
Market segmentation analysis
Gene Sequence Analysis
Medical Imaging
Document clustering based on semantic information
CLUSTERINGAnalytics Exemplars
Introduction
27
Any clustering algorithm identifies and exploits relevant similarities in the underlying potentially disparate data sources
Similarity can use either geometric distance-based metric or conceptual relationships in data
Input data: Potentially noisy, different types (e.g., binary, interval-based, categorical, ordinal, etc.), can have high dimensions, large data sizes
Two broad classes of algorithms: Parametric and Non-parametric
Parametric (model) based solution assumes an underlying probability distribution and clusters to fit the data accordingly
Non-parametric solution clusters based on spatial properties (e.g., distance or density)
Algorithms also differ depending on the dimensionality of the input data sets
CLUSTERINGAnalytics Exemplars
Basic Idea
28
CLUSTERINGAnalytics Exemplars
K-Means Clustering
29
An iterative partitioning method that constructs k non-overlapping partitions from n objects, where each partition represents a cluster.
Cluster similarity is measured as the mean value of objects, which can be viewed as the cluster’s centroid
Iterative steps
Choose k objects, mark them as the cluster centroid,
Assign the remaining objects to one of the clusters based on the distance of the object with the centroid.
The new mean is calculated, and objects are relocated to cluster whose mean is the nearest
The iterative relocation process continues until the convergence criterion is met
For categorical data (whose means can not be calculated), modes are used as similarity
ILLUSTRATION: K-MEANS CLUSTERING
30
K=3
CLUSTERINGAnalytics Exemplars
Hierarchical Clustering
31
Hierarchical clustering group data objects into a tree of clusters based on distance based models using two approaches:
Agglomerative: bottom-up approach that creates increasingly large clusters
Divisive: Start with a single cluster and subdivide into smaller pieces until termination conditions are met
BIRCH: Two-stage agglomerative clustering algorithm designed to operate on very large datasets
First phase builds in-memory clusters that summarizes input data using statistical clustering features
Second phase, clusters the input summary are reclustered using iterative partitioning method
CLUSTERINGAnalytics Exemplars
EM Clustering
32
Model-based approach that extends the k-means partitioning algorithm: assumes that underlying data is a mixture of k probability distributions
Aims to estimate parameters of the underlying probability distributions
Most common form of EM learns a mixture of Gaussian distribution
Starts with the initial estimate of parameters of the mixture model. Randomly selects k items as initial clusters
(Expectation): Assign each item to a cluster according to a weight that denotes probability of membership.
(Maximization) The item weights are re-scored based on the new model distribution, and then parameters are updated to maximize the likelihood of the distributions.
Key Data Structures
Trees, Graphs, Matrices, Queries
Key Types
Double-,Single-precision Floats
Key Operations
Matrix computations, Distance Functions
CLUSTERINGAnalytics Exemplars
33
Key Data Structures, Types and Operations
An optimization problem for finding points from an ensemble that are closest in some defined proximity
The definition of proximity varies from domain to domain, and usually formulated using a metric function (e.g., Euclidian distance for spatial proximity)
Applied to both clustering and classification scenarios
Well-known applications include online media providers (e.g., Netflix) suggesting movies or songs to match taste of a particular user
Other applications include analyzing multimedia data for copy-right violations, robotics, and drug discovery.
NEAREST NEIGHBOR SEARCHAnalytics Exemplars
Introduction
34
Given a set S of n points in some metric space (X,d), the problem is to preprocess S, so that given any query point , one can efficiently find a point that minimizes
Several variations of the problem based on the input data dimensionality and size, metric used for proximity calculations, and result cardinality (e.g., top-k or all-pairs).
Commonly used metrics for proximity calculations: Euclidian distance for low-dimensional data and Hamming distance for high-dimensional data
For very high-dimensional data, approximate version of search algorithm is preferred for efficiency
NEAREST NEIGHBOR SEARCHAnalytics Exemplars
Basic Idea
35
p 2 Xq 2 S d(p, q)
NEAREST NEIGHBOR SEARCHAnalytics Exemplars
K-d Trees
36
Addresses the precise nearest-neighbor problem for low-dimensional datasets
K-d tree is a binary tree that uses k-dimensional data using recursive hyperplane decomposition
Each record is represented by a k-element vector of real values. The tree is constructed by recursively selecting one of the k dimensions as the discriminatory dimension and partition the dataset according to a certain partition value. Each leaf points to one or more records at that location
At query time, the tree is traversed recursively. At every level, value of the discriminatory dimension is used to choose the path.
When the traversal reaches a leaf, a list of records is returned as the nearest neighbor candidates
ILLUSTRATION: K-D TREE
37
Reference: The Stony Brook Algorithm Repository www.cs.sunysb.edu/~algorith/files/kd-trees.shtml
Input K-d Tree
NEAREST NEIGHBOR SEARCHAnalytics Exemplars
Approx. Nearest Neighbor
38
Uses hierarchical space decomposition for solving approximate nearest neighbor problem for low-dimensional data
Represents points in d-dimensional space using a balanced box-decomposition (BBD) tree.
Each node is associated with a cell, which represents points in either a d-dimensional rectangle or a region corresponding to set-theoretic intersection of two rectangles.
Each leaf cell is associated with a single point; leaves of the tree span the entire space.
During the query process, for a given query point q, a node with the minimum distance to the query point is selected for recursive traversal towards the leaves. The search terminates when the distance to the current leaf falls below a bound.
NEAREST NEIGHBOR SEARCHAnalytics Exemplars
Locality-sensitive Hashing
39
Locality-sensitive hashing (LSH) algorithms are designed for solving the approximate nearest neighbor problem for very high-dimensional datasets
Hash points using several hash functions to ensure that for each function, the probability of collision is much higher for points that are close to each other than for those that are far apart.
For a query point, one can determine its nearest neighbors by hashing the query point, and retrieving the points in the bucket containing that point.
LSH relies on a family of hash functions that have the property that if two points are close, then they hash to the same bucket with high probability; if they are far apart, they hash to same bucket with low probability.
Key Data Structures
Higher-dimensional data structures (e.g., K-D, BBD trees), Hash tables, Matrices
Key Types
Double- and Single-precision Floats
Key Operations
Matrix computations (e.g., Singular Value Decomposition), Hashing, Distance function calculations
NEAREST NEIGHBOR SEARCHAnalytics Exemplars
40
Key Data Structures, Types and Operations
A key data mining method used for discovering co-occurrence relations between variables
First proposed for identifying relationships between items purchased in retail stores: process known as market-basket analysis
Amazon’s “people who bought an item x, also bought item y and z”
The association rule mining has been applied to more complex data patterns such as sequences, trees, graph, with applications to bio-informatics, intrusion detection, and web-usage analysis
ASSOCIATION RULE MININGAnalytics Exemplars
41
Introduction
Formally, association rule mining processes a set D of transactions, where each transaction is a set of items.
Association rule: An implication of the form , where X is a set of some items and Y is an item not present in X.
An association rule in the transaction set D has support s if s% of transactions in D contain X
An association rule in the transaction set D has confidence c if c% of transactions in D that contain X also contain X and Y.
The problem of association rule mining is to generate all association rules that have support and confidence greater than the user-specified support and confidence levels, minsup and minconf.
ASSOCIATION RULE MININGAnalytics Exemplars
42
Basic Idea
X =) Y
The association rule problem consists of two subproblems
Find all sets of items (itemsets) that have transaction support above minsup. These items are called frequent itemsets.
Use the frequent item sets to discover desired rules. First, for every frequent itemset l, find all non-empty subsets of l. Use minconf and minsup to identify candidate itemsets.
Use downward closure of itemsets: all subsets of frequent sets are frequent.
Implementations differ according to:
traversal strategies to identify candidate itemsets: depth-first or breadth-first search
computing the support values: direct counting vs. intersection
ASSOCIATION RULE MININGAnalytics Exemplars
43
Basic Algorithm
Both approaches use breadth-first traversal for itemset identification
Apriori exploits downward closure property to iteratively compute large candidate itemsets.
Makes multiple passes over raw data and uses a priori knowledge of infrequent item sets to reduce creation of unnecessary candidate itemsets.
Partition algorithm is similar to Apriori, but requires only two passes over underlying dataset.
Partitions the intermediate datasets into non-overlapping partitions.
ASSOCIATION RULE MININGAnalytics Exemplars
44
Apriori and Partition
These approaches use depth-first traversal for itemset identification and all of them use algorithmic optimizations for improving performance
FP-Growth uses a novel prefix-tree like data structure (FP-tree) to compactly store candidate itemsets. This tree gets traversed to compute larger itemsets.
Both Eclat and MaxClique view the downward closure property as a lattice and use lattice properties to prune candidate item sets
Eclat uses equivalent classes and MaxClique uses Maximal Cliques in a hypergraph to identify potential maximal itemsets.
ASSOCIATION RULE MININGAnalytics Exemplars
45
FP-Growth, Eclat and MaxClique
Key Data Structures
Trees, Linked Lists, Arrays, Sets, Hash tables
Key Types
Integers, Single-precision Floats
Key Operations
Set union, intersection, Tree traversals, Hashing
ASSOCIATION RULE MININGAnalytics Exemplars
46
Key Data Structures, Types and Operations
Generate meaningful recommendations to a collection of users for items that might interest them, e.g., suggestions for books, movies, songs..
Uses rating matrix whose cell r(x,y) represents rating of user “x” for item “y”. Goal is to predict r(a,i) for an active user “a”
Based on collective information on user ratings or content/profile information. Does not use association rule.
Three common approaches
Collaborative filtering uses past ratings of all users collectively
Content-based recommendation recommends items similar in content to items user has liked in past
Hybrid approaches that combine collaborative- and content-based approaches
RECOMMENDER SYSTEMSAnalytics Exemplars
47
Basic Idea
Neighborhood-based (memory-based) approaches choose a subset of users based on similarity to the active user and weighted combination of their ratings (can be extended for matching items to a user’s rated items)
Most common similarity measure is the Pearson correlation coefficient.
Similarity measures can be refined to reduce the impact of widely loved/disliked items using inverse user frequency
Model-based approaches provide recommendation by estimating parameters of the statistical models for user ratings.
Latent factor/matrix factorization aim to detect underlying hidden lower-dimensional structure
Weighted non-negative matrix factorization aim to detect additive components of user’s ratings
RECOMMENDER SYSTEMSAnalytics Exemplars
48
Collaborative Filtering
Exploits personal profile information of the user or information about the items: demographic or item genre information
Information Retrieval approaches analyze textual content
User’s preferences are viewed as a query and unrated documents scored with relevance/similarity to this query
Recommendation as a classification problem with item attributes as features and ratings as classes
Use Naive Bayes Classifier, Nearest-neighbor, Decision Trees or Neural Networks
RECOMMENDER SYSTEMSAnalytics Exemplars
49
Content-based Filtering
Merge ranked lists of recommendations produced by content-based and collaborative filtering methods
Content-boosted collaborative filters uses content information to generate full pseudo-rating matrix with predictions calculated using Naive Bayes classifier
Use content-based profiles, rather than co-rated items, to find similar users for the neighborhood-based collaborative filtering
Classification-based hybrid approaches
RECOMMENDER SYSTEMSAnalytics Exemplars
50
Hybrid Approaches
Key Data Structures
Sparse, dense matrices, Vectors
Key Types
Integers, Single-precision Floats
Key Operations
Matrix factorization, Matrix-Vector multiplications
RECOMMENDER SYSTEMSAnalytics Exemplars
51
Key Data Structures, Types and Operations
A system inspired by the biological network of neurons in the brain which uses connectionistic information processing model
Implemented as a system of interconnected simple computing nodes operating asynchronously in parallel whose function is determined by network structure, connection strengths, and processing executed on the computing nodes
A massively distributed system that acquires knowledge through a learning process and stores it using inter-neuron connection strengths, represented using synaptic weights
Key goals: Model complex relationships between input and output to infer results for novel inputs or find patterns in data. A neural network is usually designed for one of the following tasks:
Function approximation: regression analysis, prediction, time-series processing
Data processing and mining: filtering, clustering, classification of patterns and sequences
Decision making/Inferencing: systems control, robotics
Cognitive modeling: simulating and understanding neural activities
NEURAL NETWORKSAnalytics Exemplars
52
Introduction
A neuron receives a vector input, , which is weighed and accumulated to a scalar value as a weighted sum before transmitting to the receiver neuron.
The weighted sum is called the propagation function, and the set of weights represent information storage of a neural network.
The neural output is non-linear activation function
Multiple scalar outputs from different neurons in turn form a vector input for a neuron
NEURAL NETWORKSAnalytics Exemplars
53
Basic Idea
~x X
i
wixi
y = f(X
i
wixi)
Neural networks classified by
Underlying network topology
Feed-forward networks: Layers of neurons with connections to any of the next layers
Feedback networks: Can have cyclic connections
Completely linked networks: Symmetric connections between all neurons
Type of learning algorithm used
Unsupervised learning: no separate learning phase
Reinforcement learning: feedback on the response to the training data
Supervised Learning: training set includes input patterns and correct results
Type of input data: Categorical or Quantitative (numerical measurements)
NEURAL NETWORKSAnalytics Exemplars
54
Types of Neural Networks
A perceptron is the simplest feed-forward network with one input neuron layer connected to one or more trainable weight layers
Single-level perceptron (SLP) with an input neuron layer and only one trainable weight layer
Multi-level perceptron (MLP) has n variable weight layers and n+1 neuron layers, the first layer being the input layer. MLPs usually trained using supervised learning algorithm called back-propagation.
Forward propagation of the input values and backward propagation of output values to calculate the delta errors
Using delta errors to determine the weights
Radial Basis Function (RBF) Network: a three layer network that uses a radial basis function: a real-valued function whose value depends only on the norm, usually Euclidian distance
NEURAL NETWORKSAnalytics Exemplars
55
Perceptron Networks
Recurrent neural network is a class of networks where connections between neurons form a directed graph
Ability to influence themselves by means of recurrents, e.g., using output in the computational steps. Dynamics systems with varying temporal behavior
Most common recurrent networks
Fully-connected network: Each neuron connected to the other, with time-varying real-valued activation, and each connection has modifiable real-valued weight
Hopfield network: Connections are symmetric
Elman and Jordan networks: Both are three-layer networks, with additional set of context units in the input layer. In Elman networks, the hidden layer is connected to the context layer with unit weight edges, In Jordan network, the output layer is connected to the context layer with unit weight edges
Can be trained with either supervised or reinforced learning
NEURAL NETWORKSAnalytics Exemplars
56
Recurrent Networks
ILLUSTRATION: NEURAL NETWORKS
57
Single-level Perceptron
Multi-level Perceptron
Recurrent Networks
“Networks of Artificial Neurons, Single-level Perceptrons”, J. Bullinaria
Class of neural networks focus on clustering or classification of the input datasets
Vector Quantization (VQ)
Unsupervised density estimators. Each neuron acts as a cluster whose center is defined as its codebook vector.
Uses a learning algorithm that finds a codebook vector closed to the training set and updates it based on it learning rate.VQ learning rule an approximation of the K-Means algorithm
Self-organized Maps
Define an mapping from a set of potentially high-dimensional data onto regular two-dimensional grid.
Each node is associated with a model (or a codebook), and a data item is mapped to a node that is most similar to the input data using some metric. The model is then updated using a learning algorithm that employs neighborhood smoothing function the data item. Upon convergence, the grid matches the similarity graph of the data items.
Learning Vector Quantization (LVQ)
Supervised version of the Vector Quantization .Codebook vector of a neuron is assigned to one of the target classes.
Each element of the training sample is classified by finding nearest codebook vector and assigned to the class. For the actual datasets, LVQ network behaves the nearest-neighbor search algorithm
NEURAL NETWORKSAnalytics Exemplars
58
Kohonen Neural Networks
Key Data Structures
Dense Matrices, Vectors
Key Data Types
Double-precision and Complex data
Key Operations
Matrix computations (e.g., Matrix multiplication, Inversion, Factorization), Activation functions (e.g., logistic, gaussian, RBF)
NEURAL NETWORKSAnalytics Exemplars
59
Key Data Structures, Types and Operations
A family of supervised learning methods, primarily used for classification and regression analysis
Originally designed for pattern recognition applications
Applications in a wide spectrum of domains:
bio-informatics (gene classification)
Medical Imaging (brain fMRI processing)
Text analytics (Text classification)
Time-series prediction (traffic modeling)
Financial modeling (stock market prediction)
SUPPORT VECTOR MACHINESAnalytics Exemplars
60
Introduction
Support Vector Machines (SVMs) aim to produce a model based on the training data that predict the target values of test data, given only the test data attributes.
Each training set item contains one target value and several attributes (features)
Each data point mapped to a high-dimensional space. SVM constructs a set of hyperplanes to partition the space. An hyperplane is a set of points whose dot-product with a vector is constant.
The optimal hyperplane is the one that provides maximum margin separation between the data points using a subset of training data called support vectors.
Uses a combination of statistical learning along with optimization techniques.
SUPPORT VECTOR MACHINESAnalytics Exemplars
61
Basic Idea
Two broad classes based on whether data is linearly separable or not
If the data is linearly separable, one can partition data into 2 classes using a hyperplane
The SVM aims to find a separating hyperplane with the maximum distance from the training sample
This problem is formulated as an optimization problem, and can be solved using either quadratic or linear programming approaches
If the data is not linearly separable, kernel functions are used to map data to a higher dimensional space.
Examples of kernel functions include linear, polynomial, Radial Basis Function (RBF), and Sigmoid
Kernel functions compute inner-product in a high-dimensional feature space to determine the separating hyperplane.
SUPPORT VECTOR MACHINESAnalytics Exemplars
62
Core Algorithms
ILLUSTRATION: SUPPORT VECTOR MACHINES
63
Partitioning Hyperplanes Maximum-margin Hyperplane
Support Vectors
Used for text classification in a very high dimensional space using distinct words as features and word count as feature attribute
SVMs analyze text documents natively using their constituent strings: more substrings in common, more common are the documents.
Substrings need not be contiguous; degree of contiguity reflected in the weight
The string kernel maps strings to a feature vector indexed by all possible tuples of characters
A entry in the feature vector will have a non-zero entry if it occurs in a substring; the weight reflects frequency, contiguity and length of the string
The inner product of the feature vectors for two strings give the sum over all common sub-sequences weighted based on their frequency and length.
SUPPORT VECTOR MACHINESAnalytics Exemplars
64
String Kernel Methods
Key Data Structures
Sparse Matrices, Vectors
Key Data Types
Double-precision Floats
Key Operations
Matrix computations (e.g., Factorization, Matrix-Vector, Matrix-Matrix Multiplication), Kernel Functions (e.g., Linear, Sigmoid)
SUPPORT VECTOR MACHINESAnalytics Exemplars
65
Key Data Structures, Types and Operations
Decision tree learning is a class of supervised learning algorithms that use a tree-based model to represent decisions and their possible consequences
Encoding of all possible outcomes for a given problem scenario annotated with their conditional probabilities
Can be used for classification to predict values of categorical variables (e.g., yes or no) or continuous variables (e.g., amount of money a customer is willing to spend)
Important domains include marketing, fraud detection, medical diagnostics, manufacturing/production
DECISION TREE LEARNINGAnalytics Exemplars
66
Introduction
Supervised learning uses a set of training class-labeled samples, where each sample is a n-dimensional feature vector, and is associated with a class-label attribute (either categorical or continuous).
Learning phase generates a decision tree whose internal nodes represent conjunction of features and leaves represent classification.
Decision trees use iterative top-down algorithm to split data set using a feature attribute at every step as the splitting parameter
Algorithms use heuristics called attribute selection measures or splitting rules to select the feature predicates. Most popular being:
Information Gain: Choose the information attribute with the most distinct values
Information Gain Ratio: Choose the attribute that produces good classification with fewer clusters
Gini Index: Choose the attribute that reduces the inequality in a distribution
DECISION TREE LEARNINGAnalytics Exemplars
67
Basic Idea
Iterative Dichotomizer (ID3) and C4.5 are two decision tree algorithms that use entropy-based attribute selection measures.
Both employ a greedy approach to build a decision tree in a top-down recursive divide-and-conquer manner
Both algorithms require 3 parameters for building the tree: training set, set of attributes of the training vectors and the attribute selection heuristic
ID3 uses information gain measure for attribute selection, while the C4.5 uses the information gain ratio
Both suffer from the problem of overfitting noisy data. To address it the tree is pruned to remove the least reliable branches
DECISION TREE LEARNINGAnalytics Exemplars
68
ID3/C4.5
Classification and Regression Trees (CART) is a family of non-parametric recursive tree-building algorithms for predicting continuous dependent variables (regression) or categorical predictor variables (classification).
CART builds binary tress via top-down recursive partitioning of the dataset using different splitting criteria
Uses least-squares deviation criteria for continuous variables and Gini index for categorical variables
Post-pruning approach to address over-fitting using a measure called cost-complexity as a function of number of leaves and error rate as a percentage of the misclassified vectors
DECISION TREE LEARNINGAnalytics Exemplars
69
CART
CHI-square Automatic Interaction Detector (CHAID) also uses a recursive tree-building process to create a wide tree with multiple branches (potentially different at different levels). Can represent multiple categories well.
Internally, CHAID uses categorical variables. It first prepares data by initial binning into categories.
The algorithm then uses Pearson CHI-square test (for categorical) and F test for continuous variables to determine statistical independence/significance of data.
This information used to determine branching factor at internal tree nodes
If significance level is above a certain threshold, new branches are created, else branches are merged
DECISION TREE LEARNINGAnalytics Exemplars
70
CHAID
Key Data Structures
Binary and multi-way trees, vectors
Key Types
Double-precision floats, Integers
Key Operations
Tree traversals via dynamic programming, recursion
DECISION TREE LEARNINGAnalytics Exemplars
71
Key Data Structures, Types and Operations
A time series is a sequence of observations reported according to the time of their outcome
Time-series analysis achieves one of two goals: (both require the underlying pattern to be identified and modeled)
Analysis: Understand basic characteristics of the observed data
Forecasting: Model the observed data set and apply it for predicting future values based on known past values
Examples of time series data are prevalent in everyday life:
stock prices, weather reports, biometric data, utility consumption data,..
other domains include geology, social sciences, control systems, economics..
TIME SERIES PROCESSINGAnalytics Exemplars
Introduction
72
Time-series data not independent and identically distributed: dispersion can vary in time, can have cyclic components, and is often governed by a trend.
Time-series data exhibits temporal ordering: observations closer in time are more related than observations taken further apart. Observation at a given time can be derived from past observations.
A time series can be viewed as a sequence of random variables that can individually decomposed into four components:
Trend is a monotone function of time
and reflect long- and short-term non-random cyclic influences (Seasonality)
represents random noise
Time series analysis involves identifying trends and seasonalities, and can be carried in either time or frequency domains.
TIME SERIES PROCESSINGAnalytics Exemplars
Basic Idea
yt = Tt + Zt + St +Rt
Tt
yt
Zt St
Rt
t
73
TIME SERIES PROCESSINGAnalytics Exemplars
Trend Analysis
74
Trend Analysis captures the trend component of the time series
If the data is noisy, it needs to be smoothed before analyzing any trends
Smoothing involves some form of local averaging such that the irregular components of the individual observations cancel each other.
Simple or weighted average of n surrounding elements
For random errors, least-squares or exponential smoothing is applied
After smoothing, the monotonous (increasing or decreasing) component of the time series can be represented using linear or non-linear functions
TIME SERIES PROCESSINGAnalytics Exemplars
Seasonality Analysis
75
Seasonality component captures that cyclic fluctuations in the data.
Seasonality can be measured by evaluating dependences between elements of a time series separated with a distance or lag k.
In time-domain analysis, auto-correlation and auto-covariances are the most commonly used measures for inter-dependence between time-series elements
High auto-correlation values at lag positions that are multiple of k, exposes a repeated pattern
Auto-correlation values for consecutive lags are inter-dependent: they suffer from serial dependencies
Partial auto-correlation calculations for a lag of 1 can be used to remove serial dependencies
TIME SERIES PROCESSINGAnalytics Exemplars
Spectral Analysis
76
The frequency-domain (spectral) analysis of a time series aims to decompose the original time series into its cyclic components and compute their frequencies
A time series is represented using its harmonic components via two periodic sinusoidal functions
The problem can be viewed as linear multiple regression whose aim is to identify parameters (regression coefficients) that express the impact of different components
Computationally, spectral decomposition can be done using Fourier Transformations, usually implemented via Fast Fourier Transform (FFT).
TIME SERIES PROCESSINGAnalytics Exemplars
ARIMA
77
The auto-regressive integrated moving average (ARIMA) is widely used for understanding and forecasting stochastic processes represented in a time series data. This model is a combination of three different time-domain models.
Auto-regressive Model: A time series is viewed as a composition of random error and linear combination of prior observations.
Stationery Model: The stationery model assumes that the time series has constant mean, variance, and auto-correlation over time.
Moving Average Model: The moving average captures the white noise error
Once the parameters are estimated, they can be used for predicting future values of the time series
Key Data Structures
Vectors, Sparse Matrices
Key Data Types
Integers, Double- and Single-precision Floats
Key Operations
Smoothing functions (e.g., moving average, least-squares, exponential), covariance calculations, FFT
TIME SERIES PROCESSINGAnalytics Exemplars
78
Key Data Structures, Types and Operations
Text analytics covers computational approaches that process structured and un-structured text data to extract and present innate information.
Operate on a corpus of text documents, potentially in multiple languages, and with noise
Goals of text analytics: derive new information from data, find patterns across datasets, and separate relevant contextual information from noise.
Text analytics different from information retrieval, which aims to extract already known information from data
Multi-disciplinary field that uses techniques from statistics, natural language processing, linguistics, artificial intelligence, data mining
Extensive use in daily life: from web searches, email filtering, help-desk communications, online advertisements, etc..
TEXT ANALYTICSAnalytics Exemplars
79
Introduction
Main goals of text analytics: pre-process, categorize, classify, and summarize the input text corpus.
Two key phases:
Pre-processing: clean the raw text data and prepare for further analysis
Parsing, Stemming, Whitespace elimination, Stopword removal, Synonym identification..
Data structure initialization: Most common data structure is the term-document matrix.
Processes text viewed as a bag of words (token order irrelevant)
Element (i,j) represent different weightings of a term j for document i. Common weighting include binary to denote inclusion/exclusion, inverse frequency weighting which gives more weight to less frequent terms
Term-document matrix is sparse, and processed in the compressed format.
TEXT ANALYTICSAnalytics Exemplars
80
Basic Idea
Count-based analysis: View word-frequencies as a measure of importance. Can be then used for computing association between different terms via computing frequencies of their co-occurrences.
Text clustering: Enable (semi-)automatic categorization of text documents based on certain similarity measure. The text-document matrix can be viewed a high-dimensional representation of the text. Can be clustered using measures such as metric, cosine distances.
Text classification: Organize text documents into pre-defined groups (unlike clustering). Use similar metrics like in the clustering approach.
Semantic analysis: Extract and represent knowledge from a text corpus, without any prior information.
Sentiment and Topic analysis: Sentiment analysis aims to detect the tone of a document by applying natural language processing techniques to the text data. Topic analysis uses various transformations on the text-document matrix to identify hot topics from the input text corpus.
TEXT ANALYTICSAnalytics Exemplars
81
Key Operations
An example of supervised text classifier. Each document is associated with a class and the target class is determined using document words.
Naive Bayes classifier uses the naive Bayes learning method to generate the classifier. It uses the training set to estimate the parameters of the generating model.
Naive Bayes method uses document words as training samples and assumes that they are mutually independent. Uses two models to generate the training data
In the multi-variate Bernoulli event model, a document is represented by a binary vector indicating which words occur or do not occur in the document
In the multinomial model, the document is an ordered sequence of word events and uses word frequency information.
Time complexity of the naive Bayes classifier is linear for both testing and training.
TEXT ANALYTICSAnalytics Exemplars
82
Naive Bayes Classifier
The latent semantic analysis (LSA) is a method for extracting innate semantic meaning, as approximated via contextual usage of words.
Infers relations between different words, word and passages, words and documents, different documents, etc.
Uses a term-document matrix representation of the input text corpus
columns can be passages, documents
Matrix values can be either tf-idf (term frequency-inverse document frequency) where weight is inversely proportion to the frequency or logarithm of the term frequency
Key operation in LSA is to perform reduced-rank singular value decomposition of the term-document matrix to create a low-rank approximation of the term-document matrix using the top k singular values
Term document analysis can be then carried out by using dot-product or cosine-similarity metric on the contained row and column vectors
TEXT ANALYTICSAnalytics Exemplars
83
Latent Semantic Analysis
A family of unsupervised learning algorithms that view an object using parts-based additive representation that uses only non-negative weights.
NNMF Suited for classification problems in text analysis such as sentiment (topic) analysis
A document can be viewed as a set of words combined with their occurrences. NNMF can automatically discover hidden classification using the topics as a classifier, where each topic can be characterized by a set of words.
Core operation: Given a non-negative matrix V(w,d), find non-negative matrix factors W (w,c) and H(c,d) such that
Elements of H indicate with document belong to which cluster, and elements of W indicate the degree to which a word belongs to a cluster
TEXT ANALYTICSAnalytics Exemplars
84
Non-negative Matrix Factorization
V ⇡ W ⇤H
Key Data structures
Inverse Indexes, Strings (character arrays), Sparse Matrices, lists
Key Types
Characters, Strings, Integers, Double-precision Floats
Key Operations
String Operations (e.g., parsing, substring matching), Similarity and Scoring (e.g., distance computations), Matrix computations (e.g., Factorization), Set union and intersection
TEXT ANALYTICSAnalytics Exemplars
85
Key Data Structures, Types and Operations
A class of algorithms that employ repeated statistical sampling to compute approximate solutions to quantitative problems.
Model applications with inherent uncertainty, e.g., pricing of various financial instruments
Simulating systems with multiple degrees of freedom, e.g., simulating behaviors of different materials
Solving deterministic problems with infeasible computational requirements, e.g., solving high-dimension definite integrals
Applications include financial engineering, molecular modeling, process engineering, computational numerical analysis
MONTE CARLO METHODSAnalytics Exemplars
86
Introduction
For a problem, the Monte Carlo (MC) approach repeatedly generate independent identically distributed random variables from the same distribution as the problem and then uses deterministic or stochastic model to compute the solution.
In the simplest formulation, MC approach uses statistical sampling to compute a numerical integral
Standard error of a MC estimation decreases with the square root of the sample size
Standard error is independent of the dimensionality of the integral
Amount of work does not increase exponentially in the number of dimensions
To improve estimation quality, a large number of samples are needed
Variance reduction methods such as importance sampling are used to improve efficacy of the approach
MONTE CARLO METHODSAnalytics Exemplars
87
Basic Idea
Key steps in any Monte Carlo approach
Identify the probability distribution that mimics the problem under consideration (e.g., normal distribution for option pricing)
Generate samples from the probability distribution function using a pseudo-random number generator
Pass the sample values through a deterministic or stochastic model to get the final result
Irrespective of the type of approach, all MC implementation require good pseudo-random number generators
MONTE CARLO METHODSAnalytics Exemplars
88
Methodology
A class of computational algorithms that can generate a sequence of numbers that mimic random numbers (PRNGs)
Require a seed number as its initial state
Generated numbers repeat after a certain time (period). Length of the seed determines the period: for a n-bit seed the period is always less than
Most algorithms use bit-manipulation, shuffling (e.g., multiply-with-carry, xor-shift), combined with recurrence strategies over relatively prime numbers to generate pseudo-random numbers
MC Methods require PRNGs with large periods
Marsenne Twister generates integers with uniform distribution with a period chosen to be one of the Mersenne prime numbers. For example, the most commonly used variant MT19937 generates 32-bit pseudo-random integers over with the period of
MONTE CARLO METHODSAnalytics Exemplars
89
Pseudo-Random Number Generators
2n
[0, 232 � 1]219937 � 1
Key Data Structures
Bit-vectors, Vectors
Key Data Types
Integers, Bit representations
Key Operations
Bit manipulations (e.g., bit shifting, shuffling), modulo function
MONTE CARLO METHODSAnalytics Exemplars
90
Key Data Structures, Types and Operations
In mathematical programming or optimization, one seeks to find an optimal solution for a problem defined by its constraints using a mathematical formulation
A solution aims to minimize or maximize an objective function or real or integer variables, subject to constraints to variables
Applied to cases where a closed-form solution is not easily found and one has to settle for the best available solution.
Forms the cornerstone for operations research and related disciplines like industrial engineering, social sciences, economics..
Applied to domains such as scheduling, manufacturing, supply-chain,..
MATHEMATICAL PROGRAMMINGAnalytics Exemplars
91
Introduction
A mathematical program is an optimization problem of the form , , , where X is in the domain of functions f, g, and h, which map into real spaces.
The relations are called constraints
The function f is called the objective or cost function, and the domain X of the objective function is called the search space
A point X is feasible if it satisfies the constraints. A point X* is optimal if it is feasible and if the value of the objective function is better than that of any other feasible solution, for all feasible x (also called candidate or feasible solutions)
The problem can be framed as either maximization or minimization problem
In practice, different approaches are classified based on the properties of objective function, constraints, and candidate solutions.
MATHEMATICAL PROGRAMMINGAnalytics Exemplars
92
Basic Idea
Maximize f(x) : x 2 X, g(x) 0, h(x) = 0
x 2 X, g(x) 0, h(x) = 0
A special case of convex programming where the object function f is both linear and convex, and associated set of constraints are specified using only linear inequalities. Canonically, the linear programs are expressed in matrix form as
The set of constraints, , form a convex polytope and any solution traverses over its vertices to find a point (candidate solution) where the objective function has the maximum(minimum) value, if such point exists.
Traditional Approaches:
Simplex method: constructs feasible solution at a polytope vertex and then traverses the polytope edges to search for optimized solution. Exponential worst case complexity; in practice, performs much better
Ellipsoid method: Average-case polynomial algorithm that uses an iterative approach to generate sequences of ellipsoids (practical performance closer to the worst-case scenario)
Interior point method: Uses a set of feasible points lying in the interior of the polytope via projection. Polynomial time in both worst and average cases.
Barrier approaches: Aims to minimize the traversal trajectory using a logarithmic barrier function
MATHEMATICAL PROGRAMMINGAnalytics Exemplars
93
Linear Programming
Maximize c
Tx subject to Ax b, x � 0
Ax b
ILLUSTRATION: LINEAR PROGRAMMING
94
Edge traversal of a polytop defined by a set of inequalities
Optimal Solution
2 Variables, 5 Inequalities
Feasible Space
A linear programming formulation where unknown variables are integers. If only some of the variables are integers, the problem is called Mixed-Integer programming (MIP).
IP formulations generally NP-Hard, and solved using two basic approaches: cutting plane and branch-and-bound.
The cutting plane approach first solves the “relaxed linear” form of the problem. If the solution is integer, then we are done. Else, the problem is reformulated using this solution and the process is repeated.
Branch-and-bound approach is a general class of search techniques that can be applied to enumerate feasible solutions and prune them using upper and lower bounds of the cost function being optimized.
The branch-and-cut approach first solves the relaxed linear form of the problem using the cutting plane method; the solutions is then pruned using the branch-and-bound approach.
MATHEMATICAL PROGRAMMINGAnalytics Exemplars
95
Integer Programming (IP)
Combinatorial optimizations cover methods that aim to optimize a cost function by selecting a subset of objects from the input set
Path traversals, Flow/circulation, cliques, packing, scheduling,..
Unlike LP problems, feasibility space is not convex. Two key approaches:
Approximate algorithms that find a solution provably closer to the optimal in polynomial time.
Suited for NP-Hard optimization problems like bin-packing.
Heuristics that search the feasibility space in a reasonable time to compute potentially sub-optimal solutions.
Greedy, Simulated Annealing, Search algorithms with backtracking,..
MATHEMATICAL PROGRAMMINGAnalytics Exemplars
96
Combinatorial Programming
Constraint satisfaction/optimization problems (CSPs) covers problems with a constant objective function with a set of constraints that impose conditions on the solution variables.
Occur in scenarios that require solutions from combinatorial, logic programming or artificial intelligence
A CSP solution is an assignment to every variable such that every constraint is satisfied.
Boolean satisfiability, graph coloring, resource allocation and scheduling
Most CSP algorithms use depth-first search to traverse the search tree over different alternative solutions
Tree traversal heuristics can exploit problem structure and perform search in any order. Most algorithms use depth-first search with backtracking
Most widely used CSP search algorithm, A*, uses best-first strategy using a cost function to choose the next search node to traverse.
MATHEMATICAL PROGRAMMINGAnalytics Exemplars
97
Constraint Optimization
Key Data Structures
Sparse and Hyper-sparse Matrices, Trees
Key Data Types
Double-precision (and higher) floats
Key Operations
Sparse matrix computations (e.g., matrix-vector multiplication, factorization), Tree traversals
MATHEMATICAL PROGRAMMINGAnalytics Exemplars
98
Key Data Structures, Types and Operations
Online Analytical Processing (OLAP) is the key business intelligence (BI) technology for solving decision support problems like reporting, financial planning, budgeting/forecasting
Broad class of analytics techniques that process historical data using a logical multi-dimensional data model
OLAP works usually works on data warehouses that are collections of very large, subject-oriented, integrated, time-varying, historical collection of data
OLAP queries involve complex operations (e.g., aggregation or grouping) over large datasets
ON-LINE ANALYTICAL PROCESSINGAnalytics Exemplars
99
Introduction
OLAP queries evaluate relationships within the underlying data over multiple attributes or dimensions: attributes can be independent or related via parent-child relationships
OLAP data model views data as multi-dimensional cubes
A cube is organized around a central theme, e.g., car sales
Theme captured by one or more numeric measures or facts (e.g., number of cars sold)
Measures are associated with a set of independent dimensions that provide the context. Each measure value is associated with an unique combination of dimension values; thus a measure value can be viewed as an entry in a cell of a multi-dimensional cube.
Each dimension can be further characterized by attributes related via hierarchical parent-child relationships (e.g, time dimension has the hierarchy year, month,day..) A dimension can be associated with multiple hierarchies.
The parent-child hierarchy enforces the order of summarization via aggregation: measure values associated with parents are computed via aggregation of measures of its children.
ON-LINE ANALYTICAL PROCESSINGAnalytics Exemplars
100
Logical Model
Two main goals:
Reporting: Organize dimensions and perform computations on corresponding measures.
Presentation: Select dimensions and measure from the original or computed forms of data, and prepare them for display
Functional classification: what-now (post-mortem analysis), what-if (prediction), and what-next (forecasting)
Key operators
Group-by: Collates the measures as per the unique values of the specified dimensions
Slice and dice: Reducing dimensionality of data by projecting data on a subset of dimensions for selected values of other dimensions
Pivoting: Re-orient the original cube to visualize data using new relationships.
Rollup and Drill-down: Support aggregation across hierarchies over one or more dimensions. Rollup computes final total by aggregating sub-totals of dimensions in increasing granularity. For a summarized value, the drill down computes the contributions of the hierarchies and dimensions.
ON-LINE ANALYTICAL PROCESSINGAnalytics Exemplars
101
OLAP Queries
Relational OLAP (ROLAP)
Data stored as records in relational tables and queried using SQL-based OLAP queries
Two types of tables: fact tables for storing measures and dimension tables for storing hierarchical dimensions. Tables linked via referential relationships. Uses materialized views to store aggregation values.
Most common approach, the Star schema, uses a single fact table, and multiple dimension tables, one for each dimension..
MOLAP
Data stored and processed using multi-dimensional arrays using specialized data structures
OLAP queries use languages that can express data access patterns via the multi-dimensional array model. For example, the Microsoft MDX provides explicit syntactical support for navigation of dimensions
HOLAP
Uses a combination of multi-dimensional and relational strategies
ON-LINE ANALYTICAL PROCESSINGAnalytics Exemplars
102
OLAP Server Implementations
ON-LINE ANALYTICAL PROCESSINGAnalytics Exemplars
103
Key Data structures
Relational tables, Prefix trees, Linear arrays, Hash tables, Key-value pairs
Key Data Types
Strings, Integers, Double Precision Floats
Key Operations
Joins (Hash and Sort), Grouping, Multi-attribute ordering, OLAP Operators such as CUBE, ROLLUP, DRILLDOWN, Arithmetic operators, e.g., Sum, Min, Max
Key Data Structures, Types and Operations
Graphs and related data structures (e.g., trees and directed-acyclic graphs (DAGs)) form the fundamental tools used for expressing and analyzing relationships between entities.
Relationships modeled by graphs include
Association, Hierarchies, Sequences, Positions and Paths
Applications in a diverse array of application domains
Biology, Chemistry, Pharmacology, Linguistics, Economics, Operations Research..
GRAPH ANALYTICSAnalytics Exemplars
104
Introduction
A graph, G=(V,E), is described as a set of vertices V, and a set of edges, E.
The vertices or edges can be weighted, and edges can be either directed or undirected. Graph vertices can have additional information such as color.
Graph analytics refers to a class of techniques that either use graph models to solve a problem (e.g., traveling salesperson) or to analyze and exploit the inherent graph structure of a problem.
Broad classification into three overlapping categories:
Structural algorithms, Traversals algorithms, and Pattern-matching algorithms
GRAPH ANALYTICSAnalytics Exemplars
105
Basic Idea
Structural algorithms, commonly known as network analysis algorithms, analyze symmetric and asymmetric relationships between networked entities by exploring structure of the underlying graph.
Two types of networks:
Small-world networks: distance between any random vertices grows proportional to the logarithm of the total number of vertices.
Scale-free networks: distribution of vertex degrees follows the power law. Many small-world networks exhibit scale-free properties, but the reverse is not true.
Network analysis algorithms understand and exploit the inherent structural properties of a graph: order, size, degree, distance, diameter, chromatic number, centrality (degree, closeness, betweenness, eigenvector).
Key example: Eigenvector centrality algorithm used for ranking linked web-page (e.g., the PageRank algorithm) and analyzing fMRI brain scans.
GRAPH ANALYTICSAnalytics Exemplars
106
Structural Algorithms
Operate on graphs that either capture the structure of the underlying physical network (e.g., roads, pipes, etc.) or capture the abstract model of a problem
Four classes of traversal problems:
Route problems: optimize path lengths under different constraints
Flow problems: optimize material flow over a network represented by underlying directed graph
Coloring problems: label graph vertices that satisfy certain constraints.
Searching problems: find a solution by traversing vertices that encode problem states
Notoriously difficult to solve precisely: many of them are NP-Complete and extensively use heuristics.
GRAPH ANALYTICSAnalytics Exemplars
107
Traversal Algorithms
Focuses on finding different patterns in an input graph: cycles, cliques, sub-graphs of certain properties, and network motifs.
Generalized combinatorial problems of enumerating or identifying structural patterns are NP-Complete. Pattern-matching algorithms are either solved using heuristics or their constrained versions are solved in polynomial time.
Most use traversal-based solutions, but some pattern-matching algorithms use matrix algorithms (e.g., spectral clustering).
Key applications:
Clique determination for analyzing social networks, financial networks
Motif identification to diagnose diseases like Alzheimer’s or Schizophrenia
Epidemiological analysis to understand spread of diseases like AIDS or SARS
GRAPH ANALYTICSAnalytics Exemplars
108
Pattern-matching Algorithms
Key Data Structures
Adjacency/Incidence list, Sparse Matrix, Trees, Queues, Lists
Key Types
Integers, Double- and Single-precision Floats
Key Operations
Tree and Graph Traversals (Depth-first, Breadth-first search,..), Matrix computations (e.g., Eigenvalue solvers, Sparse Matrix-Vector Multiplication, Non-negative Matrix Factorization)
GRAPH ANALYTICSAnalytics Exemplars
109
Key Data Structures, Types and Operations
Accelerating Analytics
110
Analytics workloads often consist of one or more components with different functional goals, runtime requirements, algorithms, and data features
Same analytical model can be used in by different applications in varying environments
Linear programming used for scheduling algorithms or as a kernel in Support Vector Machines
A model can use different algorithms under different contexts
A clustering model can use either the hierarchical or k-means algorithm. Logistic regression used for regression or classification..
An analytical algorithm can use different implementation of a kernel depending on the runtime constraints
In practice, systems are integrating databases, analytics, and high-performance computing operations
Analytics-specific optimizations must co-exist or should be integrated with other system features
ANALYTICS WORKLOADS ISSUESAccelerating Analytics
111
COMPUTATIONAL PATTERNS OF ANALYTICS WORKLOADS
Accelerating Analytics
112
Key computational patterns
Linear algebraic formulations over vectors and sparse matrices
Operations on higher-dimensional data structures
Operations on sets, string, and hierarchical structured data
Probabilistic algorithms
Common Operations
Matrix factorizations, Matrix-Matrix and Matrix-Vector Multiplications, Transpose, Eigensolvers
Tree/Graph traversals, Hash table queries
Dynamic programming, Greedy Algorithms
Set union/intersection, Grouping/ordering of multi-dimensional elements
BEHAVIOR OF ANALYTICS WORKLOADS: COMPUTATIONAL CHARACTERISTICS
Accelerating Analytics
113
Key data types
Integers, Double-precision and Complex floats, Strings
Common data structures
Sparse matrices, Vectors, Higher-dimensional trees, Prefix trees, Relational tables, Graphs, Hash Tables, Bit Vectors, Inverse indexes, Adjacency/incident lists, Lists, Queues
Common functions and operations
Numerical (e.g., log, sqrt,sine,cosine) and bit-level (shift, mask) operations
Distance and scoring functions (e.g., euclidian, hamming, Minkowski,..)
Statistical functions (e.g., Gaussian, Logistic, Spline)
Smoothing and String functions
Matrix Operations: Factorization, Multiplication, Linear Solvers
BEHAVIOR OF ANALYTICS WORKLOADS: COMPUTATIONAL CHARACTERISTICS
Accelerating Analytics
114
RUNTIME CHARACTERISTICS OF ANALYTICS WORKLOADS
Accelerating Analytics
115
Most workloads read-only
Exhibit both iterative and non-iterative execution methodologies
Most operate in the batch mode
Time-series processing has real-time constraints
Operates on different types and size of data
Input data can be purely in-memory or stored on disks
Can be stored in relational tables, files or streams
Can be structured (e.g., relational tables, matrices), semi-structured (e.g., graphs) or unstructured (text)
Output size is usually smaller than the input size
Exceptions being associated rule mining and OLAP
BEHAVIOR OF ANALYTICS WORKLOADS: RUNTIME CHARACTERISTICS
Accelerating Analytics
116
Purely Compute-bound Workloads
Mathematical Programming and Monte Carlo Methods
Compute-bound and Network/Memory-bound
Time-series Processing
Compute-bound (in-memory), I/O-bound (disk-based)
Text Analytics, Regression Analysis, Clustering, Nearest Neighbor, Neural Networks, Support Vector Machines, Recommender Systems
Memory-bound (in-memory), I/O-bound (disk-based)
OLAP, Graph Analytics, Text Analytics, Decision Tree Learning, Associated Rule Mining
ANALYTICS WORKLOADS: PERFORMANCE CLASSIFICATION
Accelerating Analytics
117
Updatable global execution state
Non-contiguous mostly read-only data access
Gather-scatter
Inter-iteration dependencies
Collective reductions (e.g., aggregation)
ANALYTICS WORKLOADS: KEY PARALLEL EXECUTION PATTERNS
Accelerating Analytics
118
Due to irregular computation pattern, shared-address space parallelization better suited for analytics
Key exceptions: Time series processing, Monte Carlo Simulation, Text Analytics can be parallelized using distributed memory approaches
OpenMP style loop parallelization not widely useful
For parallelizing out-of-core workloads, due to iterative nature, the GraphLab/Spark-like approaches more applicable than MapReduce
MapReduce is very effective for parallelizing some text analytics workloads
MapReduce is not suited for implementing numerical linear algebraic kernels
Analytics workloads are more amenable for specific functional acceleration
PARALLELIZATION OF ANALYTICS WORKLOADS
Accelerating Analytics
119
ACCELERATION AND PARALLELIZATION OPPORTUNITIES
Accelerating Analytics
120
Foundational accelerators
Compute Accelerators: SIMD, GPU, FPGA, ASIC
Memory Accelerators: Gather-scatter accelerator, GPU Texture memory
Network Accelerators: Packet evaluation/routing acceleration
I/O Accelerators: Active Storage, Solid State Drives
Used to construct higher-order accelerators
ANALYTICS ACCELERATORS Accelerating Analytics
121
Functional Accelerators
Pattern Matching/Regex (FPGA)
Compression/Decompression, Encoding/decoding (FPGA, GPU, and SIMD)
Random Number Generators (FPGA, GPU, and SIMD)
Distance Metric Calculations (SIMD and GPU)
Smoothing Functions (SIMD and GPU)
Aggregation (SIMD and GPU)
FFT, Sort, Histogram,… (SIMD and GPU)
Matrix Computing (SIMD and GPU)
CLASSIFYING ANALYTICS ACCELERATORS (1)
Data Structure Accelerators
Hash tables
Bloom Filters
K-D, R, ANN Trees
Inverse Index (posting lists)
Prefix/suffix Trees
Key-value Pair
Graph Traversals (BFS, DFS)
Analytics Workloads
122
System Accelerators
Garbage Collection Acceleration
Virtual Memory Compression
Software-defined Networking
Visualization
MapReduce Shuffle Accelerators
CLASSIFYING ANALYTICS ACCELERATORS (2)
Workload Accelerators
XML Parsing/XPath/XQuery Accelerators
Network Accelerator for Intrusion Detection, Routing etc.
Data Warehousing Acceleration
Image Tagging and Classification Acceleration
Speech Processing Acceleration
Bio-informatics Acceleration
Analytics Workloads
123
Memory accelerators for random memory accesses
Support for scatter/gather, indirection based accesses
Application-specific memories: Two-dimensional memories, Composite types (as supported by GPU Texture Memories)
Extensions to SIMD
Wider SIMD, with newer intrinsic functions (e.g., distance metric or similarity calculations)
Support for higher-order data structures (e.g., arrays)
Analytics-specific function accelerators
Random Number Generators
String Processing Acceleration (e.g., Sub-string matching, Multiple string comparison)
OLAP Aggregation and Grouping Accelerator
Support for “Matrix Acceleration”
Unified accelerator for improving computing and memory accesses
Designed to 2/3 dimensional dense and sparse arrays
ARCHITECTURAL WISH-LIST FOR ON-CHIP ANALYTICS ACCELERATORS
Accelerating Analytics
124
HPC AND ANALYTICS: SIMILARITIES AND DIFFERENCES
Accelerating Analytics
125
Both HPC and Analytics applications use mathematical formulations to solve the problem at hand: both make extensive use of matrix-based linear algebraic kernels
Both HPC and Analytics require extensive visualization
HPC applications usually a single-domain focus (e.g., seismic processing). Most HPC applications can be viewed as information extraction processes.
Traditional HPC applications compute-bound for in-core processing
Single workflow to address domain-specific functional and runtime goals
Most analytics applications have multi-domain focus (e.g., retail analysis). Most analytics applications can be viewed as information integration processes.
Several independent workflows with potentially different domain-specific functional and runtime goals
May not be compute-bound for in-core processing
DATA MANAGEMENT AND ANALYTICS: SIMILARITIES AND DIFFERENCES
Accelerating Analytics
126
Business Intelligence (Data Warehousing/OLAP) shared across two domains
Data management over streaming data has similarities with time-series processing
Unstructured/semi-structured data processing (e.g., XML, RDF) share algorithms with text analytics and graph analytics
Relational data is used as the primary source for different analytics applications
In-memory data warehousing data layout and data structures applicable to analytics workloads: compressed columnar storage, spatial data structures
Transactional Processing completely distinct from analytical applications (no isolation modes, two-phase commit)
Transactional processing oriented disk layout and index structures (e.g., B+ trees) not useful for analytics applications
An ideal integrated system needs to support following capabilities
Balanced support for computation, memory system, networking, and I/O
Integrated with the data-centric ecosystem: transactional databases, data warehouses, text repositories, streams
System software focus on information integration, not just on computational performance
Should support scalable data structures like hash tables, trees, linked lists, bit vectors (in addition, conventional matrices)
Should support different data layout and access strategies
A traditional scalable HPC system may not be a good analytics system. However, a well-balanced, flexible, integrated analytics system can serve as a scalable HPC system.
INTEGRATING HPC, ANALYTICS AND DATA MANAGEMENT: SYSTEMS ISSUES
Accelerating Analytics
127
BRINGING IT ALL TOGETHERAccelerating Analytics
128
Graph Analysis
Analytics
HPC
Data Management
Visualization
OLAP
RDF
Transaction
Processing
Text Analytics
Regression Analysis
Modeling Simulation
Mathematical Programming
Time-Series
Processing
Clustering
Rule Mining
Nearest-neighbor Search Support Vector Machines
Neural Networks
Decision-tree Learning
Streams
SUMMARY
129
Analytics is a exciting new field that uses mathematical formulations to solve problems in diverse domains
An analytics workload usually consists of multiple components, each with its distinct runtime and functional goals
Usually employ one of the key models (exemplars)
Multiple exemplars share many computational and runtime patterns
Such patterns can be used to identify acceleration and parallelization opportunities that can used across multiple workloads
THANKS!
130
Rexer Analytics Survey (www.rexeranalytics.com)
T. Davenport, J. Harris, and R. Morrison, Analytics at Work, Smarter Decisions, Better Results, Harvard Business School Press, 2010
G. Shmueli, N. Patel, P. Bruce, Data Mining for Business Intelligence, John Wiley and Sons, 2010
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2006
A. Rajaraman and J. Ullman, Mining Massive Datasets, Cambridge University Press, 2010
P. Melville, V. Sindhwani, Recommender Systems, Encyclopedia of Machine Learning, 2010
Andrew Ng, Machine Learning Course, Coursera
FURTHER INFORMATION
131