data mining - the big picture!

PowerPoint Presentation

Data MiningThe big picture!Khalid M. Salama, Ph.D.Microsoft Business IntelligenceHitachi Consulting UK

We Make it Happen.Better.

| Copyright 2015 Hitachi Consulting#

OutlineContext

Data Mining Tasks, Techniques, and Applications

Knowledge Discovery Process

Screenshots

Concluding Remarks


Business Intelligence as a ContextBusiness Intelligence - A broad category of concepts, methods, tools and techniques of collecting, storing, managing, analysing and sharing data to support/improve decision making.

Data Mining is a subset of these concepts, methods, tools and techniques that concerns with automatically extracting hidden, useful patterns from the data.

Examples:CRM: Customer Segmentation, Profiling, etc.Finance, Banking & Insurance: Fraud Detection, Credit Scoring, Stock Market, etc.Medicine/Health Care: Disease Development, Diagnosis, Best Treatments, etc.Telecommunication: Churn Analysis, Network Fault Isolation, etc.Retail: Cross-selling, Targeted Marketing, Propensity Modelling, etc.

revealing the mystery


Terms and SignificanceData Mining An interdisciplinary subfield of computer science, which is the computational process of discovering patterns in datasets Knowledge Discovery in Databases (KDD)

Data Science the extraction of knowledge from volumes of data, which is a continuation of the field data mining and predictive analytics

Machine Learning A subfield of computer science that evolved from the study of pattern recognition and computational learning theory

Predictive Analytics A variety of statistical techniques from modelling, machine learning, and data mining that analyse current and historical facts to make predictions about future

Big Data A broad term for data sets so large or complex that traditional data processing applications are inadequate

brining order to buzzwords chaos


Data Mining in a nutshellData MiningMachine LearningStatisticsArtificial IntelligenceDatabasesOtherTechnologiesData mining, an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.Other Related Technologies:VisualizationBig DataHigh Performance ComputingCloud ComputingOthers..


Knowledge Discovery in Databases (KDD)or data science, if you like!Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation - InterpretationDeployment

Cross Industry Standard Process for Data Mining (CRISP-DM)Data


Data Mining TaxonomyA 10,000 foot viewLearning ParadigmsMining TasksModelling TechniquesMeasuresHeuristic Search Methods

Supervised LearningClassificationDecision TreesInformation GainGreedy Recursive Partitioning


Learning Paradigms Data as the teacher, machine as the student

Supervised Learning

Labelled data = data + output (predictable, target, response, class) variable Learn the relationship between data and output

Unsupervised Learning

Unlabelled dataLearn associations, similarities, groups, etc.Semi-supervised Learning

Partially labelled dataOnline/Active Learning

Real-time learning on data streams

Reinforcement Learning

game theory, control theory, simulation-based optimization, operations research, robotics, etc.


Data Mining Taskonly the genuine ones!

Important Terms:

Learning Paradigms:Supervised Unsupervised Semi-supervised Others (Reinforcement learning, Active, etc.)

Analytics Types:Descriptive (Exploratory) Predictive Prescriptive (Decisive) Application Fields:

Text Mining

Information Retrieval

(Social) Web Mining

Speech Recognition

Image Recognition

Anomaly Detection

State Transition Analysis

Collaborative Filtering (Recommender systems)3


Classification Learningmy favourite data mining task!

Data Mining Task:

ClassificationRegressionClusteringAssociation Rules AnalysisSimilarity AnalysisProbabilistic InferenceTime Series Analysis

Target Class Type

Binary vs. Multi-classMulti-labelHierarchical Class

Classification Applications:

Targeted AdvertisingChurn AnalysisFraud DetectionOCRSentiment AnalysisPredictive MaintenanceDocument ClassificationProtein Function PredictionMedical Support Systems

Input: Labelled cases (nominal labels).Process: Learn the relationships between the input variables and the target class.Output: A model that used to predicted the class of unlabeled cases (+ probability).

Model (Classifier)Classification AlgorithmOutlookTemperatureHumidityWindyClasssunnyhothighnoDontsunnyhothighyesDontovercasthothighnoOKrainmildhighnoOKraincoolnormalnoOKraincoolnormalyesDontovercastcoolnormalyesOKsunnymildnormalnoDontsunnycoolnormalnoOKrainmildnormalnoOKsunnymildnormalyesOKovercastmildhighyesOKovercasthotnormalnoOKrainmildhighyesDont

OKLabeled cases (Training Set)Unlabeled (new) Case


Classification Learningclassification modelling techniques

Data Mining Task:


Classification Techniques:

Decision TreesClassification RulesLinear Discriminant AnalysisArtificial Neural NetworksInstance-based LearningProbabilistic Graphical ModelsSupport Vector MachinesGaussian Process Ensemble Methods

Advances Classification Task:

Multi-label ClassificationHierarchical Classification

Decision TreesForests/ JunglesClassification RulesOrdered List/ Unordered SetLinear Discriminate AnalysisLogistic RegressionArtificial Neural NetworksFeed-forward Multilayer perceptronInstance-based LearningNearest-neighbours classifiersProbabilistic Graphical ModelsBayesian Network ClassifiersSupport Vector MachinesKernel MethodsGaussian ProcessNon-parametric MethodsEnsemble MethodsBagging/ Boosting/ Stacking

IF .. AND .. AND .. THEN AELSE IF .. AND .. THEN CELSE IF .. AND .. THEN B....ELSE C

...


Regression Analysisthe most classical ML task

Data Mining Task:


Regression Applications:

Credit ScoringSurvival AnalysisRisk EstimationValue Evaluation

Regression Techniques:

Simple vs. Multi-variateGeneralized LM Local Models - SplinesTrees - ANN - GP

Related Concepts:

Parameter EstimationRegularization Model Selection


Cluster AnalysisInput: cases without a specific target class.Process: find groups where the distance within is minimized, and the between in maximized.Output: case-cluster assignment (membership).

Clustering TechniquesExclusive vs. OverlappingK-Means vs. Fuzzy K-Means, EMPartitioned vs. HierarchicalK-Means vs. Agglomerative/Divisive Center-based vs. Density-basedK-Means vs. DBScanComplete vs. Partial.

Clusters QualityMinimize intra-distance/linkage (Cohesion)Maximize inter-distance/linkage (Separation)Number of Clusters rather a mean to an end

Data Mining Task:


Clustering Applications:

Customer SegmentationOutlier DetectionTopic GroupingProfilingSummarisationMixture of Models

Clustering TechniquesExclusive vs. OverlappingPartitioned vs. HierarchicalCenter-based vs. Density-basedComplete vs. Partial.


Association Rule Analysisdiscovery of interesting relationships

Data Mining Task:


Asso. Rules Applications:

Market Basket AnalysisText Mining - Sentiment AnalysisGraph/Link Analysis

Rule Measures:

Support & ConfidenceInterestingnessLift & Chi-SquaredJaccard & KulczynzkiKappa & Conviction

Related Issues:

Negative Item setsQuantitative ItemsSequential PatternsItem Sets CompressionRedundancy-Aware PatternsColossal Item Sets & ScalabilityabcdeT1yesnoyesyesnoT2yesnonoyesnoT3noyesnonoyes............

Basket Data

T1 {a,c,d}T2 {a,d}T3 {b,e}


Pattern fusion14

Similarity Analysisa.k.a. instance-based learning

Data Mining Task:


Similarity Matching Applications:

Case-based ReasoningLazy ClassificationRecord MatchingOutlier DetectionSearch Engines

Attribute Proximity Measures :

Edit-based Levenstein and Jaro-Winkler distance. Token-based Jaccard, Shannon, and Cosine Similarity.Sequence-based Longest Common Subsequence. Phonetic-based Soundex and Metaphone. Numeric-based Euclidean distance. CaseiVi,1Vi,2vi,m

CasejVj,1Vj,2vj,m

WeightsW1W2Wm

Att-1Att-2Att-m

Similarity(i,j) = Sim(Vi,1,Vj,1 ) + Sim(Vi,2,Vj,2) + Sim(Vi,m,Vj,m )

W1 .W2 .Wm .Input: A set of (labelled/ unlabeled) cases + subject case.Process: find a set of similar cases to the subject case.Output: similar cases (nearest neighbors).

Proximity MeasureDistance vs. SimilarityWeightingUser Input vs. Automatic OptimisationNeighboursDistance-based (Threshold) vs. Top KClassification / RegressionVoting / AverageWeighted Voting / Weighted Average Kernel Methods (Gaussian Kernel)


Probability Estimation and InferenceInput: A set of (labelled/ unlabeled) cases.Process: learn the structure/parameters of the variable dependency relationshipsOutput: A Probabilistic Graphical Model

Probabilistic Graphical ModelsDirected Acyclic GraphsBayesian Networks (classifiers)Dynamic Bayesian NetworksMarkov BlanketsDirected Cyclic GraphsMarkov Chains(Hidden) Markov ModelsUndirected GraphsFactor GraphsDependency NetworksMarkov Random fieldsLearningStructure (variable-dependency relationships)Parameters (quantification of the relationships)InferencingExact inference and the junction treeMCMCVariational methods and EMthe doctrine of chances

Data Mining Task:


Probabilistic Inference Applications:

ML FrameworkDiagnostic SystemsState Transition Analysis

Probabilistic Graphical Models:

Directed Acyclic GraphsBayesian NetworksMarkov BlanketsDirected Cyclic GraphsMarkov ChainsMarkov ModelsUndirected GraphsFactor GraphsDependency NetworksMarkov Random fields


Time Series AnalysisInput: a sequence of evenly-spaced numerical data.Process: learn a function that describe the current value with respect to the previous ones.Output: Time Series Model (describe/forecast).

Components:Trend: Overall upward, downward, or stationary pattern.Cyclical: Repeating upwards or downwards movements.Seasonal: Regular pattern of up & down fluctuations.Irregular: Unsystematic, residual fluctuations (random).

Techniques:Regression.(Weighted) Moving Average.Exponential Smoothing.Auto-regressive (STL, ARMA, ARIMA, etc.).

history tends to repeat itself

Data Mining Task:


Time Series Applications:

Stock MarketSupply/DemandFinancial ApplicationsSignal Processing

Time Series Components

TrendCyclicalSeasonalRandom

Techniques

RegressionMoving AverageExponential SmoothingAuto-regressive


Knowledge Discovery in Databases (KDD)the virtuous cycle of data science

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation - InterpretationDeployment

Data


Step 1 - Understanding the BusinessWays to answer Data Analysis questions:

Query/Report How many new customers bought my service this month? How many renewed? How many left?

Complex Query/Report What are the top selling products by region in the Online sales? How does that compare to the store sales? (Multi-dimensional Analysis/Visualisation)

Calculations/KPIs Is my business going well? Are we meeting our targets?

What-if Analysis Based on the last year sales, what will be the revenues if we increase the price of this product X by 1% and decreased the price of product Y by 2%? (budgeting/planning)

Statistical Analysis What are the most important factors that impact the energy consumption in our facilities? (dependency/correlation)

Hypothesis Testing Is there significant improved amongst the group of people who took the new drug, compared the placebo group? (experimental studies/market research)

Data Mining Who are the customer that most likely to response to our new advertising campaign? (predictive analytics)The formulation of a problem is often more essential than its solution - Albert Einstein

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment 1Analytics Techniques:

Database QueryMulti-dimensional Analysis/VisualisationCalculations/KPIsWhat-if AnalysisStatistical AnalysisHypothesis TestingData Mining


Step 1 - Understanding the BusinessThe more specific the question, the better!Bad Question: what opportunities do we have to save energy? Good Question: what are the buildings that exhibit different energy usage pattern, with respect to the building type, temperature, size, and number of occupants?

The more the user is acquitted with his/her business data (facts), the more specific/sophisticated questions he/she will ask BI Maturity

from business problems to analytic tasks

CRISP-DM Process:



TDWI Maturity Model


Step 1 - Understanding the BusinessA business problem can be decomposed into multiple business question, which of each can be mapped to different analytics technique or data mining task.

Example 1: Microsoft How-old.netWhat are the distinct object in the picture? ClusteringFor each object, is it a face or not? ClassificationWhat is the estimate age for each identified face? Regression

Example 2: Churn Analysis and Targeted OfferingWhich customers would likely terminate the contract this month? ClassificationWhich service package will a customer likely purchase if given incentive ? ClassificationHow much will this customer use the service? RegressionWhat will be the expected utility of targeting this customer? Calculation

Example 3: PlanningWhat will be the amount of demand on each item next year, per region? Time SeriesWhat will be the revenue according to this pricing schemes? What-if

from business problems to analytics tasks

CRISP-DM Process:




Step 2 Understanding the Datawhat is data?

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment Data are values of qualitative or quantitative variables, belonging to a set of items VariablesNumericalCategorical (Nominal, Ordinal)Special (Identifier, Time Index)What should data look like:One row for each caseColumns represent attributes What does data really look like:Transactional (normalised) dataOrdered dataSequence data (DNA)Time-based data (temporal auto-correlated)Spatial data (spatial auto-correlated)Graph-based data Free-from TextImage/Video (sequence of images)Audio

IdAtt-1Att-2..Att-MCase 1V(1,1)V(1,2)Case 2V(2,1)V(2,2)Case NV(N,M)

Variables:

NumericalCategoricalNominalOrdinalData Forms:

MatrixNormalizedOrderedSequenceTime-SeriesSpatialGraph-basedFree-from TextImage/VideoAudio


Step 2 Understanding the DataAnswering the following questionsWhat is the available data?Do we need to acquire other data? (Publicly available/ Buy data)What is the nature of the dataset? (Data profiling)Number of casesNumber of attributesMissing values (sparsity) Numerical variables (min, max, mean, media, stdv. , outliers)Categorical variables (cardinality, frequencies, mode value)Correlations between numerical variablesStatistical dependency between categorical variables.Statistical variance (numerical vs. categorical variables) Inconsistencies (based on business rules)Should lead toIdentify the data pre-processing operation needed.Suggest the model to be used.

exploratory data analysis

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment Data Profiling:

Number of casesNumber of attributesMissing valuesNumerical variablesMin - Max - MedianDistribution(Mean, stdv.)OutliersCategorical variablesCardinalityFrequenciesCorrelations/DependenciesInconsistencies


Step 3 Preparing the DataFeature Engineering: Building the dataset.

Feature Construction: fabricating a set of (possibly) useful features.

Example - Input: Sales Transactions (Customer, Product, Orders) - Objective: Customer Segmentation - Features: Days First Purchase, Days Last Purchase, Avg. Days between 2 Purchase, Last 3 months total Spending, Last 6 Month Total Spending, Promotion Responsiveness, New Product Responsiveness, Avg. Purchased Product Price, , Web Usage Information, Demographics, Geographic, Economic Indices, Date Indicators, etc.

Feature Selection: Selecting the most effective subset of the available features Filter vs. Wrapper

Feature Extraction: constructing a new set of independent (uncorrelated) features, from the existing feature set, using mathematical transformation Principal Component Analysis (PCA), Factor Analysis (FA), etc.good luck is a residue of preparation

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment 2Data Preparation:

Feature EngineeringFeature ConstructionFeature SelectionFeature ExtractionType ConversionDiscretisationTo NumericVariable TuningMissing valuesClippingScalingRow ProcessingAggregationRemoving duplicates SamplingData Reduction


Step 3 Preparing the DataVariable Type Conversion:Numerical to Categorical (Discretisation) Equal Width/ Equal Size/ Supervised.Categorical to Numerical Hot-one/ Relative Counts

Variable Tuning:Missing Values Eliminate/ Estimate.Clipping Extreme Values Fix/ Remove.Scaling Normalisation/ Standardisation.

Row Processing:AggregationRemoving DuplicatesInstance Selection (Data Reduction)Sampling/Partitioninggarbage ingarbage out

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment 2Data Preparation:

Feature EngineeringFeature ConstructionFeature SelectionFeature ExtractionType ConversionDiscretisationTo NumericVariable TuningMissing valuesClippingScalingRow ProcessingAggregationRemoving duplicates SamplingData Reduction


Step 4 - Modelling If you interrogate the data, it will confess

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment Modelling Variation:

ApproachesAlgorithmsParametersDataset Representations

Overall Procedure:sets = Split( dataset, ratio);train=sets[0]; test=sets[1]; model=Build(algorithm, train, preproc, param);Visualize(model);quality= Evaluate( model, test, measure);

Always Build Multiple Models:Using different approaches.Using different algorithms.Using different parameters (parameter sweeping).Using different dataset representations.

Empirical Evaluation for Model Selection


Step 5 Evaluation and Interpretation Model Predictive EffectivenessPredictive AccuracyModel Comprehensibility Interpretability Insights Model acceptanceLegal explanation (Justifiability)Credit DenialMedical DecisionsAlgorithm EfficiencyScalability/running timeUser Input parameters

Performance Quality Aspects

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment

Predictive Model Quality

Predictive EffectivenessComprehensibility

Algorithm Efficiency

Scalability, running timeUser input parameters


Step 5 Evaluation and Interpretation Predictive Models Predictive Effectiveness (accuracy?)ConsiderationsImbalance ClassMisclassification Cost (Expected Utility)Single Class Focus (Hits Rate vs. False Alarms)MeasuresConfusion MatrixAccuracy (Micro vs. Macro)Precision, Recall, Sensitivity, Specificity, F-Measure, etc.Area Under Curve, lift Chart, Profit/Cost Chart, etc.QLF, BIR, etc. (Probabilistic Classification/Regression)MethodsHold-outk-fold Cross ValidationLeave-one-out

Descriptive Models It is up to you!

all models are wrong, but some are usefulActualPredictedPositiveNegativePositiveTPFPNegativeFNTN

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment

Predictive Model Quality

Predictive EffectivenessComprehensibility

Algorithm Efficiency

Scalability, running timeUser input parameters

Predictive Quality Measures:

Accuracy (Micro vs. Macro)Precision vs. RecallSensitivity vs. SpecificityKappa Lift odds QLF, CE, BIRAUC, lift, cost charts

Evaluation Methods:

Hold-outk-fold Cross Validation


Step 6 Deploymentdata mining in action!

CRISP-DM Process:

Understanding the BusinessUnderstanding the DataPreparing the DataModellingEvaluation & InterpretationDeployment DemoTools & Technologies

MS Azure MLMS Analysis ServicesInfer.NETWEKA (JAVA)R Statistics (caret, rattle)Python (Mlpy, scikit-learn)OpenMLC/C++ - MatlabSASSPSS RapidMinerApache MahoutDataset Repository

UCI - KDDdata.gov.uk GapMinder


Screenshot Decision TreesMicrosoft Analysis Services


Screenshot Cluster AnalysisMicrosoft Analysis Services


Screenshot Association Rules AnalysisMicrosoft Analysis Services


Screenshot Time SeriesMicrosoft Analysis Services


Screenshot ML ExperimentMicrosoft Azure Machine Learning


Screenshot ML Web ServicesMicrosoft Azure Machine Learning


Screenshot Probabilistic ModelsMicrosoft Infer.net


Screenshot Classification RulesJava - WEKA


Screenshot Text MiningR Statistics


Screenshot Regression ModelsR Statistics


Concluding Remarksa few takeawaysUnderstand the business problem first, please!

Use the appropriate tool/technique that best suits the business problem, not the other way around.

Start by solving simple business problems first, before moving to complex ones (BI Insight Maturity Journey).

Spend sometime to explore and understand the data.

Incorporate domain knowledge in your analysis (avoid reinventing the wheel!).

Data preparation is very important for building effective models.

Data mining is an experimental/ iterative process (not ideal for fixed-price projects!).

Try to tackle the business problem with different analytic approaches.

It is clever to solve complex problems with simple techniques.


My BackgroundApplying Ant Colony Optimisation (ACO) in Building Classification Models

Honorary Research Fellow, School of Computing , University of Kent.Ph.D. Computer Science, University of Kent, Canterbury, UK.M.Sc. Computer Science , The American University in Cairo, Egypt.

20+ published journal and conference papers, focusing on: classification rules induction, decision trees construction, Bayesian classification modelling, data reduction,instance-based learning, and evolving neural networks.

Journals: Swarm Intelligence, Swarm & Evolutionary Computation,Intelligent Data Analysis, Applied Soft Computing, and Memetic Computing.

Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, and INNS-BigData.


data mining - the big picture!

Data & Analytics