analytics in a big data world · 2014-04-22 · fair lending compliance: ... chapter 3 predictive...

26

Upload: others

Post on 05-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision
Page 2: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision
Page 3: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

Analytics in a Big Data World

Page 4: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior‐level

managers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:

Activity‐Based Management for Financial Institutions: Driving Bottom‐

Line Results by Brent Bahnub

Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian

Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst

Branded! How Retailers Engage Consumers with Social Media and Mobil-

ity by Bernie Brennan and Lori Schafer

Business Analytics for Customer Intelligence by Gert Laursen

Business Analytics for Managers: Taking Business Intelligence beyond

Reporting by Gert Laursen and Jesper Thorlund

The Business Forecasting Deal: Exposing Bad Practices and Providing

Practical Solutions by Michael Gilliland

Business Intelligence Applied: Implementing an Effective Information and

Communications Technology Infrastructure by Michael Gendron

Business Intelligence in the Cloud: Strategic Implementation Guide by

Michael S. Gendron

Business Intelligence Success Factors: Tools for Aligning Your Business in

the Global Economy by Olivia Parr Rud

CIO Best Practices: Enabling Strategic Value with Information Technology,

second edition by Joe Stenzel

Connecting Organizational Silos: Taking Knowledge Flow Management to

the Next Level with Social Media by Frank Leistner

Credit Risk Assessment: The New Lending System for Borrowers, Lenders,

and Investors by Clark Abrahams and Mingyuan Zhang

Page 5: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

Credit Risk Scorecards: Developing and Implementing Intelligent Credit

Scoring by Naeem Siddiqi

The Data Asset: How Smart Companies Govern Their Data for Business

Success by Tony Fisher

Delivering Business Analytics: Practical Guidelines for Best Practice by

Evan Stubbs

Demand‐Driven Forecasting: A Structured Approach to Forecasting, Sec-

ond Edition by Charles Chase

Demand‐Driven Inventory Optimization and Replenishment: Creating a

More Effi cient Supply Chain by Robert A. Davis

The Executive’s Guide to Enterprise Social Media Strategy: How Social Net-

works Are Radically Transforming Your Business by David Thomas and

Mike Barlow

Economic and Business Forecasting: Analyzing and Interpreting Econo-

metric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah

Watt, and Sam Bullard

Executive’s Guide to Solvency II by David Buckham, Jason Wahl, andI

Stuart Rose

Fair Lending Compliance: Intelligence and Implications for Credit Risk

Management by Clark R. Abrahams and Mingyuan Zhangt

Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide

to Fundamental Concepts and Practical Applications by Robert Rowan

Health Analytics: Gaining the Insights to Transform Health Care by Jason

Burke

Heuristics in Analytics: A Practical Perspective of What Infl uences Our

Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeilld

Human Capital Analytics: How to Harness the Potential of Your Organiza-

tion’s Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz‐enz t

Implement, Improve and Expand Your Statewide Longitudinal Data Sys-

tem: Creating a Culture of Data in Education by Jamie McQuiggan and

Armistead Sapp

Information Revolution: Using the Information Evolution Model to Grow

Your Business by Jim Davis, Gloria J. Miller, and Allan Russell

Page 6: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Markt

Brown

Manufacturing Best Practices: Optimizing Productivity and Product Qual-

ity by Bobby Hull

Marketing Automation: Practical Steps to More Effective Direct Marketing

by Jeff LeSueur

Mastering Organizational Knowledge Flow: How to Make Knowledge

Sharing Work by Frank Leistnerk

The New Know: Innovation Powered by Analytics by Thornton May

Performance Management: Integrating Strategy Execution, Methodologies,

Risk, and Analytics by Gary Cokins

Predictive Business Analytics: Forward‐Looking Capabilities to Improve

Business Performance by Lawrence Maisel and Gary Cokins

Retail Analytics: The Secret Weapon by Emmett Cox

Social Network Analysis in Telecommunications by Carlos Andre Reis

Pinheiro

Statistical Thinking: Improving Business Performance, second edition by

Roger W. Hoerl and Ronald D. Snee

Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data

Streams with Advanced Analytics by Bill Franks

Too Big to Ignore: The Business Case for Big Data by Phil Simon

The Value of Business Analytics: Identifying the Path to Profi tability by

Evan Stubbs

Visual Six Sigma: Making Data Analysis Lean by Ian Cox, Marie A.

Gaudard, Philip J. Ramsey, Mia L. Stephens, and Leo Wright

Win with Advanced Business Analytics: Creating Business Value from

Your Data by Jean Paul Isson and Jesse Harriott

For more information on any of the above titles, please visit www

.wiley.com .

Page 7: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

Analytics in a Big Data World

The Essential Guide to Data Science and Its Applications

Bart Baesens

Page 8: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

Cover image: ©iStockphoto/vlastosCover design: Wiley

Copyright © 2014 by Bart Baesens. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through paymentof the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifi cally disclaim any implied warranties of merchantability or fi tness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consultwith a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profi t or any other commercial damages, including butnot limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data:Baesens, Bart. Analytics in a big data world : the essential guide to data science and its applications / Bart Baesens. 1 online resource. — (Wiley & SAS business series) Description based on print version record and CIP data provided by publisher; resource not viewed. ISBN 978-1-118-89271-8 (ebk); ISBN 978-1-118-89274-9 (ebk);ISBN 978-1-118-89270-1 (cloth) 1. Big data. 2. Management—Statistical methods. 3. Management—Data processing. 4. Decision making—Data processing. I. Title. HD30.215658.4’038 dc23 2014004728

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

Page 9: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

To my wonderful wife, Katrien, and my kids,Ann-Sophie, Victor, and Hannelore. To my parents and parents-in-law.

Page 10: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision
Page 11: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

ix

Contents

Preface xiii

Acknowledgments xv

Chapter 1 Big Data and Analytics 1

Example Applications 2

Basic Nomenclature 4

Analytics Process Model 4

Job Profi les Involved 6

Analytics 7

Analytical Model Requirements 9

Notes 10

Chapter 2 Data Collection, Sampling,

and Preprocessing 13

Types of Data Sources 13

Sampling 15

Types of Data Elements 17

Visual Data Exploration and Exploratory

Statistical Analysis 17

Missing Values 19

Outlier Detection and Treatment 20

Standardizing Data 24

Categorization 24

Weights of Evidence Coding 28

Variable Selection 29

Page 12: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

x ▸ CONTENTS

Segmentation 32

Notes 33

Chapter 3 Predictive Analytics 35

Target Defi nition 35

Linear Regression 38

Logistic Regression 39

Decision Trees 42

Neural Networks 48

Support Vector Machines 58

Ensemble Methods 64

Multiclass Classifi cation Techniques 67

Evaluating Predictive Models 71

Notes 84

Chapter 4 Descriptive Analytics 87

Association Rules 87

Sequence Rules 94

Segmentation 95

Notes 104

Chapter 5 Survival Analysis 105

Survival Analysis Measurements 106

Kaplan Meier Analysis 109

Parametric Survival Analysis 111

Proportional Hazards Regression 114

Extensions of Survival Analysis Models 116

Evaluating Survival Analysis Models 117

Notes 117

Chapter 6 Social Network Analytics 119

Social Network Defi nitions 119

Social Network Metrics 121

Social Network Learning 123

Relational Neighbor Classifi er 124

Page 13: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

C O N T E N T S ◂ xi

Probabilistic Relational Neighbor Classifi er 125

Relational Logistic Regression 126

Collective Inferencing 128

Egonets 129

Bigraphs 130

Notes 132

Chapter 7 Analytics: Putting It All to Work 133

Backtesting Analytical Models 134

Benchmarking 146

Data Quality 149

Software 153

Privacy 155

Model Design and Documentation 158

Corporate Governance 159

Notes 159

Chapter 8 Example Applications 161

Credit Risk Modeling 161

Fraud Detection 165

Net Lift Response Modeling 168

Churn Prediction 172

Recommender Systems 176

Web Analytics 185

Social Media Analytics 195

Business Process Analytics 204

Notes 220

About the Author 223

Index 225

Page 14: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision
Page 15: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

xiii

Preface

Companies are being fl ooded with tsunamis of data collected in a

multichannel business environment, leaving an untapped poten-

tial for analytics to better understand, manage, and strategically

exploit the complex dynamics of customer behavior. In this book, we

will discuss how analytics can be used to create strategic leverage and

identify new business opportunities.

The focus of this book is not on the mathematics or theory, but on

the practical application. Formulas and equations will only be included

when absolutely needed from a practitioner’s perspective. It is also not

our aim to provide exhaustive coverage of all analytical techniques

previously developed, but rather to cover the ones that really provide

added value in a business setting.

The book is written in a condensed, focused way because it is tar-

geted at the business professional. A reader’s prerequisite knowledge

should consist of some basic exposure to descriptive statistics (e.g.,

mean, standard deviation, correlation, confi dence intervals, hypothesis

testing), data handling (using, for example, Microsoft Excel, SQL, etc.),

and data visualization (e.g., bar plots, pie charts, histograms, scatter

plots). Throughout the book, many examples of real‐life case studies

will be included in areas such as risk management, fraud detection,

customer relationship management, web analytics, and so forth. The

author will also integrate both his research and consulting experience

throughout the various chapters. The book is aimed at senior data ana-

lysts, consultants, analytics practitioners, and PhD researchers starting

to explore the fi eld.

Chapter 1 discusses big data and analytics. It starts with some

example application areas, followed by an overview of the analytics

process model and job profi les involved, and concludes by discussing

key analytic model requirements. Chapter 2 provides an overview of

Page 16: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

xiv ▸ PREFACE

data collection, sampling, and preprocessing. Data is the key ingredi-

ent to any analytical exercise, hence the importance of this chapter.

It discusses sampling, types of data elements, visual data exploration

and exploratory statistical analysis, missing values, outlier detection

and treatment, standardizing data, categorization, weights of evidence

coding, variable selection, and segmentation. Chapter 3 discusses pre-

dictive analytics. It starts with an overview of the target defi nition

and then continues to discuss various analytics techniques such as

linear regression, logistic regression, decision trees, neural networks,

support vector machines, and ensemble methods (bagging, boost-

ing, random forests). In addition, multiclass classifi cation techniques

are covered, such as multiclass logistic regression, multiclass deci-

sion trees, multiclass neural networks, and multiclass support vector

machines. The chapter concludes by discussing the evaluation of pre-

dictive models. Chapter 4 covers descriptive analytics. First, association

rules are discussed that aim at discovering intratransaction patterns.

This is followed by a section on sequence rules that aim at discovering

intertransaction patterns. Segmentation techniques are also covered.

Chapter 5 introduces survival analysis. The chapter starts by introduc-

ing some key survival analysis measurements. This is followed by a

discussion of Kaplan Meier analysis, parametric survival analysis, and

proportional hazards regression. The chapter concludes by discussing

various extensions and evaluation of survival analysis models. Chap-

ter 6 covers social network analytics. The chapter starts by discussing

example social network applications. Next, social network defi nitions

and metrics are given. This is followed by a discussion on social network

learning. The relational neighbor classifi er and its probabilistic variant

together with relational logistic regression are covered next. The chap-

ter ends by discussing egonets and bigraphs. Chapter 7 provides an

overview of key activities to be considered when putting analytics to

work. It starts with a recapitulation of the analytic model requirements

and then continues with a discussion of backtesting, benchmarking,

data quality, software, privacy, model design and documentation, and

corporate governance. Chapter 8 concludes the book by discussing var-

ious example applications such as credit risk modeling, fraud detection,

net lift response modeling, churn prediction, recommender systems,

web analytics, social media analytics, and business process analytics.

Page 17: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

xv

Acknowledgments

I would like to acknowledge all my colleagues who contributed to

this text: Seppe vanden Broucke, Alex Seret, Thomas Verbraken,

Aimée Backiel, Véronique Van Vlasselaer, Helen Moges, and Barbara

Dergent.

Page 18: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision
Page 19: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

Analytics in a Big Data World

Page 20: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision
Page 21: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

1

C H A P T E R 1 Big Data and Analytics

Data are everywhere. IBM projects that every day we generate 2.5

quintillion bytes of data.1 In relative terms, this means 90 percent

of the data in the world has been created in the last two years.

Gartner projects that by 2015, 85 percent of Fortune 500 organizations

will be unable to exploit big data for competitive advantage and about

4.4 million jobs will be created around big data. 2 Although these esti-

mates should not be interpreted in an absolute sense, they are a strong

indication of the ubiquity of big data and the strong need for analytical

skills and resources because, as the data piles up, managing and analyz-

ing these data resources in the most optimal way become critical suc-

cess factors in creating competitive advantage and strategic leverage.

Figure 1.1 shows the results of a KDnuggets 3 poll conducted dur-

ing April 2013 about the largest data sets analyzed. The total number

of respondents was 322 and the numbers per category are indicated

between brackets. The median was estimated to be in the 40 to 50 giga-

byte (GB) range, which was about double the median answer for a simi-

lar poll run in 2012 (20 to 40 GB). This clearly shows the quick increase

in size of data that analysts are working on. A further regional break-

down of the poll showed that U.S. data miners lead other regions in big

data, with about 28% of them working with terabyte (TB) size databases.

A main obstacle to fully harnessing the power of big data using ana-

lytics is the lack of skilled resources and “data scientist” talent required to

Page 22: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

2 ▸ ANALYTICS IN A B IG DATA WORLD

exploit big data. In another poll ran by KDnuggets in July 2013, a strong

need emerged for analytics/big data/data mining/data science educa-

tion.4 It is the purpose of this book to try and fi ll this gap by providing a

concise and focused overview of analytics for the business practitioner.

EXAMPLE APPLICATIONS

Analytics is everywhere and strongly embedded into our daily lives. As I

am writing this part, I was the subject of various analytical models today.

When I checked my physical mailbox this morning, I found a catalogue

sent to me most probably as a result of a response modeling analytical

exercise that indicated that, given my characteristics and previous pur-

chase behavior, I am likely to buy one or more products from it. Today,

I was the subject of a behavioral scoring model of my fi nancial institu-

tion. This is a model that will look at, among other things, my check-

ing account balance from the past 12 months and my credit payments

during that period, together with other kinds of information available

to my bank, to predict whether I will default on my loan during the

next year. My bank needs to know this for provisioning purposes. Also

today, my telephone services provider analyzed my calling behavior

Figure 1.1 Results from a KDnuggets Poll about Largest Data Sets Analyzed Source: www.kdnuggets.com/polls/2013/largest‐dataset‐analyzed‐data‐mined‐2013.html.

Less than 1 MB (12) 3.7%

1.1 to 10 MB (8) 2.5%

11 to 100 MB (14) 4.3%

101 MB to 1 GB (50) 15.5%

1.1 to 10 GB (59)18%

11 to 100 GB (52) 16%

101 GB to 1 TB(59) 18%

1.1 to 10 TB (39) 12%

11 to 100 TB (15) 4.7%

101 TB to 1 PB (6) 1.9%

1.1 to 10 PB (2) 0.6%

11 to 100 PB (0) 0%

Over 100 PB (6) 1.9%

Page 23: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

B I G D A T A A N D A N A L Y T I C S ◂ 3

and my account information to predict whether I will churn during the

next three months. As I logged on to my Facebook page, the social ads

appearing there were based on analyzing all information (posts, pictures,

my friends and their behavior, etc.) available to Facebook. My Twitter

posts will be analyzed (possibly in real time) by social media analytics to

understand both the subject of my tweets and the sentiment of them.

As I checked out in the supermarket, my loyalty card was scanned fi rst,

followed by all my purchases. This will be used by my supermarket to

analyze my market basket, which will help it decide on product bun-

dling, next best offer, improving shelf organization, and so forth. As I

made the payment with my credit card, my credit card provider used

a fraud detection model to see whether it was a legitimate transaction.

When I receive my credit card statement later, it will be accompanied by

various vouchers that are the result of an analytical customer segmenta-

tion exercise to better understand my expense behavior.

To summarize, the relevance, importance, and impact of analytics

are now bigger than ever before and, given that more and more data

are being collected and that there is strategic value in knowing what

is hidden in data, analytics will continue to grow. Without claiming to

be exhaustive, Table 1.1 presents some examples of how analytics is

applied in various settings.

Table 1.1 Example Analytics Applications

Marketing

Risk

Management Government Web Logistics Other

Response

modeling

Credit risk

modeling

Tax avoidance Web analytics Demand

forecasting

Text

analytics

Net lift

modeling

Market risk

modeling

Social

security fraud

Social media

analytics

Supply chain

analytics

Business

process

analytics

Retention

modeling

Operational

risk modeling

Money

laundering

Multivariate

testing

Market basket

analysis

Fraud

detection

Terrorism

detection

Recommender

systems

Customer

segmentation

Page 24: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

4 ▸ ANALYTICS IN A B IG DATA WORLD

It is the purpose of this book to discuss the underlying techniques

and key challenges to work out the applications shown in Table 1.1

using analytics. Some of these applications will be discussed in further

detail in Chapter 8 .

BASIC NOMENCLATURE

In order to start doing analytics, some basic vocabulary needs to be

defi ned. A fi rst important concept here concerns the basic unit of anal-

ysis. Customers can be considered from various perspectives. Customer

lifetime value (CLV) can be measured for either individual customers

or at the household level. Another alternative is to look at account

behavior. For example, consider a credit scoring exercise for which

the aim is to predict whether the applicant will default on a particular

mortgage loan account. The analysis can also be done at the transac-

tion level. For example, in insurance fraud detection, one usually per-

forms the analysis at insurance claim level. Also, in web analytics, the

basic unit of analysis is usually a web visit or session.

It is also important to note that customers can play different roles.

For example, parents can buy goods for their kids, such that there is

a clear distinction between the payer and the end user. In a banking

setting, a customer can be primary account owner, secondary account

owner, main debtor of the credit, codebtor, guarantor, and so on. It

is very important to clearly distinguish between those different roles

when defi ning and/or aggregating data for the analytics exercise.

Finally, in case of predictive analytics, the target variable needs to

be appropriately defi ned. For example, when is a customer considered

to be a churner or not, a fraudster or not, a responder or not, or how

should the CLV be appropriately defi ned?

ANALYTICS PROCESS MODEL

Figure 1.2 gives a high‐level overview of the analytics process model. 5

As a fi rst step, a thorough defi nition of the business problem to be

solved with analytics is needed. Next, all source data need to be identi-

fi ed that could be of potential interest. This is a very important step, as

data is the key ingredient to any analytical exercise and the selection of

Page 25: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

B I G D A T A A N D A N A L Y T I C S ◂ 5

data will have a deterministic impact on the analytical models that will

be built in a subsequent step. All data will then be gathered in a stag-

ing area, which could be, for example, a data mart or data warehouse.

Some basic exploratory analysis can be considered here using, for

example, online analytical processing (OLAP) facilities for multidimen-

sional data analysis (e.g., roll‐up, drill down, slicing and dicing). This

will be followed by a data cleaning step to get rid of all inconsistencies,

such as missing values, outliers, and duplicate data. Additional trans-

formations may also be considered, such as binning, alphanumeric to

numeric coding, geographical aggregation, and so forth. In the analyt-

ics step, an analytical model will be estimated on the preprocessed and

transformed data. Different types of analytics can be considered here

(e.g., to do churn prediction, fraud detection, customer segmentation,

market basket analysis). Finally, once the model has been built, it will

be interpreted and evaluated by the business experts. Usually, many

trivial patterns will be detected by the model. For example, in a market

basket analysis setting, one may fi nd that spaghetti and spaghetti sauce

are often purchased together. These patterns are interesting because

they provide some validation of the model. But of course, the key issue

here is to fi nd the unexpected yet interesting and actionable patterns

(sometimes also referred to as knowledge diamonds ) that can provide

added value in the business setting. Once the analytical model has

been appropriately validated and approved, it can be put into produc-

tion as an analytics application (e.g., decision support system, scoring

engine). It is important to consider here how to represent the model

output in a user‐friendly way, how to integrate it with other applica-

tions (e.g., campaign management tools, risk engines), and how to

make sure the analytical model can be appropriately monitored and

backtested on an ongoing basis.

It is important to note that the process model outlined in Fig-

ure  1.2 is iterative in nature, in the sense that one may have to go back

to previous steps during the exercise. For example, during the analyt-

ics step, the need for additional data may be identifi ed, which may

necessitate additional cleaning, transformation, and so forth. Also, the

most time consuming step is the data selection and preprocessing step;

this usually takes around 80% of the total efforts needed to build an

analytical model.

Page 26: Analytics in a Big Data World · 2014-04-22 · Fair Lending Compliance: ... Chapter 3 Predictive Analytics 35 Target Defi nition 35 Linear Regression 38 Logistic Regression 39 Decision

6 ▸ ANALYTICS IN A B IG DATA WORLD

JOB PROFILES INVOLVED

Analytics is essentially a multidisciplinary exercise in which many

different job profi les need to collaborate together. In what follows, we

will discuss the most important job profi les.

The database or data warehouse administrator (DBA) is aware of

all the data available within the fi rm, the storage details, and the data

defi nitions. Hence, the DBA plays a crucial role in feeding the analyti-

cal modeling exercise with its key ingredient, which is data. Because

analytics is an iterative exercise, the DBA may continue to play an

important role as the modeling exercise proceeds.

Another very important profi le is the business expert. This could,

for example, be a credit portfolio manager, fraud detection expert,

brand manager, or e‐commerce manager. This person has extensive

business experience and business common sense, which is very valu-

able. It is precisely this knowledge that will help to steer the analytical

modeling exercise and interpret its key fi ndings. A key challenge here

is that much of the expert knowledge is tacit and may be hard to elicit

at the start of the modeling exercise.

Legal experts are becoming more and more important given that

not all data can be used in an analytical model because of privacy,

Figure 1.2 The Analytics Process Model

Understandingwhat data isneeded for theapplication

Data Cleaning

Interpretation and Evaluation

DataTransformation(binning, alpha tonumeric, etc.)

Analytics

DataSelection

SourceData

AnalyticsApplication

PreprocessedData

TransformedData

Patterns

Data MiningMart

Dumps of Operational Data