analysis of big data - exceptional learning for infinite minds · pdf...

Analysis of Big Data

Tim Miller, Sr. Analytics Consultant – Teradata Alexander Kolovos, Ph.D., Advanced Analytics Software Engineer – Teradata

March 28, 2017

2 © 2017 Teradata

Tim MillerSenior Analytics ConsultantTeradata Corporation

• Expertise in advanced analytic software, systems and methodologies.

• Principal engineer for the first commercial in-database data mining system, Teradata Warehouse Miner.

• Consultant to Teradata analytic partners (SAS, SPSS, etc.) and customers.

• Retired youth basketball, football, baseball, softball, soccer, etc. coach

Alexander Kolovos, Ph.D.Advanced Analytics Software EngineerTeradata Corporation

• Expertise in analytical methodologies and platforms

• Specialization in space-time data and stochastic predictive analysis

• Ph.D. in Sciences and Engineering

• 4 years in Teradata (Analytics Engineer) 6 years in SAS (Spatial software expert)

• Loves language, theater, & rock music

2

Your Presenters

3 © 2017 Teradata

• Big Data: Brief review

• Storage and AvailabilityInformation in your hands. Wanna keep it where?

• Analysis– Exploratory and summary. Is this enough?

– Can pretty pictures tell stories?

– Recreate the world around you

– Tools and tricks of the trade

• Walking the walkThe foundation, the strategy, the tools

• Managing the big story– Model management

– Application management

Overview

3 © 2017 Teradata

4 © 2017 Teradata










Overview

4 © 2017 Teradata

5 © 2017 Teradata

Last half-century marked precipitous

technological advances that

gradually provided

unprecedented computing power

to speed up calculations that were

previously time-consuming

or even time-prohibitive

enabled progressively increasing

monitoring, recording and storing

of empirical information, specifically

focusing on data measurements

Big Data: Brief Review

Schematic depiction of Moore’s Law (computational power doubles annually to Physics laws limit)

6 © 2017 Teradata

A more sober approach appears to have followed an initial frenzy about

the availability of large volumes of information. According to Hortonworks

Inc., developer of the Hadoop platform:

Big data describes the realization of greater business intelligence

by storing, processing, and analyzing data that was previously ignored or siloed

due to the limitations of traditional data management technologies.


7 © 2017 Teradata


Big Data is the 3 Vs:

Variety

Structured and unstructured data

Volume

Tera- (1012) peta- (1015) and even exabytes (1018) of data

Velocity

Data flows into your organization at an increasing rate

8 © 2017 Teradata

Big Data bring forward the issue of scaling:

Solve problems old or new, trivial or elaborate

in entirely new frameworks characterized by increasing data sizes

Face new challenges: Conceive and apply appropriate methodologies

Maintain competitive performance

Engineer effective hardware architectures

Generate suitable software solutions

→ Example: Handle matrix inversions for increasingly large matrix sizes

→ Example: Deal with sparse data in very large dimensions


9 © 2017 Teradata










Overview

9 © 2017 Teradata

10 © 2017 Teradata

So you have this dazzling amount of information. Where will you keep it?

Storing Your Information

11 © 2017 Teradata

Nowadays, majority of options can be rather summarized in the following:


Locally: Company Server Cloud: Remote ServerLocally: Computer / Drive

12 © 2017 Teradata

Nowadays, majority of options can be rather summarized in the following:


Cloud: Remote Server[Typically] somebody else’s computer!

No matter whether it is called DropBox, Box, iCloud, AWS, Samsung Cloud, etc. it is what it is:

Somebody else’s computer.

Privacy concerns and data safety:

Huge topics in the era of Big Data!

13 © 2017 Teradata

If your data are not in your hands, then where are they?

Cloud servers can be hardware located anywhere in the world.

Data can be stored in multiple copies, possibly in different locations, too,

to prevent loss in case of hardware or network failures

Availability and Safekeeping

Example:

Amazon Web Services

Global Infrastructure

14 © 2017 Teradata

What steps can be taken to protect your data?

Formal legislation

Data encryption, safety protocols, restricted access

Availability and Safekeeping

A very sensitive topic in the nascent steps of new tech.

Technology offers great business opportunities, but…

Caution needed to prevent putting the cart before the horse

15 © 2017 Teradata










Overview

15 © 2017 Teradata

16 © 2017 Teradata

Assume your data is kept …safe …somewhere. What comes next?

Now What?

17 © 2017 Teradata

In academic environments:

Use data to answer scientific questions

about a phenomenon or study an attribute

of interest.

The Next Step

In a business context:

Gain insight about problem to understand market,

optimize operations, increase profit, etc.; commonly

expressed as aiming to increase Business Intelligence.

18 © 2017 Teradata

Big Data analysis conceptually similar to any other data analysis.

First step: Perform preliminary exploratory analysis

Obtain data

― Make records available within a data processing environment

Data may be accessed at storage location or brought over locally

― Ensure all analysis-relevant datasets are present

Such as unprocessed / raw data, different contributing data collections.

Preliminary Data Exploration

D e b t < 1 0 % o f I n c o m e D e b t = 0 %

G o o dC r e d i tR i s k s

B a dC r e d i tR i s k s


Y e s

Y e sY e s

N O

N ON O

I n c o m e > $ 4 0 K

Server Desktop

19 © 2017 Teradata



Inspect data

― For missing values and errors

Check for type mismatch

Missings: Remove record? Assign default?

― Additional quality checks

Check for out-of-bound values (Ex.: Remove neg recs from positive variable)

Transform variables as needed (Ex.: Isolate street number from address string)

Yield secondary variables as needed (Ex.: Time duration from date range)

Isolate and handle extremes, if makes sense to do so


20 © 2017 Teradata



Possibly run summary calculations

― For example, compute minimum, maximum, average, etc.

Insight from charts, plots, maps

― Visualization can be huge aid for cognitive understanding


…or:

When , then vs. helps

21 © 2017 Teradata

Scale matters in Big Data. Visualization can be all the more elaborate and

important when exploring complexity and multiple dimensions in data.


…At the end of the day, are summary statistics and pretty pictures enough?

22 © 2017 Teradata

Most typically, one needs deeper insight to help drive decision-making.

Very often, one may want to

predict a variable

― Examples: Assess sales volume; project the number of passengers for an airline

select between multiple options when taking action

― Examples: Accept or reject a transaction? Has a threshold been exceeded?

classify/put in order a series of items

― Examples: How can a business distribute retail stores? Which customers are loyal?

To provide answers to similar problems, one must seek to

understand behavior of a target variable as a function of the data features.

Transition To Data Workbench

23 © 2017 Teradata

One starts with the output of the first exploratory step, that is the so-called

Analytics Data Set (ADS). The ADS contains all the features (variables) that

are assessed to be relevant to the target variable.

In each problem, one utterly seeks to build an accurate enough

representation of reality to enable inference about the target variable.


This is done in the next step, otherwise known as

Second step: Data modeling

A model is a representation of reality.

Data modeling serves in discovering and establishing associations among

the data to describe accurately the target variable.

24 © 2017 Teradata


How do you fit a model to data?

Keep in mind the Indetermination Thesis: In principle, there may exist an

infinite number of curves that satisfy a given dataset.

27 © 2017 Teradata


A theoretical model can eliminate the bulk of possibilities and provide us

with a few meaningful ones.


28 © 2017 Teradata

We distinguish 2 main data modeling categories: Prediction and clustering.

Prediction: A model is trained by data to develop data-driven behavior

― Training data: The subset used to train and validate the model behavior

― Testing/scoring data: The subset used by the model to yield predictions

― Also known as: Supervised learning

Data Modeling: Prediction

29 © 2017 Teradata


Prediction: Commonly used methodologies include:

― Regression: Family of techniques suitable for a variety of tasks; some types are

― Linear: Suitable for continuous variable prediction

― Logistic: Suitable for prediction of binary outcome or class variables (discrete values)

― k-Nearest Neighbor: Prediction based on averaged response from k “nearest” samples

― Ridge: An option to solve ill-posed (overfitted or underfitted) problems

― Neural Networks: Parametric, layered models inspired by how neurons work

― Decision Trees: Node-based classification algorithm. Each node makes a

decision by using a condition on one of the input features.

― Random Forests: A regressor made up of multiple decision trees. Performs partial

analysis for each tree; then averages answers from trees.

Data Modeling: Prediction

30 © 2017 Teradata


Clustering: Associate and categorize data in groups (clusters) on the basis

of specified group characteristics

― Model can be used to split a data set in a desired number of clusters

― Also known as: Unsupervised learning

Data Modeling: Clustering

31 © 2017 Teradata


Clustering: Commonly used methodologies include:

― K-Means Algorithm: Data separated into specified number of clusters around

center points called centroids. Objective is to minimize each cluster’s data

distance from the cluster centroid. Done by minimizing a criterion called inertia.

― Hierarchical Clustering: Family of clustering algorithms. Objective is to build

nested clusters in the form of a dendrogram tree. The tree root is a unique cluster

that gathers all the samples; the leaves are clusters with a single sample.

― Variations / Combinations of the above that include

― Spectral Clustering: Performs a low-dimensional embedding of the affinity matrix between

samples; then applies K-Means.

― Interactive Clustering: First user K-Means to create small clusters; then applies hierarchical

clustering to each one of those.

Data Modeling: Clustering

32 © 2017 Teradata

Aside from core statistical techniques adopted for data modeling…

Big Data analysis also gave rise to some very popular concepts in the field:

Machine Learning: Actually, ML is a Computer Science domain

Very similar to Computational Statistics. Focuses on using/combining statistical

methodologies to solve problems through use of computers alone.

Artificial Intelligence: A neighboring concept; relies on machine-only intelligence.

Deep Learning: A class of ML algorithms

― Cascading multiple nonlinear processing units for feature transformation and/or

extraction. Based on unsupervised learning of multiple levels of data features.

Hierarchical representation: Higher level features derived from lower level ones.

― Multiple level approach: Different levels of abstraction; a hierarchy of concepts.

Data Modeling: Contemporary Trends

33 © 2017 Teradata

Often enough, specific methodologies might be recommended more than

others for particular business tasks.

Data Modeling: Business Applications

Business Issue Data Mining Analytical Approaches

Customer Segmentation Clustering, Factor Analysis, Ranking and Tiering

Propensity to Buy Induction Trees, Logistic Regression, Neural Nets

Attrition Induction Trees, Logistic Regression, Neural Nets

Lifetime Value Net Present value, Structural Equation Modeling

Purchase Sequence Association/Affinity and Sequence Analysis, Time Series

Sales Forecasting Time Series, Neural Nets, Linear Regression

Customer Acquisition and Prospecting

Induction Trees, Logistic Regression, Neural Nets

Profitability analysis Activity-based costing, process-based costing,

Campaign Effectiveness Assessment

Neural Nets, Rule Induction, Logistic Regression, Discriminant Analysis

34 © 2017 Teradata










Overview

34 © 2017 Teradata

35 © 2017 Teradata

Big Data Analysis Setup: Foundation

Everything we saw to this point requires a core framework to be built on...

Hardware, including:

Servers

Storage devices

Network

infrastructure

Connectors

Computing

resources

36 © 2017 Teradata

Big Data Analysis Setup: Foundation

Everything we saw to this point requires a core framework to be built on...

Software

for the engineering foundation to operate on

to enable data transfers

for hardware setup

for communication between hardware, and processing requests

37 © 2017 Teradata

Big Data Analysis Setup: Architecture

Everything we saw to this point requires an architectural approach strategy.

Example: A very simple strategy:

I can do everything from the comfort of my multi-core computer!

38 © 2017 Teradata



Often in practice, one of the following computing architectures is adopted:

PC Client

Data Extract

Request Results

PC Client

Database

PC Client

Data Extract

Request Results

DatabaseDatabase

Server or Cluster Of Servers

Side-by-side Distributed Computing In-Database

39 © 2017 Teradata




B a dC r e d i tR i s k s


Y e s

Y e sY e s

N O

N ON O

I n c o m e > $ 4 0 K

Processing Request

Sample Data


G o o d

C r e d i tR i s k s

B a d


G o o d


Y e s

Y e sY e s

N O

N ON O

I n c o m e > $ 4 0 K

Results

Desktop and Server Analytic Architecture

In-Database Analytic Architecture

Results

Exponential Performance Improvement

In-Database architecture may carry significant speed advantages.


40 © 2017 Teradata



In business applications, the architecture strategy further implies how data

are accessed to enable lower cost and higher speed in decision-making.

IntegratedData

Warehouse(IDW)

OPERATIONAL SYSTEMS DECISION MAKERS OPERATIONAL SYSTEMS DECISION MAKERS

Fragmented vs. Integrated data approach

41 © 2017 Teradata

Big Data Analysis Setup: Tools

Everything we saw to this point requires programming and software.

Programming languages and software packages provide the building

blocks for implementation of suitable algorithms and analysis with Big Data

42 © 2017 Teradata

Big Data Analysis Setup: Tools

Everything we saw to this point requires programming and software.

Analytical frameworks and interfaces provide creative programming

platforms that facilitate data analysis and understanding.

43 © 2017 Teradata

Big Data Analysis Setup: People

Above all, everything we saw to this point requires people on the helm.

Sign of times? The industry no longer needs just a mathematician, a statistician, a programmer, or a general scientist.

The new superstar in the age of Big Data is a…

Data Scientist: Coined as a person with a most versatile skillset to perform all-in-one tasks such as

handling computationally any size of datasets

possessing statistical prowess and modeling skills

understanding and programming relational data storage (databases)

solving problems; visualize, communicate well

45 © 2017 Teradata

Data analysis yields an optimally fit model for prediction.

For an individual or a small team, this is where a Big Data analysis might

come to completion. In an industrial/commercial setting, however:

Data refresh: Customer data might be in constant flow.

― Deliverable must account for streaming data flows and continuous model use

In Production: Model Management Aspects

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

Engineers

Data Scientists Business Analysts Marketing Front-Line Workers

Operational SystemsCustomers / Partners Executives

46 © 2017 Teradata




Scoring, validation, and model health: Model must be kept current.

― With time, data characteristics might change and prediction may deteriorate.

Mechanisms should perform quality checks and support model retraining.


47 © 2017 Teradata




Promotion: Model may need to circulate across an organization.

― Different teams might require training to learn about a model

Champion model: Analysis may indicate multiple prevailing solutions.

― Different teams might need to select own champion model for usage among

challenger ones. In addition, this might be a repeatable periodic process.


48 © 2017 Teradata

In industrial/commercial settings, a successful model could be converted

into an application for broader adoption. Topics pertaining to this case are:

Application development: Fitting a new application in existing ecosystem

― Effort to retain model functionality and provide users with control over

parameters and variables to specify

Deployment and Extensions: Distributing,

maintaining and extending the application

In Production: Application Management Aspects

49 © 2017 Teradata

Data Ingest

Stream

Batch

Data Ingest

Stream

Batch

IoT

Streams

Edge Systems

Version Control

Workflow Management

Data Stores

Application ManagementModel Management

Data Science Workbench

Data Refresh

Scoring / Validation Promotion

Health StatisticsChampion & Challenger

Application Deployment

Application Extensions

Data integration Interface

Data Profiling Lifecycle Dashboard

Provisioning

Common Scoring Library

Algorithms

Tools

Feature Engineering

AnalyticsADS

Data Science Lab Production Analytics

Modeling AnalyticData Set

Repository

ADS Metadata Scoring Data Set

Repository

Scoring Metadata

Review: Flow Overview For Big Data Analysis

What you want to do with Big Data? A conceptual analytical framework

analysis of big data - exceptional learning for infinite minds · pdf...

Documents