analysis of big data - exceptional learning for infinite minds · pdf...
TRANSCRIPT
Analysis of Big Data
Tim Miller, Sr. Analytics Consultant – Teradata Alexander Kolovos, Ph.D., Advanced Analytics Software Engineer – Teradata
March 28, 2017
2 © 2017 Teradata
Tim MillerSenior Analytics ConsultantTeradata Corporation
• Expertise in advanced analytic software, systems and methodologies.
• Principal engineer for the first commercial in-database data mining system, Teradata Warehouse Miner.
• Consultant to Teradata analytic partners (SAS, SPSS, etc.) and customers.
• Retired youth basketball, football, baseball, softball, soccer, etc. coach
Alexander Kolovos, Ph.D.Advanced Analytics Software EngineerTeradata Corporation
• Expertise in analytical methodologies and platforms
• Specialization in space-time data and stochastic predictive analysis
• Ph.D. in Sciences and Engineering
• 4 years in Teradata (Analytics Engineer) 6 years in SAS (Spatial software expert)
• Loves language, theater, & rock music
2
Your Presenters
3 © 2017 Teradata
• Big Data: Brief review
• Storage and AvailabilityInformation in your hands. Wanna keep it where?
• Analysis– Exploratory and summary. Is this enough?
– Can pretty pictures tell stories?
– Recreate the world around you
– Tools and tricks of the trade
• Walking the walkThe foundation, the strategy, the tools
• Managing the big story– Model management
– Application management
Overview
3 © 2017 Teradata
4 © 2017 Teradata
• Big Data: Brief review
• Storage and AvailabilityInformation in your hands. Wanna keep it where?
• Analysis– Exploratory and summary. Is this enough?
– Can pretty pictures tell stories?
– Recreate the world around you
– Tools and tricks of the trade
• Walking the walkThe foundation, the strategy, the tools
• Managing the big story– Model management
– Application management
Overview
4 © 2017 Teradata
5 © 2017 Teradata
Last half-century marked precipitous
technological advances that
gradually provided
unprecedented computing power
to speed up calculations that were
previously time-consuming
or even time-prohibitive
enabled progressively increasing
monitoring, recording and storing
of empirical information, specifically
focusing on data measurements
Big Data: Brief Review
Schematic depiction of Moore’s Law (computational power doubles annually to Physics laws limit)
6 © 2017 Teradata
A more sober approach appears to have followed an initial frenzy about
the availability of large volumes of information. According to Hortonworks
Inc., developer of the Hadoop platform:
Big data describes the realization of greater business intelligence
by storing, processing, and analyzing data that was previously ignored or siloed
due to the limitations of traditional data management technologies.
Big Data: Brief Review
7 © 2017 Teradata
Big Data: Brief Review
Big Data is the 3 Vs:
Variety
Structured and unstructured data
Volume
Tera- (1012) peta- (1015) and even exabytes (1018) of data
Velocity
Data flows into your organization at an increasing rate
8 © 2017 Teradata
Big Data bring forward the issue of scaling:
Solve problems old or new, trivial or elaborate
in entirely new frameworks characterized by increasing data sizes
Face new challenges: Conceive and apply appropriate methodologies
Maintain competitive performance
Engineer effective hardware architectures
Generate suitable software solutions
→ Example: Handle matrix inversions for increasingly large matrix sizes
→ Example: Deal with sparse data in very large dimensions
Big Data: Brief Review
9 © 2017 Teradata
• Big Data: Brief review
• Storage and AvailabilityInformation in your hands. Wanna keep it where?
• Analysis– Exploratory and summary. Is this enough?
– Can pretty pictures tell stories?
– Recreate the world around you
– Tools and tricks of the trade
• Walking the walkThe foundation, the strategy, the tools
• Managing the big story– Model management
– Application management
Overview
9 © 2017 Teradata
10 © 2017 Teradata
So you have this dazzling amount of information. Where will you keep it?
Storing Your Information
11 © 2017 Teradata
Nowadays, majority of options can be rather summarized in the following:
Storing Your Information
Locally: Company Server Cloud: Remote ServerLocally: Computer / Drive
12 © 2017 Teradata
Nowadays, majority of options can be rather summarized in the following:
Storing Your Information
Cloud: Remote Server[Typically] somebody else’s computer!
No matter whether it is called DropBox, Box, iCloud, AWS, Samsung Cloud, etc. it is what it is:
Somebody else’s computer.
Privacy concerns and data safety:
Huge topics in the era of Big Data!
13 © 2017 Teradata
If your data are not in your hands, then where are they?
Cloud servers can be hardware located anywhere in the world.
Data can be stored in multiple copies, possibly in different locations, too,
to prevent loss in case of hardware or network failures
Availability and Safekeeping
Example:
Amazon Web Services
Global Infrastructure
14 © 2017 Teradata
What steps can be taken to protect your data?
Formal legislation
Data encryption, safety protocols, restricted access
Availability and Safekeeping
A very sensitive topic in the nascent steps of new tech.
Technology offers great business opportunities, but…
Caution needed to prevent putting the cart before the horse
15 © 2017 Teradata
• Big Data: Brief review
• Storage and AvailabilityInformation in your hands. Wanna keep it where?
• Analysis– Exploratory and summary. Is this enough?
– Can pretty pictures tell stories?
– Recreate the world around you
– Tools and tricks of the trade
• Walking the walkThe foundation, the strategy, the tools
• Managing the big story– Model management
– Application management
Overview
15 © 2017 Teradata
16 © 2017 Teradata
Assume your data is kept …safe …somewhere. What comes next?
Now What?
17 © 2017 Teradata
In academic environments:
Use data to answer scientific questions
about a phenomenon or study an attribute
of interest.
The Next Step
In a business context:
Gain insight about problem to understand market,
optimize operations, increase profit, etc.; commonly
expressed as aiming to increase Business Intelligence.
18 © 2017 Teradata
Big Data analysis conceptually similar to any other data analysis.
First step: Perform preliminary exploratory analysis
Obtain data
― Make records available within a data processing environment
Data may be accessed at storage location or brought over locally
― Ensure all analysis-relevant datasets are present
Such as unprocessed / raw data, different contributing data collections.
Preliminary Data Exploration
D e b t < 1 0 % o f I n c o m e D e b t = 0 %
G o o dC r e d i tR i s k s
B a dC r e d i tR i s k s
G o o dC r e d i tR i s k s
Y e s
Y e sY e s
N O
N ON O
I n c o m e > $ 4 0 K
Server Desktop
19 © 2017 Teradata
Big Data analysis conceptually similar to any other data analysis.
First step: Perform preliminary exploratory analysis
Inspect data
― For missing values and errors
Check for type mismatch
Missings: Remove record? Assign default?
― Additional quality checks
Check for out-of-bound values (Ex.: Remove neg recs from positive variable)
Transform variables as needed (Ex.: Isolate street number from address string)
Yield secondary variables as needed (Ex.: Time duration from date range)
Isolate and handle extremes, if makes sense to do so
Preliminary Data Exploration
20 © 2017 Teradata
Big Data analysis conceptually similar to any other data analysis.
First step: Perform preliminary exploratory analysis
Possibly run summary calculations
― For example, compute minimum, maximum, average, etc.
Insight from charts, plots, maps
― Visualization can be huge aid for cognitive understanding
Preliminary Data Exploration
…or:
When , then vs. helps
21 © 2017 Teradata
Scale matters in Big Data. Visualization can be all the more elaborate and
important when exploring complexity and multiple dimensions in data.
Preliminary Data Exploration
…At the end of the day, are summary statistics and pretty pictures enough?
22 © 2017 Teradata
Most typically, one needs deeper insight to help drive decision-making.
Very often, one may want to
predict a variable
― Examples: Assess sales volume; project the number of passengers for an airline
select between multiple options when taking action
― Examples: Accept or reject a transaction? Has a threshold been exceeded?
classify/put in order a series of items
― Examples: How can a business distribute retail stores? Which customers are loyal?
To provide answers to similar problems, one must seek to
understand behavior of a target variable as a function of the data features.
Transition To Data Workbench
23 © 2017 Teradata
One starts with the output of the first exploratory step, that is the so-called
Analytics Data Set (ADS). The ADS contains all the features (variables) that
are assessed to be relevant to the target variable.
In each problem, one utterly seeks to build an accurate enough
representation of reality to enable inference about the target variable.
Transition To Data Workbench
This is done in the next step, otherwise known as
Second step: Data modeling
A model is a representation of reality.
Data modeling serves in discovering and establishing associations among
the data to describe accurately the target variable.
24 © 2017 Teradata
Transition To Data Workbench
How do you fit a model to data?
Keep in mind the Indetermination Thesis: In principle, there may exist an
infinite number of curves that satisfy a given dataset.
25 © 2017 Teradata
Transition To Data Workbench
How do you fit a model to data?
Keep in mind the Indetermination Thesis: In principle, there may exist an
infinite number of curves that satisfy a given dataset.
26 © 2017 Teradata
Transition To Data Workbench
How do you fit a model to data?
Keep in mind the Indetermination Thesis: In principle, there may exist an
infinite number of curves that satisfy a given dataset.
27 © 2017 Teradata
How do you fit a model to data?
A theoretical model can eliminate the bulk of possibilities and provide us
with a few meaningful ones.
Transition To Data Workbench
28 © 2017 Teradata
We distinguish 2 main data modeling categories: Prediction and clustering.
Prediction: A model is trained by data to develop data-driven behavior
― Training data: The subset used to train and validate the model behavior
― Testing/scoring data: The subset used by the model to yield predictions
― Also known as: Supervised learning
Data Modeling: Prediction
29 © 2017 Teradata
We distinguish 2 main data modeling categories: Prediction and clustering.
Prediction: Commonly used methodologies include:
― Regression: Family of techniques suitable for a variety of tasks; some types are
― Linear: Suitable for continuous variable prediction
― Logistic: Suitable for prediction of binary outcome or class variables (discrete values)
― k-Nearest Neighbor: Prediction based on averaged response from k “nearest” samples
― Ridge: An option to solve ill-posed (overfitted or underfitted) problems
― Neural Networks: Parametric, layered models inspired by how neurons work
― Decision Trees: Node-based classification algorithm. Each node makes a
decision by using a condition on one of the input features.
― Random Forests: A regressor made up of multiple decision trees. Performs partial
analysis for each tree; then averages answers from trees.
Data Modeling: Prediction
30 © 2017 Teradata
We distinguish 2 main data modeling categories: Prediction and clustering.
Clustering: Associate and categorize data in groups (clusters) on the basis
of specified group characteristics
― Model can be used to split a data set in a desired number of clusters
― Also known as: Unsupervised learning
Data Modeling: Clustering
31 © 2017 Teradata
We distinguish 2 main data modeling categories: Prediction and clustering.
Clustering: Commonly used methodologies include:
― K-Means Algorithm: Data separated into specified number of clusters around
center points called centroids. Objective is to minimize each cluster’s data
distance from the cluster centroid. Done by minimizing a criterion called inertia.
― Hierarchical Clustering: Family of clustering algorithms. Objective is to build
nested clusters in the form of a dendrogram tree. The tree root is a unique cluster
that gathers all the samples; the leaves are clusters with a single sample.
― Variations / Combinations of the above that include
― Spectral Clustering: Performs a low-dimensional embedding of the affinity matrix between
samples; then applies K-Means.
― Interactive Clustering: First user K-Means to create small clusters; then applies hierarchical
clustering to each one of those.
Data Modeling: Clustering
32 © 2017 Teradata
Aside from core statistical techniques adopted for data modeling…
Big Data analysis also gave rise to some very popular concepts in the field:
Machine Learning: Actually, ML is a Computer Science domain
Very similar to Computational Statistics. Focuses on using/combining statistical
methodologies to solve problems through use of computers alone.
Artificial Intelligence: A neighboring concept; relies on machine-only intelligence.
Deep Learning: A class of ML algorithms
― Cascading multiple nonlinear processing units for feature transformation and/or
extraction. Based on unsupervised learning of multiple levels of data features.
Hierarchical representation: Higher level features derived from lower level ones.
― Multiple level approach: Different levels of abstraction; a hierarchy of concepts.
Data Modeling: Contemporary Trends
33 © 2017 Teradata
Often enough, specific methodologies might be recommended more than
others for particular business tasks.
Data Modeling: Business Applications
Business Issue Data Mining Analytical Approaches
Customer Segmentation Clustering, Factor Analysis, Ranking and Tiering
Propensity to Buy Induction Trees, Logistic Regression, Neural Nets
Attrition Induction Trees, Logistic Regression, Neural Nets
Lifetime Value Net Present value, Structural Equation Modeling
Purchase Sequence Association/Affinity and Sequence Analysis, Time Series
Sales Forecasting Time Series, Neural Nets, Linear Regression
Customer Acquisition and Prospecting
Induction Trees, Logistic Regression, Neural Nets
Profitability analysis Activity-based costing, process-based costing,
Campaign Effectiveness Assessment
Neural Nets, Rule Induction, Logistic Regression, Discriminant Analysis
34 © 2017 Teradata
• Big Data: Brief review
• Storage and AvailabilityInformation in your hands. Wanna keep it where?
• Analysis– Exploratory and summary. Is this enough?
– Can pretty pictures tell stories?
– Recreate the world around you
– Tools and tricks of the trade
• Walking the walkThe foundation, the strategy, the tools
• Managing the big story– Model management
– Application management
Overview
34 © 2017 Teradata
35 © 2017 Teradata
Big Data Analysis Setup: Foundation
Everything we saw to this point requires a core framework to be built on...
Hardware, including:
Servers
Storage devices
Network
infrastructure
Connectors
Computing
resources
36 © 2017 Teradata
Big Data Analysis Setup: Foundation
Everything we saw to this point requires a core framework to be built on...
Software
for the engineering foundation to operate on
to enable data transfers
for hardware setup
for communication between hardware, and processing requests
37 © 2017 Teradata
Big Data Analysis Setup: Architecture
Everything we saw to this point requires an architectural approach strategy.
Example: A very simple strategy:
I can do everything from the comfort of my multi-core computer!
38 © 2017 Teradata
Big Data Analysis Setup: Architecture
Everything we saw to this point requires an architectural approach strategy.
Often in practice, one of the following computing architectures is adopted:
PC Client
Data Extract
Request Results
PC Client
Database
PC Client
Data Extract
Request Results
DatabaseDatabase
Server or Cluster Of Servers
Side-by-side Distributed Computing In-Database
39 © 2017 Teradata
Big Data Analysis Setup: Architecture
D e b t < 1 0 % o f I n c o m e D e b t = 0 %
G o o dC r e d i tR i s k s
B a dC r e d i tR i s k s
G o o dC r e d i tR i s k s
Y e s
Y e sY e s
N O
N ON O
I n c o m e > $ 4 0 K
Processing Request
Sample Data
D e b t < 1 0 % o f I n c o m e D e b t = 0 %
G o o d
C r e d i tR i s k s
B a d
C r e d i tR i s k s
G o o d
C r e d i tR i s k s
Y e s
Y e sY e s
N O
N ON O
I n c o m e > $ 4 0 K
Results
Desktop and Server Analytic Architecture
In-Database Analytic Architecture
Results
Exponential Performance Improvement
In-Database architecture may carry significant speed advantages.
Everything we saw to this point requires an architectural approach strategy.
40 © 2017 Teradata
Big Data Analysis Setup: Architecture
Everything we saw to this point requires an architectural approach strategy.
In business applications, the architecture strategy further implies how data
are accessed to enable lower cost and higher speed in decision-making.
IntegratedData
Warehouse(IDW)
OPERATIONAL SYSTEMS DECISION MAKERS OPERATIONAL SYSTEMS DECISION MAKERS
Fragmented vs. Integrated data approach
41 © 2017 Teradata
Big Data Analysis Setup: Tools
Everything we saw to this point requires programming and software.
Programming languages and software packages provide the building
blocks for implementation of suitable algorithms and analysis with Big Data
42 © 2017 Teradata
Big Data Analysis Setup: Tools
Everything we saw to this point requires programming and software.
Analytical frameworks and interfaces provide creative programming
platforms that facilitate data analysis and understanding.
43 © 2017 Teradata
Big Data Analysis Setup: People
Above all, everything we saw to this point requires people on the helm.
Sign of times? The industry no longer needs just a mathematician, a statistician, a programmer, or a general scientist.
The new superstar in the age of Big Data is a…
Data Scientist: Coined as a person with a most versatile skillset to perform all-in-one tasks such as
handling computationally any size of datasets
possessing statistical prowess and modeling skills
understanding and programming relational data storage (databases)
solving problems; visualize, communicate well
44 © 2017 Teradata
• Big Data: Brief review
• Storage and AvailabilityInformation in your hands. Wanna keep it where?
• Analysis– Exploratory and summary. Is this enough?
– Can pretty pictures tell stories?
– Recreate the world around you
– Tools and tricks of the trade
• Walking the walkThe foundation, the strategy, the tools
• Managing the big story– Model management
– Application management
Overview
44 © 2017 Teradata
45 © 2017 Teradata
Data analysis yields an optimally fit model for prediction.
For an individual or a small team, this is where a Big Data analysis might
come to completion. In an industrial/commercial setting, however:
Data refresh: Customer data might be in constant flow.
― Deliverable must account for streaming data flows and continuous model use
In Production: Model Management Aspects
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists Business Analysts Marketing Front-Line Workers
Operational SystemsCustomers / Partners Executives
46 © 2017 Teradata
Data analysis yields an optimally fit model for prediction.
For an individual or a small team, this is where a Big Data analysis might
come to completion. In an industrial/commercial setting, however:
Scoring, validation, and model health: Model must be kept current.
― With time, data characteristics might change and prediction may deteriorate.
Mechanisms should perform quality checks and support model retraining.
In Production: Model Management Aspects
47 © 2017 Teradata
Data analysis yields an optimally fit model for prediction.
For an individual or a small team, this is where a Big Data analysis might
come to completion. In an industrial/commercial setting, however:
Promotion: Model may need to circulate across an organization.
― Different teams might require training to learn about a model
Champion model: Analysis may indicate multiple prevailing solutions.
― Different teams might need to select own champion model for usage among
challenger ones. In addition, this might be a repeatable periodic process.
In Production: Model Management Aspects
48 © 2017 Teradata
In industrial/commercial settings, a successful model could be converted
into an application for broader adoption. Topics pertaining to this case are:
Application development: Fitting a new application in existing ecosystem
― Effort to retain model functionality and provide users with control over
parameters and variables to specify
Deployment and Extensions: Distributing,
maintaining and extending the application
In Production: Application Management Aspects
49 © 2017 Teradata
Data Ingest
Stream
Batch
Data Ingest
Stream
Batch
IoT
Streams
Edge Systems
Version Control
Workflow Management
Data Stores
Application ManagementModel Management
Data Science Workbench
Data Refresh
Scoring / Validation Promotion
Health StatisticsChampion & Challenger
Application Deployment
Application Extensions
Data integration Interface
Data Profiling Lifecycle Dashboard
Provisioning
Common Scoring Library
Algorithms
Tools
Feature Engineering
AnalyticsADS
Data Science Lab Production Analytics
Modeling AnalyticData Set
Repository
ADS Metadata Scoring Data Set
Repository
Scoring Metadata
Review: Flow Overview For Big Data Analysis
What you want to do with Big Data? A conceptual analytical framework
5050